[00:00:57] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:01:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [00:02:42] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) ` papaul@asw-b-codfw# run show interfaces ge-5/0/8 descriptions Interface Admin Link Description ge-5/0/8 down do... [00:03:07] 10Operations, 10ops-codfw, 10decommission, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) [00:04:17] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:04:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 70 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [00:04:51] RECOVERY - Check systemd state on kafka-jumbo1001 is OK: OK - running: The system is fully operational [00:04:51] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [00:06:05] !log jnt push to msw switches [00:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:01] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1003 is CRITICAL: 5.955e+06 ge 5e+06 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [00:07:42] (03PS1) 10Papaul: DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637 [00:08:41] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1006 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [00:08:57] RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational [00:09:13] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:09:22] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) [00:09:39] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [00:09:57] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1003 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [00:10:51] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [00:11:13] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:12:59] (03CR) 10BryanDavis: [C: 03+1] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:13:23] RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational [00:13:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad@0 on kafka-jumbo1004 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad@0/producer\.properties [00:16:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:57] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:17:41] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:18:53] PROBLEM - tilerator on maps2001 is CRITICAL: connect to address 10.192.0.144 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [00:19:14] !log replacing accepted-prefix-limit with prefix-limit on one ulsfo peer - T211730 [00:19:17] PROBLEM - tilerator on maps2002 is CRITICAL: connect to address 10.192.16.179 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [00:19:19] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [00:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:22] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [00:20:39] RECOVERY - Varnishkafka Eventlogging Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=eventlogging&var-host=All [00:20:47] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:21:52] 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) Confirmed that replacing accepted-prefix-limit with prefix-limit does NOT cause the peer to bounce. [00:22:27] RECOVERY - Kafka Broker Replica Max Lag on kafka-jumbo1003 is OK: (C)5e+06 ge (W)1e+06 ge 9.916e+05 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [00:24:05] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [00:25:23] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:25:35] !log replacing accepted-prefix-limit with prefix-limit on all ulsfo peers - T211730 [00:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:39] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [00:27:15] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:30:33] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:35:59] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [00:36:59] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:40:34] !log replacing accepted-prefix-limit with prefix-limit in [co|eq]dfw - T211730 [00:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:41] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [00:42:43] (03CR) 10Bstorm: [C: 03+2] cloudstore: add py extension to nfs-exportd and apply nfsd-ldap everywhere [puppet] - 10https://gerrit.wikimedia.org/r/500635 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [00:43:59] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:49:07] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [00:51:34] (03PS1) 10Ayounsi: Depooling eqsin because of eqsin-codfw link outage [dns] - 10https://gerrit.wikimedia.org/r/500638 [00:51:51] (03CR) 10Ayounsi: [C: 03+2] Depooling eqsin because of eqsin-codfw link outage [dns] - 10https://gerrit.wikimedia.org/r/500638 (owner: 10Ayounsi) [00:52:19] !log depool eqsin due to Telia eqsin-codfw link outage [00:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:03] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:00:01] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:01:00] 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi) p:05Triage→03Normal [01:02:25] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 49.81 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:05:09] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:10:53] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:11:59] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:14:40] !log replacing accepted-prefix-limit with prefix-limit in eqord - T211730 [01:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:44] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [01:14:44] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10Aklapper) @Yann: No, because the "Priority" field is not for users to express how... [01:15:31] 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi) [01:17:59] !log replacing accepted-prefix-limit with prefix-limit on cr1-eqiad - T211730 [01:18:00] (03CR) 10Pppery: [C: 03+1] Add editcontentmodel right to the templateeditor group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [01:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:21:55] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:22:19] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [01:32:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:34:37] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:34:47] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:34:57] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:39:55] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:46:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:46:21] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [01:47:45] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 96.93 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [01:48:47] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.38 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:52:29] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:56:13] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [02:08:09] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:09:27] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:12:17] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:13:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:14:39] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:15:43] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:18:17] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:19:27] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:21:05] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:24:39] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:24:59] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:25:11] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:28:17] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:36:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:37:55] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:46:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:48:17] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [02:49:17] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [02:51:27] (03PS22) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [02:53:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.44 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [02:58:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:01:13] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:10:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:10:33] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:12:05] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:12:05] 10Operations, 10Wikimedia-General-or-Unknown: Figure out why HHVM isn't using error_document404 setting - https://phabricator.wikimedia.org/T187754 (10Krinkle) [03:14:05] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:17:57] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:18:33] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:23:07] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:24:47] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:27:03] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 70.84 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:30:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:33:29] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:37:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:38:39] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:39:10] (03PS23) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [03:48:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:52:53] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [03:53:19] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:01:55] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:03:17] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:03:37] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:05:45] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:08:33] PROBLEM - Disk space on cloudcontrol2001-dev is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=86%) [04:12:36] (03PS24) 10BryanDavis: wmcs: Migrate tools-checker to Stretch [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) [04:13:45] RECOVERY - tilerator on maps2002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [04:13:47] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [04:14:37] RECOVERY - tilerator on maps2001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [04:14:50] !log restarted tilerator on maps200[1-3] - connection refused [04:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:16:05] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:19:57] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:23:13] 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Mathew.onipe) [04:23:13] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:23:24] (03CR) 10BryanDavis: "All tests currently passing. Testable at https://tools-checker-03.wmflabs.org or using `curl localhost/...` on tools-checker-03.tools.eqia" [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [04:26:25] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:29:07] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:32:05] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:32:35] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:34:17] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:35:33] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:35:57] PROBLEM - puppet last run on ores1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:36:04] (03PS1) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [04:41:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:41:54] (03PS1) 10Andrew Bogott: cloud enc: duplicate a placeholder password from 'main' to 'eqiad1' [labs/private] - 10https://gerrit.wikimedia.org/r/500641 [04:42:35] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:42:38] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] cloud enc: duplicate a placeholder password from 'main' to 'eqiad1' [labs/private] - 10https://gerrit.wikimedia.org/r/500641 (owner: 10Andrew Bogott) [04:44:13] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:47:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:47:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:49:03] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:49:45] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [04:50:57] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:50:58] (03PS4) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [04:51:00] (03PS2) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [04:51:59] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [04:52:13] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [04:52:27] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:56:49] (03PS5) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [04:56:51] (03PS3) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [04:56:53] (03PS1) 10Andrew Bogott: admin_scripts: add a case for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407) [04:58:03] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:01:10] (03CR) 10Andrew Bogott: [C: 03+2] admin_scripts: add a case for Trusty [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407) (owner: 10Andrew Bogott) [05:02:13] (03Abandoned) 10Giuseppe Lavagetto: Edit Project Config [docker-images/production-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500386 (owner: 10Giuseppe Lavagetto) [05:02:19] RECOVERY - puppet last run on ores1008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [05:02:20] (03Abandoned) 10Giuseppe Lavagetto: Edit Project Config [docker-images] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/500385 (owner: 10Giuseppe Lavagetto) [05:03:59] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:04:27] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:05:37] (03PS4) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [05:06:23] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:08:57] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:08:59] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:09:19] (03PS5) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [05:10:15] elukey: I'm sure you'll find out reading the backlog this morning, but still: it seems that something might have gone (still is going?) slightly wrong with kafka during the night [05:12:26] (03CR) 10Andrew Bogott: [C: 04-1] "this needs a bit more work; the diff is still bigger than it should be. I also need to double-check that this applies cleanly on VMs." [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott) [05:14:40] (03PS2) 10Marostegui: realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) [05:16:45] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:17:33] (03CR) 10Marostegui: [C: 03+2] realm.pp: Add urlshortcodes to private table [puppet] - 10https://gerrit.wikimedia.org/r/500470 (https://phabricator.wikimedia.org/T219777) (owner: 10Marostegui) [05:18:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:18:51] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:19:23] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:23:13] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:24:31] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:25:14] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:30:59] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:32:23] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:37:29] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [05:37:52] (03PS1) 10Giuseppe Lavagetto: New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643 [05:38:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643 (owner: 10Giuseppe Lavagetto) [05:40:27] (03CR) 10jenkins-bot: New version release [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/500643 (owner: 10Giuseppe Lavagetto) [05:48:01] (03PS1) 10Giuseppe Lavagetto: Release 1.1.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500644 [05:50:52] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 [05:52:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui) [05:53:48] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui) [05:54:55] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [05:55:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1008 (duration: 00m 56s) [05:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:09] !log Upgrade pc1008 [05:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:44] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 1.1.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/500644 (owner: 10Giuseppe Lavagetto) [05:56:28] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Marostegui) [05:58:19] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@2a090ef]: New version for T219778 [05:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:22] T219778: docker-pkg is unhappy on contint1001 - https://phabricator.wikimedia.org/T219778 [05:58:38] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@2a090ef]: New version for T219778 (duration: 00m 19s) [05:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:05] PROBLEM - MariaDB Slave IO: pc2 on pc2008 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@pc1008.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on pc1008.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [05:59:10] ^ me [05:59:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 [05:59:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500645 (owner: 10Marostegui) [06:00:51] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui) [06:01:57] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui) [06:02:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1008 (duration: 00m 53s) [06:02:59] RECOVERY - MariaDB Slave IO: pc2 on pc2008 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:37] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:07:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 [06:09:47] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:11:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500648 (owner: 10Marostegui) [06:11:20] lovely [06:11:54] weird I don't see anything in the graphs? [06:12:37] yeah, there is nothing for this time [06:14:18] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui) [06:15:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui) [06:16:27] even kafka looks good [06:16:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1064 (duration: 00m 54s) [06:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:53] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:16:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 [06:22:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500650 (owner: 10Marostegui) [06:23:57] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:23:59] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:24:22] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui) [06:25:15] RECOVERY - ElasticSearch unassigned shard check - 9200- on logstash1007 is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [06:25:17] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui) [06:26:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1064 (duration: 00m 52s) [06:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:25] ACKNOWLEDGEMENT - HP RAID on db2070 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:6 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T219852 [06:27:30] 10Operations, 10ops-codfw: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10ops-monitoring-bot) [06:27:36] (03PS1) 10Marostegui: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 [06:28:14] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2070 - https://phabricator.wikimedia.org/T219852 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we get this disk replaced? Thanks! [06:28:45] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:28:55] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:29:17] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui) [06:30:03] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui) [06:31:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1120 (duration: 00m 52s) [06:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 [06:33:03] PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:33:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500651 (owner: 10Marostegui) [06:33:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500652 (owner: 10Marostegui) [06:34:25] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:34:27] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:36:08] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui) [06:36:17] I am not getting why it is still alarming [06:36:30] sumSeries(perSecond(varnishkafka.*.webrequest.upload.varnishkafka.kafka_drerr)) on graphite looks flat zero (that is the metric used in the alarm) [06:37:01] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:37:06] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui) [06:38:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 (duration: 00m 50s) [06:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:34] (03PS1) 10Marostegui: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) [06:44:45] (03PS5) 10Giuseppe Lavagetto: Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [06:44:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1120" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500653 (owner: 10Marostegui) [06:45:59] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:46:15] (03PS1) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 [06:46:17] (03PS1) 10Ema: ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263) [06:47:17] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:47:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Expose rsyslog_udp_port to services configs. [puppet] - 10https://gerrit.wikimedia.org/r/498872 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [06:48:35] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [06:57:21] (03CR) 10Marostegui: "@jcrespo does this look ok to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [06:58:54] (03PS2) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263) [06:59:27] RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:05:47] (03PS2) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407 [07:09:42] (03PS3) 10Ema: ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263) [07:11:18] (03CR) 10Ema: [C: 03+2] ATS: strip PKP headers [puppet] - 10https://gerrit.wikimedia.org/r/500655 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:13:06] (03PS3) 10Muehlenhoff: Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407 [07:14:08] (03CR) 10Muehlenhoff: [C: 03+2] Remove labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/500407 (owner: 10Muehlenhoff) [07:15:21] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:19:55] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:21:14] (03PS2) 10Ema: ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263) [07:22:34] (03CR) 10Ema: [C: 03+2] ATS: test unsetting Accept-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/500656 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:23:14] moritzm: ok to puppet-merge 1292208e09? [07:23:49] (03PS1) 10Marostegui: site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749) [07:24:08] (03PS2) 10Marostegui: site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749) [07:24:34] moritzm: I'll leave it up to you, my change is minimal and can be merged [07:25:05] (03CR) 10Marostegui: [C: 03+2] site.pp: Clarify labsdb1004 and 1005 status [puppet] - 10https://gerrit.wikimedia.org/r/500657 (https://phabricator.wikimedia.org/T216749) (owner: 10Marostegui) [07:25:12] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [07:25:20] (03PS3) 10Vgutierrez: redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) [07:25:42] ema: got it, moritzm can your change be merged? [07:26:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:27:05] My change can also be merged anytime, it is just a comment to clarify the status of two hsots [07:27:09] hosts [07:27:56] feel free to merge mine as well [07:28:10] nice queue XD [07:30:31] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:30:56] ema: sorry, got distracted by ms-be2026, just merged it [07:32:04] moritzm: I think all the others, ema's, vgutierrez and mine can also be merged [07:32:38] +1 [07:32:55] * vgutierrez merging [07:33:14] thanks! [07:33:27] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:33:33] (done) [07:34:24] thx [07:34:47] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: send varnish syslogs via kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [07:46:03] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:47:05] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:48:32] (03PS10) 10Urbanecm: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) [07:49:50] 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10MoritzMuehlenhoff) [07:52:27] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [07:52:57] !log removed labvirt1008 from debmonitor (T216661) [07:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:01] T216661: cloudVPS: drain and rebuild labvirt1008 as cloudvirt1008 - https://phabricator.wikimedia.org/T216661 [07:54:19] PROBLEM - Disk space on notebook1003 is CRITICAL: DISK CRITICAL - free space: /srv 4952 MB (3% inode=85%) [07:54:41] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:01:01] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:01:31] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) @Yann and @Aklapper please stop discussing that (or at least discussing... [08:02:47] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:03:04] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) [08:03:19] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:05:15] !log installing openssl1.0 security updates [08:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:20] (03PS1) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 [08:06:13] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:06:41] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:07:25] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:07:40] (03CR) 10Elukey: "LGTM, added also Filippo! Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [08:09:59] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:10:53] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:11:25] (03CR) 10Vgutierrez: "some of the listed SNIs have several DNS issues: https://phabricator.wikimedia.org/P8325" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:13:13] !log installing debdeploy updates on remaining hosts in eqiad/codfw [08:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:32] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [08:13:36] (03CR) 10Vgutierrez: [C: 03+2] Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:13:37] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:13:38] \o/ [08:13:48] (03PS4) 10Vgutierrez: Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) [08:13:49] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:14:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [08:15:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [08:16:03] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:16:25] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:16:40] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool all slaves in x1 T219777 T143763 (duration: 00m 53s) [08:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:45] T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 [08:16:45] T219777: DBA review of UrlShortener - https://phabricator.wikimedia.org/T219777 [08:18:24] (03CR) 10Vgutierrez: "After merging Ib064d25b82cdc1fcf9372a7881d8caece2433507 looks way better: https://phabricator.wikimedia.org/P8326" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:20:31] (03CR) 10Filippo Giunchedi: [C: 03+1] "no PCC but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [08:20:36] !log Compress wikishared.urlshortcodes table on x1, directly on the master with replication (table has 1 row) - T219777 [08:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:55] RECOVERY - Disk space on notebook1003 is OK: DISK OK [08:24:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool all slaves in x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500654 (https://phabricator.wikimedia.org/T143763) (owner: 10Marostegui) [08:24:35] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:31:43] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:32:39] (03PS13) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [08:34:12] (03Abandoned) 10Gehel: WIP: experimentation with type hints [software/spicerack] - 10https://gerrit.wikimedia.org/r/491812 (owner: 10Gehel) [08:34:27] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:34:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I've cc'd service ops and ores folks for notification / heads up" [puppet] - 10https://gerrit.wikimedia.org/r/497614 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [08:36:55] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:40:07] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog-kafka] [08:41:09] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:45:37] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:47:22] 10Operations, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez) [08:48:04] 10Operations, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Vgutierrez) p:05Triage→03Normal [08:50:00] !log Execute schema change on db1069 x1 master with replication enabled on the following small wikis: aawiki aawikibooks aawiktionary abwiki abwiktionary acewiki advisorswiki advisorywiki adywiki afwiki T143763 [08:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:05] T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 [08:50:39] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:50:45] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [08:52:08] (03PS2) 10Dzahn: k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769 [08:52:37] RECOVERY - MegaRAID on sodium is OK: OK: optimal, 1 logical, 4 physical [08:52:50] (03CR) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [08:52:53] :) sodium [08:53:14] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10Joe) I encountered the same problem, and I think the problem lies elsewhere, specifically in the prerm script from the current `rsyslog` package: ` Unpacking rsyslog-gnutls (8.1901.0-1~bpo8... [08:53:44] (03PS3) 10Gehel: elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:54:33] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 [08:56:33] (03CR) 10Gehel: [C: 03+2] elasticsearch: add profile for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/500525 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [08:56:45] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui) [08:57:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui) [08:58:05] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool all slaves in x1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500670 (owner: 10Marostegui) [08:58:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool all slaves in x1 T219777 T143763 (duration: 00m 53s) [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:00] T143763: Remove unused bundling DB fields - https://phabricator.wikimedia.org/T143763 [08:59:01] T219777: DBA review of UrlShortener - https://phabricator.wikimedia.org/T219777 [08:59:17] PROBLEM - HHVM jobrunner on mw1337 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [09:00:23] (03PS4) 10Vgutierrez: acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) [09:00:35] RECOVERY - HHVM jobrunner on mw1337 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [09:00:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) Thanks, it installed with no issues. [09:01:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott) [09:02:59] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:03:05] (03PS2) 10Arturo Borrero Gonzalez: DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637 (owner: 10Papaul) [09:03:14] <_joe_> !log uploaded patched version of bootstrap-vz to account for jessie-updates vanishing (T219683) [09:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:17] T219683: Rebuild docker-registry.wikimedia.org/wikimedia-jessie to drop jessie-update/jessie-backports - https://phabricator.wikimedia.org/T219683 [09:04:01] (03CR) 10Vgutierrez: "everything looking good now: https://phabricator.wikimedia.org/P8327" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:04:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/500637 (owner: 10Papaul) [09:04:49] (03PS1) 10Strainu: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T210325) [09:05:01] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:06:03] (03PS2) 10Strainu: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) [09:08:05] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:09:15] (03PS2) 10Arturo Borrero Gonzalez: DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634 (owner: 10Papaul) [09:10:27] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `... [09:10:52] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Issue the non-canonical redirect certificates [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [09:11:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] DNS: Remove mgmt and production DNS for cloudnet2001-dev [dns] - 10https://gerrit.wikimedia.org/r/500634 (owner: 10Papaul) [09:11:43] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:12:19] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [09:14:13] PROBLEM - puppet last run on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:14:16] (03PS1) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [09:14:31] PROBLEM - DPKG on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:14:36] !log T219776 finally reimaging cloudnet2003-dev.codfw.wmnet (was labtestnet2003) [09:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:39] T219776: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 [09:14:59] PROBLEM - configured eth on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:15:09] PROBLEM - MD RAID on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:15:25] PROBLEM - Disk space on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:15:43] PROBLEM - dhclient process on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:16:45] ACKNOWLEDGEMENT - DPKG on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:45] ACKNOWLEDGEMENT - Disk space on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:45] ACKNOWLEDGEMENT - MD RAID on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:45] ACKNOWLEDGEMENT - configured eth on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:45] ACKNOWLEDGEMENT - dhclient process on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:46] ACKNOWLEDGEMENT - puppet last run on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused Arturo Borrero Gonzalez T219776 [09:16:46] ACKNOWLEDGEMENT - DNS labtestnet2003.mgmt on labtestnet2003.mgmt is CRITICAL: Domain labtestnet2003.mgmt.codfw.wmnet was not found by the server Arturo Borrero Gonzalez T219776 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:17:11] 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) I can ssh into it via cumin. The MD raid status is this: `lang=bash $ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sda1[0](F)... [09:18:25] (03PS1) 10Mathew.onipe: icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678 [09:19:25] (03PS1) 10Arturo Borrero Gonzalez: install_server: fix typo in partman recipe selector for cloudnet2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/500679 (https://phabricator.wikimedia.org/T219776) [09:21:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] install_server: fix typo in partman recipe selector for cloudnet2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/500679 (https://phabricator.wikimedia.org/T219776) (owner: 10Arturo Borrero Gonzalez) [09:22:29] (03PS2) 10Gehel: icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678 (owner: 10Mathew.onipe) [09:23:09] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:23:50] (03CR) 10Gehel: [C: 03+2] icinga: remove unwanted character from elastic check [puppet] - 10https://gerrit.wikimedia.org/r/500678 (owner: 10Mathew.onipe) [09:24:03] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] ` Of which those... [09:24:18] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: `... [09:25:27] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:25:55] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) 05Open→03Resolved The non-canonical certs have been issued successfully: `root@acmechief1001:~# for i in {1..4}; do openssl x509 -text -no... [09:26:03] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [09:26:25] PROBLEM - SSH on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:28:47] RECOVERY - SSH on labtestnet2003 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:37:51] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10Joe) [09:38:36] (03CR) 10Arturo Borrero Gonzalez: "sorry for that" [puppet] - 10https://gerrit.wikimedia.org/r/500642 (https://phabricator.wikimedia.org/T215407) (owner: 10Andrew Bogott) [09:39:33] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10MoritzMuehlenhoff) Running the steps from the prerm on a jessie system with 8.38 works fine: ` jmm@alsafi:~$ sudo systemctl stop syslog.socket jmm@alsafi:~$ sudo invoke-rc.d rsyslog stop j... [09:41:59] (03PS2) 10Alexandros Kosiaris: wikitech: Lock LDAP accounts when users are blocked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497866 (https://phabricator.wikimedia.org/T168692) (owner: 10BryanDavis) [09:45:03] (03PS6) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [09:45:05] (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) [09:46:51] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) [09:47:06] PROBLEM - Long running screen/tmux on labtestnet2003 is CRITICAL: connect to address 10.192.20.12 port 5666: Connection refused [09:48:09] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) [09:49:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:49:35] mmmmm [09:50:03] today is not the best :D [09:50:08] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:50:24] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Server rename: labtestnet2003 to cloudnet2003-dev, update label and switch ports descriptions, etc - https://phabricator.wikimedia.org/T219861 (10aborrero) [09:50:52] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) [09:50:54] 10Operations, 10ops-codfw: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) So the `dsa-check-hpssacli` check is happily returning `0` exit code and this output: ` OK: Slot 0: no logical drives --- Slot 0: no drives ` Given that IIRC we add the HP raid check only on the hosts tha... [09:51:17] ah ok a burst of request waiting to be cached [09:52:31] (03PS2) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [09:53:48] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:54:00] RECOVERY - Disk space on cloudcontrol2001-dev is OK: DISK OK [09:55:32] PROBLEM - IPMI Sensor Status on labtestnet2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.20.12: Connection reset by peer [09:57:58] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [09:59:32] PROBLEM - NTP on labtestnet2003 is CRITICAL: NTP CRITICAL: No response from NTP server https://wikitech.wikimedia.org/wiki/NTP [09:59:33] (03PS3) 10Ema: ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) [10:01:36] RECOVERY - dhclient process on labtestnet2003 is OK: PROCS OK: 0 processes with command name dhclient [10:01:38] RECOVERY - MD RAID on labtestnet2003 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:02:32] RECOVERY - configured eth on labtestnet2003 is OK: OK - interfaces up [10:02:34] RECOVERY - Disk space on labtestnet2003 is OK: DISK OK [10:03:12] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:14] (03PS1) 10Volans: RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854) [10:03:58] (03CR) 10Marostegui: mariadb-backups: Setup dbprov2002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [10:04:32] (03PS3) 10Dzahn: k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769 [10:04:36] RECOVERY - DPKG on labtestnet2003 is OK: All packages OK [10:04:46] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:06:59] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) I had a look at the missing packages: catch: It's self-contained and has minimal build deps and isn't used anywhere in our fleet, I think we can simply import the... [10:08:22] !log manually purge varnishkafka graphite alert's URL as attempt to avoid a flapping alert - T219842 [10:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:25] T219842: [Post-mortem] Kafka Jumbo cluster cannot accept connections - https://phabricator.wikimedia.org/T219842 [10:09:18] (03PS1) 10Mathew.onipe: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 [10:09:38] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:10:00] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudnet2003-dev.codfw.wmnet'] ` and were **ALL*... [10:10:26] RECOVERY - puppet last run on labtestnet2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:10:28] (03PS2) 10Mathew.onipe: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 [10:14:33] (03CR) 10Vgutierrez: [C: 03+1] ATS: add ats-backend-restart [puppet] - 10https://gerrit.wikimedia.org/r/500675 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [10:15:58] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:18:32] (03PS2) 10Volans: RAID: hpssacli exit with correct code [puppet] - 10https://gerrit.wikimedia.org/r/500684 (https://phabricator.wikimedia.org/T219854) [10:19:28] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 3 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) [10:20:34] (03CR) 10Dzahn: [C: 03+2] k8s:proxy: remove upstart support [puppet] - 10https://gerrit.wikimedia.org/r/499769 (owner: 10Dzahn) [10:25:30] (03Abandoned) 10Muehlenhoff: kube-proxy: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500404 (owner: 10Muehlenhoff) [10:25:36] RECOVERY - IPMI Sensor Status on labtestnet2003 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [10:25:38] (03PS1) 10Arturo Borrero Gonzalez: serverpackages: mitaka: stretch: additional pinning fixes [puppet] - 10https://gerrit.wikimedia.org/r/500691 (https://phabricator.wikimedia.org/T215407) [10:28:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "pcc OK https://puppet-compiler.wmflabs.org/compiler1002/15489/" [puppet] - 10https://gerrit.wikimedia.org/r/500691 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [10:29:18] (03PS1) 10Greta WMDE: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) [10:30:30] !log add debhelper 10.2.5 and dh-systemd 10.2.5 to jessie-wikimedia/backports [10:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:58] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:33:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: drop file cleanup declarations [puppet] - 10https://gerrit.wikimedia.org/r/500694 [10:35:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: drop file cleanup declarations [puppet] - 10https://gerrit.wikimedia.org/r/500694 (owner: 10Arturo Borrero Gonzalez) [10:35:54] PROBLEM - puppet last run on cloudvirt1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:36:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Nice, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk) [10:36:59] (03PS2) 10Alexandros Kosiaris: service::node: Only try to define node10 repository if it is not already defined [puppet] - 10https://gerrit.wikimedia.org/r/500615 (owner: 10Alex Monk) [10:37:44] PROBLEM - puppet last run on cloudnet2002-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:38:14] PROBLEM - puppet last run on cloudnet1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:39:06] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:39:08] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:39:10] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:39:14] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:39:18] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:39:25] !log add dh-autoreconf 12 to jessie-wikimedia/backports [10:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:34] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:39:44] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:39:51] <_joe_> uh what's up with aqs? [10:39:55] <_joe_> elukey: ^^ [10:40:00] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:40:05] oh that wikitech page is empty :( [10:40:12] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [10:40:20] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:40:21] yeah I think it is somebody using AQS -> Druid to gather edit data [10:40:44] (03PS1) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [10:40:47] yes see broker metrics in https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&orgId=1 [10:40:52] marostegui: should be https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS will fix url [10:40:56] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:41:02] mutante: thanks! [10:41:58] might be an expensive query, going to check [10:42:16] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron-common: fix relationship with sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/500696 [10:42:38] !log add strip-nondeterminism 0.034 to jessie-wikimedia/backports [10:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:44] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:43:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: neutron-common: fix relationship with sqlite3 [puppet] - 10https://gerrit.wikimedia.org/r/500696 (owner: 10Arturo Borrero Gonzalez) [10:43:10] eh.. cant find the check in puppet.. odd.. keep looking though [10:43:18] PROBLEM - puppet last run on cloudvirt1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:43:36] PROBLEM - puppet last run on cloudnet1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:44:06] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:44:06] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [10:44:38] hosts unknown to pybal ? [10:45:04] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:45:10] here it is: https://config-master.wikimedia.org/pybal/eqiad/aqs [10:45:22] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:45:35] no, wrong one.. but here anyways: https://config-master.wikimedia.org/pybal/eqiad/druid-public-broker [10:45:35] <_joe_> druid, not aqs [10:46:20] PROBLEM - puppet last run on cloudvirt1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:46] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:46:54] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:46:55] https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&panelId=43&fullscreen&orgId=1&from=now-24h&to=now [10:46:58] this is the issue [10:47:04] PROBLEM - puppet last run on cloudvirt2001-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:47:34] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:47:37] !log add catch 1.10 to jessie-wikimedia/backports [10:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:08] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:48:08] PROBLEM - puppet last run on cloudvirt2003-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:58] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:49:11] the noise from cloudvirt* .. like cloudvirt1025.. seems already over, puppet run fine [10:49:42] PROBLEM - puppet last run on cloudvirt2002-dev is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:50:00] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:50:59] yes I can confirm, big queries coming from probably a bot [10:51:36] RECOVERY - puppet last run on cloudvirt1025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:51:44] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:51:58] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:52:36] (03PS3) 10Alaa Sarhan: Add wgScoreLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) [10:52:54] marostegui: now i see what happened.. that check is a generic endpoints check for all services (as in 'scb') so the URL is built from Service/Monitoring/$name . will add a redirect in wiki [10:53:26] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:53:36] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:53:44] (03PS7) 10Alaa Sarhan: Add wgMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [10:53:58] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:54:02] RECOVERY - puppet last run on cloudnet1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:54:24] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [10:54:30] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:54:38] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:56:46] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:57:40] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:57:41] (03PS1) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 [10:58:04] (03CR) 10Dzahn: [C: 04-1] "ah. it was the other way around. naked -> www right now" [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [10:58:28] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:59:09] (03PS2) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 [10:59:18] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1100). [11:00:05] Tulsi, Urbanecm, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] Here [11:00:22] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:00:50] Amir1: want to take the entire swat today? (since there are 3 patches, 1 of them yours) :) [11:01:22] if Amir1 is not around, I can SWAT [11:01:32] Tulsi: around for swat? [11:01:39] @seen Amir1 [11:01:39] mutante: Amir1 is in here, right now [11:01:39] Tulsi|Away: around for swat? [11:01:49] sure [11:01:52] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:02:02] Amir1: great, swat is yours then :) [11:02:20] RECOVERY - puppet last run on cloudvirt1019 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:02:45] !log add rapidjson 1.1.0 to jessie-wikimedia/backports [11:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:17] Urbanecm: around? [11:03:24] you are [11:03:24] Yes Amir1 [11:03:37] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm) [11:03:52] does your patch need syncing? it's test, right? [11:04:02] RECOVERY - puppet last run on cloudnet2002-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:04:17] yes, but I think if you won't sync it, other devs will comply about unsynced changes, right? [11:04:23] (yes, it is a test) [11:04:31] (03PS2) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 [11:04:34] not fully sure about procedure for getting this merged, through [11:04:43] Urbanecm: we just need to rebase it [11:04:46] (03Merged) 10jenkins-bot: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm) [11:05:01] it's fine, done it a million times before :P [11:05:17] Okay then :) [11:05:25] now it's done [11:05:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: don't install python-sphinx from our repo [puppet] - 10https://gerrit.wikimedia.org/r/500701 (https://phabricator.wikimedia.org/T215407) [11:05:56] thanks [11:06:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: serverpackages: stretch: don't install python-sphinx from our repo [puppet] - 10https://gerrit.wikimedia.org/r/500701 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [11:06:20] (03CR) 10jenkins-bot: Test rules reference only existing wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494188 (https://phabricator.wikimedia.org/T217541) (owner: 10Urbanecm) [11:06:24] Tulsi, Tulsi|Away Please ping me when you're around [11:06:34] (03PS2) 10Ladsgroup: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) [11:07:18] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup) [11:08:14] (03Merged) 10jenkins-bot: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup) [11:08:42] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. [11:09:14] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. [11:09:54] RECOVERY - puppet last run on cloudnet1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:10:32] Amir1: could i ask you about wikiba.se or the best contact at WMDE [11:10:35] (03CR) 10Ema: [C: 03+1] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto) [11:10:53] Amir1: also added some more test URLs for shortener using Unicode chars [11:11:28] (03CR) 10Mathew.onipe: "PCC output is expected: https://puppet-compiler.wmflabs.org/compiler1002/15494/" [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe) [11:11:38] mutante: sure, if I can't answer your question, I will tell you who can [11:11:39] (03CR) 10Elukey: [C: 03+1] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto) [11:12:18] mutante: yeah but the accepted char set doesn't have those https://github.com/wikimedia/mediawiki-extensions-UrlShortener/blob/master/extension.json#L122 :((( [11:12:37] mutante: btw. Regarding V for Vendetta, that's capital V :P [11:13:09] Amir1: currently both wikiba.se and www.wikiba.se work equally, but there are no redirects/rewrites between them. i think we should avoid serving same content from different URLs, so i would rewrite them. but which way around.. do we want to make the www canonical? [11:13:12] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:13:30] RECOVERY - puppet last run on cloudvirt2001-dev is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:13:33] Amir1: re: accepted char set: ooh.. ok, well.. it was a test:) and capital V makes sense, heh [11:14:07] hmm, I can send the lowercase v to wikivoyage [11:14:21] !log add cmake 3.6.2 to jessie-wikimedia/backports [11:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:24] (03PS19) 10Mathew.onipe: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) [11:14:26] mutante: hmm, that's a question for our PM I guess [11:14:31] let me ask her [11:14:33] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/15496/" [puppet] - 10https://gerrit.wikimedia.org/r/487895 (owner: 10Muehlenhoff) [11:14:36] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:14:39] !log T217715 Update mathoid, citoid, cxserver, eventgate grafana dashboards to use the new recording rules for the quantiles [11:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 [11:14:52] RECOVERY - puppet last run on cloudvirt1020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:15:22] Amir1: thank you! tell her that is part of moving it to WMF prod finally. we are now unblocked. we have a certificate :) [11:15:29] Amir1: or https://en.wiktionary.org/wiki/v if you dont have Wiktionary yet [11:15:40] yay [11:15:55] we have some wiktionary already :D [11:15:59] Amir1: if you hack your /etc/hosts you can already see wikiba.se in wmf prod [11:16:02] RECOVERY - puppet last run on cloudvirt2002-dev is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:16:15] Amir1: now just missing the rewrite stuff and HSTS [11:16:51] YESS, I've been waiting for this for years [11:16:52] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:17:08] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:17:21] Amir1: :) https://phabricator.wikimedia.org/T155359#5077009 [11:17:26] (03CR) 10jenkins-bot: Add the 'urlshortener-manage-url' right and enable it for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499777 (https://phabricator.wikimedia.org/T133109) (owner: 10Ladsgroup) [11:18:08] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [11:18:11] Just tried it [11:18:26] it's awesome, we should document how to do deployment because I completely forgot it [11:18:28] or put "91.198.174.192 wikiba.se" in /etc/hosts [11:19:44] RECOVERY - puppet last run on cloudvirt2003-dev is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [11:21:02] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:499777|Add the urlshortener-manage-url right and enable it for stewards (T133109)]], Part I (duration: 00m 53s) [11:21:02] PROBLEM - Check systemd state on ms-be2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:16] T133109: Add basic abuse prevention to UrlShortener - https://phabricator.wikimedia.org/T133109 [11:22:22] Amir1: done https://wikitech.wikimedia.org/wiki/Microsites#How_to_deploy [11:22:26] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499777|Add the urlshortener-manage-url right and enable it for stewards (T133109)]], Part I (duration: 00m 51s) [11:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:05] \o/ You're awesome [11:23:24] !log EU SWAT is done [11:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:28] we could create a new admin group for it to be able to run puppet themselves if needed [11:30:11] (03CR) 10Jcrespo: "I have not changed the original source hosts, probably the more suitable hosts were not available when this was first deployed." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [11:31:02] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) wikiba.se can now be viewed in WMF production by editing the local `/etc/hosts` file with f.e. `91.198.174.192 wikiba.se` Open issues... [11:31:24] (03CR) 10Jcrespo: "Of course, before deployment we will need some grant changes, too." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [11:32:11] nah, it's fine [11:32:21] ok [11:32:56] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) [11:33:17] !log contint1001: cleaning Docker containers #T219850 [11:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:21] T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 [11:34:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [11:35:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] gerrit: admins: ops -> gerritadmin [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [11:35:45] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [11:37:16] (03PS2) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [11:37:36] (03PS2) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [11:37:42] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [11:38:02] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [11:38:54] (03PS3) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [11:39:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Lemme know when it's ready to go (what's blocking it?) and I 'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [11:41:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) 05duplicate→03Open p:05Triage→03Unbreak! a:03hashar That task is valid it is for... [11:42:05] !log restarting parsoid on wtp1025 to pick up openssl update [11:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:44] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:42:50] RECOVERY - keystone admin endpoint port 35357 on cloudcontrol2001-dev is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:42:56] RECOVERY - keystone public endoint port 5000 on cloudcontrol2001-dev is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 757 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [11:43:20] (03PS3) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) [11:43:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15493/ looks good in the compiler. Clearly much more needs to be done." [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [11:43:24] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:44:04] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:44:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 (owner: 10Giuseppe Lavagetto) [11:44:52] (03PS3) 10Giuseppe Lavagetto: profile::graphite::base: prevent caching of metrics [puppet] - 10https://gerrit.wikimedia.org/r/500700 [11:49:22] RECOVERY - Check systemd state on ms-be2018 is OK: OK - running: The system is fully operational [11:49:42] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:49:52] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:50:56] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) Ok I hit another road block. leatherman depends on debhelper 11. I manully updated debian/compat and debian/control to try and build with debhelper 10 . The first build l... [11:51:48] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:54:20] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [11:55:29] (03PS3) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [11:55:31] (03PS1) 10Dzahn: wikiba.se: add HSTS header with low max_age [puppet] - 10https://gerrit.wikimedia.org/r/500711 (https://phabricator.wikimedia.org/T99531) [11:55:44] jbond42, hey, is T219803 a part of T184564? [11:55:47] T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 [11:55:55] looks like a puppet 5 thing [11:56:10] (03PS4) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [11:56:32] Krenair: checking [11:56:50] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [11:57:50] (03CR) 10Elukey: [C: 03+1] "LGTM, assuming that there is nothing that prevents monitoring hosts (like firewall rules etc..) to contact graphite-in.eqiad.wmnet (don't " [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [11:58:39] Krenair: i think they are related however T184564 is talking more about the server side, i am only concentrating on the client side for now. puppet 5 is allready running on buster systems T219803 [11:58:43] T219803 [11:58:55] (03PS5) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [11:58:58] ^^ that ticket is about packporting the packages to stretch and jessie [11:59:37] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1200) [12:00:32] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 312 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [12:01:15] (03PS6) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:01:25] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) [12:01:26] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:01:28] 10Operations, 10Puppet: Plan Puppet 5 upgrade - https://phabricator.wikimedia.org/T184564 (10jbond) [12:01:29] arturo^ bstorm_ andrewbogott [12:01:34] looking [12:02:27] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:02:36] (03CR) 10BBlack: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:03:05] weird, I can r/w from NFS in toolforge [12:03:22] arturo: it looks like its an error from codfw [12:03:22] The checker must be screwed [12:03:26] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:03:55] Did the downtime on toolchecker eun out arturo ? [12:03:59] <_joe_> bstorm_: look at http://checker.tools.wmflabs.org/nfs/home [12:04:14] <_joe_> there is an error in the response [12:04:16] bstorm_: it was downtimed until today? [12:04:25] I know bd808 is reworking toolscheker [12:04:30] <_joe_> it's looking for a file that doesn't exists [12:04:32] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:04:32] Yes [12:04:34] but not sure in which status is right now [12:04:34] <_joe_> *exist [12:04:44] <_joe_> I'd go touch that file on NFS [12:05:01] <_joe_> and make it writable only by root [12:05:14] ACKNOWLEDGEMENT - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 312 bytes in 0.006 second response time Arturo Borrero Gonzalez looking https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [12:05:14] ACKNOWLEDGEMENT - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.404 second response time Arturo Borrero Gonzalez looking https://wikitech.wikimedia.org/wiki/Help:Toolforge/Monitoring [12:05:47] _joe_ toolcheckers are undergoing rewrite tho. They needed trusty [12:05:50] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:06:01] arturo: could it be the fact its using trusty? [12:06:09] So they may not be reliable [12:07:23] !log contint1001: compressing some MediaWiki debugging logs under /srv/jenkins/builds # T219850 [12:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:30] T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 [12:07:41] (03PS4) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [12:07:58] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:01] arturo: I know the downtime was set for this week. I suspect it just ended. trusty grid is down intentionally [12:09:27] ok, downtiming again for... 1 month [12:09:52] (03PS8) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498660 (https://phabricator.wikimedia.org/T218191) [12:09:55] lol...should be done by then [12:10:02] arturo: hehe now if it would only fix itself right :P [12:10:20] Zppix: :-P [12:10:21] (03PS4) 10Alaa Sarhan: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) [12:11:10] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:11:17] * bstorm_ falls asleep [12:11:42] !log icinga downtime toolschecker for 1 month T219243 [12:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:46] T219243: Migrate tools-checker system to Stretch - https://phabricator.wikimedia.org/T219243 [12:12:27] (03CR) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:13:12] arturo: now 1004 for k8s is failing... [12:13:39] Actually nevermind i misread [12:13:42] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:13:49] Zppix: ok [12:14:13] (03PS7) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:14:20] (03CR) 10Ema: [C: 03+1] monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [12:14:56] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:15:52] (03PS8) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:16:04] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:16:35] (03CR) 10jerkins-bot: [V: 04-1] openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:17:18] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:17:54] (03PS9) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:19:41] (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) [12:19:54] PROBLEM - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:20:36] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:20:43] (03CR) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [12:21:10] RECOVERY - cache_upload: Varnishkafka Webrequest Delivery Errors per second on icinga1001 is OK: OK: Less than 1.00% above the threshold [1.0] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&orgId=1&var-instance=webrequest&var-host=All [12:22:14] (03CR) 10Marostegui: [C: 03+1] "Let's deploy and adjust if necessary whatever we might face and we haven't thought of :)" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:22:34] (03PS10) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:23:46] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:25:16] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:25:53] (03PS1) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [12:26:37] (03CR) 10jerkins-bot: [V: 04-1] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:28:23] (03PS1) 10Vgutierrez: redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705) [12:28:28] (03CR) 10BBlack: varnish/trafficserver: add regex to cover www.wikiba.se as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:29:39] (03PS11) 10Arturo Borrero Gonzalez: openstack: serverpackages: stretch: factorize negative apt pinning [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) [12:29:52] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:30:08] (03PS2) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) [12:30:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "finally, PCC ok https://puppet-compiler.wmflabs.org/compiler1002/15501/" [puppet] - 10https://gerrit.wikimedia.org/r/500706 (https://phabricator.wikimedia.org/T215407) (owner: 10Arturo Borrero Gonzalez) [12:31:18] (03CR) 10Vgutierrez: varnish/trafficserver: add regex to cover www.wikiba.se as well (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:31:28] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:33:56] (03PS2) 10Vgutierrez: redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705) [12:35:18] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:37:09] (03CR) 10Mathew.onipe: icinga: add mediawiki cirrus update lag check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [12:38:59] 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) [12:39:34] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:40:53] (03PS3) 10Gehel: icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe) [12:41:11] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) [12:41:48] (03CR) 10Gehel: [C: 03+2] icinga: align elastic base and nrpe check titles [puppet] - 10https://gerrit.wikimedia.org/r/500686 (owner: 10Mathew.onipe) [12:42:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [12:42:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 (owner: 10Giuseppe Lavagetto) [12:42:20] (03PS3) 10Giuseppe Lavagetto: monitoring: use internal graphite url [puppet] - 10https://gerrit.wikimedia.org/r/500665 [12:43:53] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626) [12:44:24] (03PS2) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626) [12:45:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: fix partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500718 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [12:45:51] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:46:01] 10Operations, 10Domains, 10Traffic: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) Sent a mail about it to Chuck in legal who handles domain registrations. [12:46:01] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1005.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:46:07] (03CR) 10Jcrespo: [C: 03+1] "I will add the appropriate grants to the affected hosts (dbstore2*, db1115* -statistics- and misc hosts) then deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:46:25] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:46:27] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:46:31] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:46:33] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:46:37] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - druid-public-broker_8082: Servers druid1006.eqiad.wmnet, druid1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:47:13] elukey: should we depool them ?^ [12:48:39] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:48:51] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:49:05] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) 05Open→03Resolved The jobs running MediaWiki tests no gzip the hu... [12:49:19] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([druid1005.eqiad.wmnet, druid1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:49:23] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2003.codfw.wmnet: rename to cloudnet2003-dev.codfw.wmnet and reimage to stretch - https://phabricator.wikimedia.org/T219776 (10aborrero) 05Open→03Resolved p:05Triage→03Normal [12:49:33] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:50:09] <_joe_> mutante: no please [12:50:46] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) The later cmake version in combination with the debian/rules file tries to enable position independent ELF files, which doesn't work with libcurl-openssl from stan... [12:51:03] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:51:05] _joe_: ok [12:51:15] (03CR) 10Jcrespo: "There is a problem somewhere: https://puppet-compiler.wmflabs.org/compiler1002/15503/" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:52:18] (03CR) 10Dzahn: [C: 04-2] "not now, maybe later" [puppet] - 10https://gerrit.wikimedia.org/r/500711 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [12:52:35] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:52:37] (03PS20) 10Gehel: cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:52:45] (03CR) 10Jcrespo: "[ 2019-04-02T12:49:56 ] ERROR: Unable to find facts for host dbprov2001.codfw.wmnet, skipping" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:52:51] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:53:12] (03CR) 10Marostegui: [C: 03+1] "> [ 2019-04-02T12:49:56 ] ERROR: Unable to find facts for host" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:53:37] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:53:40] (03CR) 10Volans: [C: 03+1] "LGTM" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [12:53:45] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:54:06] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) >>! In T219803#5077313, @MoritzMuehlenhoff wrote: > This makes the build phase work fine (but it's failing in test suite now, but unrelated). That seems to be a k... [12:54:17] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:54:51] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:54:53] (03CR) 10Gehel: [C: 03+2] cloudelastic: Add cloudelastic configs [puppet] - 10https://gerrit.wikimedia.org/r/487129 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [12:54:58] (03CR) 10Dzahn: "the problem is the host names are new and the compiler does not know them yet, hence the 404s. syncing facts should fix it (https://wikite" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [12:54:59] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:55:07] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:55:47] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:56:47] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:56:51] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:57:51] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:58:14] 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Peachey88) [12:59:07] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [12:59:09] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:59:29] PROBLEM - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:59:31] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1300) [13:01:09] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:01:16] 10Operations, 10puppet-compiler: puppet compiler - puppet facts need refreshing - https://phabricator.wikimedia.org/T110546 (10jcrespo) Don't use the above procedure, I was pointed instead to https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs [13:01:25] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:01:29] RECOVERY - LVS HTTP IPv4 on druid-public-broker.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1423 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:02:45] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:02:57] (03CR) 10Marostegui: [C: 03+1] "> the problem is the host names are new and the compiler does not" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [13:03:07] (03PS1) 10Giuseppe Lavagetto: uwsgi: allow setting routing rules [puppet] - 10https://gerrit.wikimedia.org/r/500729 [13:03:09] (03PS1) 10Giuseppe Lavagetto: graphite: correctly set Cache-control: no-store [puppet] - 10https://gerrit.wikimedia.org/r/500730 [13:03:19] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:03:27] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:03:41] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:03:53] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:03:57] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:04:31] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:05:58] (03PS1) 10BBlack: Non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 [13:06:16] (03CR) 10jerkins-bot: [V: 04-1] Non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (owner: 10BBlack) [13:08:16] (03PS14) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [13:09:41] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:12:08] 10Operations, 10Domains, 10Traffic, 10serviceops: contact Wikivoyage e. V. and figure out status of wikivoyage-old.org / fix or park broken domain - https://phabricator.wikimedia.org/T219867 (10Dzahn) Sent email to Roland Unger (http://wikivoyage-ev.org/wiki/Kontakt) [13:13:20] (03PS15) 10Gehel: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 [13:14:43] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:20:42] !log updating puppet compiler facts [13:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:50] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [13:20:50] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [13:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:09] Amir1: who's the wikibase PM? Lea? [13:21:22] no Lydia [13:21:36] I think she's afk for meetings and lunch [13:22:00] ok! no rush, i was just going to email [13:23:03] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [13:23:03] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:25] (03PS1) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 [13:24:37] !log reboot ms-be2026 to see if that fixes the controller - T219854 [13:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:43] T219854: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 [13:26:03] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [13:26:43] (03PS16) 10Volans: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [13:26:45] (03PS2) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 [13:29:22] (03PS7) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [13:29:32] (03PS3) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) [13:29:41] RECOVERY - swift-container-updater on ms-be2026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [13:29:57] (03PS1) 10Gehel: Cleanup a few warnings. [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737 [13:30:17] RECOVERY - MD RAID on ms-be2026 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:30:23] RECOVERY - Disk space on ms-be2026 is OK: DISK OK [13:30:33] RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational [13:31:32] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15504/" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [13:33:58] (03CR) 10Volans: [C: 03+2] elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [13:34:09] (03CR) 10Gehel: [WIP] build with maven instead of bazel (033 comments) [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel) [13:34:31] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:39:17] (03Merged) 10jenkins-bot: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [13:40:28] (03CR) 10jenkins-bot: elasticsearch: Retrieve hostname and fqdn from node attributes [software/spicerack] - 10https://gerrit.wikimedia.org/r/492375 (owner: 10Gehel) [13:44:16] gehel: Error: Could not find any hostgroup matching 'cloudelastic_eqiad' (config file '/etc/icinga/objects/puppet_hosts.cfg', starting on line 3670) [13:44:27] icinga config is not happy [13:44:27] !log anomie@mwmaint1002 Fixing empty values for 'target_author_actor' in log_search on test wikis and mediawikiwiki for T215525 [13:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:33] T215525: log_search rows with ls_field='target_author_actor' and empty ls_value are created during actor migration - https://phabricator.wikimedia.org/T215525 [13:45:00] how hard would it be to have jenkins test icinga config validity? :D [13:45:22] impossible given the exported resources :D [13:45:33] s/impossible/quite hard/ [13:47:11] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) After the reboot the host is back up and running, all seems good so far. Keeping open for a bit to see if it holds. [13:47:18] (03PS1) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [13:47:23] RECOVERY - Long running screen/tmux on labtestnet2003 is OK: OK: No SCREEN or tmux processes detected. [13:48:25] (03CR) 10jerkins-bot: [V: 04-1] jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 (owner: 10Jbond) [13:48:37] (03PS1) 10Andrew Bogott: bootstrap_vz firstboot: run apt-get upgrade before anything else [puppet] - 10https://gerrit.wikimedia.org/r/500740 [13:49:07] RECOVERY - Check the NTP synchronisation status of timesyncd on ms-be2026 is OK: OK: synced at Tue 2019-04-02 13:49:05 UTC. [13:51:02] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap_vz firstboot: run apt-get upgrade before anything else [puppet] - 10https://gerrit.wikimedia.org/r/500740 (owner: 10Andrew Bogott) [13:52:40] (03PS1) 10Volans: cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921) [13:52:42] gehel: ^^^ [13:54:18] cdanis: but what we could do is to add a check that when a hiera value 'cluster:' is modified it checks that the matching definitions exists in the monitoring.yaml file [13:54:35] not sure how to gather which DC to add there tbh though and adding all of them seems redundant and useless in most cases [13:55:16] (03CR) 10Mathew.onipe: [C: 03+1] cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921) (owner: 10Volans) [13:55:31] (03CR) 10Volans: [C: 03+2] cloudelastic: add missing monitoring clusters [puppet] - 10https://gerrit.wikimedia.org/r/500742 (https://phabricator.wikimedia.org/T214921) (owner: 10Volans) [13:56:31] volans: sorry, that was related to merging the cloudelastic patch [13:57:26] yeah I know :) [13:58:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk) [13:58:59] (03PS7) 10Alexandros Kosiaris: uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk) [13:59:01] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] uwsgi::app: Handle ensure absent [puppet] - 10https://gerrit.wikimedia.org/r/498641 (owner: 10Alex Monk) [13:59:28] (03PS2) 10Zoranzoki21: Add three domains at wgCopyUploadDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500154 (https://phabricator.wikimedia.org/T216886) [13:59:36] (03PS4) 10Zoranzoki21: Remove namespace 104 from FlaggedRevs configuration for arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500153 (https://phabricator.wikimedia.org/T217507) [13:59:59] icinga config back to happy (cc gehel, onimisionipe ) [14:00:57] Request from X via cp1082 cp1082, Varnish XID 108983720 Error: 429, Too Many Requests at Tue, 02 Apr 2019 14:00:25 GMT [14:01:07] when trying to open https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Coluber_plicatilis_-_1734-1765_-_Print_-_Iconographia_Zoologica_-_Special_Collections_University_of_Amsterdam_-_UBA01_IZ12000206.tif/lossy-page1-1280px-Coluber_plicatilis_-_1734-1765_-_Print_-_Iconographia_Zoologica_-_Special_Collections_University_of_Amsterdam_-_UBA01_IZ12000206.tif.jpg [14:01:50] or any jpg thumbnail for this tiff file [14:04:24] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:05:58] (03CR) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [14:06:54] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [14:09:15] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Ensure absent in this resource isn't really needed as the equivalent is achieved otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk) [14:11:01] (03CR) 10Alex Monk: "good point. I'll make a new change to get rid of the existing use of this resource with ensure absent" [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk) [14:12:56] (03PS1) 10Alex Monk: profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747 [14:12:59] (03PS1) 10Andrew Bogott: labs_bootstrapvz firstboot: do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 [14:13:26] (03Abandoned) 10Alex Monk: ferm::service: Allow ensure absent without proto/port [puppet] - 10https://gerrit.wikimedia.org/r/498645 (owner: 10Alex Monk) [14:14:51] (03CR) 10Muehlenhoff: [C: 03+1] labs_bootstrapvz firstboot: do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 (owner: 10Andrew Bogott) [14:21:51] 10Operations: add wdoran@wikimedia.org to cpt-leads@wikimedia.org alias - https://phabricator.wikimedia.org/T219875 (10kchapman) [14:22:58] (03PS2) 10Andrew Bogott: labs_bootstrapvz firstboot: do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 [14:23:44] (03CR) 10DCausse: [C: 03+1] "works well after applying next patch (I'd put the cleanup before this one tho)" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel) [14:24:02] (03CR) 10Ladsgroup: [C: 04-1] Increase musical notation datatype string length limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [14:24:08] (03CR) 10DCausse: [C: 03+1] "LGTM, I don't have +2 on this repo" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737 (owner: 10Gehel) [14:24:46] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz firstboot: do dist-upgrade instead of just apt-get upgrade [puppet] - 10https://gerrit.wikimedia.org/r/500748 (owner: 10Andrew Bogott) [14:24:59] (03PS2) 10Gehel: Cleanup a few warnings. [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500737 [14:25:01] (03PS3) 10Gehel: [WIP] build with maven instead of bazel [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 [14:25:21] (03CR) 10Gehel: "> Patch Set 2: Code-Review+1" [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/500733 (owner: 10Gehel) [14:25:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [14:26:55] 10Operations, 10Maps (Tilerator), 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) [14:27:06] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10MSantos) [14:27:09] 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) [14:27:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747 (owner: 10Alex Monk) [14:27:17] (03PS2) 10Alexandros Kosiaris: profile::puppetdb: Remove ensure absent ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/500747 (owner: 10Alex Monk) [14:27:28] (03PS2) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [14:29:35] 10Operations, 10Maps: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) @Mathew.onipe this is solved and will be fixed when the stretch migration finishes. It's a known issue with the populate_admin script. [14:29:56] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) [14:30:10] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10MSantos) p:05Triage→03High [14:30:47] (03PS6) 10Andrew Bogott: labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 [14:30:56] (03PS6) 10Alex Monk: Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 [14:31:18] (03PS2) 10Greta WMDE: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) [14:32:05] (03CR) 10Andrew Bogott: [C: 03+2] labweb: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500622 (owner: 10Andrew Bogott) [14:32:51] (03CR) 10Greta WMDE: Increase musical notation datatype string length limit (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [14:35:35] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): Server rename: labtestnet2003 to cloudnet2003-dev, update label and switch ports descriptions, etc - https://phabricator.wikimedia.org/T219861 (10Papaul) p:05Triage→03Normal [14:36:59] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10Papaul) a:05Papaul→03RobH @robh there is 1 check box left for this. You can take a look and resolve the task once do... [14:37:37] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestvirt200[12].codfw.wmnet - https://phabricator.wikimedia.org/T218023 (10Papaul) 05Open→03Resolved This is complete. [14:38:48] (03CR) 10Volans: Allow ensure absent in monitoring classes without description/nrpe_command (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [14:40:58] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696 (10crusnov) Minor suggestion, perhaps we could increase the alert threshold if operation isn't actually affected at these levels. Quite often kubelet will sit on the alert threshol... [14:42:07] elukey: this ok to go ahead? just adding more of those notes URLs https://gerrit.wikimedia.org/r/c/operations/puppet/+/497273 [14:43:38] (03PS8) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [14:43:45] (03PS9) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [14:44:07] (03CR) 10Elukey: hadoop/hue/systemd: add Icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn) [14:44:23] mutante: all the analytics ones are good, the systemd one might not be.. since it is a generic class [14:44:49] (03CR) 10Marostegui: [C: 03+1] "Let's go then!" [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [14:45:22] elukey: oh yea, you are absolutely right. let me just remove that one from this patch and think about it in a later one. for other generic ones i used URLs containing $name or something [14:45:52] super [14:46:14] (03CR) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn) [14:47:06] (03PS3) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 [14:47:50] (03CR) 10Volans: [C: 04-1] "Minor details, almost there." (0311 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [14:48:23] "Minor" ==> 11 comments [14:48:32] :D [14:48:34] * elukey runs away [14:48:52] (03PS1) 10Muehlenhoff: Add qemu processes/Ganeti instances to filter_services list of debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991) [14:49:25] (03PS2) 10Dzahn: varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 [14:49:55] (03CR) 10Vgutierrez: [C: 03+1] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn) [14:50:08] (03PS6) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [14:50:10] (03PS1) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758 [14:50:24] (03PS3) 10Dzahn: varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 [14:50:29] (03CR) 10jerkins-bot: [V: 04-1] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn) [14:52:50] (03CR) 10Dzahn: [C: 03+2] varnish: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497512 (owner: 10Dzahn) [14:53:58] (03PS2) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758 [14:54:00] (03PS7) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [14:54:04] !log add leatherman 1.4 to jessie-wikimedia/backports [14:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:59] (03CR) 10Herron: [C: 03+1] profile: do not mutate level for mjolnir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500099 (https://phabricator.wikimedia.org/T213899) (owner: 10Cwhite) [14:56:11] (03PS3) 10Andrew Bogott: cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758 [14:56:23] (03PS8) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [14:57:00] (03PS1) 10Esanders: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) [14:57:07] (03CR) 10Andrew Bogott: [C: 03+2] cloud instance hiera: prepare for removal of 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/500758 (owner: 10Andrew Bogott) [14:57:12] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:58:34] (03PS10) 10Jcrespo: mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) [14:58:49] (03PS4) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 [15:00:21] (03CR) 10Elukey: [C: 03+1] hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn) [15:00:45] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Allow remote dumps from cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/499997 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [15:01:11] (03PS4) 10Jcrespo: mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) [15:01:13] (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott) [15:01:16] (03CR) 10Andrew Bogott: "compiler diffs:" [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott) [15:03:26] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup dbprov2002 [puppet] - 10https://gerrit.wikimedia.org/r/500683 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [15:03:27] (03PS5) 10Dzahn: hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 [15:03:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500756 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:03:40] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:05:05] (03CR) 10Dzahn: [C: 03+2] hadoop/hue/systemd: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/497273 (owner: 10Dzahn) [15:05:27] (03CR) 10Volans: Netbox module for Spicerack (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [15:07:02] (03PS1) 10Zoranzoki21: Enable Draft namespace on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500761 (https://phabricator.wikimedia.org/T214428) [15:07:20] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:07:55] !log stopped/disabled ipmievd on cumin2001 [15:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:20] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "network: Allow customisation of cumin list on a per-project basis" [puppet] - 10https://gerrit.wikimedia.org/r/498797 (owner: 10Arturo Borrero Gonzalez) [15:10:51] (03Abandoned) 10Alex Monk: network::constants: Move hiera calls to the parameters [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [15:11:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Why not calling the hiera keys `cumin_master` or something?. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [15:11:30] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:13:00] (03PS11) 10Volans: Make the puppetdb backend process primitive types for queries [software/cumin] - 10https://gerrit.wikimedia.org/r/474087 (https://phabricator.wikimedia.org/T207037) (owner: 10CRusnov) [15:13:46] 10Operations, 10Traffic, 10Goal, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548 (10Vgutierrez) a:03Vgutierrez [15:14:47] (03CR) 10Jforrester: [C: 03+2] VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders) [15:16:23] (03Merged) 10jenkins-bot: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders) [15:16:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Thanks for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott) [15:21:32] (03CR) 10jenkins-bot: VE section editing: Enable mobile AB test on all remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500759 (https://phabricator.wikimedia.org/T219564) (owner: 10Esanders) [15:21:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: VE: Enable mobile section editing A/B test on all remaining wikis T219564 (duration: 00m 51s) [15:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:45] T219564: Deploy Section Editing to all wikis - https://phabricator.wikimedia.org/T219564 [15:23:59] 10Operations: add wdoran@wikimedia.org to cpt-leads@wikimedia.org alias - https://phabricator.wikimedia.org/T219875 (10Dzahn) 05Open→03Resolved a:03Dzahn done! wdoran@ has been added to cpt-leads@ [master f1c100f] (dzahn) add wdoran@ to cpt-leads@ mail alias (T219875) [15:28:50] 10Operations, 10Office-IT, 10Research, 10Wikimedia-Mailing-lists: Create research-alerts mailing list - https://phabricator.wikimedia.org/T219309 (10Dzahn) Do you want the full features of mailman with user subscription and archives and your own list admins, listinfo page etc? Or do you just want a simple... [15:30:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson) Network ports have been set up for the servers below and added to cloud-hosts1 vlan. I need cables cloudvirt1015 and 1024 and will... [15:31:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson) [15:32:29] (03PS3) 10Jbond: jbond home: add user files [puppet] - 10https://gerrit.wikimedia.org/r/500739 [15:32:52] !log add cpp-hocon 0.1.6 to jessie-wikimedia/backports [15:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:57] (03PS1) 10Elukey: admin: remove tbayer from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802) [15:34:55] (03PS1) 10Ayounsi: Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766 [15:36:05] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10elukey) >>! In T178802#5076008, @Tbayer wrote: > @elukey Sure, that totally makes sense! The end of January estimate from T178802#4647106 turned out a b... [15:36:22] (03CR) 10Ayounsi: [C: 03+2] Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766 (owner: 10Ayounsi) [15:36:42] (03PS2) 10Ayounsi: Revert "Depooling eqsin because of eqsin-codfw link outage" [dns] - 10https://gerrit.wikimedia.org/r/500766 [15:36:45] PROBLEM - BGP status on cr1-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 316. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:37:06] (03CR) 10Nuria: [C: 03+1] "Tbayer to file a ticket if he were to needs this permits again." [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802) (owner: 10Elukey) [15:37:56] (03CR) 10Elukey: [C: 03+2] admin: remove tbayer from analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/500765 (https://phabricator.wikimedia.org/T178802) (owner: 10Elukey) [15:39:15] XioNoX: --^ [15:39:29] seems like that's the check having issue, I just checked the router itself and it's fine [15:39:29] there's a weirdness in the check_bgp [15:39:55] !log repool eqsin - T219847 [15:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:59] T219847: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 [15:40:23] (03PS1) 10Arturo Borrero Gonzalez: cloudcontrol2001-dev: use raid1-lvm.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500768 (https://phabricator.wikimedia.org/T219626) [15:41:35] great, that script is perl [15:41:51] yay perl [15:42:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudcontrol2001-dev: use raid1-lvm.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/500768 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [15:42:09] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:42:32] perl issues should go to RT :P jk [15:43:21] 10Operations, 10Traffic, 10netops: Outage on the primary codfw-eqsin link - https://phabricator.wikimedia.org/T219847 (10ayounsi) 05Open→03Resolved a:03ayounsi Telia stabilized the situation, " Services should be stable at the moment, hands are off and we are working with the vendor to provide an RFO i... [15:48:06] (03CR) 10Gehel: [C: 04-1] "Mostly minor style issues. Feel free to disagree!" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [15:48:20] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10RobH) [15:49:13] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:49:57] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 56.56 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:50:50] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500422 (https://phabricator.wikimedia.org/T219601) (owner: 10Mathew.onipe) [15:50:52] ACKNOWLEDGEMENT - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet daniel_zahn https://phabricator.wikimedia.org/T219696 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:51:40] the "Varnish traffic drop between 30min ago and now at ulsfo" is normal and due to repooling eqsin [15:51:57] I checked, it was so huge, then I saw the SAL :-) [15:52:58] !log icinga - re-enabling notifications for scandium. setup task is resolved yet systemd is alerting, should not have been turned off anymore (T201366) [15:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:01] T201366: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 [15:55:25] !log beginning rolling upgrade of codfw ELK cluster to 5.6.15 T219571 [15:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:43] seems like that perl error went away, I'll just pretend it never happened [15:55:46] !log scandium - systemctl start parsoid-vd was failed (T201366) [15:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:10] but my guess would be some snmp packets getting lost on the way [15:56:57] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:57:07] !log T219626 reimaging cloudcontrol2001-dev again [15:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:11] T219626: codfw1dev: bootstrap cloudcontrol servers in mitaka/stretch - https://phabricator.wikimedia.org/T219626 [16:00:05] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:23] !log icinga - schedule (30d) downtime for kubernetes operational latencies alerts (T219696) on kubernetes1004 [16:00:24] (03PS1) 10Mathew.onipe: cloudelastic: allow elastic to bind to public ip [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921) [16:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:27] T219696: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696 [16:00:41] (03CR) 10Cwhite: [C: 03+1] Allow ensure absent in monitoring classes without description/nrpe_command [puppet] - 10https://gerrit.wikimedia.org/r/498655 (owner: 10Alex Monk) [16:01:27] ACKNOWLEDGEMENT - EDAC syslog messages on wtp2013 is CRITICAL: 82.02 ge 4 daniel_zahn still https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [16:01:27] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 1155 ge 4 daniel_zahn still https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [16:02:39] !log T194174 - bump. started alerting again 2 days ago [16:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:46] T194174: wtp2013 memory correctable errors - https://phabricator.wikimedia.org/T194174 [16:04:04] (03PS2) 10Gehel: cloudelastic: allow elastic to bind to public ip [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:04:06] 10Operations, 10ops-eqiad: rack/setup 3 new single cpu spare pool systems - https://phabricator.wikimedia.org/T219890 (10RobH) p:05Triage→03Normal [16:04:15] (03CR) 10Gehel: [C: 03+2] "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/15507/cloudelastic1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/500773 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:07:25] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) With compression of mediawiki debug logs, disk usage went down to 287... [16:09:46] (03CR) 10EBernhardson: [C: 04-1] "This needs to be split into two patches for a clean sync. Likely WikibaseSearchSettings.php needs to be duplicated in the first patch, syn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [16:12:05] !log - replacing accepted-prefix-limit with prefix-limit on cr2-eqiad - T211730 [16:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:09] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [16:13:48] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:17:26] (03PS6) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [16:17:28] (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [16:17:30] (03PS1) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [16:18:13] (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [16:18:22] (03PS7) 10EBernhardson: Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) [16:18:24] (03PS2) 10EBernhardson: Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) [16:18:26] (03PS2) 10EBernhardson: Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) [16:18:30] (03CR) 10jerkins-bot: [V: 04-1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [16:21:30] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:24:14] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:24:44] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:24:46] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 91.25 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:28:50] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:33:06] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:33:42] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:35:26] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:36:55] !log - replacing accepted-prefix-limit with prefix-limit on esams - T211730 [16:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:01] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [16:39:14] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:39:36] !log ppchelko@deploy1001 Started deploy [restbase/deploy@6026ad1]: Switch to swagger 3 T218218 [16:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:39] T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 [16:39:58] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:40:37] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) [16:43:14] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:43:44] 10Operations, 10Patch-For-Review: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Andrew) [16:43:54] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:44:27] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@6026ad1]: Switch to swagger 3 T218218 (duration: 04m 52s) [16:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:45] akosiaris: should we downtime those kubernetes alerts or are they useful? [16:47:03] i downtimed one of them linking to the ticket ..but there are more hosts [16:47:22] !log - replacing accepted-prefix-limit with prefix-limit in eqsin - T211730 [16:47:24] https://phabricator.wikimedia.org/T219696 [16:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:26] T211730: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 [16:47:33] T21969 [16:47:34] T21969: Special:WantedCategories shows categories that are "not wanted" or already exist - https://phabricator.wikimedia.org/T21969 [16:47:42] T219696 [16:47:43] T219696: Alert "kubelet operational latencies" - https://phabricator.wikimedia.org/T219696 [16:48:12] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:48:35] grump [16:49:00] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:50:46] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:52:58] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:56:04] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:57:14] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [16:59:16] (03PS2) 10BBlack: Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) [16:59:38] (03CR) 10jerkins-bot: [V: 04-1] Turn on non-chaining CNAMEs experimental option [dns] - 10https://gerrit.wikimedia.org/r/500731 (https://phabricator.wikimedia.org/T208263) (owner: 10BBlack) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1700). [17:10:16] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:13:19] 10Operations, 10netops: Replace accepted-prefix-limit with prefix-limit - https://phabricator.wikimedia.org/T211730 (10ayounsi) 05Open→03Resolved All set, no down or bouncing peers, no mentions of `accepted-prefix-limit` in Rancid [17:16:06] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) [17:16:21] 10Operations, 10Analytics, 10hardware-requests: GPU upgrade for stat1005 - https://phabricator.wikimedia.org/T216226 (10elukey) 05Stalled→03Resolved [17:18:06] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) 05Stalled→03Open Vega GPU mounted on stat1005, it looks go... [17:19:22] 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10Joe) [17:21:17] 10Operations, 10Discovery-Search, 10Operations-Software-Development, 10Wikidata, and 2 others: Create a cookbook to copy data between WDQS servers - https://phabricator.wikimedia.org/T213401 (10Mathew.onipe) [17:22:39] (03PS5) 10Alex Monk: Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 [17:27:36] !log ppchelko@deploy1001 Started deploy [restbase/deploy@3dcf328] (dev-cluster): Upgrade swagger to v3, attempt 2, T218218 [17:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:40] T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 [17:28:00] 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10Joe) Fun finding: if we eliminate either the `until=Xmin` or the `from=Xmin` we have in the request url for `check_graphite` we get back `Cache-Control: max-age=120`. If we d... [17:29:58] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:30:38] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@3dcf328] (dev-cluster): Upgrade swagger to v3, attempt 2, T218218 (duration: 03m 02s) [17:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:49] !log ppchelko@deploy1001 Started deploy [restbase/deploy@3dcf328]: Upgrade swagger to v3, attempt 2, T218218 [17:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:06] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.128, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:35:30] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [17:35:30] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:37:02] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:37:52] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:37:59] XioNoX: FYI mr1-eqsin ^^^ [17:38:42] volans: seems like equinix either outage or planned maintenance, no big deal for now, will keep an eye [17:40:00] volans: yup, planned maintenance - REMINDER - Scheduled Equinix Connect Software Upgrade-SG Metro Area Network Maintenance (SERVICE IMPACTING)-03-APR-2019 [5-185702129870] [17:40:09] it's april 3rd Singapore time [17:40:26] ack [17:40:33] volans: so I guess best thing to do now it so downtime it for the duration of their window [17:40:43] so it can alert if it's still down afterwards [17:42:05] given it's already down it will not re-alert after the downtime [17:42:06] expires [17:42:12] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:42:20] doh [17:42:32] 10Operations, 10monitoring: Stop using public (cached) endpoints for checks on graphite - https://phabricator.wikimedia.org/T219902 (10CDanis) For Prometheus, there is just a LVS service IP that goes to local Apache, which on a quick glance does not seem to have any caching modules enabled. Looking at a curl,... [17:43:02] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:43:56] (03PS1) 10Ayounsi: LLDP fact - return correct port information [puppet] - 10https://gerrit.wikimedia.org/r/500795 [17:44:52] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:45:34] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:46:26] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.06 ms [17:46:26] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 232.24 ms [17:50:04] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:50:44] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:51:36] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@3dcf328]: Upgrade swagger to v3, attempt 2, T218218 (duration: 20m 47s) [17:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:40] T218218: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 [17:54:02] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:54:40] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:55:56] (03Abandoned) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [puppet] - 10https://gerrit.wikimedia.org/r/499316 (owner: 10Andrew Bogott) [17:56:04] !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125 [17:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:08] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [17:56:31] (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [17:57:39] (03CR) 10jerkins-bot: [V: 04-1] openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [17:59:10] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:59:29] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7] (dev-cluster): Kafka logging pipeline, dev cluster only T211125 (duration: 03m 25s) [17:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1800) [18:00:07] (03PS1) 10Andrew Bogott: cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195) [18:00:12] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:02:08] (03PS2) 10Andrew Bogott: cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195) [18:02:28] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:02:47] (03CR) 10Bstorm: [C: 03+1] "It seems like a more sensible and explicit approach, once the linter is happy. As long as a sane default ends up somewhere for cloud VPS" [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [18:03:06] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:04:08] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) The new UI has been deployed. Next step here - explore the new features in openAPI 3.0, see what we can start using,... [18:04:12] (03CR) 10Andrew Bogott: openstack: clientpackages: fix missing deb repo installation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [18:04:31] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: update six servers to use 10Gb nics [puppet] - 10https://gerrit.wikimedia.org/r/500799 (https://phabricator.wikimedia.org/T216195) (owner: 10Andrew Bogott) [18:05:12] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC fails https://puppet-compiler.wmflabs.org/compiler1002/15508/" [puppet] - 10https://gerrit.wikimedia.org/r/500797 (owner: 10Arturo Borrero Gonzalez) [18:06:23] !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125 [18:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:26] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [18:08:45] (03PS1) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661) [18:08:56] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, canary on restbase2010 T211125 (duration: 02m 33s) [18:08:58] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:08:59] (03PS2) 10Andrew Bogott: Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661) [18:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:40] (03CR) 10Andrew Bogott: [C: 03+2] Rename labvirt1008 to cloudvirt1008 [dns] - 10https://gerrit.wikimedia.org/r/500800 (https://phabricator.wikimedia.org/T216661) (owner: 10Andrew Bogott) [18:10:22] (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: fix missing deb repo installation [puppet] - 10https://gerrit.wikimedia.org/r/500797 [18:10:28] (03PS1) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [18:11:44] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:18:04] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:20:09] !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125 [18:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:12] !log ppchelko@deploy1001 deploy aborted: Kafka logging pipeline, full deploy T211125 (duration: 00m 03s) [18:20:13] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [18:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:22] !log ppchelko@deploy1001 Started deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125 [18:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:58] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:22:01] !log cutting mediawiki branch 1.33.0-wmf.24 [18:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:28] !log cutting mediawiki branch 1.33.0-wmf.24 (T206678) [18:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:31] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [18:23:45] (03CR) 10Bstorm: "Looks legit, though I think the compiler one should go and one other change." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/500095 (https://phabricator.wikimedia.org/T219243) (owner: 10BryanDavis) [18:25:01] 10Operations, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries - https://phabricator.wikimedia.org/T217715 (10CDanis) Filippo, did you decide r494685 wasn't necessary? [18:26:02] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [18:33:40] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:35:23] 10Operations, 10Puppet, 10Packaging: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) thanks to moritz all dependencies have been built but now getting the following error while building facter ` root@boron:/tmp/buildd/facter-3.11.0# /usr/bin/c++ -DBOOST_ALL_... [18:38:54] PROBLEM - Nginx local proxy to apache on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:38:58] PROBLEM - Apache HTTP on mw1283 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:40:14] RECOVERY - Nginx local proxy to apache on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:40:16] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.030 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:40:50] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:41:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2cb53a7]: Kafka logging pipeline, full deploy T211125 (duration: 20m 49s) [18:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:14] T211125: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 [18:44:50] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:45:28] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:46:44] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:50:05] (03PS2) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [18:51:54] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10EBernhardson) hacks abound, but basically: * Added `deb [arch=amd64]... [18:51:58] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [18:55:54] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:00:04] marxarelli: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T1900). [19:03:59] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10greg) This is a great example of a almost-worst case scenario, sadly. Things tha... [19:05:00] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:05:20] ^ can confirm that ssh is really slow on that host, however i don't see any particular reason for it [19:09:24] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Later), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) I have deployed a new pipeline for RESTBase in production and it all looks great. Next step -... [19:11:21] (03PS1) 10Dduvall: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 [19:12:57] !log dduvall@deploy1001 Started scap: testwiki to php-1.33.0-wmf.24 and rebuild l10n cache [19:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:30] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:17:26] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), 10Services (done): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) a:05holger.knust→03Pchelolo [19:19:33] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Pchelolo) [19:19:54] 10Operations, 10Citoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move citoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219919 (10Pchelolo) a:05Pchelolo→03Mvolz [19:20:44] marxarelli: Can I push out a patch after you're done to fix a Vector regression? [19:20:57] Wikipedia article "0" no longer has a

title :) [19:21:13] 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10Pchelolo) [19:22:04] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:22:40] 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10Pchelolo) [19:23:40] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move eventstreams logging to new logging pipeline - https://phabricator.wikimedia.org/T219922 (10Pchelolo) [19:25:02] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo) [19:26:00] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:26:35] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move mobile apps logging to new logging pipeline - https://phabricator.wikimedia.org/T219924 (10Pchelolo) [19:27:20] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:27:53] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10Pchelolo) [19:28:59] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Move recommendation-api logging to new logging pipeline - https://phabricator.wikimedia.org/T219926 (10Pchelolo) [19:29:40] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:15] 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10Pchelolo) [19:31:18] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:32:01] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10Pchelolo) [19:32:19] Krinkle: gah. of course. i'm running a bit late on the train, but the full sync is happening now [19:32:40] marxarelli: no worries, you're perfectly within the window. :) [19:33:09] i'll ping ya when i'm done! [19:33:11] marxarelli: is it alright if I start the gate tests? [19:33:42] yeah, go for it [19:33:52] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:39:50] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:42:00] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:45:06] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:51:40] (03CR) 10Hashar: "iirc that followed a discussion with Alexandros, Giuseppe and Faidon on IRC. I am not sure whom approval we need?" [puppet] - 10https://gerrit.wikimedia.org/r/498431 (owner: 10Hashar) [19:52:46] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.34, 25.00, 16.15 [19:53:02] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 56.82, 27.56, 18.15 [19:53:29] ^ current scap-cdb-rebuild most likely [19:53:34] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:55:06] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:55:22] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 19.22, 23.65, 17.19 [19:55:36] PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 1464 MB (3% inode=67%) [19:55:36] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 19.72, 25.18, 18.86 [19:56:56] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:57:17] !log dduvall@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.24 and rebuild l10n cache (duration: 44m 20s) [19:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:06] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:59:51] (03CR) 10Dduvall: [C: 03+2] Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall) [20:00:00] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:01:02] (03PS1) 10Andrew Bogott: cloudvirt1008/1009: updated hiera settings for stretch and 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/500820 [20:01:04] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall) [20:01:59] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500812 (owner: 10Dduvall) [20:02:04] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:02:25] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1008/1009: updated hiera settings for stretch and 10Gb [puppet] - 10https://gerrit.wikimedia.org/r/500820 (owner: 10Andrew Bogott) [20:03:12] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) ping @Miriam @Gilles so they know the status of this. [20:03:18] PROBLEM - Check whether ferm is active by checking the default input chain on cloudcontrol2001-dev is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [20:03:36] PROBLEM - Check systemd state on cloudcontrol2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:03:44] PROBLEM - keystone public endoint port 5000 on cloudcontrol2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 804 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:03:47] (03PS9) 10Andrew Bogott: labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 [20:04:06] PROBLEM - keystone admin endpoint port 35357 on cloudcontrol2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 805 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:04:30] PROBLEM - puppet last run on cloudcontrol2001-dev is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[keystone] [20:07:40] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Group0 to 1.33.0-wmf.24 [20:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:56] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:08:17] (03CR) 10Andrew Bogott: [C: 03+2] labpuppetmaster: move from 'main' region to 'eqiad1' region [puppet] - 10https://gerrit.wikimedia.org/r/500640 (owner: 10Andrew Bogott) [20:11:50] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:12:16] (03PS1) 10Andrew Bogott: cloud cumin: move extra cumin modules to eqiad1 profile [puppet] - 10https://gerrit.wikimedia.org/r/500822 [20:12:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Please note I've been going through and updating the firmware of the ilom and bios for the following systems: [x] - cloudvirt1008 [x] - cloudvirt1009 [] -... [20:13:53] (03CR) 10Andrew Bogott: [C: 03+2] cloud cumin: move extra cumin modules to eqiad1 profile [puppet] - 10https://gerrit.wikimedia.org/r/500822 (owner: 10Andrew Bogott) [20:15:10] (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: Use a list of projects [puppet] - 10https://gerrit.wikimedia.org/r/500823 [20:15:12] (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824 [20:16:31] !log 1.33.0-wmf.24 successfully deployed to group0. errors rates look normal (T206678) [20:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:35] T206678: 1.33.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T206678 [20:16:47] Krinkle: all done :) [20:17:09] marxarelli: thx [20:17:14] * Krinkle stages on mwdebug1002 [20:19:06] (03PS1) 10Alex Monk: openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825 [20:19:59] (03PS6) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [20:20:09] (03CR) 10CRusnov: "minor updates, fixing issues raised. thanks!" (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [20:20:42] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.23/skins/Vector/includes/: I6e04b512d / T219864 (duration: 01m 00s) [20:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:45] T219864: Articles about zero (0) not displaying title in Vector skin - https://phabricator.wikimedia.org/T219864 [20:21:06] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:22:36] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [20:23:32] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/skins/Vector/includes/: I6e04b512d / T219864 (duration: 00m 59s) [20:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:23] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10jcrespo) For example https://www.mediawiki.org/wiki/How_to_report_a_bug is a very... [20:24:38] * Krinkle done staging on mwdebug102 [20:24:50] (03PS3) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [20:24:58] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:25:42] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:26:23] (03CR) 10jerkins-bot: [V: 04-1] cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:29:26] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:30:16] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75489 bytes in 0.171 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:30:34] (03PS4) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [20:31:26] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:31:32] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:32:12] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:32:27] (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [20:32:46] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:34:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:34:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:34:40] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [20:34:42] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:37:33] (03CR) 10Andrew Bogott: [C: 03+1] "I can merge this when I have time to watch it apply :)" [puppet] - 10https://gerrit.wikimedia.org/r/500823 (owner: 10Alex Monk) [20:37:50] (03CR) 10Andrew Bogott: [C: 03+1] openstack::monitor::spreadcheck: rm old renaming absent file resources [puppet] - 10https://gerrit.wikimedia.org/r/500824 (owner: 10Alex Monk) [20:38:05] (03CR) 10Andrew Bogott: [C: 03+1] openstack::monitor::spreadcheck: add cloudinfra config [puppet] - 10https://gerrit.wikimedia.org/r/500825 (owner: 10Alex Monk) [20:38:21] (03PS1) 10Kosta Harlan: (wip) Enable ORES RCFilters for eswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) [20:40:04] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:42:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:42:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:43:58] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:45:58] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:47:50] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:49:00] PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 1463 MB (3% inode=67%) [20:50:16] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:52:47] (03PS5) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [20:54:41] !log restarting pdns and pdns-recursor on labservices1001 and 1002 in hopes of getting those machines to act a bit less sluggish [20:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:14] (03PS1) 10CRusnov: profile kubernetes node: Adjust latency alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/500839 [20:57:26] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:58:28] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/DNS [21:00:06] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:00:46] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:01:56] (03PS1) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [21:01:56] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:02:14] PROBLEM - Auth DNS on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:58] PROBLEM - Check for gridmaster host resolution UDP on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:58] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [21:03:40] PROBLEM - Check for gridmaster host resolution TCP on labservices1002 is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:04:10] PROBLEM - SSH on labservices1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:04:40] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:04:50] RECOVERY - Check for gridmaster host resolution TCP on labservices1002 is OK: DNS OK - 0.778 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:06:36] (03PS6) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [21:06:47] (03PS2) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [21:07:54] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [21:10:00] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.061 seconds response time. www.wikipedia.org returns 208.80.154.224 https://wikitech.wikimedia.org/wiki/DNS [21:10:21] (03CR) 10Catrope: [C: 03+1] (wip) Enable ORES RCFilters for eswikiquote (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500831 (https://phabricator.wikimedia.org/T219160) (owner: 10Kosta Harlan) [21:11:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:14:06] 10Operations, 10ORES, 10Scoring-platform-team: [Epic] Deploy ORES in kubernetes cluster - https://phabricator.wikimedia.org/T182331 (10Halfak) [21:14:22] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:14:32] RECOVERY - Check for gridmaster host resolution UDP on labservices1002 is OK: DNS OK - 0.019 seconds response time (tools-sgegrid-master.tools.eqiad.wmflabs. 60 IN A 172.16.4.197) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:14:34] RECOVERY - SSH on labservices1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:15:08] RECOVERY - Auth DNS on labservices1002 is OK: DNS OK: 0.014 seconds response time. labs-ns1.wikimedia.org returns https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:15:16] 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 3 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10KartikMistry) @Pchelolo Added patch. Feel free to fix :) [21:16:11] !log rebooting labservices1002 [21:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:41] (03PS3) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [21:17:48] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 (owner: 10Alex Monk) [21:19:58] (03PS4) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [21:22:15] 10Operations, 10Core Platform Team, 10MediaWiki-General-or-Unknown, 10PHP 7.2 support: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Jdforrester-WMF) >>! In T219279#5068956, @Joe wrote: > @Anomie so you're suggesting we... [21:22:45] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) [21:23:17] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) [21:23:44] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10MarcoAurelio) This has evolved into fatals: T219935. [21:24:28] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10Halfak) [21:25:47] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10Niharika) @MarcoAurelio I don't think this... [21:26:40] (03PS5) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [21:27:18] 10Operations, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Performance, 10User-Banyek: Issues with purgeUnusedProjects.php cron job on mwmaint1002 (Fri Oct 26) - https://phabricator.wikimedia.org/T208231 (10aezell) This is interesting. T219935 seems to indicate that the query is now poorl... [21:27:25] !log rebooting labservices1001 [21:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:12] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) https://phabricator.wikimedi... [21:28:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [21:29:24] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) >>! In T219935#5079562, @Nih... [21:29:54] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:30:52] RECOVERY - SSH on labservices1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:33:46] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:34:04] 10Operations, 10ORES, 10Scoring-platform-team: Investigate memory usage of ORES in kubernetes - https://phabricator.wikimedia.org/T210264 (10Halfak) [21:36:22] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:36:54] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:37:44] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:38:35] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10aezell) >>! In T219935#5079587, @MarcoAure... [21:39:37] (03PS7) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [21:39:51] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10Niharika) @aezell Possibly because beta do... [21:42:32] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10aezell) I should have a patch shortly. [21:42:54] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:46:12] (03PS8) 10Bstorm: cloudstore: start refactor for role switch up around the labstores [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) [21:48:12] (03CR) 10Bstorm: "Ok, now I think this is about ready. The thing to watch out for is the systemd stuff around maintain-dbusers. It looks like it's a funct" [puppet] - 10https://gerrit.wikimedia.org/r/500801 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:49:26] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [21:56:22] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational [22:12:32] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:26:29] 10Puppet, 10Community-Tech, 10MediaWiki-extensions-PageAssessments, 10Beta-Cluster-reproducible, and 2 others: extensions/PageAssessments/maintenance/purgeUnusedProjects.php is causing fatals on Beta - https://phabricator.wikimedia.org/T219935 (10MarcoAurelio) 05Open→03Resolved a:03aezell ` maurelio@... [22:28:06] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:32:12] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:32:19] (03PS1) 10Bstorm: clouddb: add DNS alias for wikilabels.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/500855 (https://phabricator.wikimedia.org/T219563) [22:33:18] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:35:26] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10RobH) Ok, so attempting to load the following on cloudvirt1012 didn't work, when it worked just fine for cloudvirt100[89]. All are the same DL360 gen8 systems. S... [22:41:06] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:41:28] (03CR) 10Bstorm: [C: 03+2] clouddb: add DNS alias for wikilabels.db.svc.eqiad.wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/500855 (https://phabricator.wikimedia.org/T219563) (owner: 10Bstorm) [22:51:02] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:57:14] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:57:30] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190402T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:06] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:07:50] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:13:20] PROBLEM - SSH on labcontrol1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:15:48] RECOVERY - SSH on labcontrol1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.11 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:17:33] (03PS4) 10Jforrester: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [23:17:38] (03CR) 10Jforrester: [C: 03+2] Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [23:19:18] (03Merged) 10jenkins-bot: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [23:22:08] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:22:25] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Add 'depicts' statements to search index on testcommons (duration: 00m 59s) [23:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:47] (03PS3) 10Jforrester: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu) [23:23:50] (03CR) 10Jforrester: [C: 03+2] Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu) [23:24:56] (03Merged) 10jenkins-bot: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu) [23:25:41] (03CR) 10jenkins-bot: Add 'depicts' statements to search index on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500080 (owner: 10Cparle) [23:25:43] (03CR) 10jenkins-bot: Enable extension SandboxLink for rowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500672 (https://phabricator.wikimedia.org/T219855) (owner: 10Strainu) [23:27:01] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Enable SandboxLink for rowiki T219855 (duration: 00m 56s) [23:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:04] T219855: Activate the extension "SandboxLink" for rowiki - https://phabricator.wikimedia.org/T219855 [23:27:20] (03PS4) 10Jforrester: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [23:27:34] PROBLEM - SSH on labservices1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:28:04] (03CR) 10Jforrester: [C: 03+2] Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [23:29:03] (03Merged) 10jenkins-bot: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [23:30:10] (03CR) 10Jforrester: [C: 04-2] "Blocked on SRE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482099 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [23:30:15] (03CR) 10Jforrester: [C: 04-2] "Blocked on SRE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [23:30:25] (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [23:30:26] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Add new WMCS IP range to wgRateLimitsExcludedIps T167432 (duration: 00m 57s) [23:30:29] (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [23:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:34] T167432: Run Wikibase daily browser tests on Jenkins - https://phabricator.wikimedia.org/T167432 [23:30:36] (03CR) 10Jforrester: [C: 04-2] "Blocked on below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) (owner: 10Jforrester) [23:33:15] (03PS6) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:33:59] (03CR) 10jerkins-bot: [V: 04-1] enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:36:44] (03CR) 10jenkins-bot: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [23:36:55] (03PS7) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:37:24] PROBLEM - Check systemd state on scandium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:38:21] (03PS8) 10Jforrester: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:40:50] (03CR) 10Jforrester: [C: 03+2] enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:41:55] (03Merged) 10jenkins-bot: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:44:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot T219261 (duration: 00m 58s) [23:44:19] (03PS6) 10Alex Monk: labs puppetmaster migration: Puppet role for encapi/labspuppet DB hosts [puppet] - 10https://gerrit.wikimedia.org/r/500844 [23:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:21] T219261: Enwiki configuration: remove move-categorypages from 'user' group - https://phabricator.wikimedia.org/T219261 [23:47:45] (03CR) 10jenkins-bot: enwiki: Restrict move-categorypages to +extendedmover/+sysop/+bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499042 (https://phabricator.wikimedia.org/T219261) (owner: 10DannyS712) [23:47:53] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10Volans) Forgot to mention that during the reboot it printed: ` Slot 3 Port 1 : Smart Array P840 Controller - (4096 MB, V3.56) 14 Logical Drive(s) - Operation Failed - 1719-Slot 3 Drive Arra... [23:48:24] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500777 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:48:31] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:48:51] (03CR) 10Smalyshev: [C: 03+1] Disable wbcs dispatching query builder on commons (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500778 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:51:16] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:51:56] (03CR) 10Smalyshev: [C: 04-1] Disable wbcs dispatching query builder on commons (1/3) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500070 (https://phabricator.wikimedia.org/T218954) (owner: 10EBernhardson) [23:56:48] RECOVERY - Check systemd state on scandium is OK: OK - running: The system is fully operational