[00:03:26] 10Operations, 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Wikimedia-production-error: Could not enqueue jobs from stream mediawiki.job.cirrusSearchIncomingLinkCount - https://phabricator.wikimedia.org/T263132 (10thcipriani) p:05Unbreak!→03High After a few rounds of spikes, messages seem to h... [00:11:12] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={1,2,3,4} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to [00:11:12] datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [00:13:17] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for local development discussion - https://phabricator.wikimedia.org/T263216 (10jeena) Thanks! [00:19:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:21:36] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:18] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:01:38] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={1,5} prometheus=ops site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster= [01:01:38] -topic=All&var-consumer_group=All [01:03:34] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:23:02] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=n [01:23:02] tasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:28:52] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [01:50:26] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3 [01:50:26] var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [03:54:05] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 13.19 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [04:19:32] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 3.106 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:46:50] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 12.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200919T0700) [07:19:32] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 4.019 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:24:56] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:35:05] 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Mojibake on Mailman - https://phabricator.wikimedia.org/T263248 (10Ladsgroup) >>! In T263248#6474168, @jhsoby wrote: > @Ladsgroup I see in T52864 that you're involved in upgrading the lists. Do you have any idea what's causing this? Yes, as Dzahn put it, it's... [11:41:06] (03CR) 10Elukey: "> Patch Set 2:" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [11:45:14] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:28] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:08] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 ipmi alert - https://phabricator.wikimedia.org/T263145 (10Bstorm) This just alerted again. I'll downtime it if I can make my laptop work right. [16:44:53] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 ipmi alert - https://phabricator.wikimedia.org/T263145 (10Bstorm) Wait the alert may have been the old acked alert re-alerting in VictorOps. I will resolve it in victorops. The alert is still red in icinga, but it is acked so should no... [16:49:06] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1033 ipmi alert - https://phabricator.wikimedia.org/T263145 (10Bstorm) >>! In T263145#6476367, @Bstorm wrote: > Wait the alert may have been the old acked alert re-alerting in VictorOps. I will resolve it in victorops. The alert is still re... [17:52:13] (03PS3) 10Elukey: Add basic debian packaging [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) [18:38:12] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I9e99b766da20824391fc5111586be998c46c4331 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628513 [18:38:14] (03PS1) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I11ce4d27374aacb96a8b03b7d777406a63d5d5e1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628514 [18:38:16] (03PS1) 10Evrifaessa: Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) [18:39:58] (03Abandoned) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I11ce4d27374aacb96a8b03b7d777406a63d5d5e1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628514 (owner: 10Evrifaessa) [18:40:03] (03Abandoned) 10Evrifaessa: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config Change-Id: I9e99b766da20824391fc5111586be998c46c4331 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628513 (owner: 10Evrifaessa) [18:49:10] (03PS1) 10ArielGlenn: don't get db creds unless needed for a query [dumps] - 10https://gerrit.wikimedia.org/r/628519 (https://phabricator.wikimedia.org/T263323) [18:57:50] (03PS2) 10Jforrester: Set timezone for wikis of the CWIRP to Europe/Rome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [18:57:57] (03CR) 10ArielGlenn: [C: 03+2] don't get db creds unless needed for a query [dumps] - 10https://gerrit.wikimedia.org/r/628519 (https://phabricator.wikimedia.org/T263323) (owner: 10ArielGlenn) [18:58:28] (03CR) 10Jforrester: "Please make sure that your patches are based on the remote master branch before pushing, to avoid implicit merge commits. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628515 (https://phabricator.wikimedia.org/T263123) (owner: 10Evrifaessa) [19:02:59] !log ariel@deploy1001 Started deploy [dumps/dumps@14ba6e9]: defer getting db creds until really needed [19:03:00] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:03] !log ariel@deploy1001 Finished deploy [dumps/dumps@14ba6e9]: defer getting db creds until really needed (duration: 00m 04s) [19:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:06:36] (03PS1) 10Evrifaessa: Removing Wikipedia store link from enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628521 (https://phabricator.wikimedia.org/T262329) [19:27:19] (03PS1) 10Urbanecm: Allow local steward group members to bigdelete [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628522 [20:15:48] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:10] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:18] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:32] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2030 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:04:12] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:14:26] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2030 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:15:10] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:38] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:44:06] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:12] I know T261133 was declined. What about a project wanting to ban all ip edits from a content namespace? Please see https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/General_semi_protection_for_all_property_pages [21:50:13] T261133: Ban IP editions on pt.wiki - https://phabricator.wikimedia.org/T261133 [22:30:06] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:43:32] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:16] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:29:34] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state