[00:02:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:04:26] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:56] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:28] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:34] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:36] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:28:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:14:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:16:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:17:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:38:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:55:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:59:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:16:31] (03CR) 10Legoktm: [C: 04-2] "Per my comments on the bug, especially T220703#5383773 and T220703#6354042." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617843 (https://phabricator.wikimedia.org/T220703) (owner: 10Ladsgroup) [06:40:14] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:45:58] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200802T0700) [07:10:52] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [07:16:32] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [08:24:00] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10VulpesVulpes825) For the discussion, It should be considered a consensus has been reached. However, the task is still stalled due t... [09:12:10] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:18:04] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:53:35] (03PS1) 10MarcoAurelio: Add .gitreview [debs/karma] - 10https://gerrit.wikimedia.org/r/617814 [11:54:07] (03PS2) 10MarcoAurelio: Add .gitreview [debs/karma] - 10https://gerrit.wikimedia.org/r/617814 [11:54:18] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [debs/karma] - 10https://gerrit.wikimedia.org/r/617814 (owner: 10MarcoAurelio) [12:16:28] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:16:31] 10Operations, 10Security-Team, 10Stewards-and-global-tools, 10Security, 10User-revi: Security Issue Access Request for 2020 Stewards - https://phabricator.wikimedia.org/T246449 (10Krd) >>! In T246449#5957171, @chasemp wrote: > @Krd if you would like to be included here please add MFA to your phab account... [12:22:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [13:58:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:06:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:15:34] (03PS1) 10Urbanecm: Add lld to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/617860 (https://phabricator.wikimedia.org/T259432) [16:29:52] (03PS1) 10Urbanecm: [WIP] Initial configuration for lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) [17:26:46] randomly just coming to say how much I hate exim4 [17:27:51] :D [17:27:58] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ciell) I haven't been able to do any edits through Open Refine, and in https://phabricator.wikimedia.o... [18:33:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:37:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:39:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:24] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [19:11:30] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:11:30] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:17:19] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:17:20] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:31:52] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:31:52] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:34:44] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:34:44] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [19:54:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:01:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:09:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:16:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:46:13] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) [21:06:02] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:27:14] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:39:29] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Krenair) This should probably just be a record under mailman.wmcloud.org ? [21:42:03] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) >>! In T259444#6354493, @Krenair wrote: > This should probably just be a record under mailman.wmcloud.org ? Do you mean... [22:10:18] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1179.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:11:26] that's about 20 minutes lag... [22:30:02] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Krenair) As I recall with the meet project the project itself in OpenStack was named `meet`, therefore you automatically got a `mee... [22:31:32] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:37:04] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10Ladsgroup) >>! In T259444#6354513, @Krenair wrote: > As I recall with the meet project the project itself in OpenStack was named `m... [22:37:19] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [23:49:40] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica