[01:43:13] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: improve flavor monitoring [puppet] - 10https://gerrit.wikimedia.org/r/621840 (owner: 10Andrew Bogott) [03:38:29] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [05:24:07] !log legoktm@mwmaint1002:~$ echo "https://releases.wikimedia.org/mediawiki/1.35/" | mwscript purgeList.php --wiki=aawiki [05:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:27] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:38:27] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:46:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:48:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:52:15] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:54:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200822T0700) [07:25:41] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:41] PROBLEM - Check systemd state on wdqs1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:36:11] !log restart blazegraph on wdqs1006 + depool to catchup on lag [07:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 72 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:36:38] ryankemper: ^^ [07:37:11] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [07:39:33] RECOVERY - Check systemd state on wdqs1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:53] PROBLEM - WDQS high update lag on wdqs1006 is CRITICAL: 9.92e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:40:47] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1006 is CRITICAL: 9.92e+04 ge 4.32e+04 Gehel server depooled, catching up on lag after restart https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [07:42:31] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 45 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:06:21] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 54 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:12:21] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 44 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:28:09] 10Operations, 10Domains, 10Traffic: Change of nameservers for Wikimedia.org.tr - https://phabricator.wikimedia.org/T259792 (10HakanIST) .tr is the country code tld (domain extension) for Turkey not for the language. The Wikimedians of Turkic Languages User Group is a language based international group from... [10:56:10] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) [10:56:18] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper) Hi @Aftabuzzaman, thanks for taking the time to report this! Confirming. This is due to ` $:acko\> curl -Is "https:/... [10:56:19] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:58:13] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:46:29] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 54 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:52:25] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:20:45] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17330656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:41] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16384 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:08:01] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 56 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:57] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:46:04] (03PS2) 10VulpesVulpes825: Correct the wrong workmark and tagline for Chinese Wikimedia Project [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621542 (https://phabricator.wikimedia.org/T260908) [13:51:45] PROBLEM - Ubuntu mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [14:10:31] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10BeatEstermann) Same problem here. Using Open Refine to edit Wikidata didn't work today when I tried to... [17:10:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:23:18] RECOVERY - WDQS high update lag on wdqs1006 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.122e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:28:21] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:31:23] RECOVERY - Ensure traffic_server is running for instance tls on cp5002 is OK: PROCS OK: 1 process with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:37:17] PROBLEM - Ensure traffic_server is running for instance tls on cp5002 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:38:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:43:41] RECOVERY - Ubuntu mirror in sync with upstream on sodium is OK: /srv/mirrors/ubuntu is over 6 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [18:44:35] (03PS7) 10BryanDavis: Make `webservice shell` scriptable [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/621776 (https://phabricator.wikimedia.org/T169695) [19:16:04] (03PS1) 10Privacybatm: [POC5 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/621898 (https://phabricator.wikimedia.org/T259327) [19:16:39] (03CR) 10jerkins-bot: [V: 04-1] [POC5 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/621898 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [19:20:24] (03CR) 10Privacybatm: [C: 04-1] "It is not working, No need to review this. Please consider it as a reference for the future." [software/transferpy] - 10https://gerrit.wikimedia.org/r/621898 (https://phabricator.wikimedia.org/T259327) (owner: 10Privacybatm) [19:31:05] !log pooled wdqs1006 now that lag has dissipated [19:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:51] ah, I saw the critical had resolved but checking grafana it's still about 2 hours behind, so gonna set it back to depooled [19:33:10] !log depooled wdqs1006 (still has 2.5 hours to catch up on) [19:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:59] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:32:57] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:52:38] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Nintendofan885) Is this done now as the deadline was a month ago and blog.wikimedia.org/Foo is now redirecting to diff.wikimedia.org/Foo [21:01:37] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:21:27] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:57:21] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:07:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:43:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:52:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 52 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:14:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:17:49] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aftabuzzaman) [23:20:12] 10Operations, 10Wikimedia-Mailing-lists: Several unreadable mailing list descriptions due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aftabuzzaman) I don't know how to change it. Please change it for above mailing list or at least for /wikipedia-bn & /wikipedia-...