[00:06:23] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:54:09] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:35] 10Operations, 10Traffic, 10Wikispore, 10HTTPS: Make Wikispore HTTPS-only - https://phabricator.wikimedia.org/T260701 (10Tgr) Sure, theft of a Wikispore account is not particularly damaging, but I doubt there are many people who cannot access Wikipedia and its sister projects but would want to access Wikisp... [07:43:01] 10Operations, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Izno) It is not normal for a PEIS max to cause a 503, in my experience. While it is an actual problem due to the artificial limit, we more-or-less always see the page as... [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201114T0800) [08:16:12] 10Operations, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Mjroots) Izno, there are no issues with my internet service, the rest of the net is working fine without any issues accessing websites and web pages. I don't understand m... [08:51:51] 10Operations, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Izno) I could not reproduce with Firefox, Win10, latest version (82?). >>! In T267804#6622146, @Mjroots wrote: > Izno, there are no issues with my internet service, the... [09:09:02] 10Operations, 10MediaWiki-Parser: Varnish 503 errors on page with large number of flag icons. - https://phabricator.wikimedia.org/T267804 (10Mjroots) Izno - under preferences > editing I have the following ticked:- General options Show the difference between the latest accepted version and the latest pending... [10:15:47] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:56:52] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10Krenair) Cert was renewed: ` root@deployment-acme-chief03:~# openssl x509 -in /var/lib/acme-chief/certs/unified/live/rsa-... [15:01:35] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10Krenair) For some reason I had to do a full restart of the `trafficserver-tls` service on the cache-upload06 VM but it ha... [15:08:08] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10Krenair) a:03Krenair @Vgutierrez FYI in case this could happen in prod too, I haven't been keeping track of changes lat... [15:17:34] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on November 13, 2020. - https://phabricator.wikimedia.org/T267858 (10AlexisJazz) @hashar Could there be a relation with T267561? (very wild guess) [16:40:21] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 107 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:41:59] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:03:31] PROBLEM - Disk space on Hadoop worker on an-worker1098 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:03:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:35] (03CR) 10Ahmon Dancy: [C: 03+1] profile::lvs::realserver: use poolcounter for guarding service restarts [puppet] - 10https://gerrit.wikimedia.org/r/640928 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [19:20:10] PROBLEM - Disk space on Hadoop worker on an-worker1100 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/n 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:46:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 55.19 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:49:29] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [19:51:07] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [19:51:23] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 97.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:12:44] 10Operations, 10Traffic: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10Reedy) 05Open→03Declined [20:19:59] PROBLEM - Disk space on Hadoop worker on an-worker1100 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/n 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration