[00:03:06] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn)
[00:11:36] <wikibugs>	 (03PS1) 10Dzahn: site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568)
[00:13:53] <wikibugs>	 (03Abandoned) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn)
[00:17:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[00:18:07] <wikibugs>	 (03PS4) 10Dzahn: beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry)
[00:33:07] <wikibugs>	 (03PS1) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714
[00:35:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn)
[00:38:33] <wikibugs>	 (03PS2) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714
[00:46:07] <wikibugs>	 (03PS3) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714
[00:48:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn)
[01:00:44] <AntiComposite>	 Don't know if this has made it over here yet or not, but there have been a few reports of editors recieving 404 errors when saving pages
[01:01:01] <AntiComposite>	 I think it's correlated to mobile edits, but not sure
[01:01:14] <AntiComposite>	 It's intermittent and I haven't seen anyone reproduce it yet
[01:08:19] <mutante>	 AntiComposite: thanks for reporting. i dont see raised number of 4xx responses in graphs. if there are users saying it please make them open a ticket if possible
[01:08:37] <mutante>	 it actually looks even lower than before
[01:08:39] <AntiComposite>	 The most recent one was asked to
[01:09:04] <mutante>	 ok, good
[01:09:44] <mutante>	 i gotta go, co-working space closes. looks all normal to me
[01:10:07] <AntiComposite>	 Is there an ATS test still running? That probably wouldn't be reported with the Varnish errors
[01:58:34] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1159 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[02:43:26] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:45:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 254036272 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:47:30] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 570501432 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:49:58] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 931330680 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:53:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 916152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:54:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 78416 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:10:14] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:23:34] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250 (10AndyRussG)
[04:37:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[04:37:30] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:42] <icinga-wm>	 PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[04:38:44] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[04:39:25] <effie>	 !log Depool and reload mw1286
[04:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:41:46] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[04:42:08] <icinga-wm>	 RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[04:42:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 75142 bytes in 5.572 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[04:45:28] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:54:34] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:55:08] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:07:14] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:16:16] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:24:36] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:24] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:37:18] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:44] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:58] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[05:45:12] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:40] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:23:09] <librenms-wmf>	 08Warning 
[06:37:24] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:49:57] <wikibugs>	 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn)
[07:52:40] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) Thanks @wiki_willy While I was the reporter caching servers are ultimately handled by the traffic team so i would like to at least cc: them if we can depool this "cache::text" server anytime...
[07:56:03] <wikibugs>	 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Dzahn) I can confirm that (reading those mysql credentials) is what the researchers group was originally created for.
[09:12:51] <wikibugs>	 (03PS1) 10Gergő Tisza: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031)
[09:13:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza)
[09:23:09] <librenms-wmf>	 08Warning 
[09:50:49] <Nemo_bis>	 librenms-wmf got laconic
[10:26:36] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:28:04] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[10:46:30] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:09] <librenms-wmf>	 08Warning 
[12:23:09] <librenms-wmf>	 08Warning 
[12:38:06] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:46:02] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:06:46] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:10] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:02] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:00] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:09] <librenms-wmf>	 08Warning 
[14:27:08] <TheSandDoctor>	 I ran into an interesting error while trying to save an edit. The error passed after one (lost) edit.
[14:27:57] <TheSandDoctor>	 Error: 503, Backend fetch failed at Sat, 14 Sep 2019 14:22:48 GMT
[14:28:52] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:29:58] <TheSandDoctor>	 (first line of message is my IP & "varnish XID", which I will not share publicly. - though I dont know what varnish XID is..if it isnt connectable and is an error code, I'd happily share)
[14:30:04] <TheSandDoctor>	 if it helps
[14:39:26] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:39:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:09] <AntiComposite>	 TheSandDoctor, What page did you experience this on?
[15:14:37] <TheSandDoctor>	 PERM rollback
[15:14:42] <TheSandDoctor>	 @AntiComposite:
[15:19:45] <wikibugs>	 (03PS1) 10Urbanecm: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536740 (https://phabricator.wikimedia.org/T232831)
[15:23:09] <librenms-wmf>	 08Warning 
[16:43:42] <wikibugs>	 (03PS2) 10Krinkle: Gerrit: Make 'eclipse' and 'elegant' themes colorblind-friendly [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893)
[16:48:27] <wikibugs>	 (03PS2) 10Krinkle: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm)
[16:49:10] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.o
[16:49:10] <icinga-wm>	 Monitoring/mobileapps
[16:53:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[17:23:10] <librenms-wmf>	 08Warning 
[18:09:36] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:23:09] <librenms-wmf>	 08Warning 
[18:36:28] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:40] <icinga-wm>	 PROBLEM - SSH wtp1031.mgmt on wtp1031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:09:40] <wikibugs>	 (03CR) 10Alex Monk: "I've removed an old copy of this gerrit change from deployment-dumps-puppetmaster02 as it was breaking the rebase, the rebase should have " [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn)
[19:38:10] <icinga-wm>	 RECOVERY - SSH wtp1031.mgmt on wtp1031.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:47:28] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:53] <wikibugs>	 (03PS3) 10Urbanecm: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822)
[20:23:10] <librenms-wmf>	 08Warning 
[20:24:12] <wikibugs>	 (03PS1) 10Alex Monk: ATS: Fix check for do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/536747
[20:24:55] <Krenair>	 I wonder if the librenms-wmf issue is deliberate
[20:37:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:38:26] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:44:48] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:23:09] <librenms-wmf>	 08Warning 
[21:45:52] <wikibugs>	 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data coll...
[22:38:34] <icinga-wm>	 RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:14] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:23:09] <librenms-wmf>	 08Warning 
[23:37:50] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:42:47] <onimisionipe>	 !log force shard allocation (dewiki_content_1566659363[4]) on eqiad cluster
[23:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log