[00:03:06] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [00:11:36] (03PS1) 10Dzahn: site/phabricator: apply phab role on phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/536712 (https://phabricator.wikimedia.org/T190568) [00:13:53] (03Abandoned) 10Dzahn: puppetize setting advmss (MTU) size for GRE tunnel mitigations [puppet] - 10https://gerrit.wikimedia.org/r/536401 (https://phabricator.wikimedia.org/T232602) (owner: 10Dzahn) [00:17:59] (03CR) 10Dzahn: [C: 03+2] beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [00:18:07] (03PS4) 10Dzahn: beta cluster: Make deployment-mediawiki-parsoid10 a MW scap target [puppet] - 10https://gerrit.wikimedia.org/r/536598 (https://phabricator.wikimedia.org/T232538) (owner: 10Subramanya Sastry) [00:33:07] (03PS1) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [00:35:05] (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [00:38:33] (03PS2) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [00:46:07] (03PS3) 10Dzahn: gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 [00:48:07] (03CR) 10jerkins-bot: [V: 04-1] gerrit: get LDAP servers from ldap_config, use ro server, simplify [puppet] - 10https://gerrit.wikimedia.org/r/536714 (owner: 10Dzahn) [01:00:44] Don't know if this has made it over here yet or not, but there have been a few reports of editors recieving 404 errors when saving pages [01:01:01] I think it's correlated to mobile edits, but not sure [01:01:14] It's intermittent and I haven't seen anyone reproduce it yet [01:08:19] AntiComposite: thanks for reporting. i dont see raised number of 4xx responses in graphs. if there are users saying it please make them open a ticket if possible [01:08:37] it actually looks even lower than before [01:08:39] The most recent one was asked to [01:09:04] ok, good [01:09:44] i gotta go, co-working space closes. looks all normal to me [01:10:07] Is there an ATS test still running? That probably wouldn't be reported with the Varnish errors [01:58:34] RECOVERY - WDQS high update lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1159 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [02:43:26] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:45:14] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 254036272 and 22 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:47:30] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 570501432 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:49:58] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 931330680 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:53:50] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 916152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:44] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 78416 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:10:14] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:34] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250 (10AndyRussG) [04:37:30] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [04:37:30] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:42] PROBLEM - HHVM rendering on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [04:38:44] PROBLEM - Nginx local proxy to apache on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [04:39:25] !log Depool and reload mw1286 [04:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:41:46] RECOVERY - Nginx local proxy to apache on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [04:42:08] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.076 second response time https://wikitech.wikimedia.org/wiki/Application_servers [04:42:26] RECOVERY - HHVM rendering on mw1286 is OK: HTTP OK: HTTP/1.1 200 OK - 75142 bytes in 5.572 second response time https://wikitech.wikimedia.org/wiki/Application_servers [04:45:28] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:34] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:08] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:07:14] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:16] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:24:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:24] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:37:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:44] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:58] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [05:45:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:40] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:09] 08Warning [06:37:24] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:49:57] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [07:52:40] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10Dzahn) Thanks @wiki_willy While I was the reporter caching servers are ultimately handled by the traffic team so i would like to at least cc: them if we can depool this "cache::text" server anytime... [07:56:03] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Dzahn) I can confirm that (reading those mysql credentials) is what the researchers group was originally created for. [09:12:51] (03PS1) 10Gergő Tisza: Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) [09:13:37] (03CR) 10jerkins-bot: [V: 04-1] Update ORES filter threshold configuration for new huwiki model [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536732 (https://phabricator.wikimedia.org/T230031) (owner: 10Gergő Tisza) [09:23:09] 08Warning [09:50:49] librenms-wmf got laconic [10:26:36] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/base (Get base CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:28:04] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:46:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:09] 08Warning [12:23:09] 08Warning [12:38:06] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:02] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:46] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:10] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:02] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:00] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:09] 08Warning [14:27:08] I ran into an interesting error while trying to save an edit. The error passed after one (lost) edit. [14:27:57] Error: 503, Backend fetch failed at Sat, 14 Sep 2019 14:22:48 GMT [14:28:52] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:29:58] (first line of message is my IP & "varnish XID", which I will not share publicly. - though I dont know what varnish XID is..if it isnt connectable and is an error code, I'd happily share) [14:30:04] if it helps [14:39:26] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:39:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:09] TheSandDoctor, What page did you experience this on? [15:14:37] PERM rollback [15:14:42] @AntiComposite: [15:19:45] (03PS1) 10Urbanecm: Lift IP cap on 2019-10-02 for Senior Citizen Write Wikipedia course - cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/536740 (https://phabricator.wikimedia.org/T232831) [15:23:09] 08Warning [16:43:42] (03PS2) 10Krinkle: Gerrit: Make 'eclipse' and 'elegant' themes colorblind-friendly [puppet] - 10https://gerrit.wikimedia.org/r/536687 (https://phabricator.wikimedia.org/T232893) [16:48:27] (03PS2) 10Krinkle: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) (owner: 10Urbanecm) [16:49:10] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL: Test Get media in test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.o [16:49:10] Monitoring/mobileapps [16:53:56] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:23:10] 08Warning [18:09:36] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:09] 08Warning [18:36:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:40] PROBLEM - SSH wtp1031.mgmt on wtp1031.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:09:40] (03CR) 10Alex Monk: "I've removed an old copy of this gerrit change from deployment-dumps-puppetmaster02 as it was breaking the rebase, the rebase should have " [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn) [19:38:10] RECOVERY - SSH wtp1031.mgmt on wtp1031.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:47:28] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:53] (03PS3) 10Urbanecm: Enable logging for BlockManager channel at info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531299 (https://phabricator.wikimedia.org/T230822) [20:23:10] 08Warning [20:24:12] (03PS1) 10Alex Monk: ATS: Fix check for do_ocsp [puppet] - 10https://gerrit.wikimedia.org/r/536747 [20:24:55] I wonder if the librenms-wmf issue is deliberate [20:37:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:26] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:48] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:23:09] 08Warning [21:45:52] 10Operations, 10Cloud-Services, 10SRE-Access-Requests, 10Developer-Advocacy (Jul-Sep 2019): Membership in "researchers" group for Srishti Sethi - https://phabricator.wikimedia.org/T232664 (10Nuria) Hardly any data can be found in mysql, we are deprecating that storage this quarter as we moved all data coll... [22:38:34] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:54] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:14] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:23:09] 08Warning [23:37:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:42:47] !log force shard allocation (dewiki_content_1566659363[4]) on eqiad cluster [23:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log