[00:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T0000). [00:00:04] legoktm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:13] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:00:23] (03PS2) 10Dzahn: parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) [00:00:37] I'm here for a last minute patch too :P [00:01:02] hi [00:01:15] I can deploy stuff today [00:01:26] (03CR) 10Subramanya Sastry: [C: 03+1] ATS: re-add config for parsoid-rt-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:01:56] (03CR) 10Subramanya Sastry: [C: 03+1] Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:02:08] oh cool, more logo stuffs [00:03:06] this one should work, as I literally copied and renamed arbcom-ru's one :P [00:03:21] (03CR) 10Legoktm: [C: 03+2] arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish) [00:03:52] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry) [00:04:07] (03Merged) 10jenkins-bot: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish) [00:04:40] dont|panic: on mwdebug1002 [00:05:07] looks good :) [00:05:11] same [00:06:51] (03CR) 10Dzahn: "config changed on scandium - noop on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [00:07:22] !log legoktm@deploy1001 Synchronized static/favicon/arbcom_enwiki.ico: T272920: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico (1/2) (duration: 01m 00s) [00:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:27] T272920: arbcom-en.wikipedia.org change favicon - https://phabricator.wikimedia.org/T272920 [00:08:17] thanks a bunch, legoktm :) [00:08:27] hold on one more sync [00:08:40] oh, right [00:08:42] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T272920: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico (2/2) (duration: 00m 58s) [00:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:01] yep, now it's okay, thanks :D [00:09:05] technically you're supposed to split this into two Gerrit patches but it's okay because I'm also breaking that rule today too [00:09:27] (03PS3) 10Legoktm: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 [00:09:29] (03PS3) 10Legoktm: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 [00:09:31] (03PS7) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) [00:09:48] (03CR) 10Legoktm: [C: 03+2] "no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 (owner: 10Legoktm) [00:09:58] (03CR) 10Legoktm: [C: 03+2] Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm) [00:10:06] oh, I saw the previous patch with it and thought one could do it in one patch [00:10:13] I'll keep that in mind for the next one :) [00:10:38] (03Merged) 10jenkins-bot: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 (owner: 10Legoktm) [00:10:49] (03Merged) 10jenkins-bot: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm) [00:10:51] https://wikitech.wikimedia.org/wiki/Backport_windows#Guidelines [00:10:52] Single patches that require more than one sync - in other words, changes to multiple files which depend on each other. [00:10:52] Instead, please break up the patches into multiple safe patches that can be deployed by themselves. See: task T187761 [00:10:52] T187761: Proposal: Effective immediately, disallow multi-sync patch deployment - https://phabricator.wikimedia.org/T187761 [00:11:18] yeah, not a big deal, it's just that I literally can't do it in one sync command because they're in different directories [00:11:29] so typically you have one commit that adds the logo and the next that changes the config [00:12:18] ohhh okay [00:12:31] sorry, won't happen again [00:12:53] I even abandoned a patch as it hadn't uploaded the logo lol [00:13:18] :)) [00:14:50] !log legoktm@deploy1001 Synchronized wmf-config/logos.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 00m 56s) [00:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:24] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 01m 00s) [00:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:00] (03CR) 10Legoktm: [C: 03+2] Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [00:19:06] (03Merged) 10jenkins-bot: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [00:20:20] 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Resolved→03Open [00:23:39] (03PS1) 10Legoktm: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 [00:25:16] (03CR) 10jerkins-bot: [V: 04-1] Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm) [00:25:53] (03PS2) 10Legoktm: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 [00:27:58] 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) As part of this work, scandium puppet code was split into two pieces: (a) retain app-server config on scandium (b)... [00:28:49] (03CR) 10Legoktm: [C: 03+2] Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm) [00:29:38] (03Merged) 10jenkins-bot: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm) [00:30:34] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:02] ok, now it's working [00:32:22] !log legoktm@deploy1001 Synchronized wmf-config/logos.php: Add script to mostly automate logo management (duration: 00m 55s) [00:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:06] !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Invalidate configuration cache when logos.php is touched too (duration: 00m 56s) [00:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:21] I believe that's everything [00:37:06] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:11] 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) /etc/testreduce does not exist at all on scandium, so that doesn't seem to be a puppetization issue. The mysql conf... [00:40:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:42:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:42:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2320.codfw.wmnet [00:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2319.codfw.wmnet [00:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2318.codfw.wmnet [00:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2331.codfw.wmnet [00:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:20] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) 05Stalled→03Open a:03Legoktm Verified the account was created by ITS: https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=39709548 [00:46:26] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2320.codfw.wmnet [00:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2319.codfw.wmnet [00:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:47:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2318.codfw.wmnet [00:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2331.codfw.wmnet [00:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:52] (03PS1) 10Legoktm: admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) [00:56:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) p:05Triage→03Medium [01:07:59] (03PS2) 10Legoktm: superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) [01:09:17] (03CR) 10Legoktm: [C: 03+2] superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [01:11:31] (03PS2) 10Legoktm: threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) [01:12:04] (03CR) 10Legoktm: [C: 03+2] threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [01:12:46] (03PS2) 10Legoktm: udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) [01:14:21] (03CR) 10Legoktm: [C: 03+2] udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [01:27:42] (03PS1) 10Aaron Schulz: Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473 [01:32:06] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:32:12] 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) Looks like the config files in /etc/testreduce/ are puppetized already! I manually edited the config file /etc/te... [01:34:23] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10Legoktm) Hi @JTannerWMF I looked and it seems like you have two Developer accounts (aka wikitech/LDAP accounts): * https://ldap.toolforge.org/user/jtanner * https://ldap.toolforge.org/us... [01:38:44] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:44:45] (03PS2) 10Legoktm: mailman3: Fix python package for mysql [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [01:51:42] (03CR) 10Legoktm: [C: 03+2] mailman3: Fix python package for mysql [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [01:53:57] (03PS1) 10Legoktm: codesearch: Configure port for puppet [puppet] - 10https://gerrit.wikimedia.org/r/658477 (https://phabricator.wikimedia.org/T272947) [01:55:39] will there be a branch cut today, or is that delayed/skipped because of last week's rollback? [01:57:00] (03CR) 10Legoktm: [C: 03+2] codesearch: Configure port for puppet [puppet] - 10https://gerrit.wikimedia.org/r/658477 (https://phabricator.wikimedia.org/T272947) (owner: 10Legoktm) [02:01:19] (hm, for some reason I thought last week ended with the train blocked, but apparently not.) [02:07:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 [02:09:56] 10SRE, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 4 wiki's - https://phabricator.wikimedia.org/T270443 (10Jseddon) 05Open→03Resolved [02:10:04] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [02:13:35] tgr_: clearly the TrainBranchBot was listening to you ^^ [02:14:31] good bot. [02:21:41] (03CR) 10Cwhite: [C: 03+1] rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [02:22:12] (03CR) 10Cwhite: [C: 03+1] alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [02:31:24] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:55] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot) [02:38:28] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:44] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:56] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:06] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:52] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:10] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [03:38:56] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:26] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.478 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:53:22] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:04:48] 10SRE, 10vm-requests: : of VMs requested for - https://phabricator.wikimedia.org/T272949 (10Iupparand) [04:31:36] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:34] (03PS1) 10Legoktm: zuul: Port zuul-test-repo to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658485 [04:38:26] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:31:06] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:56] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:44] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [05:43:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set candidate master to weight 0 before the failover T271427', diff saved to https://phabricator.wikimedia.org/P13952 and previous config saved to /var/cache/conftool/dbconfig/20210126-054337-marostegui.json [05:43:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:42] T271427: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 [06:00:15] (03CR) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [06:00:20] (03PS2) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) [06:01:04] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [06:01:25] (03CR) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [06:01:29] (03PS2) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) [06:31:16] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:05] In 30 minutes we'll failover s4 (commons) master [06:38:14] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:16:04] jynus: what's the query at the moment? just a similar update to the one I built? [07:16:21] grep "# update section with section name from the former slave" [07:16:34] at /usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py [07:16:37] or on repo [07:17:04] ah I see, yeah, pretty much the same as the one I tried [07:18:24] https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/cli_admin/switchover.py;38660e943a9167fe7d174f79806acf3a6e4d4f23$722?as=source&blame=off [07:18:37] we can just add an order by [07:18:40] or remove the limit [07:18:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This goes in the right direction, but if we start collecting data about specific endpoints, I'd rather generalize the idea a bit - see the" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan) [07:19:03] it is one of those: the query is legit but the parser cannot tell the difference [07:19:32] as there should only be 1 row so it is a deterministic statement [07:19:58] but it is exactly why we cannot got to row on the primary dbs for mw [07:25:39] I am trying without the limit [07:26:28] Although I do like the limit :) [07:27:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, but I'd really get away from calling bash scripts altogether, and use python requests instead to fetch opcache data." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [07:28:18] (03PS3) 10Effie Mouzeli: service_proxy: enable ipv6 on envoy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [07:28:59] we can do it without the limit and WARN if rows affected > 1 [07:30:07] I am not sure if without the limit the query would success though [07:30:15] ? [07:31:25] you think is a db issue, not a query issue? [07:32:37] yeah, let me confirm something [07:33:08] strange, binlog format says STATEMENT [07:34:07] but it should be "safe for binlog" [07:34:23] we can also enable unsafe statements- it should be ok for tendril db [07:35:25] or just do a set session binlog_format=row for that query: https://phabricator.wikimedia.org/P13956 [07:36:25] or go REPEATABLE-READ ? [07:36:43] can you try that too ^ just for curiosity [07:36:47] sure thing [07:37:08] we may get a warning [07:37:31] yeah, that works with the warning of: this might be unsafe [07:37:39] (03PS1) 10Ladsgroup: Migrate hiera() to lookup() and set datatypes in purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) [07:37:41] so either of the 2 [07:37:54] kormat to decide :-) [07:37:55] So either the set session binlog or set session to repeatable-read can work to get this fixed without changing other things [07:37:58] yeah [07:38:22] I would go for set session binlog_format="ROW"; just to avoid the warning :) [07:38:32] OR we move zarcillo and configure the db properly [07:38:52] in a non-tokudb way, which I think is why it is in that level [07:39:07] one of the 3 [07:39:23] jynus: I would leave that for a medium-term approach and get db-switchover fixed with that small hack for now [07:39:26] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:39:41] i am just giving options [07:40:04] as in, I think there is an orchestator db [07:40:19] it wouldn't be unthinkable to move it there [07:40:28] but it would require editing the file anyway [07:40:32] as config is hardcoded [07:40:35] jynus: orchestrator db is on the zarcillo 'section' [07:40:36] so no reason really [07:41:01] kormat, you mean it lives on db1115? [07:41:12] or the one on codfw maybe? [07:41:18] on the codfw node, yeah [07:41:36] (03PS1) 10Marostegui: db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361) [07:41:58] so there you have 2 1/2 solutions, whatever you think is best (or manuel convicens you is best :-D) [07:42:05] *convinces [07:42:14] *tricks you into [07:42:21] :-))) [07:42:50] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:11] But obviously, kormat to decide :) [07:43:11] jynus: I am going to close the switchover ticket and create a follow up one for db-switchover fix, does that sound good? [07:43:14] new host has report_host, right? [07:43:21] new as in new primary [07:43:42] marostegui, ok if answer to question is yes :-D [07:43:56] XD [07:44:57] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [07:47:24] (03CR) 10Ladsgroup: "An extra PCC: https://puppet-compiler.wmflabs.org/compiler1001/27655/" [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [07:50:48] (03PS2) 10Marostegui: db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361) [07:57:22] (03CR) 10Muehlenhoff: "Thanks! There's no reason we need Py2 compat here, though. We can drop the __future__ import and simply change the shebang to python3." [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [07:58:33] (03PS1) 10Muehlenhoff: Extend access for mraish [puppet] - 10https://gerrit.wikimedia.org/r/658547 [08:02:23] (03CR) 10Marostegui: [C: 03+2] db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:03:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm) [08:03:51] (03PS1) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) [08:05:50] (03PS2) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) [08:05:55] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1160.eqiad.wmnet'] ` The log ca... [08:06:50] (03CR) 10Hashar: "The $wgSoftBlockRange fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez) [08:08:04] (03PS3) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) [08:12:07] (03PS1) 10Ryan Kemper: wdqs: add dummy key for new wdqs-internal cert [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) [08:13:08] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1119.eqiad.wmnet', 'an-worker1131.eqiad... [08:13:54] (03CR) 10Ryan Kemper: "After generating a new cert, three things need to be done:" [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [08:14:52] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat) [08:16:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat) [08:17:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [08:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy) [08:18:49] !log upgrading OpenJDK on aqs and Hadoop systems [08:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE [08:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:16] (03Abandoned) 10Effie Mouzeli: mediawiki::php bump opcache.max_accelerated_files [puppet] - 10https://gerrit.wikimedia.org/r/636047 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli) [08:21:08] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [08:25:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The change seems ok in theory, but I'd like to see some risk management in terms of adding a feature flag to control the deployment." [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [08:26:01] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE [08:26:01] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE [08:26:02] 10SRE, 10DBA: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1160.eqiad.wmnet'] ` and were **ALL** successful. [08:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE [08:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I agree with this change, but let's see what alex thinks as well - he's usually opposed to using a native version number." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218 (owner: 10Legoktm) [08:28:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE [08:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:52] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:59] !log swift start decom for ms-be20[17,19,21,23,24,25,26,27] - T272837 [08:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:03] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [08:32:35] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [08:32:47] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [08:33:16] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:42] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1131.eqiad.wmnet', 'an-worker1119.eqiad.wmnet'] ` and were **ALL** successful. [08:36:06] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:13] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1119,1131].eqiad.wmnet [08:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet [08:37:58] (03CR) 10Giuseppe Lavagetto: "One minor comment on the dockerfile, and one field missing in the control file; otherwise LGTM" (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [08:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1119,1131].eqiad.wmnet [08:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet [08:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:28] (03PS1) 10ArielGlenn: use the platform-engineering group to add people to deployers [puppet] - 10https://gerrit.wikimedia.org/r/658552 [08:41:56] (03CR) 10jerkins-bot: [V: 04-1] use the platform-engineering group to add people to deployers [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [08:42:16] (03PS1) 10Elukey: Add an-worker1119 and 1131 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/658553 (https://phabricator.wikimedia.org/T260411) [08:42:54] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:58] (03CR) 10Elukey: [C: 03+2] Add an-worker1119 and 1131 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/658553 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:44:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet [08:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:44] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 27 probes of 675 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:47:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:50:40] (03PS2) 10ArielGlenn: use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 [08:51:06] (03CR) 10jerkins-bot: [V: 04-1] use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [08:51:48] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:52:24] (03PS1) 10Marostegui: db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658554 (https://phabricator.wikimedia.org/T258361) [08:53:34] (03CR) 10Marostegui: [C: 03+2] db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658554 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [08:53:43] !log Stop mysql on db1081 to clone db1160 [08:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:00] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:07] (03PS3) 10ArielGlenn: use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 [09:01:52] (03PS1) 10Marostegui: mariadb: Productionize db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658556 (https://phabricator.wikimedia.org/T258361) [09:02:35] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658556 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:02:49] jouncebot: now [09:02:49] No deployments scheduled for the next 2 hour(s) and 57 minute(s) [09:02:51] jouncebot: next [09:02:51] In 2 hour(s) and 57 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1200) [09:04:37] (03PS1) 10Urbanecm: frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907) [09:06:39] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [09:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:49] (03CR) 10Urbanecm: [C: 03+2] frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907) (owner: 10Urbanecm) [09:07:36] (03Merged) 10jenkins-bot: frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907) (owner: 10Urbanecm) [09:09:05] (03CR) 10ArielGlenn: "Note that this adds gmodena and nikkin to the deployment group since they are team group members. Holding until we have a thumbs up from m" [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [09:09:56] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for mraish [puppet] - 10https://gerrit.wikimedia.org/r/658547 (owner: 10Muehlenhoff) [09:11:22] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:11:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 to clone db1175 T258361', diff saved to https://phabricator.wikimedia.org/P13958 and previous config saved to /var/cache/conftool/dbconfig/20210126-091149-marostegui.json [09:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:53] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078 (db1175 isn't ready yet)', diff saved to https://phabricator.wikimedia.org/P13959 and previous config saved to /var/cache/conftool/dbconfig/20210126-091236-marostegui.json [09:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab87780: frwiki: Fix tagline height and width (T272907) (duration: 00m 58s) [09:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:21] T272907: French Wikipedia logo: Tagline too distant from wordmark; Tagline has 24px height instead of 13px and wordmark is too small - https://phabricator.wikimedia.org/T272907 [09:14:44] !log reboot dbstore1004 for kernel upgrades [09:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:58] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [09:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:28] PROBLEM - Apache HTTP on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:16:42] PROBLEM - PHP7 rendering on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:16:42] PROBLEM - PHP7 rendering on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:17:14] PROBLEM - Apache HTTP on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:17:34] PROBLEM - PHP7 rendering on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:17:42] PROBLEM - Apache HTTP on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:18:20] PROBLEM - Apache HTTP on mw2320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:19:11] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10akosiaris) >>! In T179696#6776337, @Joe wrote: >>>! In T179696#6775314, @Legoktm wrote: >> In my testing of repeatedly issuing the same curl com... [09:19:50] I see something like [09:19:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:19:52] Jan 26 09:19:20 mw2320 php7.2-fpm: PHP Fatal error: require(): Failed opening required '/srv/mediawiki/wmf-config/logos.php' [09:19:58] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 115 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:20:01] (03PS1) 10Kormat: realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) [09:20:34] elukey: upps [09:20:36] I know what that is [09:20:42] I'll fix it [09:20:48] super I was about to ping you :) [09:20:49] thanks [09:21:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:51] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27658/console" [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat) [09:22:23] (03CR) 10Kormat: realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat) [09:24:26] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: Resyncing to fix mw2xxx apache loading (duration: 00m 57s) [09:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:37] (03CR) 10Kormat: [V: 03+1 C: 03+2] udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat) [09:24:56] (03CR) 10Marostegui: [C: 03+1] "requires restarting of all sanitarium hosts: db1124, db1125, db1154, db1155, db2094, db2095" [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat) [09:25:01] RECOVERY - Apache HTTP on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:25:03] elukey: ^^that should do the trick [09:25:07] RECOVERY - PHP7 rendering on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:25:17] RECOVERY - Apache HTTP on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:25:27] perfect :) [09:25:33] RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:26:29] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:27:06] (03CR) 10Kormat: [C: 03+2] realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat) [09:27:45] (03CR) 10Jbond: [C: 03+1] admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm) [09:28:35] elukey: do you happen to know if scap pull is standard part of reimaging MW servers? according to SAL, lego.ktm did https://sal.toolforge.org/production?p=0&q=logos.php&d= last night, and at around the same time, mu.tante reimagined mw2319 (among other affected servers). So, until I did the IS.php sync to fix an (unrelated) bug, the error was there, but not noticed. Once I changed IS.php, the mw2319 copy started to require [09:28:35] logos.php. For some reason, it seems mw2319 was using old copy of /srv/mediawiki [09:28:51] !log reboot dbstore1003 for kernel upgrades [09:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:39] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T272957 [09:29:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T272957 [09:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:42] T272957: Mark mediawikiwiki.translate_cache as private so it doesn't replicate to wiki replicas - https://phabricator.wikimedia.org/T272957 [09:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:57] RECOVERY - Apache HTTP on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:31:15] (03CR) 10Muehlenhoff: [C: 03+2] Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [09:32:26] !log disable mdadm check emails on ms-be1022 / known, and host is going to be decom'd - T267870 [09:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:29] T267870: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 [09:32:47] (03PS2) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795 [09:33:11] apparently, mw2319's copy is still not in its expected state [09:33:23] resyncing CommonSettings.php to fix this [09:33:43] (03PS4) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [09:34:32] we should have alerts on drifts of important files in /srv/mediawiki [09:34:54] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: Resync: Some mw2xxx hosts have old version (duration: 00m 55s) [09:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:58] Urbanecm: ah yes this might explain, in theory right after the reimage we should do a scap pull [09:35:12] it is not automatic IIRC, but it has been a while since I checke [09:35:14] *checked [09:35:32] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff) [09:35:48] (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [09:37:45] !log reboot dbstore1005 for kernel upgrades [09:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:32] (03PS16) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [09:39:03] RECOVERY - PHP7 rendering on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [09:39:10] (03PS15) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [09:39:17] (03PS17) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [09:40:00] (03PS5) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 [09:41:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet [09:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:12] (03PS1) 10Jbond: gitignore: ignore vi swap/tmp files [cookbooks] - 10https://gerrit.wikimedia.org/r/658560 [09:47:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:11] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:52:23] RECOVERY - Apache HTTP on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:53:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet [09:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 213859768 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:56:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [09:56:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete tmpreaper Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [09:57:33] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff) [09:57:47] (03PS4) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [09:58:51] (03PS5) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [10:00:19] (03CR) 10Joal: [C: 03+1] "LGTM! elukey should be the one to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/642411 (owner: 10DCausse) [10:01:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 189657152 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:01:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [10:01:49] (03PS6) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [10:04:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [10:05:14] (03PS7) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [10:05:22] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Kormat) [10:07:31] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:09:41] (03PS1) 10Jbond: add gitignore: add vi swap/tmp files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562 [10:15:25] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:02] (03PS3) 10Hnowlan: maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) [10:20:18] (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [10:20:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:20:47] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 206064424 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:22:06] (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser.conf unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [10:23:07] (03PS5) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [10:23:51] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) we're still experiencing timeouts when trying to gather the catalog list with the url `/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=10... [10:24:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:25:19] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:27:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:32] (03PS6) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [10:35:01] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10aborrero) thanks, I will put it into service soon! [10:35:59] (03CR) 10Muehlenhoff: "Looks good to me, a couple of nits inline" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [10:38:27] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:11] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 994656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:52:31] (03CR) 10David Caro: "I agree with @Bstorm, though I would instead add a '--interactive' flag or similar, as I think it's useful to be able to just copy-paste a" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [10:54:53] (03CR) 10David Caro: "Looks good, would be interesting to know where did you get the info on what to change for the upgrade too ;) (aside from the test plan/res" [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm) [11:03:57] (03CR) 10David Caro: [C: 03+1] data-services: apply user variances to future creations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm) [11:07:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] "I 'll bundle with a couple of others changes (to avoid too many pybal restarts) and deploy today" [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [11:08:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [11:08:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the review. I 'll merge and let's see how it goes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [11:09:51] (03Merged) 10jenkins-bot: Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [11:11:58] (03CR) 10Elukey: sre: convert the generic reboot functions to the cookbook class API (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [11:13:00] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) Looking at the logs from a failed run, it looks like no retry is attempted when a 504 is received, at least on `registry2002`. Every 504 f... [11:13:28] (03PS10) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [11:16:49] (03PS2) 10ArielGlenn: handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305 [11:16:51] (03PS2) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 [11:20:57] (03PS1) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 [11:21:20] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm) [11:21:47] (03PS3) 10ArielGlenn: handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305 [11:21:49] (03PS3) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 [11:23:46] (03Abandoned) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/657416 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli) [11:25:34] (03PS11) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) [11:26:19] (03PS4) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 [11:27:38] (03PS1) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368) [11:29:32] !log imported jenkins 2.263.3 to apt.wikimedia.org (thirdparty/ci) [11:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:09] (03CR) 10Volans: [C: 04-1] "Thanks for starting to explore implementing things with spicerack and cookbooks." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [11:30:30] (03PS1) 10Marostegui: instances.yaml: Add db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658569 (https://phabricator.wikimedia.org/T258361) [11:30:42] (03CR) 10Volans: [C: 03+2] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562 (owner: 10Jbond) [11:31:00] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658569 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [11:31:07] (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/658560 (owner: 10Jbond) [11:31:41] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:10] (03PS1) 10Hnowlan: cassandra::single_instance: use dedicated hiera key, don't use 'cluster' [puppet] - 10https://gerrit.wikimedia.org/r/658572 [11:34:18] (03Merged) 10jenkins-bot: add gitignore: add vi swap/tmp files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562 (owner: 10Jbond) [11:34:36] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305 (owner: 10ArielGlenn) [11:34:52] (03Merged) 10jenkins-bot: gitignore: ignore vi swap/tmp files [cookbooks] - 10https://gerrit.wikimedia.org/r/658560 (owner: 10Jbond) [11:34:55] (03CR) 10Volans: [C: 04-1] "Thanks for the fix, minor improvement inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:35:10] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 (owner: 10ArielGlenn) [11:35:57] (03PS2) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 [11:36:09] (03CR) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:38:04] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27660/console" [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan) [11:38:15] (03PS3) 10Volans: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:38:21] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:23] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:41:34] (03CR) 10Elukey: [C: 03+2] sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:43:34] (03Merged) 10jenkins-bot: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey) [11:44:33] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [11:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:30] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [11:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:42] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [11:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:20] (03CR) 10David Caro: ""I think that would be easier to start with cookbooks to manage the production side of the WMCS infrastructure than toolforge"" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [11:49:51] (03PS1) 10ArielGlenn: version 0.1.3 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658574 [11:51:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1160 is ready to replace db1081. Leaving it to replicate for 24h before pooling it. [11:53:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1200). [12:00:04] Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:06] (03PS2) 10Giuseppe Lavagetto: httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305) [12:00:08] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) [12:00:10] (03PS9) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [12:00:20] I hope only one, jouncebot :) [12:00:46] requesting deployer not present, so there's nothing to do anyway [12:00:46] (03PS7) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [12:01:24] (03CR) 10Volans: "Couple of nits inline, CI is not happy and you might need a rebase." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [12:01:47] hey [12:01:57] (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [12:02:14] Urbanecm, Amir1, Lucas_WMDE awight: anyone here? [12:02:15] hi, Evrifaessa [12:02:19] I can deploy today [12:02:21] hi! [12:02:22] o/ [12:02:36] (unless Lucas_WMDE really wants to? 🙂 ) [12:02:46] you can do it if you want to :) [12:02:54] lol [12:02:59] (03CR) 10ArielGlenn: [C: 03+2] version 0.1.3 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658574 (owner: 10ArielGlenn) [12:03:06] (03PS2) 10Urbanecm: Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa) [12:03:28] (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa) [12:04:21] (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa) [12:04:49] Evrifaessa: available at mwdebug1001 for testing, can you have a look? [12:04:59] (03PS4) 10Urbanecm: Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa) [12:05:03] (03CR) 10Urbanecm: [C: 03+2] Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa) [12:05:04] works [12:05:10] thanks, syncing [12:06:04] (03Merged) 10jenkins-bot: Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa) [12:06:08] (03PS1) 10Alexandros Kosiaris: services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603) [12:06:12] I'd love to have T272776 deployed today too, but I couldn't optimize the SVG [12:06:13] T272776: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272776 [12:06:18] (03PS10) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [12:06:48] can't pull the change. for some reason my git freezes out. can you optimize the SVG? Urbanecm [12:06:58] this one: https://gerrit.wikimedia.org/r/c/657971 [12:07:27] Evrifaessa: try going to master, doing git pull, and then git review -d 657971 again :) [12:07:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab535fcc983d57dd36c41309162ace8aadcae1a: Add namespace aliases to Turkish Wikivoyage (T272782) (duration: 01m 00s) [12:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:33] T272782: Add namespace aliases to Turkish Wikivoyage - https://phabricator.wikimedia.org/T272782 [12:08:42] Evrifaessa: please test the other change at mwdebug1001 [12:08:46] (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [12:08:59] works. [12:09:06] thank you [12:10:59] (03PS2) 10Evrifaessa: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) [12:12:04] Urbanecm: mind checking and deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657971 now? [12:12:15] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=trwikivoyage --cluster=all [12:12:16] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4dfc28a4a759050726561da861a9e1030b529d3e: Add Turkish Powered by MediaWiki and A Wikimedia project icons for Turkish Wikivoyage (T272781) (duration: 01m 00s) [12:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:18] I uploaded a new patchset [12:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:21] T272781: Localize the footer images in Turkish Wikivoyage - https://phabricator.wikimedia.org/T272781 [12:12:23] Evrifaessa: saw, will look :) [12:13:55] Evrifaessa: what was your optimalization command? [12:14:26] svgo wikivoyage-wordmark-tr.svg --disable={cleanupIDs,convertPathData,removeDesc,removeTitle,removeViewBox,removeXMLProcInst} --enable='sortAttrs' --pretty [12:15:45] Urbanecm: ? [12:15:53] Evrifaessa: I'm writing ;) [12:15:58] oh [12:15:59] lol [12:16:18] (03PS8) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [12:17:07] Evrifaessa: and also looking at the patch itself [12:17:28] (03PS3) 10Urbanecm: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa) [12:19:12] (03CR) 10Urbanecm: [C: 03+2] Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa) [12:20:05] (03Merged) 10jenkins-bot: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa) [12:20:28] (03PS11) 10Giuseppe Lavagetto: mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [12:20:51] (03PS1) 10Alexandros Kosiaris: apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577 [12:21:12] Evrifaessa: can you check? [12:21:19] (03CR) 10Hnowlan: [C: 03+2] maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [12:21:27] Urbanecm: mwdebug1001? [12:21:31] yup [12:21:47] hmm [12:21:47] nope [12:21:53] still the default logo [12:21:59] https://tr.m.wikivoyage.org/wiki/Anasayfa [12:22:17] Evrifaessa: sorry, can you try now? [12:22:34] looks awesome [12:22:35] thanks [12:22:45] great, syncing [12:23:23] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27661/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [12:23:33] (03PS3) 10Alexandros Kosiaris: services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:23:35] (03PS2) 10Alexandros Kosiaris: services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603) [12:23:37] (03PS2) 10Alexandros Kosiaris: apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577 [12:24:26] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikivoyage-wordmark-tr.svg: 080389dbac5bb2cddab7640071e43674a868e945: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (T272776; 1/2) (duration: 01m 01s) [12:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] T272776: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272776 [12:24:40] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27662/console" [puppet] - 10https://gerrit.wikimedia.org/r/658577 (owner: 10Alexandros Kosiaris) [12:26:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 080389dbac5bb2cddab7640071e43674a868e945: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (T272776; 2/2) (duration: 01m 02s) [12:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:30] Evrifaessa: should be live :) [12:26:40] yeah, it works. thanks :)) [12:27:16] excellent :) [12:27:57] o/ [12:28:04] (03PS6) 10Urbanecm: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875) [12:28:10] (03CR) 10Urbanecm: [C: 03+2] Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875) [12:29:00] (03Merged) 10jenkins-bot: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875) [12:29:45] (03CR) 10Alexandros Kosiaris: "I 've added the LVS IP to the kubernetes nodes on this patch and run PCC on this and descendants. https://puppet-compiler.wmflabs.org/comp" [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:29:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [12:30:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [12:30:08] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577 (owner: 10Alexandros Kosiaris) [12:30:15] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 11cfef4f05612771d6a7cbe27f9bb1fbb41e0e5d: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia (T271612) (duration: 01m 01s) [12:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:07] T271612: New namespace for WikiProject on zh.wikipedia - https://phabricator.wikimedia.org/T271612 [12:32:11] (03Abandoned) 10JMeybohm: Demo - don't merge: Add a new listener to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/657591 (owner: 10JMeybohm) [12:32:16] (03Abandoned) 10JMeybohm: Demo - don't merge: Enable the service-proxy-demo listener for MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/657592 (owner: 10JMeybohm) [12:33:00] (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) [12:33:23] (03CR) 10jerkins-bot: [V: 04-1] similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [12:33:31] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) Thanks, @sbassett, @JFishback_WMF, and @jcrespo, for the further input! Yes, it sounds like I would need... [12:34:01] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=zhwiki --fix # T271612 # P13960 [12:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:05] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:03] legoktm, _joe_: the above is because requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://docker-r [12:40:06] egistry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100 [12:42:29] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Assuming legal approves and IT helps you on client side, we (SREs) will be able to help the person with any tra... [12:43:32] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >>! In T271202#6737057, @Platonides wrote: > On the topic of ssh accesses, there shouldn't be a "big hea... [12:44:21] (03PS1) 10Kormat: switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) [12:44:53] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal [12:46:25] (03CR) 10jerkins-bot: [V: 04-1] switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [12:47:24] (03PS17) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:47:57] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 55 connections established with conf2001.codfw.wmnet:2379 (min=56) https://wikitech.wikimedia.org/wiki/PyBal [12:48:10] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >>! In T271202#6777042, @jcrespo wrote: > Assuming legal approves and IT helps you on client side, we (S... [12:48:39] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27663/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:49:41] <_joe_> volans: we know, and it's tracked in the task already [12:49:42] (03CR) 10Volans: "Couple of minor things inline." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [12:49:56] ack, thx, sorry for the noise [12:49:58] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Analytics tools are improving every day, The data engineering team are doing a great job of offering web-based... [12:51:54] <_joe_> volans: no, sorry for the noise from that script, somehow retries seem not to be working, and I didn't look further into what's going wrong [12:52:02] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 97 connections established with conf1004.eqiad.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal [12:52:26] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) > do y'all still need to approve it? We ask you to loop us in. For production access is always better to ask f... [12:55:24] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal [12:55:56] (03PS1) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 [12:56:20] (03PS18) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [12:57:12] (03CR) 10Jcrespo: [C: 03+1] "I haven't tested, +1, but see comments below." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [12:57:51] (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond) [13:03:58] (03CR) 10Marostegui: [C: 03+1] "we can test with pc1 codfw at some point if you like" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [13:05:32] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal [13:06:30] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 75 connections established with conf2001.codfw.wmnet:2379 (min=76) https://wikitech.wikimedia.org/wiki/PyBal [13:10:20] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal [13:21:43] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal [13:21:49] (03CR) 10Jcrespo: [C: 03+1] "More options (if we are afraid a connection loss can happen between one statement and the other:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [13:29:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:30:54] !log Upgraded and restarting Jenkins on release1002 / releases2002 / contint1001 and contint2001 [13:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:04] CI jobs halting for a couple minutes [13:31:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:27] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:04] (03PS3) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) [13:35:28] (03CR) 10David Caro: "Fixed issues introduced when linting, now playing with classes." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [13:35:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:36:07] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:37:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [13:40:31] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] MediaSearch: new set of heuristics for alternative implementation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532) [13:40:33] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] MediaSearch: remove old, unused set of heuristics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658590 (https://phabricator.wikimedia.org/T271532) [13:41:16] !log admin update some kubernetes-related packages in buster-wikimedia/thirdparty/kubeadm-k8s-1-17 (T263284) [13:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:23] T263284: Upgrade Toolforge K8s to 1.17 - https://phabricator.wikimedia.org/T263284 [13:44:11] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [13:45:45] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156823768 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:46:14] (03CR) 10Patsagorn Y.: Create patroller user group for thwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.) [13:48:48] (03CR) 10Patsagorn Y.: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.) [13:49:11] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.4 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [13:51:49] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:54:23] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 789000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:54:53] (03CR) 10Patsagorn Y.: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.) [13:56:00] (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.) [13:58:56] the lvs hosts are me btw. Proceeding with setting up new LVS services [13:59:31] (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) [14:00:26] (03PS1) 10Kormat: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 [14:00:55] (03PS2) 10Kormat: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 [14:03:25] !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837 [14:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:30] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [14:03:43] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86118936 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:04:53] (03PS16) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [14:05:17] !log Restart db1077 [14:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:03] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:06:39] (03CR) 10HitomiAkane: "Note: the topic (bug) ID in the commit message is incorrect, also need rebase to solve the merge conflict issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.) [14:07:02] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:07:14] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [14:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:42] (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:09:09] (03PS1) 10Filippo Giunchedi: alertmanager: hide 'logger' receiver on alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/658593 (https://phabricator.wikimedia.org/T272474) [14:10:22] (03CR) 10Volans: [C: 03+1] "LGTM, but you might want to open a bug upstream, this is quite an unwanted behaviour." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat) [14:10:48] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: hide 'logger' receiver on alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/658593 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [14:13:34] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [14:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:46] (03CR) 10Kormat: [C: 03+2] setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat) [14:15:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [14:16:48] (03PS17) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [14:17:26] (03Merged) 10jenkins-bot: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat) [14:18:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:48] (03PS2) 10Kormat: switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) [14:19:24] (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [14:20:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:21] (03PS1) 10WMDE-Fisch: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238) [14:21:24] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [14:21:25] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:21:48] !log Install mariadb 10.4.18 on pc2010 - T268457 [14:21:51] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 59 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal [14:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:53] T268457: Investigate possible optimizer regression on 10.4.17 with DELETE statements - https://phabricator.wikimedia.org/T268457 [14:22:01] !log restart pybal on lvs1015, lvs1016, lvs2009, lvs2010 for picking up linkrecommendation, similar-users, apertium-tls LVS services. [14:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:11] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 234392848 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:23:07] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [14:23:07] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:23:07] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:41] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal [14:23:43] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:24:27] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 342200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:25:44] (03CR) 10Jbond: [C: 03+2] dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [14:28:28] (03Merged) 10jenkins-bot: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [14:30:23] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:19] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 60634264 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:32:08] (03PS1) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) [14:33:35] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1079136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:34:31] (03PS2) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) [14:37:21] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:55] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 79 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [14:44:06] !log reimaging maps1009 as new buster master [14:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:35] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 101 connections established with conf1004.eqiad.wmnet:4001 (min=101) https://wikitech.wikimedia.org/wiki/PyBal [14:46:21] (03PS1) 10Filippo Giunchedi: alertmanager: force 'default' receiver as last child route [puppet] - 10https://gerrit.wikimedia.org/r/658598 (https://phabricator.wikimedia.org/T272474) [14:46:27] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:46:30] (03CR) 10Kormat: "> SET STATEMENT binlog_format=STATEMENT FOR " (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat) [14:47:01] (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603) [14:47:43] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:48:19] (03PS18) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 [14:48:33] (03CR) 10Jbond: cookbook sre.misc-clusters.apt: (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [14:48:40] (03PS1) 10Elukey: Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600 [14:49:08] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: force 'default' receiver as last child route [puppet] - 10https://gerrit.wikimedia.org/r/658598 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi) [14:49:19] (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603) [14:49:43] (03PS3) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) [14:50:06] (03PS3) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 [14:50:25] (03PS2) 10Elukey: Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600 [14:50:45] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600 (owner: 10Elukey) [14:50:47] (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond) [14:51:16] (03CR) 10jerkins-bot: [V: 04-1] Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey) [14:53:38] (03PS4) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) [14:53:48] (03CR) 10Kormat: "I've filed https://github.com/python/mypy/issues/9972 with upstream." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat) [14:54:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:55:10] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27667/console" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey) [14:56:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE [14:56:44] (03CR) 10Elukey: [V: 03+1] "To be effective this change needs something like https://gerrit.wikimedia.org/r/c/labs/private/+/658600 in the private repo before merging" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey) [14:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:52] (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [14:57:10] (03CR) 10Volans: [C: 04-2] "This is great but should go in spicerack. wmflib is a multi-purpose library that doesn't require any special permissions and can be import" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond) [14:58:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE [14:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:54] (03PS4) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 [15:00:00] (03CR) 10Jbond: "updated thanks" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [15:02:36] (03CR) 10Bstorm: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [15:06:46] (03PS1) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 [15:06:48] (03PS1) 10Hnowlan: maps::apps: only use nodejs10 repo on stretch [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753) [15:06:59] (03Abandoned) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond) [15:07:35] (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [15:08:04] (03PS1) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628 [15:08:06] (03PS1) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629 [15:08:08] (03CR) 10Gehel: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [15:08:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [15:09:32] (03CR) 10Gehel: [C: 03+1] "LGTM as far as I understand how we manage SSL certs. A +1 from someone who has a more complete understanding would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [15:09:37] (03CR) 10Gehel: "LGTM as far as I understand how we manage SSL certs. A +1 from someone who has a more complete understanding would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper) [15:12:42] (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond) [15:12:50] (03PS2) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628 [15:12:52] (03PS2) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629 [15:12:54] (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603) [15:14:38] 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) @Kormat @Marostegui I believe this is unblocked now for you to remove groups from the db configuration. At this... [15:16:48] (03PS5) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 [15:26:15] (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond) [15:28:04] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 174964472 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:29:32] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 580528 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:29:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn) [15:30:02] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Created all the TLS certs and configs as described in https://wikitech.wik... [15:30:12] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:19] (03PS19) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:30:47] (03PS7) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [15:33:15] (03PS8) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [15:33:58] (03PS7) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [15:33:59] (03PS1) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [15:35:30] (03CR) 10David Caro: "The diff looks really bad xd, there aren't really many changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro) [15:36:44] (03PS8) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [15:36:46] (03PS2) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [15:36:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:44] (03CR) 10jerkins-bot: [V: 04-1] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [15:43:34] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47542712 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:46:13] (03CR) 10Elukey: "Hi David, I saw that code change passing by and I added a couple of comments, nothing related to the logic of the cookbooks but only to so" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro) [15:51:42] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 20528 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:53:37] (03CR) 10David Caro: "Thanks a lot Elukey! Will fix/added reply 😊" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro) [15:54:52] RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:39] (03CR) 10Muehlenhoff: "Looks good, two nits/comments inline (but feel free to ignore)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [15:55:43] (03PS9) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [15:55:45] (03PS3) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [15:58:20] (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [16:02:02] !log installing mutt security updates on buster [16:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:01] (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [16:12:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:14:46] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 178367440 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:15:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [16:15:22] (03PS6) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [16:15:55] (03CR) 10Bstorm: data-services: apply user variances to future creations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm) [16:16:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27670/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [16:18:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:18:47] (03PS7) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [16:19:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27671/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [16:20:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:21:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto) [16:22:04] (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot) [16:22:06] (03PS8) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [16:23:14] (03PS4) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [16:23:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27672/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [16:24:25] (03CR) 10Jbond: [V: 03+1 C: 03+2] O:idp: update apero_cas::service so its a bit more intuitive (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [16:25:31] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978) [16:25:54] (03CR) 10Hnowlan: [C: 03+2] maps::apps: only use nodejs10 repo on stretch [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [16:26:02] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) >>! In T266481#6775156, @wiki_willy wrote: > Hi @Jgreen - it looks like we're running a bit tight on space in the Fundraising rack. In order for us to rack the servers for... [16:26:53] (03CR) 10RLazarus: [C: 03+1] add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [16:27:26] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 26536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:28:36] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) Thanks @Jgreen (cc'ing @Jclark-ctr as a fyi) >>! In T266481#6777535, @Jgreen wrote: >>>! In T266481#6775156, @wiki_willy wrote: >> Hi @Jgreen - it looks like we're runn... [16:28:45] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) [16:28:50] (03PS1) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 [16:29:10] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Dzahn) Hi @mewoph could you let us know what specific thing you actually want to access? Thanks! [16:29:21] (03CR) 10David Caro: "Hmm... something got borked here... I seem no be unable to update, patch 9 was not supposed to be there (removing the linting comments)." [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [16:29:42] (03CR) 10David Caro: "Comes from https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/658357" [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [16:30:27] (03PS5) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 [16:31:02] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:21] (03Abandoned) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [16:31:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:32:22] (03CR) 10Dzahn: [C: 03+1] "lgtm, checked on mwmaint and ldap-corp. full time employee" [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm) [16:32:51] (03CR) 10Dzahn: [C: 03+2] add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [16:32:57] (03PS6) 10Dzahn: add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) [16:33:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:33:40] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Ok all nodes racked are now working! We have 6 missing node still to rack, ideally in rows not already too used. For example, this is our current d... [16:34:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:36:05] adding new deployment hosts to firewalls, this will be a ferm change/reload on a LOT of hosts [16:36:20] (03Merged) 10jenkins-bot: linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [16:37:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:26] PROBLEM - puppet last run on kafka-test1007 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:38:41] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [16:38:41] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [16:38:41] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [16:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:40:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE [16:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:41:50] (03CR) 10Elukey: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:42:06] !log reimaginge l33t jobrunner mw1337 [16:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:20] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [16:42:20] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [16:42:20] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [16:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:41] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE [16:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:43:45] RECOVERY - puppet last run on kafka-test1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:43:55] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [16:43:56] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [16:43:56] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [16:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:07] PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:45:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` Of... [16:45:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:46:04] (03CR) 10Elukey: "One last little thing that I just realized, and I think we are ready to go :)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [16:50:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [16:50:15] !log Deploy schema change on testwiki - T272953 [16:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:20] T272953: CentralNotice: Update DB schema on Meta for campign types feature - https://phabricator.wikimedia.org/T272953 [16:50:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot) [16:51:38] Am I being stupid or has the diagnosing connection problems page been moved on wikitech? [16:55:02] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE [16:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:27] I found it [16:56:30] (03CR) 10Ebernhardson: [C: 03+1] "seems sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey) [16:56:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE [16:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE [16:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE [16:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:05] jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1700). [17:01:11] (03PS1) 10Dzahn: DHCP: switch all eqiad appservers to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/658642 (https://phabricator.wikimedia.org/T245757) [17:02:20] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) [17:02:35] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) [17:02:38] 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle) [17:02:55] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:59] (03CR) 10Dzahn: [C: 03+2] DHCP: switch all eqiad appservers to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/658642 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [17:04:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2321.codfw.wmnet with reason: REIMAGE [17:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE [17:05:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:42] (03PS1) 10Dzahn: scap: add deploy2002 to mediawiki installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963) [17:06:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2321.codfw.wmnet with reason: REIMAGE [17:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:07] (03PS2) 10Dzahn: scap: add deploy2002 to mediawiki installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963) [17:07:37] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) a new motherboard has been dispatched, I will coordinate with Dell Tech to get this completed, Hoping for Wednesday. [17:08:24] 10SRE, 10puppet-compiler, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10thcipriani) [17:08:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE [17:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:59] (03PS3) 10Dzahn: scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963) [17:11:23] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) New DIMM has been dispatched for the server I will coordinate a time with you to power down to restore the original configuration. [17:12:13] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Sounds good @Cmjohnson let me know when it arrives and you plan to change it so I can stop mysql Thank you [17:12:36] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) Good news bad news, Dell dispatched a new DIMM. The bad news, is we do not know which one and it could take some time to figure that... [17:15:03] (03CR) 10JMeybohm: [C: 04-1] service proxy: Add apertium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris) [17:17:04] (03CR) 10JMeybohm: [C: 04-1] "At least termbox still references this file (helmfile.d/services/termbox/helmfile.yaml). I would suggest to first remove it from there, ju" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris) [17:17:09] (03PS6) 10Dzahn: Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509) [17:17:50] (03CR) 10Dzahn: [C: 03+2] Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [17:18:04] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:18:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1409.eqiad.wmnet'] ` an... [17:19:13] !log ms-be1028 - running puppet to clear ferm icinga alert [17:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:19:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1408.eqiad.wmnet'] ` an... [17:20:38] so that kind of ferm alert is a race condition that can happen in very few cases if you make a change to ferm rules in base.. just the scale of it [17:21:47] that's why i announced it earlier. running puppet another time. rescheduling icinga check if it does like in this one case here [17:21:49] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:21:49] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:56] and there it is fixed again [17:27:01] (03PS2) 10Cwhite: profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) [17:27:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:27:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:28:26] (03PS20) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:30:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` an... [17:32:32] (03PS21) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:33:27] RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [17:34:03] (03PS4) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [17:34:19] (03CR) 10Cwhite: [C: 03+2] profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:34:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:35:41] RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [17:36:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:37:07] 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10thcipriani) [17:38:36] (03PS5) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [17:39:06] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - in looking through Netbox and talking to Chris, this is what I'm thinking, but @Cmjohnson/@Jclark-ctr/@elukey - please call me out... [17:39:34] 10SRE, 10ops-eqiad: sdg1 failed on ms-be1054 - https://phabricator.wikimedia.org/T269556 (10Cmjohnson) A ticket has been opened with HPE 5353126225 [17:40:34] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson) [17:40:37] 10SRE, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Cmjohnson) 05Open→03Resolved [17:40:59] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson) [17:41:44] 10SRE, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) 05Open→03Resolved This has been completed [17:42:00] 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson) 05Open→03Resolved The move has taken place, if you have work to do outside of data center, please re-open and rem... [17:42:51] (03PS1) 10Cwhite: profile: define required curator config variables [puppet] - 10https://gerrit.wikimedia.org/r/658646 [17:43:06] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) @Bstorm I do have the cross-over cable on-site. Is it okay to just connect o... [17:43:13] (03PS6) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) [17:44:14] (03CR) 10Giuseppe Lavagetto: Add support for php deployments (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [17:44:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE [17:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:30] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE [17:44:30] (03PS4) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [17:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:09] (03CR) 10Cwhite: [C: 03+2] profile: define required curator config variables [puppet] - 10https://gerrit.wikimedia.org/r/658646 (owner: 10Cwhite) [17:46:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [17:46:19] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE [17:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE [17:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1337.eqiad.wmnet'] ` an... [17:55:53] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10JTannerWMF) Hi there sorry for the delayed response, for the sake of consistency JTanner works. [18:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1800). [18:05:05] (03PS1) 10Ahmon Dancy: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647 [18:05:07] (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647 (owner: 10Ahmon Dancy) [18:06:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1409.eqiad.wmnet'] ` an... [18:06:49] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647 (owner: 10Ahmon Dancy) [18:07:02] !log dancy@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.28 [18:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1408.eqiad.wmnet'] ` an... [18:11:20] (03PS22) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:15:08] !log installing sudo security updates on Buster [18:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:03] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming after rebuild [18:22:04] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming after rebuild [18:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:02] (03CR) 10Elukey: [V: 03+1 C: 03+2] Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey) [18:26:00] I've gotten a report about Cyberbot malfunction globally in the last few weeks. I suspected a transient issue, but it has not gone away. I looked at the logs today and I see an enourmous amount of Error: 502, Server Hangup at 2021-01-26 14:21:35 GMT [18:28:37] A lot of other requests are simply 0-byte responses. [18:29:39] !help [18:29:39] want docs? ask for "!wm-bot". all keywords? try "@regsearch .*" [18:30:30] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:15] legoktm do I ping you for this? [18:32:13] hi [18:32:26] Hi. It's been a while. :-) [18:32:30] Cyberpower678: can you file a bug please? and tag #SRE [18:32:36] Sure [18:32:54] if you could include what API requests you're making that would help too [18:33:56] I have a lot of juicy details for you. :-) [18:34:56] (03PS16) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [18:36:08] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:21] Cyberpower678: yeah, please include particular API calls, and also originating IP address, if you can [18:37:34] Cyberpower678: thanks, I just got into a meeting so I'll look as soon as I'm done [18:37:50] !log installing sudo security updates on Stretch [18:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:56] 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) [18:39:11] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) It should be ok to just connect any time as long as the primary link is ok. [18:39:59] Cyberpower678: dumb question, does Cyberbot run on WMCS? [18:41:24] what User-Agent does it set? [18:43:53] cdanis yes it does, and I don't remember what the UA is. [18:44:02] Cyberbot is quite old [18:44:05] 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) What is the originating IP address of these requests? What User-Agent is sent by Cyberbot? [18:44:13] ah [18:44:38] But the originating IP is contained in some of the 502 HTML responses. [18:46:59] !log dancy@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.28 (duration: 40m 09s) [18:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:37] it seems that the User-Agent being sent is "Peachy MediaWiki Bot API Version 2.0 (alpha 8)". It would be nice to have that updated to be compliant with https://meta.wikimedia.org/wiki/User-Agent_policy [18:52:26] 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) The IP 172.16.2.21, I don't know what the UA is. The bot is pretty old. [18:53:40] cdanis: it originates from a framework I no longer maintain. The bot is old, and whenever I get free time, I am slowly coding up replacements for the existing bot tasks. [18:55:57] I see [18:57:09] !log uploaded sudo 1.8.10p3-1+deb8u7+wmf1 to apt.wikimedia.org [18:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:15] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [18:58:03] !log installing sudo security updates on Jessie [18:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1900) [19:05:55] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) It seems the User-Agent being used is `Peachy MediaWiki Bot API Version 2.0 (alpha 8)` (which ideally should... [19:06:44] (03PS1) 10Andrew Bogott: OpenStack: add config files for openstack Train [puppet] - 10https://gerrit.wikimedia.org/r/658652 (https://phabricator.wikimedia.org/T261135) [19:06:46] (03PS1) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) [19:06:48] (03PS1) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135) [19:06:50] (03PS1) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135) [19:06:52] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) [19:06:55] (03PS1) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135) [19:06:57] (03PS1) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135) [19:06:59] (03PS1) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135) [19:07:01] (03PS1) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) [19:07:03] (03PS1) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135) [19:07:05] (03PS1) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135) [19:07:07] (03PS1) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135) [19:09:37] Cyberpower678: does Cyberbot have rate limiting built in? [19:10:05] (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [19:12:32] It follows maxlag [19:13:32] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2321.codfw.wmnet [19:13:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:18] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2317.codfw.wmnet with reason: REIMAGE [19:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:28] PROBLEM - mediawiki-installation DSH group on mw2321 is CRITICAL: Host mw2321 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:18:20] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2317.codfw.wmnet with reason: REIMAGE [19:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:42] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2321 is CRITICAL: Host mw2321 is not in mediawiki-installation dsh group daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:19:52] (03CR) 10Andrew Bogott: "removing jenkin's down-vote. It's because the linter doesn't like us including network::constants and I'm not going to refactor that toda" [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [19:20:38] PROBLEM - Ensure local MW versions match expected deployment on mw1337 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:21:32] PROBLEM - Ensure local MW versions match expected deployment on mw2321 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:22:22] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:22:56] ^ they need a scap pull.. doing that [19:25:52] RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:25:58] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) a:03Dzahn [19:26:28] (03Abandoned) 10Jforrester: Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712) [19:26:44] RECOVERY - Ensure local MW versions match expected deployment on mw2321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:27:21] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes but just every couple hours, i think. [19:27:38] RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [19:31:09] (03PS1) 10Andrew Bogott: toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666 [19:31:32] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:33:25] (03PS2) 10Legoktm: admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) [19:34:05] (03CR) 10Legoktm: [C: 03+2] admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm) [19:34:12] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) 05Open→03Resolved ` 19:25 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Applicatio... [19:34:35] (03PS2) 10Andrew Bogott: toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666 [19:36:08] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658666 (owner: 10Andrew Bogott) [19:37:45] (03CR) 10Andrew Bogott: [C: 03+2] toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666 (owner: 10Andrew Bogott) [19:37:58] PROBLEM - mediawiki-installation DSH group on mw1337 is CRITICAL: Host mw1337 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:40:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) 05Open→03Resolved You're now in the `wmf` group: https://ldap.toolforge.org/user/mewoph [19:40:27] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1337 is CRITICAL: Host mw1337 is not in mediawiki-installation dsh group daniel_zahn reimage https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [19:40:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:41:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2317.codfw.wmnet'] ` an... [19:42:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:42:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet [19:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1409.eqiad.wmnet [19:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:14] 10SRE, 10ops-codfw: codfw: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268749 (10Papaul) Row B complete [19:44:40] PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:49:23] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2317.codfw.wmnet [19:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:13] (03PS2) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135) [19:51:15] (03PS2) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135) [19:51:17] (03PS2) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135) [19:51:19] (03PS2) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135) [19:51:21] (03PS2) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135) [19:51:23] (03PS2) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) [19:51:25] (03PS2) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135) [19:51:27] (03PS2) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135) [19:51:29] (03PS2) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135) [19:51:31] (03PS2) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135) [19:53:05] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2317.codfw.wmnet [19:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1408.eqiad.wmnet [19:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:50] (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:00:04] dancy and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T2000). [20:00:49] (here, but also in a pairing session shortly. please ping if i can be of use.) [20:01:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:01:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1408.eqiad.wmnet [20:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:03:31] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1409.eqiad.wmnet [20:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:09:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:11:50] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add config files for openstack Train [puppet] - 10https://gerrit.wikimedia.org/r/658652 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:12:07] (03CR) 10Andrew Bogott: [C: 03+2] Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:12:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2321.codfw.wmnet [20:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:29] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:01] (03CR) 10Andrew Bogott: [C: 03+2] Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:02] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2321.codfw.wmnet [20:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:07] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:15] (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:19] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:24] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:29] (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:13:36] (03CR) 10Andrew Bogott: [C: 03+2] Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:14:30] (03PS3) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135) [20:14:49] (03PS3) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135) [20:14:57] (03PS3) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135) [20:15:09] (03PS3) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135) [20:15:18] (03PS3) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135) [20:15:29] (03PS3) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135) [20:15:40] (03PS3) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) [20:15:51] (03PS3) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135) [20:15:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:16:01] (03PS3) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135) [20:16:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:16:36] (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:17:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott) [20:17:13] (03CR) 10Ryan Kemper: "(reaching out to service-ops to try to find a reviewer)" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper) [20:17:23] RECOVERY - mediawiki-installation DSH group on mw2321 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [20:17:53] (03Abandoned) 10Ryan Kemper: udev_reload missing trailing sudo [puppet] - 10https://gerrit.wikimedia.org/r/634390 (owner: 10Ryan Kemper) [20:19:35] RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers [20:20:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:20:32] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm) 05Resolved→03Open Boldly reopening this. >>! In T272967#6778204, @Dzahn wrote: > It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes... [20:20:38] (03CR) 10Dzahn: [C: 03+1] "This looks good. But before you merge it you need to create the certificate for it or puppet/envoy will fail." [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper) [20:22:13] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) a:05Dzahn→03None [20:22:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:22:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE [20:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:34] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) also see T218412 [20:24:56] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE [20:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:01] (03PS4) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) [20:26:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE [20:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:11] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) Probably just that downtimes expired 8 hours later. I think there is not much to fix here besides "remember to scap pull" and he other part about MW versions would be T272967 [20:28:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE [20:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:14] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:31:15] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm) >>! In T272967#6778314, @Dzahn wrote: > Probably just that downtimes expired 8 hours later. > > I think there is not much to fix here besides "remember to scap pull" and the other part about M... [20:31:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:33:47] (03PS1) 10Legoktm: admin: Add jaz to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658672 (https://phabricator.wikimedia.org/T272522) [20:34:40] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:39] (03CR) 10Legoktm: [C: 03+2] admin: Add jaz to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658672 (https://phabricator.wikimedia.org/T272522) (owner: 10Legoktm) [20:36:00] !log T272444 (Decommission relforge100[1,2]) Downtimed `relforge100[1,2]` in Icinga cookbook for the next 26 hours [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:04] T272444: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 [20:36:30] (03CR) 10Ryan Kemper: [C: 03+2] Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [20:37:47] !log T272444 (Decommission relforge100[1,2]) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/657453 prior to running decom cookbook [20:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:13] (03PS1) 10Ottomata: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) [20:38:32] (03CR) 10Bstorm: [C: 03+2] data-services: apply user variances to future creations [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm) [20:38:43] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10Legoktm) 05Open→03Resolved a:05JTannerWMF→03Legoktm Done, you're now a member of the `wmf` group: https://ldap.toolforge.org/user/jaz. A tip that if you're... [20:39:01] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission [20:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674 [20:39:41] (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674 (owner: 10Ahmon Dancy) [20:39:48] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) [20:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:17] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission [20:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:20] (03PS2) 10Ottomata: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) [20:40:26] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674 (owner: 10Ahmon Dancy) [20:40:27] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) I don't know. I think the answer to that question would best be found on T218412. [20:40:29] !log T272444 (Decommission relforge100[1,2]) Beginning decommission of `relforge1001`: `sudo -i cookbook sre.hosts.decommission relforge1001.eqiad.wmnet -t T272444` [20:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:21] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.28 [20:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:52] dancy: brennen i have a config change to sync, only affects testwiki. ok to sync or is train in progress? [20:43:10] ottomata: train is in progress, i believe. [20:43:12] group0 rollout just finished. [20:43:26] ok, dancy let me know when it is clear to sync [20:43:42] dancy: i see a maybe blocker [20:43:48] ottomata: Go for it. [20:43:53] oh k [20:43:53] MediumSpecificBagOStuff:1094 PHP Notice: Undefined offset: 1 [20:43:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE [20:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:04] nod. seeing lots of those now.. [20:44:09] shall I wait? [20:44:12] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE [20:44:14] yes please. [20:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:16] k [20:44:25] I'm going to roll back now. [20:45:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1411.eqiad.wmnet'] ` an... [20:46:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE [20:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1411.eqiad.wmnet [20:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:13] (03PS1) 10Ahmon Dancy: Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676 [20:47:31] (03CR) 10Ahmon Dancy: [C: 03+2] Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676 (owner: 10Ahmon Dancy) [20:48:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE [20:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:19] (03Merged) 10jenkins-bot: Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676 (owner: 10Ahmon Dancy) [20:49:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1413.eqiad.wmnet'] ` an... [20:49:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:06] !log dancy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided) [20:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1411.eqiad.wmnet [20:50:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:37] !log group0 rolled back to 1.36.0-wmf.27 [20:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2308.codfw.wmnet with reason: REIMAGE [20:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:15] ottomata: Your turn. Lemme know when you're done. I'll file a complaint about MW errors in the meantime [20:51:21] k [20:51:23] 10SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/mailman/listinfo/wikics-l has no CSS styling due to 404 URLs; Cannot subscribe due to token error - https://phabricator.wikimedia.org/T272969 (10Legoktm) 05Open→03Resolved p:05Triage→03Medium a:03Legoktm I reset the HTML theme to the Wikim... [20:51:30] (03CR) 10Ottomata: [C: 03+2] Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [20:51:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:12] !log T272444 (Decommission relforge100[1,2]) Beginning decommission of `relforge1002`: `sudo -i cookbook sre.hosts.decommission relforge1002.eqiad.wmnet -t T272444` [20:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:15] T272444: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 [20:52:17] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:52:20] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission [20:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1413.eqiad.wmnet [20:52:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2308.codfw.wmnet with reason: REIMAGE [20:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:33] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on testwiki - T271208 (duration: 01m 17s) [20:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:36] T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208 [20:55:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1413.eqiad.wmnet [20:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:57] (03CR) 10Krinkle: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata) [20:56:08] dancy: done [20:56:15] thx! [20:56:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:57:20] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@2662ca2]: ship hourly link recommendations [20:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:30] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata) @ppelberg, this ticket needs to be done before you can access data via Presto. You don't need ssh access, but you do need to be in the `analytics-privatedata-users` group, which requires the sa... [21:04:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:05:50] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@2662ca2]: ship hourly link recommendations (duration: 08m 30s) [21:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:58] !log restart airflow-scheduler and airflow-webserver on an-airflow1001 post-deploy [21:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:09] (03PS1) 10Subramanya Sastry: Parsoid Testing: Switch rt/vd server db hosts to localhost [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509) [21:11:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2306.codfw.wmnet with reason: REIMAGE [21:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:17] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) Have you made any changes to the bot recently? `lang=irc 11:09:36 Cyberpower678: does Cyberbot... [21:13:27] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2306.codfw.wmnet with reason: REIMAGE [21:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:51] maybe I'm blind, but I don't see any option to change the priority on https://phabricator.wikimedia.org/T273003 [21:14:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2308.codfw.wmnet'] ` an... [21:15:08] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [21:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2304.codfw.wmnet with reason: REIMAGE [21:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:28] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) No changes have been made to the bot whatsoever. I believe it only does maxlag on write requests, li... [21:15:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27680/" [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509) (owner: 10Subramanya Sastry) [21:17:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2304.codfw.wmnet with reason: REIMAGE [21:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:41] (03CR) 10Dzahn: "deployed on testreduce1001. puppet also triggered a refresh of the parsoid-rt service" [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509) (owner: 10Subramanya Sastry) [21:19:02] legoktm: I wonder if that missing phab priority setting is an artifact of it being marked "production error" somehow? [21:19:21] legoktm: you're not blind [21:19:25] also, this is not a "production error" report but meh [21:19:30] bd808: that form was changed today [21:19:35] for some reason it uses the "agile" story points thing instead [21:19:51] https://phabricator.wikimedia.org/T240343 [21:20:09] twentyafterfour: ^ [21:20:16] That seems to have broke something [21:20:57] okay [21:21:04] I edited the form to make Priority visible [21:21:24] 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) p:05Triage→03Medium [21:21:27] thanks for the pointer RhinosF1|NotHere [21:23:00] legoktm: np [21:27:52] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a87a69a]: correct alter table syntax to create wbitem table [21:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2308.codfw.wmnet [21:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1338.eqiad.wmnet'] ` an... [21:29:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2308.codfw.wmnet [21:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1337.eqiad.wmnet'] ` an... [21:29:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:31:02] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a87a69a]: correct alter table syntax to create wbitem table (duration: 03m 09s) [21:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:09] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:32:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet [21:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw13388.eqiad.wmnet [21:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:22] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1338.eqiad.wmnet [21:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1337.eqiad.wmnet [21:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2306.codfw.wmnet'] ` an... [21:35:04] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6776832, @Joe wrote: > Looking at the logs from a failed run, it looks like no retry is attempted when a 504 is received... [21:36:37] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [21:37:39] ACKNOWLEDGEMENT - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T179696 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:58] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2304.codfw.wmnet'] ` an... [21:40:34] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1338.eqiad.wmnet [21:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:49] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/658682 [21:45:11] (03PS1) 10Ryan Kemper: relforge: remove decommed relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/658683 (https://phabricator.wikimedia.org/T272444) [21:48:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2302.codfw.wmnet with reason: REIMAGE [21:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:12] (03PS1) 10Legoktm: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696) [21:50:21] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2306.codfw.wmnet [21:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:31] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2304.codfw.wmnet [21:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2302.codfw.wmnet with reason: REIMAGE [21:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2304.codfw.wmnet [21:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2306.codfw.wmnet [21:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:38] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2300.codfw.wmnet with reason: REIMAGE [21:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:57:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:58:40] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2300.codfw.wmnet with reason: REIMAGE [21:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:02:08] (03CR) 10Ryan Kemper: [C: 03+2] relforge: remove decommed relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/658683 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper) [22:02:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:07:00] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10ppelberg) @Joe, @elukey, @Ottomata and @jcrespo: thank you responding as helpfully and responsively as y'all did...I'm sorry I left you wondering for as long as I have. Responses to, what I think are, th... [22:08:39] (03PS5) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677 [22:09:31] (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm) [22:10:19] (03CR) 10Dzahn: [C: 03+2] "openssl x509 -text -noout -in wdqs-internal.discovery.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [22:10:56] (03CR) 10Dzahn: [V: 03+2 C: 03+2] wdqs: add dummy key for new wdqs-internal cert [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper) [22:12:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2302.codfw.wmnet'] ` an... [22:15:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2299.codfw.wmnet with reason: REIMAGE [22:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2298.codfw.wmnet with reason: REIMAGE [22:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2299.codfw.wmnet with reason: REIMAGE [22:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:29] 10ops-eqiad, 10decommission-hardware, 10Discovery-Search (Current work), 10Patch-For-Review: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10RKemper) [22:18:54] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27682/wdqs1008.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper) [22:19:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2298.codfw.wmnet with reason: REIMAGE [22:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2300.codfw.wmnet'] ` an... [22:21:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2297.codfw.wmnet with reason: REIMAGE [22:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:04] (03CR) 10Dzahn: [C: 03+1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/658548 and https://gerrit.wikimedia.org/r/c/labs/private/+/658550 are deployed.. a" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper) [22:22:06] (03PS3) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [22:23:37] (03CR) 10jerkins-bot: [V: 04-1] openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [22:23:55] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2297.codfw.wmnet with reason: REIMAGE [22:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1264.eqiad.wmnet with reason: REIMAGE [22:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1264.eqiad.wmnet with reason: REIMAGE [22:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:47] (03PS4) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [22:27:07] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2300.codfw.wmnet [22:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:40] (03PS1) 10Ppchelko: CacheTime: Extra protection for rollback unserialization [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658688 (https://phabricator.wikimedia.org/T273007) [22:28:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:01] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2300.codfw.wmnet [22:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:34:03] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a276626]: correct execution_date_fn in ores_predictions_hourly [22:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:10] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a276626]: correct execution_date_fn in ores_predictions_hourly (duration: 01m 07s) [22:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2298.codfw.wmnet'] ` an... [22:40:56] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2299.codfw.wmnet'] ` an... [22:45:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2297.codfw.wmnet'] ` an... [22:47:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:49:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:11:46] (03PS1) 10Dzahn: add certificate for testreduce.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/658695 (https://phabricator.wikimedia.org/T266509) [23:12:02] (03PS1) 10Dzahn: add fake cert for testreduce.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/658696 (https://phabricator.wikimedia.org/T266509) [23:12:15] (03CR) 10Dzahn: [C: 03+2] add certificate for testreduce.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/658695 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [23:15:50] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake cert for testreduce.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/658696 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [23:17:19] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1264.eqiad.wmnet'] ` an... [23:20:38] 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Legoktm) Running `scap pull` after reimaging will definitely take care of the immediate problem. I read through {T218412} and it looks like that's mostly what's being asked for in this ticket, figuring... [23:23:38] (03PS5) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [23:24:37] (03CR) 10Legoktm: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [23:25:22] (03CR) 10Volans: "quick reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [23:25:24] (03PS2) 10Dzahn: ATS: re-add config for parsoid-rt-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) [23:26:07] (03CR) 10Dzahn: [C: 03+2] "certificate created" [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [23:30:24] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2302.codfw.wmnet [23:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2299.codfw.wmnet [23:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:26] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2298.codfw.wmnet [23:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1264.eqiad.wmnet [23:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2297.codfw.wmnet [23:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:43] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Ppchelko!" [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658688 (https://phabricator.wikimedia.org/T273007) (owner: 10Ppchelko) [23:37:24] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1264.eqiad.wmnet [23:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:39] one more canary server in eqiad is buster now ^ [23:38:06] trying to match canaries with reality [23:38:21] Nice [23:40:14] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2302.codfw.wmnet [23:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2298.codfw.wmnet [23:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:00] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2297.codfw.wmnet [23:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:16] 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10DannyH) I approve Peter for this, thanks. [23:54:16] (03PS1) 10Dzahn: add testreduce.discovery.wmnet, point to testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/658701 (https://phabricator.wikimedia.org/T266509) [23:55:46] 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn) [23:56:17] (03CR) 10Dzahn: [C: 03+2] add testreduce.discovery.wmnet, point to testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/658701 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn) [23:57:25] (03PS4) 10Dzahn: scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963)