[00:02:47] !log Began Elasticsearch reindex job on index `dewiki_content` across [`eqiad`, `codfw`, `cloudelastic`], on `rkemper@mwmaint1002` under tmux session `reindex`. Should complete in <24 hours [00:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:00] (03PS2) 10Jdlrobson: Enable Vector opt in preference everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614837 (https://phabricator.wikimedia.org/T254228) [00:23:17] (03PS2) 10Jdlrobson: Enable side bar instrumentation at 20% for all test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) [00:26:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:31:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:35:28] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17096976 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:20] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 11976 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:19:24] PROBLEM - dump of zarcillo in codfw on icinga1001 is CRITICAL: dump for zarcillo at codfw (2020-07-21 00:57:01) is less than 977 KB: 410480 bytes https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:05:38] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.1 [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/614913 [02:09:43] (03CR) 10Krinkle: [C: 04-1] Enable side bar instrumentation at 20% for all test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [03:32:52] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: prometheus5001.eqsin.wmnet, prometheus3001.esams.wmnet, prometheus4001.ulsfo.wmnet, testreduce1001.eqiad.wmnet, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [04:05:34] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:06:03] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:06:05] 10Operations, 10Graphoid, 10serviceops, 10Patch-For-Review, 10Platform Team (Icebox): Undeploy graphoid for phase 1 wiki's - https://phabricator.wikimedia.org/T257402 (10Jseddon) 05Open→03Resolved [04:11:42] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [04:23:34] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Team (Icebox): Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Jseddon) [04:25:25] (03PS1) 10Seddon: Undeploy graphoid for phase 2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614915 (https://phabricator.wikimedia.org/T258463) [04:26:48] 10Operations, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Jseddon) [04:30:56] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10mb) This was announced to be fixed by 6 July 2020. It's now been 3 weeks. Is there an ETA for a fix? [05:11:21] (03PS1) 10Marostegui: db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614917 (https://phabricator.wikimedia.org/T254462) [05:13:42] (03CR) 10Marostegui: [C: 03+2] db1119: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614917 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:18:57] (03PS2) 10Thcipriani: Branch commit for wmf/1.36.0-wmf.1 [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/614913 (https://phabricator.wikimedia.org/T257969) (owner: 10TrainBranchBot) [06:10:13] (03PS4) 10Kormat: mariadb: Promote es1021 to es4 master. [puppet] - 10https://gerrit.wikimedia.org/r/612551 (https://phabricator.wikimedia.org/T257847) [06:23:22] <_joe_> !log systemctl reset-failed on both centrallogs [06:23:23] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:01] RECOVERY - Check systemd state on centrallog2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:23] <_joe_> !log enabling notifications for lists1001 [06:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:29] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:53] <_joe_> !log systemctl reset-failed on lists1001, a network interface was failing since 1 month [06:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:23] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11973 and previous config saved to /var/cache/conftool/dbconfig/20200721-065430-marostegui.json [06:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:48] !log Pool db1119 into enwiki with MCR schema change done - T238966 [06:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:53] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [06:54:57] !log kormat@cumin1001 dbctl commit (dc=all): 'Set es1021 to weight 50 T257847', diff saved to https://phabricator.wikimedia.org/P11974 and previous config saved to /var/cache/conftool/dbconfig/20200721-065457-kormat.json [06:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:03] T257847: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 [06:55:56] (03PS1) 10Volans: netbox: skip Ganeti sync on netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/614921 [06:59:10] (03CR) 10Kormat: [C: 03+2] mariadb: Promote es1021 to es4 master. [puppet] - 10https://gerrit.wikimedia.org/r/612551 (https://phabricator.wikimedia.org/T257847) (owner: 10Kormat) [06:59:51] !log Starting es4 failover from es1020 to es1021 T257847 [06:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] kormat and marostegui: #bothumor I � Unicode. All rise for es4 database master failover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T0700). [07:00:11] (03CR) 10Kormat: [C: 03+2] db-eqiad.php: Depool cluster26 (es4) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612559 (https://phabricator.wikimedia.org/T257847) (owner: 10Kormat) [07:00:46] (03CR) 10DCausse: [C: 04-1] "this patch should be used to include and set OAUTH_WIKI_LOGOUT_LINK_PARAM I think" [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [07:01:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool cluster26 (es4) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612559 (https://phabricator.wikimedia.org/T257847) (owner: 10Kormat) [07:03:05] (03CR) 10Volans: "To avoid unnecessary icinga checks and syncs that are failing anyway. The removal though will need to be done manually or we just reimage " [puppet] - 10https://gerrit.wikimedia.org/r/614921 (owner: 10Volans) [07:03:07] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Disable writes to es4 T257847 (duration: 01m 00s) [07:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:39] T257847: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 [07:06:18] <_joe_> !log systemctl reset-failed on deneb, the usual known issue with releng image reporting [07:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:43] 10Operations, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, 10Traffic: Cached thumbnails and originals are sometimes not being purged correctly/quickly - https://phabricator.wikimedia.org/T256313 (10ema) 05Open→03Resolved a:03ema There's been no new report of stale images since the l... [07:07:21] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:59] James_F: hey. you around? [07:08:24] we're seeing some writes by a script running on mwmaint1002 to es4 [07:08:59] (03CR) 10Volans: [C: 03+2] "Self-merging to reduce icinga noise, will see how to cleanup shortly." [puppet] - 10https://gerrit.wikimedia.org/r/614921 (owner: 10Volans) [07:10:15] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 60 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:11:21] <_joe_> kormat: just kill [07:11:55] _joe_: i have a feeling that HR would be unhappy ;) [07:12:08] <_joe_> I meant the script, we value James_F [07:12:17] oh good :) [07:13:18] lol [07:13:41] !log killing James_F('s script) on mwmaint1002 [07:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:17] <_joe_> ahah [07:15:52] Hey, sorry. [07:15:59] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:16:01] Yeah, kill it. [07:17:05] James_F: np :) [07:18:23] It’s been slowly running for two weeks. Will restart it later. [07:18:50] James_F: can we make it to re-read the config often? [07:18:57] kormat: thanks for keeping the bar high for being human and kind with colleagues [07:19:24] :D [07:19:24] elukey: :D [07:19:31] marostegui: Maybe. It’s just an MW script. [07:19:46] I guess we could make MW re-read from time to time? [07:20:21] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Joe) given the situation, it would be smart to start compressing older logs. I will not act today but I think adding a simple systemd timer that fires daily and c... [07:20:54] James_F: yeah, basically to avoid this kind of issues in the future, especially if it is an emergency failover, the script would keep trying to insert on a broken (read-only) master [07:21:18] <_joe_> James_F: well it's complicated - you'd have to re-fetch etcdconfig [07:21:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Promote es1021 to es4 master T257847', diff saved to https://phabricator.wikimedia.org/P11975 and previous config saved to /var/cache/conftool/dbconfig/20200721-072127-kormat.json [07:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:33] T257847: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 [07:21:48] <_joe_> we've tailored everything to the idea of running requests, which last less than 3 minutes [07:22:07] Yeah. [07:22:20] <_joe_> James_F: it's easier on you if you just have a main script invoke a second script in a loop [07:22:31] <_joe_> like exec(php ...) [07:22:32] Right. [07:22:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Depool es1020 from es4 T257847', diff saved to https://phabricator.wikimedia.org/P11976 and previous config saved to /var/cache/conftool/dbconfig/20200721-072251-kormat.json [07:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:57] <_joe_> which ofc is horrible, but also solves a lot of problems for you [07:24:09] PROBLEM - MariaDB read only es4 on es1021 is CRITICAL: CRIT: read_only: False, expected True: OK: Version 10.4.13-MariaDB-log, Uptime 943172s, read_only: True, 239.67 QPS, connection latency: 0.003147s, query latency: 0.000658s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:24:55] ^ known, we are in the middle of the maintenance [07:25:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) Once we have the sign-off from Birgit we can proceed. [07:25:17] RECOVERY - MariaDB read only es4 on es1021 is OK: Version 10.4.13-MariaDB-log, Uptime 943241s, read_only: False, read_only: True, 252.24 QPS, connection latency: 0.003107s, query latency: 0.000659s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:25:19] (03PS1) 10Kormat: Revert "db-eqiad.php: Depool cluster26 (es4) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614783 [07:25:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) a:03Joe [07:25:37] (03CR) 10Kormat: [C: 03+2] Revert "db-eqiad.php: Depool cluster26 (es4) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614783 (owner: 10Kormat) [07:26:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool cluster26 (es4) from writes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614783 (owner: 10Kormat) [07:29:01] !log kormat@deploy1001 Synchronized wmf-config/db-eqiad.php: Re-enable writes to es4 T257847 (duration: 00m 57s) [07:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:07] T257847: Switchover es4 master from es1020 to es1021 - https://phabricator.wikimedia.org/T257847 [07:30:25] (03PS1) 10Volans: netbox: make HTTP(S) checks use dynamic hostname [puppet] - 10https://gerrit.wikimedia.org/r/615145 [07:31:27] (03CR) 10Kormat: [C: 03+2] wmnet: Update es4-master alias [dns] - 10https://gerrit.wikimedia.org/r/612560 (https://phabricator.wikimedia.org/T257847) (owner: 10Kormat) [07:33:18] (03CR) 10Volans: [C: 03+2] "Compiler looks good, self-merging to fix the checks:" [puppet] - 10https://gerrit.wikimedia.org/r/615145 (owner: 10Volans) [07:34:56] (03CR) 10Giuseppe Lavagetto: "The change LGTM, minus a small correction (see the comment)." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [07:35:06] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [07:36:23] (03PS4) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612918 [07:37:44] (03PS3) 10Giuseppe Lavagetto: admin: shell & analytics access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) (owner: 10CDanis) [07:37:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11977 and previous config saved to /var/cache/conftool/dbconfig/20200721-073757-marostegui.json [07:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:09] (03PS1) 10Kormat: es1020: Disable notifications for reimaging. [puppet] - 10https://gerrit.wikimedia.org/r/615147 (https://phabricator.wikimedia.org/T257284) [07:44:03] (03PS1) 10Kormat: install_server: Switch es1020 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/615148 (https://phabricator.wikimedia.org/T257284) [07:45:18] (03CR) 10Marostegui: [C: 03+1] install_server: Switch es1020 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/615148 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:45:38] (03CR) 10Marostegui: [C: 03+1] es1020: Disable notifications for reimaging. [puppet] - 10https://gerrit.wikimedia.org/r/615147 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:45:53] (03CR) 10Kormat: [C: 03+2] es1020: Disable notifications for reimaging. [puppet] - 10https://gerrit.wikimedia.org/r/615147 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:46:21] (03CR) 10Kormat: [C: 03+2] install_server: Switch es1020 to buster. [puppet] - 10https://gerrit.wikimedia.org/r/615148 (https://phabricator.wikimedia.org/T257284) (owner: 10Kormat) [07:48:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 T256685', diff saved to https://phabricator.wikimedia.org/P11978 and previous config saved to /var/cache/conftool/dbconfig/20200721-074843-marostegui.json [07:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:50] T256685: pl_namespace index on pagelinks is unique only in s8 - https://phabricator.wikimedia.org/T256685 [07:49:45] !log Deploy schema change on db1087, lag will appear on s8 (wikidata) on labsdb hosts T256685 [07:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11979 and previous config saved to /var/cache/conftool/dbconfig/20200721-075233-marostegui.json [07:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:37] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm) p:05Triage→03Medium [07:53:45] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) p:05Triage→03Medium [07:53:52] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) p:05Triage→03Medium [07:53:59] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) p:05Triage→03Medium [07:54:03] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) p:05Triage→03Medium [07:54:10] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) p:05Triage→03Medium [07:54:15] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) p:05Triage→03Medium [07:54:21] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) p:05Triage→03Medium [07:54:25] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) p:05Triage→03Medium [07:54:35] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm) p:05Triage→03Medium [07:54:43] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) p:05Triage→03Medium [07:57:07] (03CR) 10Marostegui: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/24011/db2093.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/614747 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [07:57:46] (03CR) 10Marostegui: "We don't really use them for any other DB, any specific reason you need them?" [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [07:58:11] (03PS2) 10Kormat: mariadb: Add replication monitoring for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/614747 (https://phabricator.wikimedia.org/T257816) [07:58:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: shell & analytics access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/613241 (https://phabricator.wikimedia.org/T258214) (owner: 10CDanis) [07:58:32] (03CR) 10Marostegui: [C: 03+1] Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 (owner: 10Elukey) [07:59:41] (03CR) 10Kormat: [C: 03+2] mariadb: Add replication monitoring for zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/614747 (https://phabricator.wikimedia.org/T257816) (owner: 10Kormat) [07:59:53] (03CR) 10Marostegui: mariadb: Comment out sections that do not appear in puppet. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) (owner: 10Kormat) [08:01:12] (03CR) 10Elukey: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:01:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10Joe) 05Open→03Resolved The patch has been merged and it will be applied across the cluster in ~ 30 minutes. Resolving the t... [08:01:54] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) (owner: 10Elukey) [08:02:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) [08:02:28] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Joe) p:05Triage→03Medium [08:04:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Agaduran - https://phabricator.wikimedia.org/T258214 (10elukey) @MGerlach if this user needs kerberos please follow https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/User... [08:06:22] (03PS2) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [08:06:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor typo, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614894 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [08:07:15] (03CR) 10ZPapierski: "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:07:25] (03PS4) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [08:08:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P11980 and previous config saved to /var/cache/conftool/dbconfig/20200721-080842-marostegui.json [08:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:45] (03CR) 10Alexandros Kosiaris: [C: 03+1] changeprop/changeprop-jobqueue: swap rdb2003 with rdb2007 [deployment-charts] - 10https://gerrit.wikimedia.org/r/614901 (https://phabricator.wikimedia.org/T255250) (owner: 10Effie Mouzeli) [08:11:36] 10Operations, 10SRE-Access-Requests: Request for SSH access to analytics-privatadata-users group - https://phabricator.wikimedia.org/T258413 (10Joe) p:05Triage→03Medium a:03Joe Hi @CBogen the procedure go get shell access is outlined in https://wikitech.wikimedia.org/wiki/Production_access. Specifically,... [08:13:12] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Joe) [08:15:40] (03PS1) 10KartikMistry: Update cxserver to 2020-07-20-200559-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615150 (https://phabricator.wikimedia.org/T257674) [08:15:59] 10Operations, 10LDAP-Access-Requests: Add nskaggs to WMF ldap group - https://phabricator.wikimedia.org/T258437 (10Joe) p:05Triage→03Medium a:03Joe [08:16:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi) [08:18:42] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10fgiunchedi) >>! In T257949#6321247, @Jclark-ctr wrote: > @fgiunchedi if you would like bbu swapped i am available tomorrow morning eastern time For sure, LMK on IRC or here when available [08:26:41] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [08:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:15] PROBLEM - snapshot of x1 in codfw on icinga1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2020-07-18 08:07:45 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [08:29:39] (03PS3) 10Kormat: mariadb: Comment out sections that do not appear in puppet. [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) [08:30:45] (03CR) 10Kormat: mariadb: Comment out sections that do not appear in puppet. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) (owner: 10Kormat) [08:33:21] (03CR) 10Muehlenhoff: [C: 03+2] Switch yarn.wikimedia.org to CAS [puppet] - 10https://gerrit.wikimedia.org/r/614692 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [08:34:38] !log akosiaris@cumin1001 conftool action : set/weight=3; selector: dc=codfw,service=mobileapps,name=scb.* [08:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:36] !log increase codfw mobileapps kubernetes traffic to 47% T218733 [08:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:41] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [08:37:46] 47? [08:37:49] Not 50? :-) [08:38:07] 16÷(16+6×3) = 0.47 [08:38:18] 3 is the only thing I can change in that equation [08:38:23] and it has to be integers :-) [08:38:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but I will let andrew merge & babysit." [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) (owner: 10BryanDavis) [08:38:33] Aha, right. [08:38:34] so it's either 0.47 or 0.57 [08:38:41] * James_F nods. [08:38:47] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10Joe) 05Open→03Invalid @CGlenn my bad, I just realized that wikimediafoundation.org is not currently managed by our google search console account. Given the site is exte... [08:39:36] next one (assuming all goes perfect) is going to be 0,727272727 :-) [08:39:59] And then 100% and we can decom the crap out of it? [08:40:00] :-) [08:44:23] 10Operations, 10LDAP-Access-Requests: Add nskaggs to WMF ldap group - https://phabricator.wikimedia.org/T258437 (10Joe) 05Open→03Resolved I didn't add the user to admin/data.yaml as we have another pending task for full shell access that should be merged quite soon. [08:47:19] 10Operations, 10serviceops: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10Joe) p:05Triage→03High a:03Joe I think we should just move to the v3 api as soon as possible. I'll think of how to test it easily. [08:48:54] James_F: I sure hope so [08:50:00] (03CR) 10DCausse: [C: 04-1] Add logout location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [08:57:01] 10Operations, 10netops, 10observability, 10Patch-For-Review: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [08:57:29] Even with pcs moved over, there's still cxserver. :-( [08:57:36] But wonderful to see this progress. Thank you so much. [09:00:09] James_F: cxserver has been moved over for quite some time now. I think you refer to apertium [09:00:17] which we are tackling this Q [09:02:25] (03PS1) 10Jcrespo: mariadb-backups: Adjust check parameters to get less false positives [puppet] - 10https://gerrit.wikimedia.org/r/615155 (https://phabricator.wikimedia.org/T258045) [09:04:55] (03CR) 10Kormat: [C: 03+1] "Seems sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/615155 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [09:08:31] Sorry, yes, apertium. [09:10:34] (03PS2) 10Jcrespo: mariadb-backups: Adjust check parameters to get less false positives [puppet] - 10https://gerrit.wikimedia.org/r/615155 (https://phabricator.wikimedia.org/T258045) [09:11:40] 10Operations, 10Wikimedia-General-or-Unknown, 10Sustainability: Consider using Cassandra/restbase in place of external store - https://phabricator.wikimedia.org/T100705 (10Addshore) >>! In T100705#5697515, @ArielGlenn wrote: > I don't really want to revive this ticket but I do want to know if it's seriously... [09:14:09] (03PS1) 10Jbond: pontoon: switch to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615158 [09:15:38] (03PS2) 10Jbond: pontoon: switch to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615158 [09:16:24] (03PS1) 10Kormat: Revert "es1020: Disable notifications for reimaging." [puppet] - 10https://gerrit.wikimedia.org/r/614784 [09:17:45] (03CR) 10Kormat: [C: 03+2] Revert "es1020: Disable notifications for reimaging." [puppet] - 10https://gerrit.wikimedia.org/r/614784 (owner: 10Kormat) [09:18:10] 10Operations, 10RESTBase, 10Patch-For-Review: restbase: "featured" endpoint times out - https://phabricator.wikimedia.org/T257887 (10Joe) 05Open→03Resolved [09:18:35] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Adjust check parameters to get less false positives [puppet] - 10https://gerrit.wikimedia.org/r/615155 (https://phabricator.wikimedia.org/T258045) (owner: 10Jcrespo) [09:20:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "I can guess the script I used to generate this list isn't too smart:" [puppet] - 10https://gerrit.wikimedia.org/r/614820 (owner: 10BryanDavis) [09:21:27] !log kormat@cumin1001 dbctl commit (dc=all): 'Re-pool es1020 at 25% in es4 T257284', diff saved to https://phabricator.wikimedia.org/P11982 and previous config saved to /var/cache/conftool/dbconfig/20200721-092126-kormat.json [09:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:34] T257284: Upgrade es4 to debian buster + mariadb 10.4 - https://phabricator.wikimedia.org/T257284 [09:21:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:11] (03PS3) 10Jbond: pontoon: switch to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615158 [09:23:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:24:26] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/615158 (owner: 10Jbond) [09:25:35] (03CR) 10Zfilipin: [C: 03+1] "@20after4 Looks like I don't have +2 on this repo 😬 Please merge if this looks ok, or let me know if anything needs to be fixed." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/613657 (https://phabricator.wikimedia.org/T255761) (owner: 10Zfilipin) [09:27:08] RECOVERY - dump of es4 in eqiad on icinga1001 is OK: Last dump for es4 at eqiad (es1022.eqiad.wmnet) taken on 2020-07-21 00:00:01 (481 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:27:08] RECOVERY - dump of s4 in codfw on icinga1001 is OK: Last dump for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2020-07-21 01:49:44 (185 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:27:08] RECOVERY - dump of s4 in eqiad on icinga1001 is OK: Last dump for s4 at eqiad (db1145.eqiad.wmnet:3314) taken on 2020-07-21 00:00:01 (185 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:27:08] RECOVERY - dump of zarcillo in eqiad on icinga1001 is OK: Last dump for zarcillo at eqiad (db1115.eqiad.wmnet) taken on 2020-07-21 00:52:05 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:30:08] (03PS16) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [09:30:10] (03PS1) 10Jbond: cloud - hiera5: migrate labs main environment to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615159 [09:30:21] (03PS2) 10Ladsgroup: Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) [09:30:24] RECOVERY - Check systemd state on prometheus4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:50] RECOVERY - Check systemd state on prometheus3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:10] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus3001 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:32:45] 10Operations, 10observability, 10Patch-For-Review: Move Prometheus off eqsin/ulsfo/esams bastions - https://phabricator.wikimedia.org/T243057 (10Kormat) The grants have been created, all 3 new prom hosts can now successfully run generate-mysqld-exporter-config. [09:32:56] (03PS17) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [09:35:14] (03PS1) 10Elukey: Add term idp to analytics-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/615160 [09:35:42] (03PS1) 10Jbond: hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 [09:37:18] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus5001 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:37:18] (03PS1) 10Giuseppe Lavagetto: envoy: fix deprecated filter names [puppet] - 10https://gerrit.wikimedia.org/r/615162 (https://phabricator.wikimedia.org/T258140) [09:37:22] (03PS18) 10Jbond: labs - hiera: migrate to hiera version5 [puppet] - 10https://gerrit.wikimedia.org/r/613112 [09:37:22] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus4001 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [09:38:00] (03PS2) 10Jbond: cloud - hiera5: migrate labs main environment to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615159 [09:38:19] RECOVERY - dump of zarcillo in codfw on icinga1001 is OK: Last dump for zarcillo at codfw (db2093.codfw.wmnet) taken on 2020-07-21 09:37:04 (0 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:38:20] (03PS2) 10Jbond: hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 [09:38:25] (03CR) 10Muehlenhoff: [C: 03+1] "I'm not familiar with the syntax, but the data/logic looks good to me" [homer/public] - 10https://gerrit.wikimedia.org/r/615160 (owner: 10Elukey) [09:38:28] (03PS3) 10Jbond: hiera3: remove old hiera backend files [puppet] - 10https://gerrit.wikimedia.org/r/615161 [09:39:31] (03CR) 10Elukey: [C: 03+2] Add term idp to analytics-in4/6 filters [homer/public] - 10https://gerrit.wikimedia.org/r/615160 (owner: 10Elukey) [09:41:04] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi We have implemented option 1, namely adding more ingestion capacity vi... [09:44:21] (03CR) 10Marostegui: [C: 03+1] mariadb: Comment out sections that do not appear in puppet. [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) (owner: 10Kormat) [09:44:59] !log add term 'idp' to analytics-in4/6 filters on cr1-eqiad and cr2-eqiad (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/615160) [09:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:02] RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:03] (03PS3) 10Elukey: Update analytics-in(4|6) filters [homer/public] - 10https://gerrit.wikimedia.org/r/614702 [09:50:33] (03PS3) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [09:50:35] (03PS5) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [09:50:38] (03CR) 10Kormat: [C: 03+2] mariadb: Comment out sections that do not appear in puppet. [puppet] - 10https://gerrit.wikimedia.org/r/614697 (https://phabricator.wikimedia.org/T258376) (owner: 10Kormat) [09:50:50] (03PS1) 10Effie Mouzeli: hiera: switch nutcracker shard from rdb2003 to rdb2007 [puppet] - 10https://gerrit.wikimedia.org/r/615163 (https://phabricator.wikimedia.org/T255250) [09:52:57] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: Database cumin aliases without a matching host - https://phabricator.wikimedia.org/T258376 (10Kormat) 05Open→03Resolved All fixed now. [09:53:14] (03CR) 10Jbond: "> Patch Set 15: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [09:57:44] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.* [09:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:01] !log increase codfw mobileapps kubernetes traffic to 72.727272% T218733 [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:06] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [09:58:36] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=mobileapps [09:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:13] !log disable puppet on wtp* to merge 613307 [09:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:39] !log move all codfw mobileapps nodes (kubernetes and scb) to weight 10. Traffic level remains at 72.727272% flowing to kubernetes, the rest to scb [09:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:48] !log move all codfw mobileapps nodes (kubernetes and scb) to weight 10. Traffic level remains at 72.727272% flowing to kubernetes, the rest to scb T218733 [09:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:55] (03PS1) 10Filippo Giunchedi: profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) [10:00:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove IDP defintions for logstash vhosts [puppet] - 10https://gerrit.wikimedia.org/r/607509 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [10:00:22] RECOVERY - snapshot of x1 in codfw on icinga1001 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2020-07-21 09:26:39 (219 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:01:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1119 T258480', diff saved to https://phabricator.wikimedia.org/P11983 and previous config saved to /var/cache/conftool/dbconfig/20200721-100159-marostegui.json [10:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:05] T258480: Query plan changes on enwiki.revision queries with MCR change - https://phabricator.wikimedia.org/T258480 [10:02:22] !log Analyze revision table on db1119 T258480 [10:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:20] (03CR) 10Effie Mouzeli: [C: 03+2] role::parsoid: Add missing exporters for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/613307 (owner: 10Effie Mouzeli) [10:04:48] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24019/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [10:05:34] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Alert on no (or "few") logs indexed (was: No logs ingested in logstash7 since 2020-07-06 19:23) - https://phabricator.wikimedia.org/T257294 (10fgiunchedi) [10:06:44] 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi) [10:07:36] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash1007 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:07:43] (03PS4) 10Alexandros Kosiaris: otrs: Set otrs1001 as OTRS role [puppet] - 10https://gerrit.wikimedia.org/r/614746 (https://phabricator.wikimedia.org/T187984) [10:07:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Set otrs1001 as OTRS role [puppet] - 10https://gerrit.wikimedia.org/r/614746 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:08:28] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash2006 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:08:34] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:06] (03PS1) 10Marostegui: db1085: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615165 (https://phabricator.wikimedia.org/T258360) [10:13:15] !log enable puppet on on wtp* [10:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:13] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This would most surely have unintended consequences if we don't review completely the jobrunners setup." [puppet] - 10https://gerrit.wikimedia.org/r/599683 (https://phabricator.wikimedia.org/T190111) (owner: 10Dzahn) [10:18:49] (03PS1) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) [10:19:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:19:45] (03CR) 10Hnowlan: "Clarifying note: the internal repo for ratelimit has not been created yet, pushing this for review in advance." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [10:19:54] (03PS1) 10Jbond: analytics_test_cluster::hadoop::ui: add idp config [puppet] - 10https://gerrit.wikimedia.org/r/615169 [10:20:38] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM, thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/615158 (owner: 10Jbond) [10:20:49] !log disable puppet on P:mediawiki::mcrouter_wancache - T247956 [10:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:55] T247956: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 [10:21:27] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash1008 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:22:13] (03CR) 10Jbond: "PCC (ops): https://puppet-compiler.wmflabs.org/compiler1001/24017/" [puppet] - 10https://gerrit.wikimedia.org/r/615161 (owner: 10Jbond) [10:22:39] (03CR) 10Marostegui: [C: 03+2] db1085: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/615165 (https://phabricator.wikimedia.org/T258360) (owner: 10Marostegui) [10:22:41] (03PS2) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [10:22:43] (03PS1) 10Alexandros Kosiaris: otrs: Allow disabling daemon in profile [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) [10:23:21] (03CR) 10Jbond: [C: 03+2] pontoon: switch to hiera5 [puppet] - 10https://gerrit.wikimedia.org/r/615158 (owner: 10Jbond) [10:24:14] (03CR) 10jerkins-bot: [V: 04-1] otrs: Allow disabling daemon in profile [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:24:41] (03PS2) 10Alexandros Kosiaris: otrs: Allow disabling daemon in profile [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) [10:24:43] (03PS3) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [10:25:55] (03CR) 10jerkins-bot: [V: 04-1] otrs: Allow disabling daemon in profile [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:26:45] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::mcrouter_wancache: refactor [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:26:47] (03PS1) 10DCausse: [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) [10:28:30] (03PS1) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) [10:29:16] (03PS1) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) [10:30:37] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:13] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash1009 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:33:30] (03PS11) 10Effie Mouzeli: profile::mediawiki::mcrouter_wancache: refactor [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:34:03] PROBLEM - cas-logstash.wikimedia.org requires authentication on logstash2005 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 401 Unauthorized https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [10:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1085', diff saved to https://phabricator.wikimedia.org/P11984 and previous config saved to /var/cache/conftool/dbconfig/20200721-103430-marostegui.json [10:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:39] ^ these should recover soon, currently running puppet on icinga1001 [10:36:08] (03PS16) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/612523 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:37:25] (03PS10) 10Effie Mouzeli: mcrouter: store defaults in module not in hiera [puppet] - 10https://gerrit.wikimedia.org/r/612532 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:38:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:38:52] (03CR) 10Effie Mouzeli: [C: 03+2] profile::mediawiki::mcrouter_wancache: refactor [puppet] - 10https://gerrit.wikimedia.org/r/612514 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:39:02] (03CR) 10JMeybohm: [C: 04-1] chartmuseum: Add systemd timer to package and push charts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613635 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [10:39:21] (03PS3) 10Alexandros Kosiaris: otrs: Allow disabling daemon in profile [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) [10:39:23] (03PS4) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [10:40:04] (03PS1) 10Muehlenhoff: Remove cas-graphite from cache config [puppet] - 10https://gerrit.wikimedia.org/r/615175 (https://phabricator.wikimedia.org/T244861) [10:40:13] (03PS4) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [10:40:19] (03PS6) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [10:40:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC fine at https://puppet-compiler.wmflabs.org/compiler1003/24021/" [puppet] - 10https://gerrit.wikimedia.org/r/615170 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [10:44:02] (03CR) 10Vgutierrez: [C: 03+1] hieradata: improve description of ncredir [puppet] - 10https://gerrit.wikimedia.org/r/611197 (owner: 10Effie Mouzeli) [10:44:26] 10Operations, 10Wikimedia-Apache-configuration, 10Developer Productivity, 10Patch-For-Review, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost (on jobrunners) - https://phabricator.wikimedia.org/T190111 (10Joe) My take on this is that we should rathe... [10:45:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11985 and previous config saved to /var/cache/conftool/dbconfig/20200721-104546-marostegui.json [10:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:52] T258360: db1085 crashed - https://phabricator.wikimedia.org/T258360 [10:46:25] (03PS17) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/612523 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:48:10] (03CR) 10Effie Mouzeli: [C: 03+2] P:mediawiki::mcrouter_wancache: refactor parameters [puppet] - 10https://gerrit.wikimedia.org/r/612523 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:49:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:51:16] (03PS5) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [10:51:18] (03PS1) 10Alexandros Kosiaris: otrs: vary mysql-client on debian distro version [puppet] - 10https://gerrit.wikimedia.org/r/615176 (https://phabricator.wikimedia.org/T187984) [10:51:37] (03CR) 1020after4: [C: 03+2] Add .gitreview file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/613657 (https://phabricator.wikimedia.org/T255761) (owner: 10Zfilipin) [10:51:44] (03CR) 1020after4: [V: 03+2 C: 03+2] Add .gitreview file [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/613657 (https://phabricator.wikimedia.org/T255761) (owner: 10Zfilipin) [10:53:28] (03PS2) 1020after4: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [10:54:03] PROBLEM - Check systemd state on otrs1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:30] (03CR) 1020after4: [C: 03+1] "looks good but I have not tested yet," [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [10:54:36] (03PS1) 10Kosta Harlan: PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/615186 (https://phabricator.wikimedia.org/T257766) [10:55:01] (03CR) 10Kosta Harlan: [C: 04-1] "Would like Catrope to review before we backport this." [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/615186 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [10:55:17] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: store defaults in module not in hiera [puppet] - 10https://gerrit.wikimedia.org/r/612532 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [10:55:33] (03PS1) 10Kosta Harlan: PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615187 (https://phabricator.wikimedia.org/T257766) [10:55:46] (03CR) 10Kosta Harlan: "Would like Catrope to review before we backport this." [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615187 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [10:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11986 and previous config saved to /var/cache/conftool/dbconfig/20200721-105852-marostegui.json [10:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:58] T258360: db1085 crashed - https://phabricator.wikimedia.org/T258360 [10:59:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to WMCS for nskaggs - https://phabricator.wikimedia.org/T258438 (10Bmueller) approved, thanks! [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Your horoscope predicts another unfortunate European mid-day backport window(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T1100). [11:00:04] jan_drewniak: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] \o [11:00:18] o/ [11:00:27] jan_drewniak: do you want to self-deploy, or should I do that for you? :) [11:00:41] (03PS1) 10Urbanecm: Enable botpasswords at checkuserwiki and stewardwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615177 (https://phabricator.wikimedia.org/T258358) [11:00:49] !log enable puppet on P:mediawiki::mcrouter_wancache - T247956 [11:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:55] T247956: Ensure hiera only has profile:: qualified or global hiera keys - https://phabricator.wikimedia.org/T247956 [11:01:30] (03PS1) 10Muehlenhoff: Record extended MOU date for Daniele Rama [puppet] - 10https://gerrit.wikimedia.org/r/615178 [11:01:31] Urbanecm: could you do it for me please? (It's embarrassing but even though I have the rights, I've never actually deployed a standard config myself!) [11:01:49] or I can walk you through it :) [11:02:58] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614837 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [11:03:17] Urbanecm: maybe a quick rundown after the deploy? This one's gonna require some mwdebug testing on my end [11:03:29] (03PS2) 10Effie Mouzeli: hieradata: improve description of ncredir [puppet] - 10https://gerrit.wikimedia.org/r/611197 [11:03:38] okay, sure :). Will ping you once it's at mwdebug1001 [11:03:42] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/24018/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/611197 (owner: 10Effie Mouzeli) [11:03:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615176 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [11:04:14] (03Merged) 10jenkins-bot: Enable Vector opt in preference everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614837 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [11:04:48] jan_drewniak: your patch is available at mwdebug1001 :) [11:05:14] (the beta part will arrive to the beta cluster within 30 minutes) [11:06:27] (03PS1) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [11:07:26] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] hieradata: improve description of ncredir [puppet] - 10https://gerrit.wikimedia.org/r/611197 (owner: 10Effie Mouzeli) [11:07:27] Urbanecm: it works! good to sync [11:07:34] thanks! syncing then :) [11:08:38] (03CR) 10Privacybatm: "In this also I am facing similar issue as POC1!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [11:08:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1085 T258360', diff saved to https://phabricator.wikimedia.org/P11987 and previous config saved to /var/cache/conftool/dbconfig/20200721-110854-marostegui.json [11:08:57] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5d5bb37c342310be5ca0b0e11a8490703867f4fd: Enable Vector opt in preference everywhere (T254228) (duration: 00m 57s) [11:08:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] T258360: db1085 crashed - https://phabricator.wikimedia.org/T258360 [11:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:06] T254228: Deploy new version of vector skin to all wikis as a user preference - https://phabricator.wikimedia.org/T254228 [11:09:10] jan_drewniak: should be all done! [11:10:09] (03CR) 10Ema: [C: 03+1] "One nit, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614763 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [11:10:19] Urbanecm: yeah it's looking' good! [11:10:23] great! [11:10:37] !log Create bot_passwords table at stewardwiki (T258355) [11:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:43] T258355: Allow users to use botpasswords at stewardwiki - https://phabricator.wikimedia.org/T258355 [11:11:44] !log Create bot_passwords table at checkuserwiki (T258358) [11:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:49] T258358: Allow users to use botpasswords at checkuserwiki - https://phabricator.wikimedia.org/T258358 [11:12:38] Urbanecm: I've done custom deploys for the Wikimedia Portals (weekly) but I bet a stanrdard config deploy is simpler than that. Is it just logging into the deployment server, pulling down the config changes, and running `scap sync` or something? [11:13:15] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) 05Open→03Resolved a:03Marostegui I have fully repooled this host. It doesn't have a BBU, but s6 doesn't really have much load, so it will probably be able to keep up with replication without issues. Next follo... [11:13:17] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [11:13:30] (03PS2) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [11:13:39] jan_drewniak: almost. You fetch that at the deployment server, then do scap pull at mwdebug1001, do your testing (and watch logstash), and then you do scap sync-file path/to/file.php 'your message' [11:13:51] (03CR) 10Privacybatm: "This change is ready for review." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [11:14:36] jan_drewniak: https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers is the docs, feel free to ask if anything is unclear [11:15:54] (03CR) 10Ema: [C: 03+1] Remove cas-graphite from cache config [puppet] - 10https://gerrit.wikimedia.org/r/615175 (https://phabricator.wikimedia.org/T244861) (owner: 10Muehlenhoff) [11:16:14] Urbanecm: aha, one file at a time? Got it. Does the beta config need to be synced too or is that picked up automatically at some time? [11:16:14] (03PS2) 10Urbanecm: Enable botpasswords at checkuserwiki and stewardwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615177 (https://phabricator.wikimedia.org/T258358) [11:16:32] jan_drewniak: no, beta config doesn't need to be synced. [11:16:51] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Check if images are debian based before generating report (031 comment) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611251 (https://phabricator.wikimedia.org/T251918) (owner: 10JMeybohm) [11:17:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think you need to add a dependency on the python3-docker package for the build." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 (owner: 10JMeybohm) [11:17:49] Sweet, well we're doing another one of these tomorrow so I'll try it then :) [11:18:09] cool! [11:18:12] jan_drewniak: scap sync-file path/to/directory also works, btw [11:18:43] (03CR) 10Urbanecm: [C: 03+2] Enable botpasswords at checkuserwiki and stewardwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615177 (https://phabricator.wikimedia.org/T258358) (owner: 10Urbanecm) [11:19:37] (03Merged) 10jenkins-bot: Enable botpasswords at checkuserwiki and stewardwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615177 (https://phabricator.wikimedia.org/T258358) (owner: 10Urbanecm) [11:19:45] (03PS3) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [11:22:14] 10Operations, 10Project-Admins: Rename #Operations Phab project to #WMF-SRE (or so) - https://phabricator.wikimedia.org/T258305 (10Joe) p:05Triage→03Low [11:23:03] PROBLEM - OTRS SMTP on otrs1001 is CRITICAL: connect to address 10.64.16.39 and port 25: Connection refused https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [11:23:15] (03CR) 10Privacybatm: "Okay, Now it is working!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [11:24:33] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7b96c7ea35557888c6cec2dd19768c246bff804b: Enable botpasswords at checkuserwiki and stewardwiki (T258358, T258355) (duration: 00m 57s) [11:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:39] T258355: Allow users to use botpasswords at stewardwiki - https://phabricator.wikimedia.org/T258355 [11:24:39] T258358: Allow users to use botpasswords at checkuserwiki - https://phabricator.wikimedia.org/T258358 [11:25:07] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:25:26] !log EU B&C window done [11:25:28] (03PS1) 10Cmjohnson: Adding new cloudceph servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/615182 (https://phabricator.wikimedia.org/T251619) [11:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:47] 10Operations, 10ops-eqiad, 10Analytics-Clusters: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10elukey) p:05Triage→03Medium [11:25:49] (03CR) 10Jbond: [C: 03+1] Record extended MOU date for Daniele Rama [puppet] - 10https://gerrit.wikimedia.org/r/615178 (owner: 10Muehlenhoff) [11:26:08] (03Abandoned) 10Jbond: varnish: Rate limit cloud providers for all requiests [puppet] - 10https://gerrit.wikimedia.org/r/609477 (owner: 10Jbond) [11:26:37] (03CR) 10Muehlenhoff: [C: 03+2] Record extended MOU date for Daniele Rama [puppet] - 10https://gerrit.wikimedia.org/r/615178 (owner: 10Muehlenhoff) [11:27:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/615175 (https://phabricator.wikimedia.org/T244861) (owner: 10Muehlenhoff) [11:27:21] (03CR) 10Privacybatm: "Please give first priority to POC3 (https://gerrit.wikimedia.org/r/c/operations/software/transferpy/+/615179)" [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [11:28:37] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:29:08] 10Operations, 10Wikimedia-Apache-configuration: Encoding discrepancy in https://wikipedia.org redirects - https://phabricator.wikimedia.org/T257608 (10Joe) I'm not sure the problem is what you think it is. wikipedia.org isn't even supposed to respond to such requests IIRC. I'll dig deeper. [11:31:22] (03CR) 10Cmjohnson: [C: 03+2] Adding new cloudceph servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/615182 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [11:32:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:35:31] (03PS1) 10Cmjohnson: Add cloudcephosd mac addressess to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/615183 (https://phabricator.wikimedia.org/T251619) [11:36:17] (03CR) 10Cmjohnson: [C: 03+2] Add cloudcephosd mac addressess to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/615183 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson) [11:36:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:40:12] ^ looking [11:40:30] (03PS1) 10Marostegui: install_server: Reimage dbproxy1012 as Buster [puppet] - 10https://gerrit.wikimedia.org/r/615184 (https://phabricator.wikimedia.org/T255408) [11:41:50] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:41:50] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1012 as Buster [puppet] - 10https://gerrit.wikimedia.org/r/615184 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [11:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:25] so the alert is possibly for https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.07.21/mediawiki?id=AXNxKUZ7NoG2jwpwBMjp&_g=h@66534ad [11:47:08] (03CR) 10Muehlenhoff: "Ah, good catch! Ultimately Yarn/CAS won't work on the hadoop test cluster since it couples Hue/Yarn and runs into the OpenSSL/curl mess (u" [puppet] - 10https://gerrit.wikimedia.org/r/615169 (owner: 10Jbond) [11:49:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [11:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:53:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:30] (03PS5) 10Jbond: role::exim: update config to drop ldap validation [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) [11:57:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:01:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:35] 10Operations, 10Traffic, 10serviceops, 10affects-Kiwix-and-openZIM: ETAG response headers not always with double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) As it turns out, this is due to a deliberate bug in Swift: https://bugs.launchpad.net/swift/+bug/1099087 > When swift was originally... [12:03:35] (03Abandoned) 10Muehlenhoff: Allow specifying a process name for services without a native systemd unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/592628 (owner: 10Muehlenhoff) [12:05:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:34] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) [12:08:55] (03CR) 10Jbond: "> I did some experimenting with adding a condition to the callout "deny domains" ACL to check if the user is valid in otrs mysql after the" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:09:38] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove cas-graphite from cache config [puppet] - 10https://gerrit.wikimedia.org/r/615175 (https://phabricator.wikimedia.org/T244861) (owner: 10Muehlenhoff) [12:10:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:13:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087', diff saved to https://phabricator.wikimedia.org/P11988 and previous config saved to /var/cache/conftool/dbconfig/20200721-121302-marostegui.json [12:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:42] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10fgiunchedi) >>! In T256217#6322433, @ema wrote: > As it turns out, this is due to a deliberate bug in Swift: https://bugs.launchpad.net/swift/+bu... [12:15:51] (03CR) 10Jcrespo: "haven't look at the code yet, and I don't mind too much what the internal variables are called, but I think a good name for the user param" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:21:24] 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10CDanis) Hi @Cmjohnson -- any idea of when you might be able to get around to this? Thanks! [12:27:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:27:43] (03PS4) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [12:30:09] (03CR) 10Privacybatm: "> Patch Set 3:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm) [12:32:18] (03PS5) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601) [12:33:12] (03CR) 10ZPapierski: Add logout location (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:34:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:34:05] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) >>! In T256217#6322461, @fgiunchedi wrote: >>>! In T256217#6322433, @ema wrote: >> Openstack/swift 2.24.0 fixes the issue (or removes the fea... [12:41:08] (03CR) 10Muehlenhoff: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [12:41:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:42:22] (03CR) 10DCausse: Correct url and path for nginx OAuth 1.0a (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:45:43] (03CR) 10DCausse: [C: 03+1] Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:46:48] (03CR) 10DCausse: [C: 03+1] add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [12:47:26] (03PS2) 10Ema: tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) [12:48:05] (03PS12) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [12:49:14] (03PS5) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [12:49:19] (03PS7) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [12:50:19] (03PS8) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [12:50:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:50:55] (03CR) 10ZPapierski: Correct url and path for nginx OAuth 1.0a (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [12:54:41] !log Stop haproxy on dbproxy1012 - T255408 [12:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:47] T255408: Upgrade dbproxyXXXX to Buster - https://phabricator.wikimedia.org/T255408 [12:55:27] !log start of ladsgroup@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T258472 T258473) [12:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:33] T258472: Add Wikidata support to arywiki - https://phabricator.wikimedia.org/T258472 [12:55:33] T258473: Add Wikidata support for lijwikisource - https://phabricator.wikimedia.org/T258473 [12:58:02] (03CR) 10Muehlenhoff: "Haven't had a closer look, only some comments inline" (033 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 (owner: 10Jbond) [12:58:55] (03PS2) 10Alexandros Kosiaris: otrs: vary mysql-client on debian distro version [puppet] - 10https://gerrit.wikimedia.org/r/615176 (https://phabricator.wikimedia.org/T187984) [12:58:57] (03PS6) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [12:58:59] (03PS1) 10Alexandros Kosiaris: otrs: Remove demime condition from exim [puppet] - 10https://gerrit.wikimedia.org/r/615210 (https://phabricator.wikimedia.org/T187984) [12:59:40] (03PS2) 10Alexandros Kosiaris: otrs: Remove demime condition from exim [puppet] - 10https://gerrit.wikimedia.org/r/615210 (https://phabricator.wikimedia.org/T187984) [12:59:42] (03PS7) 10Alexandros Kosiaris: traffic: Add ticket-test.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/614759 (https://phabricator.wikimedia.org/T187984) [13:01:36] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:01:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] (03CR) 10Ppchelko: [C: 03+1] ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan) [13:03:07] !log draining restbase1019 for eventual reboot for kernel security update [13:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:01] (03PS13) 10ZPapierski: Correct url and path for nginx OAuth 1.0a [puppet] - 10https://gerrit.wikimedia.org/r/609909 (https://phabricator.wikimedia.org/T251498) [13:04:05] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:04:07] (03CR) 10Muehlenhoff: [C: 03+2] Remove cas-graphite from cache config [puppet] - 10https://gerrit.wikimedia.org/r/615175 (https://phabricator.wikimedia.org/T244861) (owner: 10Muehlenhoff) [13:04:10] (03PS6) 10ZPapierski: Add logout location [puppet] - 10https://gerrit.wikimedia.org/r/613186 (https://phabricator.wikimedia.org/T251498) [13:04:18] (03PS9) 10ZPapierski: add logout config for wcqs [puppet] - 10https://gerrit.wikimedia.org/r/612681 (https://phabricator.wikimedia.org/T257314) (owner: 10Mstyles) [13:06:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:14] (03PS1) 10Jbond: profile::thanos::httpd: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/615212 (https://phabricator.wikimedia.org/T151009) [13:06:16] (03PS1) 10Jbond: profile::thanos::frontend: Add SSO [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) [13:06:18] (03PS1) 10Jbond: profile::thanos::frontend: enable SSO on thanos-fe2003 [puppet] - 10https://gerrit.wikimedia.org/r/615214 [13:06:20] (03PS1) 10Jbond: profile::thanos::frontend: enable sso for all thanos frontends [puppet] - 10https://gerrit.wikimedia.org/r/615215 (https://phabricator.wikimedia.org/T151009) [13:06:22] (03PS1) 10Jbond: profile::thanos::frontend: only support SSO on thanos [puppet] - 10https://gerrit.wikimedia.org/r/615216 (https://phabricator.wikimedia.org/T151009) [13:07:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:10:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:18] PROBLEM - cassandra-a CQL 10.64.0.101:9042 on restbase1019 is CRITICAL: connect to address 10.64.0.101 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:10:18] PROBLEM - cassandra-b CQL 10.64.0.102:9042 on restbase1019 is CRITICAL: connect to address 10.64.0.102 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [13:11:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [13:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] RECOVERY - cassandra-a CQL 10.64.0.101:9042 on restbase1019 is OK: TCP OK - 0.000 second response time on 10.64.0.101 port 9042 https://phabricator.wikimedia.org/T93886 [13:12:10] RECOVERY - cassandra-b CQL 10.64.0.102:9042 on restbase1019 is OK: TCP OK - 0.000 second response time on 10.64.0.102 port 9042 https://phabricator.wikimedia.org/T93886 [13:13:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:56] (03PS3) 10Ema: tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) [13:14:21] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:14:28] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6321576, @mb wrote: > This was announced to be fixed by 6 July 2020. It's now... [13:15:42] !log end of ladsgroup@mwmaint1002:~$ foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T258472 T258473) [13:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] T258472: Add Wikidata support to arywiki - https://phabricator.wikimedia.org/T258472 [13:15:48] T258473: Add Wikidata support for lijwikisource - https://phabricator.wikimedia.org/T258473 [13:15:51] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/615176 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [13:16:00] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: Remove demime condition from exim [puppet] - 10https://gerrit.wikimedia.org/r/615210 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [13:18:12] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [13:19:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:13] * Amir1 akosiaris: for when you have some time, the osm role seems to be unused in wmcs, people are asking if it's used in production and you were one of the contributors of the role, so you might know something: https://gerrit.wikimedia.org/r/c/operations/puppet/+/613727 [13:20:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:26:06] Amir1: I haven't touched that in years [13:26:26] I 'll profess my current ignorance as well on the change. [13:27:03] (03PS1) 10Ottomata: Don't use merged Hive + event schema when reading raw event data [puppet] - 10https://gerrit.wikimedia.org/r/615217 (https://phabricator.wikimedia.org/T255818) [13:28:46] All good. Thanks! [13:30:03] (03CR) 10Alexandros Kosiaris: "That role was used at some point in time to be applied (IIRC) on a set of labsdb hosts (I don't remember the exact names) that provides os" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [13:33:29] (03CR) 10Vgutierrez: [C: 03+1] "looks good, as a nitpick, please get rid of the cosmetic change on nginx.conf shown in https://puppet-compiler.wmflabs.org/compiler1001/50" [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:33:48] (03PS1) 10Filippo Giunchedi: icinga: add logs retention [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) [13:35:20] (03CR) 10Ottomata: [C: 03+2] "I just manually ran this with new refinery version on test_event table, and _schema was populated correctly! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/615217 (https://phabricator.wikimedia.org/T255818) (owner: 10Ottomata) [13:37:56] RECOVERY - OTRS SMTP on otrs1001 is OK: SMTP OK - 0.006 sec. response time https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [13:38:00] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:36] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10Kelson) @ema not really this is case which had to be handled in MWoffliner. This is all. [13:41:11] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:41:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:57] !log increase codfw mobileapps kubernetes traffic to 96% T218733 [13:41:59] !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=codfw,service=mobileapps,name=scb.* [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:02] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [13:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] (03CR) 10Jbond: "thanks updated" (033 comments) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 (owner: 10Jbond) [13:42:41] Amir1: per puppetdb it's used on maps1004/maps2004 [13:43:02] (03PS3) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [13:43:38] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) p:05Medium→03Low [13:43:55] Ooh. Thanks. Can you write it on the ticket? I'm on phone currently [13:44:20] (03PS2) 10Vgutierrez: vcl: Use synthetic warning for ECDHE-RSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/614763 (https://phabricator.wikimedia.org/T258405) [13:44:23] 10Operations, 10SRE-swift-storage, 10Traffic, 10affects-Kiwix-and-openZIM: Swift sends ETAG without double-quotes - https://phabricator.wikimedia.org/T256217 (10ema) @Kelson: alright, this will be solved with the upgrade to Bullseye then. Thank you! [13:44:28] (03CR) 10Vgutierrez: vcl: Use synthetic warning for ECDHE-RSA-AES128-SHA pageviews (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/614763 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [13:45:22] Amir1: actually, nvm, after re-reading the change, I got the wrong class name, "osm" is used in production, but role::osm::master is not [13:45:36] (03PS4) 10Ema: tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) [13:46:04] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:46:09] Aaaah, okay [13:46:31] I'll followup on the patch [13:46:38] Thanks [13:46:39] !log draining restbase1020 for eventual reboot for kernel security update [13:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:56] (03PS4) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [13:48:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24024/icinga1001.wikimedia.org/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) (owner: 10Filippo Giunchedi) [13:48:38] (03CR) 10Ema: [C: 03+2] tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) (owner: 10Ema) [13:48:50] (03PS5) 10Ema: tlsproxy: drop websocket_support [puppet] - 10https://gerrit.wikimedia.org/r/573277 (https://phabricator.wikimedia.org/T238625) [13:50:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11991 and previous config saved to /var/cache/conftool/dbconfig/20200721-135028-marostegui.json [13:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:54] (03CR) 10Zfilipin: "> Patch Set 2: Code-Review+1" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [13:51:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:54] (03CR) 10Muehlenhoff: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 (owner: 10Jbond) [13:55:03] !log draining restbase1021 for eventual reboot for kernel security update [13:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:45] (03CR) 10Andrew Bogott: [C: 03+2] wmcs: Add redirector site for wmcloud.org and www.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/614882 (https://phabricator.wikimedia.org/T258415) (owner: 10BryanDavis) [13:57:19] (03PS4) 10JMeybohm: New package version [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/611252 [13:57:51] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:21] (03PS1) 10Jbond: rename: rename source package to wmf-laptop [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/615224 [13:58:46] (03PS5) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 [14:00:24] (03CR) 10Jbond: (WIP) use dnsmasq: add configueration to use dnsmasq with WMF config (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/614787 (owner: 10Jbond) [14:00:36] (03CR) 10Elukey: "Yes exactly this can stay with LDAP, it is not exposed anywhere, we use it via ssh when needed! :)" [puppet] - 10https://gerrit.wikimedia.org/r/615169 (owner: 10Jbond) [14:01:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:55] !log draining restbase1022 for eventual reboot for kernel security update [14:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:01] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/615212 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:04:44] (03PS4) 10Muehlenhoff: Switch Turnilo to CAS [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) [14:06:34] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:06:36] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:06:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:07:24] (03CR) 10Ema: [C: 03+1] varnish::logging: move default definitions inline [puppet] - 10https://gerrit.wikimedia.org/r/605272 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:08:28] (03PS2) 10Filippo Giunchedi: icinga: add logs retention [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) [14:09:03] (03CR) 10Filippo Giunchedi: [C: 03+1] profile::thanos::httpd: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/615212 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:09:48] (03CR) 10Ema: [C: 03+1] trafficserver::instance: move single line scripts inline [puppet] - 10https://gerrit.wikimedia.org/r/605279 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:10:38] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] (03CR) 10Vidhi-Mody: "> Patch Set 2:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:12:25] (03CR) 10Filippo Giunchedi: "Thanks for working on this! At the moment thanos-query.wikimedia.org isn't a thing yet but will be eventually for sure. The virtual host i" [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:13:46] (03CR) 10Zfilipin: "> Patch Set 2:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:13:48] (03Abandoned) 10Vidhi-Mody: Test: Verfiy pushing a patch to Gerrit [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614828 (owner: 10Vidhi-Mody) [14:15:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:16:56] !log draining restbase1023 for eventual reboot for kernel security update [14:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:38] (03CR) 10Ema: [C: 03+1] vcl: ratelimit search API calls [puppet] - 10https://gerrit.wikimedia.org/r/609475 (owner: 10CDanis) [14:17:47] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) [14:18:03] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:18:06] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) @Joe done, thanks! [14:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11992 and previous config saved to /var/cache/conftool/dbconfig/20200721-141813-marostegui.json [14:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:19:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] 10Operations, 10Wikimedia-Apache-configuration: Encoding discrepancy in https://wikipedia.org redirects - https://phabricator.wikimedia.org/T257608 (10Joe) The problem presents even if we start with `www.wikipedia.org`: ` $ curl -I -L "https://www.wikipedia.org/wiki/Special:Redirect/file/(61-365)_Can_you_imag... [14:20:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:21:18] (03CR) 10Jgreen: [V: 03+2 C: 03+2] Add icinga check for recurring contributions in a processing state [puppet] - 10https://gerrit.wikimedia.org/r/614898 (https://phabricator.wikimedia.org/T258013) (owner: 10Dwisehaupt) [14:22:51] (03CR) 10Jbond: [C: 03+2] varnish::logging: move default definitions inline [puppet] - 10https://gerrit.wikimedia.org/r/605272 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:23:18] (03PS2) 10Jbond: trafficserver::instance: move single line scripts inline [puppet] - 10https://gerrit.wikimedia.org/r/605279 (https://phabricator.wikimedia.org/T254480) [14:23:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:23:35] (03CR) 10Jbond: [C: 03+2] profile::thanos::httpd: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/615212 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [14:23:49] (03CR) 10Jbond: [C: 03+2] trafficserver::instance: move single line scripts inline [puppet] - 10https://gerrit.wikimedia.org/r/605279 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [14:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:24:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] (03PS1) 10Muehlenhoff: Configure yarn/testcluster for LDAP auth only [puppet] - 10https://gerrit.wikimedia.org/r/615225 [14:26:32] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/613626 (https://phabricator.wikimedia.org/T159584) (owner: 10Muehlenhoff) [14:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1119', diff saved to https://phabricator.wikimedia.org/P11993 and previous config saved to /var/cache/conftool/dbconfig/20200721-142634-marostegui.json [14:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:52] 10Operations, 10Wikimedia-Apache-configuration, 10serviceops: Encoding discrepancy in https://wikipedia.org redirects - https://phabricator.wikimedia.org/T257608 (10Joe) p:05Triage→03Low [14:30:01] (03CR) 10Muehlenhoff: "I've made https://gerrit.wikimedia.org/r/c/operations/puppet/+/615225/ to switch the test cluster Yarn installation explicitly to LDAP (wh" [puppet] - 10https://gerrit.wikimedia.org/r/615169 (owner: 10Jbond) [14:30:59] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615169 (owner: 10Jbond) [14:31:15] (03Abandoned) 10Jbond: analytics_test_cluster::hadoop::ui: add idp config [puppet] - 10https://gerrit.wikimedia.org/r/615169 (owner: 10Jbond) [14:32:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1119', diff saved to https://phabricator.wikimedia.org/P11994 and previous config saved to /var/cache/conftool/dbconfig/20200721-143204-marostegui.json [14:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] (03PS2) 10Jbond: Configure yarn/testcluster for LDAP auth only [puppet] - 10https://gerrit.wikimedia.org/r/615225 (owner: 10Muehlenhoff) [14:32:24] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/615225 (owner: 10Muehlenhoff) [14:33:08] (03PS1) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) [14:35:11] (03CR) 10Jbond: [C: 03+1] "LGTM: PCC shows noop but thats because its broken in production" [puppet] - 10https://gerrit.wikimedia.org/r/615225 (owner: 10Muehlenhoff) [14:35:45] !log draining restbase1024 for eventual reboot for kernel security update [14:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:50] !log decrease codfw mobileapps kubernetes traffic to 72% T218733. Weird latency patterns exhibited when 92% was reached. See https://grafana.wikimedia.org/d/5CmeRcnMz/mobileapps?panelId=34&fullscreen&orgId=1&from=1595338489749&to=1595342071227&var-dc=codfw%20prometheus%2Fk8s&var-service=mobileapps&var-container_name=All [14:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:55] T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733 [14:35:59] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=mobileapps,name=scb.* [14:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:16] (03PS1) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615228 (https://phabricator.wikimedia.org/T255471) [14:38:32] (03Abandoned) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615228 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:39:14] (03PS3) 10Elukey: Add PTR/AAAA records for db1108 [dns] - 10https://gerrit.wikimedia.org/r/614751 (https://phabricator.wikimedia.org/T234826) [14:39:24] (03CR) 10Zfilipin: "Duplicate of https://gerrit.wikimedia.org/r/c/phabricator/deployment/+/614829 ?" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:44:36] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:06] (03CR) 10Vidhi-Mody: "> Patch Set 1:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:46:05] (03PS1) 10BBlack: Remove mobile redirect for donate.wp [puppet] - 10https://gerrit.wikimedia.org/r/615229 [14:48:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:24] (03PS1) 10Ottomata: Revert "Don't use merged Hive + event schema when reading raw event data" [puppet] - 10https://gerrit.wikimedia.org/r/615192 [14:50:35] (03CR) 10Vidhi-Mody: "I still have the terminal output! Any idea where I went wrong?" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:50:37] (03CR) 10jerkins-bot: [V: 04-1] Revert "Don't use merged Hive + event schema when reading raw event data" [puppet] - 10https://gerrit.wikimedia.org/r/615192 (owner: 10Ottomata) [14:51:21] !log draining restbase1025 for eventual reboot for kernel security update [14:51:22] (03PS2) 10Ottomata: Revert "Don't use merged Hive + event schema when reading ..." [puppet] - 10https://gerrit.wikimedia.org/r/615192 [14:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:36] (03CR) 10jerkins-bot: [V: 04-1] Revert "Don't use merged Hive + event schema when reading ..." [puppet] - 10https://gerrit.wikimedia.org/r/615192 (owner: 10Ottomata) [14:52:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:40] (03PS3) 10Ottomata: Revert - Refine Don't use merged Hive + event schema when reading [puppet] - 10https://gerrit.wikimedia.org/r/615192 [14:56:36] (03PS1) 10Volans: puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 [14:56:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:10] (03PS2) 10Volans: puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) [14:57:11] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [14:57:13] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:22] (03CR) 10Ottomata: [C: 03+2] Revert - Refine Don't use merged Hive + event schema when reading [puppet] - 10https://gerrit.wikimedia.org/r/615192 (owner: 10Ottomata) [14:57:46] (03CR) 10Zfilipin: "> Patch Set 1:" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [14:58:22] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [14:58:30] (03PS19) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [14:58:32] (03PS1) 10Hnowlan: ratelimit: create subchart in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/615233 (https://phabricator.wikimedia.org/T254906) [14:59:16] (03CR) 10Vgutierrez: [C: 03+2] vcl: Use synthetic warning for ECDHE-RSA-AES128-SHA pageviews [puppet] - 10https://gerrit.wikimedia.org/r/614763 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [14:59:26] (03CR) 10jerkins-bot: [V: 04-1] ratelimit: create subchart in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/615233 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:00:11] !log draining restbase1026 for eventual reboot for kernel security update [15:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:17] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) @CBogen Have you pinged the data analysts about working with you in gathering this data? it exists in quite a raw form and querying it requires famil... [15:01:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:01:11] !log show a synthetic warning for traffic using ECDHE-RSA-AES128-SHA - T258405 [15:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:17] T258405: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 [15:02:19] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615213 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond) [15:04:10] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:04:15] here, looking [15:04:21] (03CR) 10Andrew Bogott: "This isn't a super comprehensive test, but looks promising: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/2" [puppet] - 10https://gerrit.wikimedia.org/r/615159 (owner: 10Jbond) [15:04:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:01] I'm here too if needed [15:05:05] +1 [15:05:08] deploy or reboot? [15:05:09] o/ [15:05:12] I just got the page, im about for assist [15:05:13] hey [15:05:13] * apergos peeks in [15:05:16] (03CR) 10Andrew Bogott: [C: 03+1] "Given that this doesn't modify eqiad1, I'm fine with you merging this whenever." [puppet] - 10https://gerrit.wikimedia.org/r/613112 (owner: 10Jbond) [15:05:50] (03CR) 10Andrew Bogott: "This looks good to me, but let's coordinate so I can watch for after-effects when you merge it." [puppet] - 10https://gerrit.wikimedia.org/r/615159 (owner: 10Jbond) [15:05:53] it looks like pybal restarted? [15:05:54] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 24704 bytes in 0.531 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:06:09] no, the process didn't restart [15:06:35] but at 15:00 precisely it re-initialized itself [15:07:17] anyway I don't think there was much user impact for a codfw hiccup [15:07:21] no user impact anyway? [15:07:22] bgp was up all the time [15:07:32] weird [15:07:34] jynus: well, some services are active/active [15:07:54] restbase flopped too [15:07:57] https://phabricator.wikimedia.org/P11996 [15:08:00] yeah I'm sure every service did [15:08:28] this looks like pybal initializing itself from scratch, wherein it clears IPVS state on the machine (so you lose all current forwardings to realservers) [15:08:28] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime [15:08:29] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:08:37] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [15:08:37] but I have no idea... why [15:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:59] !log poweroff ms-be1024 for bbu replacement - T257949 [15:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:09] T257949: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 [15:09:13] we didn't do anything to etcd in codfw, right? [15:09:19] * volans not [15:09:26] not that I know off [15:09:40] cdanis: nah.. that's just pybal reconnecting to etcd [15:09:48] hm [15:10:15] or just getting a new config for the api service [15:10:19] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:10:23] after a server got [de]pooled [15:10:26] !log draining restbase1027 for eventual reboot for kernel security update [15:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:32] vgutierrez: it must be the former, it happened for all services [15:10:33] the alert btw is confusing [15:10:35] (03PS3) 10Volans: puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) [15:10:44] MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK ? [15:10:53] which is it? codfw or eqiad ? [15:10:57] it's codfw [15:10:59] careful with the hashtag page [15:11:01] :) [15:11:02] it's possible to make sense out of it [15:11:04] but it doesn't help [15:11:15] it should be more clear when in an emergency [15:11:15] it's a bug in the macro expansion [15:11:19] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:11:21] yeah I agree [15:11:23] cdanis: so that looks like a net issue.. cause pybal keeps one connection per configurede service to etcd [15:12:32] hm [15:12:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:22] (03PS4) 10Nskaggs: Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) [15:13:37] (03CR) 10jerkins-bot: [V: 04-1] Add nskaggs key and grant access to WMCS related groups [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [15:15:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:15:08] okay, this correlates, and tracks with that vgutierrez: Jul 21 15:00:00 conf2002 etcdmirror-conftool-eqiad-wmnet[7141]: [etcd-mirror] INFO: Replicating key /conftool/v1/pools/codfw/api_appserver/apache2/mw2338.codfw.wmnet at index 336796 [15:15:19] still doesn't explain the pybal blip [15:15:32] cdanis: hmm checking the log.. it looks like pybal reconnects to etcd every hour? [15:15:50] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) @Nuria, @EBernhardson from the Search team is gathering the data, see T257361 [15:16:15] not every hour... [15:16:17] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1024 - https://phabricator.wikimedia.org/T257949 (10Jclark-ctr) finished with bbu swap [15:16:31] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:36] vgutierrez@lvs2009:~$ sudo -i journalctl -u pybal --since=today |grep config-etcd |grep cleanly|wc -l [15:16:36] 70 [15:17:22] <_joe_> I got the page now [15:17:28] <_joe_> way to go victorops [15:17:40] wow, that's not nice [15:17:50] <_joe_> bblack: my phone has low battery [15:17:57] <_joe_> so maybe that's why [15:18:08] cdanis: so pybal reconnected to etcd for apache2|nginx services at 10, 14 and 15 [15:18:14] I got it about the same time chris said "here, looking" on my IRC window [15:18:19] FWIW I got the victorops page 12 minutes ago [15:18:33] yall need to have your IRC clients highlight on # p a g e ;) [15:18:40] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) @CBogen I imagine @EBernhardson will be putting that data in superset (pinging him here so he can let us know) in which case you just need permits... [15:18:41] I do that as well ;P [15:18:46] <_joe_> pybal should only reinitialize on a restart [15:18:59] <_joe_> cdanis: I wasn't looking at irc, doing some heavy task reading :P [15:19:00] it isn't reinitializing anything [15:19:57] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10ArthurPSmith) Something seems to be going on very recently that's a different pattern - did something... [15:20:02] <_joe_> so it's both pybals at the same time? [15:20:24] it would only need to be the primary [15:20:27] there was no BGP transition [15:20:39] <_joe_> oh so bgp stayed up I see [15:20:59] but yeah... it's the same on lvs2010 [15:21:16] so it looks to me that something triggered a change for those services at etcd levl [15:21:20] *level [15:21:24] <_joe_> yes [15:21:29] ... on conf1005, every 50 seconds, the system is detecting a new USB hub [15:21:31] <_joe_> what time are we talking about? [15:21:32] *what* [15:21:36] _joe_: 15:00:01 [15:21:45] <_joe_> Jul 21 15:00:01 lvs2009 pybal[27818]: [api-https_443] INFO: Merged disabled server mw2338.codfw.wmnet, weight 15 [15:21:52] <_joe_> this was disabled then [15:21:53] yes [15:22:05] <_joe_> that happens some 10s of time a day [15:22:13] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:22:45] <_joe_> also, pybal in codfw connects to etcd in codfw [15:23:03] so that got me to look at android a bit. On my Android 10: Settings -> Apps&Notifications -> Advanced -> Special app access -> Battery Optimiation -> (Dropdown starts at "Not Optimized", flip it to "All Apps") -> Scroll down to VictorOps, click it and select "Don't Optimize". [15:23:28] so that android can't sleep the app when it feels like it to save battery [15:23:30] <_joe_> bblack: I don't have android 10 but I'll check [15:23:40] _joe_: yes, etcd-mirror also logged those changes on conf2002 [15:23:43] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10EBernhardson) This data isn't going to be available in superset, carley is looking for a report on the top 10k queries to a wiki. We have a script that gene... [15:24:12] <_joe_> cdanis: https://access.redhat.com/discussions/2914291 for the usb hub [15:24:21] <_joe_> it's the iLO [15:24:34] lovely [15:24:45] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/614847 (https://phabricator.wikimedia.org/T258438) (owner: 10Nskaggs) [15:25:40] (03CR) 10Andrew Bogott: "+1 for ripping this code out entirely" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [15:27:02] (03CR) 10CRusnov: [C: 04-1] "I would prefer this be done in the import script, i feel as though this script should be as free of logic and purpose as possible." [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [15:27:04] okay well I have no idea what happened, unless Icinga will notify about that one after a single failed request (which I doubt) [15:27:53] (03CR) 10Muehlenhoff: "Confirmed, that role::osm::master is unused in production, +1 for removing wholesale" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [15:29:28] (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615237 [15:29:48] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) @EBernhardson let's talk about abetter process for this. If all it is required is access to a file it can be placed on the stats machines on a known... [15:37:34] 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH) [15:39:13] 10Operations, 10Pywikibot, 10Traffic, 10HTTPS: Configure HTTPS for pywikibot.org - https://phabricator.wikimedia.org/T257537 (10ema) p:05Triage→03Medium [15:40:49] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) Let's please add @Cbogen to wmf ldap so she has access to superset as well. [15:44:53] (03PS3) 10Jdlrobson: Enable instrumentation for wikis in the desktop improvements testing group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) [15:50:51] (03PS1) 10Ebernhardson: airflow: Include refinery python dependencies [puppet] - 10https://gerrit.wikimedia.org/r/615240 [15:50:55] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Joe) 05Open→03Stalled a:05Joe→03None >>! In T258413#6323137, @Nuria wrote: > Let's please add @Cbogen to wmf ldap so she has access to superset as w... [15:52:18] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) @CBogen can you confirm that you have access to https://superset.wikimedia.org ? (you need to use your user/password with which you log in into wikit... [15:53:46] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) @Nuria yes, I already have access to superset, thanks. My understanding is that this data that Erik is gathering for me is considered sensitive and... [15:53:47] (03PS1) 10Ladsgroup: osm: Drop role::osm::master [puppet] - 10https://gerrit.wikimedia.org/r/615241 [15:54:17] (03CR) 10Ladsgroup: "Done: Id79d07198dd" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [15:57:58] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) > this data that Erik is gathering for me is considered sensitive and can't be placed in an alternate location Not quite correct, the permits you are... [15:58:38] (03CR) 10Ppchelko: "A few other thingies I've noticed, after that I think we should just merge it as soon as images are ready and address whatever else comes " (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [15:58:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10EBernhardson) Search queries in general are considered PII, so a report containing a list of 10k queries essentially still counts as PII. We can probably pu... [15:58:42] (03CR) 10Ppchelko: [C: 04-1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [16:00:05] godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T1600). [16:00:05] Amir1: A patch you scheduled for Puppet request window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:27] ooh, we have been discussing this patch already [16:00:40] I made this instead https://gerrit.wikimedia.org/r/c/operations/puppet/+/615241 [16:00:54] <_joe_> uh? [16:01:10] <_joe_> ahahah [16:01:19] <_joe_> is it unused though? [16:01:44] (03PS1) 10Nskaggs: Merge "maps: remove no-longer-used slave files" [labs/private] - 10https://gerrit.wikimedia.org/r/615242 [16:01:47] (03PS1) 10Nskaggs: Add nskaggs key [labs/private] - 10https://gerrit.wikimedia.org/r/615243 (https://phabricator.wikimedia.org/T255220) [16:02:21] (03Abandoned) 10Nskaggs: Merge "maps: remove no-longer-used slave files" [labs/private] - 10https://gerrit.wikimedia.org/r/615242 (owner: 10Nskaggs) [16:04:19] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) >Search queries in general are considered PII, so a report containing a list of 10k queries essentially still counts as PII. This is correct and supe... [16:06:38] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:08:05] (03PS2) 10Nskaggs: Add nskaggs key [labs/private] - 10https://gerrit.wikimedia.org/r/615243 (https://phabricator.wikimedia.org/T255220) [16:08:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:08:08] <_joe_> can someone look at logstash? [16:08:25] is logstash-next? [16:08:31] (both can ) [16:10:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] osm: Drop role::osm::master [puppet] - 10https://gerrit.wikimedia.org/r/615241 (owner: 10Ladsgroup) [16:11:46] <_joe_> Amir1: merged, thanks! [16:16:33] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase [16:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:00] (03PS1) 10JMeybohm: blubberoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615245 (https://phabricator.wikimedia.org/T256843) [16:20:02] (03PS1) 10JMeybohm: citoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615246 (https://phabricator.wikimedia.org/T256843) [16:20:04] (03PS1) 10JMeybohm: cxserver: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615247 (https://phabricator.wikimedia.org/T256843) [16:20:06] (03PS1) 10JMeybohm: eventgate-analytics-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615248 (https://phabricator.wikimedia.org/T256843) [16:20:08] (03PS1) 10JMeybohm: eventgate-analytics: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615249 (https://phabricator.wikimedia.org/T256843) [16:20:10] (03PS1) 10JMeybohm: eventgate-logging-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615250 (https://phabricator.wikimedia.org/T256843) [16:20:12] (03PS1) 10JMeybohm: eventgate-main: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615251 (https://phabricator.wikimedia.org/T256843) [16:20:14] (03PS1) 10JMeybohm: eventstreams: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615252 (https://phabricator.wikimedia.org/T256843) [16:20:16] (03PS1) 10JMeybohm: mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) [16:20:18] (03PS1) 10JMeybohm: proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) [16:20:20] (03PS1) 10JMeybohm: termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) [16:20:23] (03PS1) 10JMeybohm: wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) [16:20:24] (03PS1) 10JMeybohm: zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) [16:21:08] !log 1.36.0-wmf.1 was branched at 3a1faac3764ecae8dde813bd67a5a8e8f4975a85 for T257969 [16:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:13] T257969: 1.36.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T257969 [16:21:45] (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.36.0-wmf.1 [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/614913 (https://phabricator.wikimedia.org/T257969) (owner: 10TrainBranchBot) [16:22:07] (03Abandoned) 10Andrew Bogott: osm: Rename "slave" to "replica" [puppet] - 10https://gerrit.wikimedia.org/r/613727 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [16:27:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase (duration: 10m 37s) [16:27:10] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Nuria) Commented for a possible approach here: {T257361} [16:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:16] !log ppchelko@deploy1001 Started deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase, take 2 [16:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:33] (03CR) 10Herron: [C: 03+1] profile: add alert on no logs ingested [puppet] - 10https://gerrit.wikimedia.org/r/615164 (https://phabricator.wikimedia.org/T257294) (owner: 10Filippo Giunchedi) [16:30:50] (03CR) 10Herron: [C: 03+1] "LGTM! Curious, why 780 days?" [puppet] - 10https://gerrit.wikimedia.org/r/615219 (https://phabricator.wikimedia.org/T258491) (owner: 10Filippo Giunchedi) [16:32:02] (03CR) 10RLazarus: [C: 03+1] eventgate-analytics: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615249 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:32:10] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@4f3cb41]: Add new wikis to RESTBase, take 2 (duration: 04m 54s) [16:32:13] (03CR) 10RLazarus: [C: 03+1] blubberoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615245 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:47] PROBLEM - proton LVS codfw on proton.svc.codfw.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [16:32:56] (03CR) 10RLazarus: [C: 03+1] citoid: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615246 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:33:10] (03CR) 10RLazarus: [C: 03+1] eventgate-analytics-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615248 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:33:20] (03CR) 10RLazarus: [C: 03+1] cxserver: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615247 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:33:40] (03CR) 10RLazarus: [C: 03+1] eventgate-logging-external: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615250 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:34:07] RECOVERY - proton LVS codfw on proton.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [16:34:49] (03PS2) 10CDanis: Remove mobile redirect for donate.wp & update VTC [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [16:35:29] (03CR) 10RLazarus: [C: 03+1] termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:36:16] (03CR) 10RLazarus: [C: 03+1] eventstreams: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615252 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:36:59] (03PS1) 10JMeybohm: _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) [16:37:50] (03CR) 10RLazarus: [C: 03+1] eventgate-main: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615251 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:38:07] (03CR) 10RLazarus: [C: 03+1] mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:38:28] (03CR) 10RLazarus: [C: 03+1] proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:38:41] (03CR) 10RLazarus: [C: 03+1] zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:38:55] (03CR) 10RLazarus: [C: 03+1] wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:39:37] (03CR) 10RLazarus: [C: 03+1] _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm) [16:40:42] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.1 [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/614913 (https://phabricator.wikimedia.org/T257969) (owner: 10TrainBranchBot) [16:49:28] (03PS1) 10Lucas Werkmeister (WMDE): extension-list: Load WikibaseRepo via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 [16:50:38] (03PS18) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [16:50:51] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "Let’s not merge this before wmf.1 has reached at least group0." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE)) [16:51:03] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [16:52:12] (03PS19) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [16:55:25] (03CR) 10Herron: "> > I don't have the insight into OTRS, but one other angle might be switch OTRS addresses to something like user@otrs.wikimedia.org (or w" [puppet] - 10https://gerrit.wikimedia.org/r/605568 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:58:34] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10dpifke) @Joe: if you gzip the logs before https://gerrit.wikimedia.org/r/c/performance/arc-lamp/+/613740/ lands, they won't be expired, the related output SVG file... [17:00:04] halfak and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T1700). [17:00:16] (03CR) 10BBlack: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [17:01:54] (03PS1) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615265 [17:02:49] (03CR) 10Lucas Werkmeister (WMDE): "I don’t think this is blocked by the train – the JSON file should be usable in wmf.41 already. But we can also wait for wmf.1 if we want." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615265 (owner: 10Lucas Werkmeister (WMDE)) [17:05:34] (03CR) 10CDanis: [C: 03+2] Remove mobile redirect for donate.wp & update VTC [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [17:10:29] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.35.0-wmf.39 (duration: 16m 25s) [17:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:28] (03CR) 10Mepps: "Is there a way to know when this is deployed? Thanks for fixing it so quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [17:20:57] (03CR) 10Ladsgroup: "I already made this: Iedfb4eba3b :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615265 (owner: 10Lucas Werkmeister (WMDE)) [17:22:46] (03Abandoned) 10Lucas Werkmeister (WMDE): Load WikibaseClient using extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615265 (owner: 10Lucas Werkmeister (WMDE)) [17:24:45] (03PS2) 10Volans: mgmt: netbox-generated data for frack mgmt eqiad [dns] - 10https://gerrit.wikimedia.org/r/612472 (https://phabricator.wikimedia.org/T233183) [17:25:06] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615162 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [17:26:54] (03CR) 10CDanis: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [17:28:38] (03CR) 10JMeybohm: charts for push-notification service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos) [17:30:26] (03CR) 10Volans: [C: 03+2] mgmt: netbox-generated data for frack mgmt eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/612472 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [17:30:31] (03PS3) 10MusikAnimal: Enable $wgWatchlistExpiry on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) [17:34:42] (03PS1) 10Cwhite: profile: install statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [17:41:00] I can’t seem to connect to bast3004.wikimedia.org (esams)… [17:42:27] bast1002 (eqiad) works [17:42:45] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615271 [17:43:11] Lucas_WMDE: the host is up, can you give me a: mtr -zwb --tcp --port 22 bast3004.wikimedia.org [17:44:29] (03PS2) 10Jdlrobson: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 [17:44:30] cdanis: https://phabricator.wikimedia.org/P11999 [17:44:40] (03PS4) 10Jdlrobson: Enable instrumentation for wikis in the desktop improvements testing group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) [17:45:01] wow [17:45:24] Lucas_WMDE: can you try again, but with -4 ? [17:45:28] added eqiad in a comment [17:45:30] sure [17:45:52] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:45:52] that looks better [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:06] (also in paste comment) [17:46:09] I don't believe Vodafone GmbH has a looking glass at which to inspect their routes [17:46:17] but it does look to be a problem within their network, or on the edge of it [17:46:57] hm, ok [17:47:03] (I didn’t even think about trying IPv4, thanks) [17:49:18] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615271 (owner: 10Jeena Huneidi) [17:50:04] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615271 (owner: 10Jeena Huneidi) [17:50:20] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:16] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.1 [17:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:05] (03PS3) 10Bstorm: Add nskaggs key [labs/private] - 10https://gerrit.wikimedia.org/r/615243 (https://phabricator.wikimedia.org/T255220) (owner: 10Nskaggs) [17:58:34] (03PS1) 10Herron: prometheus: introduce role::prometheus::pop [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) [17:59:02] (03CR) 10Mepps: "Thanks CDanis!" [puppet] - 10https://gerrit.wikimedia.org/r/615229 (owner: 10BBlack) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T1800) [18:00:08] (03CR) 10Herron: prometheus: introduce role::prometheus::pop (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [18:00:36] (03PS20) 10Hnowlan: api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) [18:02:12] (03Abandoned) 10Catrope: PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.40) - 10https://gerrit.wikimedia.org/r/615186 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [18:02:54] 10Operations, 10Beta-Cluster-Infrastructure: deployment-puppetmaster04: git-sync-upstream is failing with a merge conflict since 2020-07-17T08:50:01Z - https://phabricator.wikimedia.org/T258451 (10Krinkle) /cc @elukey @jbond per . [18:03:38] (03PS1) 10Catrope: PageUpdater: fix handling of null edits [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615196 (https://phabricator.wikimedia.org/T257766) [18:08:07] !log mforns@deploy1001 Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] [18:08:11] (03CR) 10Catrope: [C: 03+1] PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615187 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [18:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:10] (03CR) 10Hnowlan: api-gateway: Basic envoy chart WIP (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [18:13:09] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) >>! In T257931#6321760, @Joe wrote: > given the situation, it would be smart to start compressing older logs. In avoidance of doubt, these not "log" file... [18:13:39] !log mforns@deploy1001 Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 05m 32s) [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:51] !log mforns@deploy1001 Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - second try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] [18:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:59] (03CR) 10SBassett: "> Why does this need to be cherry picked?" [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614772 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:17:08] !log mforns@deploy1001 Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - second try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 17s) [18:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:07] 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10Krinkle) [18:21:19] (03CR) 10Krinkle: "Does this fix a problem in production? Does it allow some other migration or feature to be enabled sooner? If not, probably not worth back" [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614772 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [18:21:30] !log mforns@deploy1001 Started deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - third try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] [18:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:21:42] !log mforns@deploy1001 Finished deploy [analytics/refinery@0c25de1]: Redeploying to unbreak unique devices per domain monthly - third try [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 12s) [18:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:11] !log mforns@deploy1001 Started deploy [analytics/refinery@0c25de1] (thin): Redeploying to unbreak unique devices per domain monthly THIN [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] [18:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:18] !log mforns@deploy1001 Finished deploy [analytics/refinery@0c25de1] (thin): Redeploying to unbreak unique devices per domain monthly THIN [analytics/refinery@0c25de19a3a309276654b4463cca4f574336d8fd] (duration: 00m 07s) [18:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:41] (03CR) 10Andrew Bogott: [C: 03+1] Add nskaggs key [labs/private] - 10https://gerrit.wikimedia.org/r/615243 (https://phabricator.wikimedia.org/T255220) (owner: 10Nskaggs) [18:26:44] (03CR) 10Bstorm: [V: 03+2 C: 03+2] Add nskaggs key [labs/private] - 10https://gerrit.wikimedia.org/r/615243 (https://phabricator.wikimedia.org/T255220) (owner: 10Nskaggs) [18:33:38] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.1 (duration: 41m 22s) [18:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:36:47] (03CR) 10Catrope: [C: 03+2] PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615187 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [18:36:52] (03CR) 10Catrope: [C: 03+2] PageUpdater: fix handling of null edits [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615196 (https://phabricator.wikimedia.org/T257766) (owner: 10Catrope) [18:41:08] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/24037/" [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [18:42:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:49:42] (03CR) 10Ottomata: Initial debian commit (031 comment) [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [18:55:26] (03Merged) 10jenkins-bot: PageUpdater: fix handling of null edits [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615187 (https://phabricator.wikimedia.org/T257766) (owner: 10Kosta Harlan) [18:55:55] (03PS2) 10Cwhite: profile: install statsd_exporter and retarget statsv [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) [18:56:41] (03Merged) 10jenkins-bot: PageUpdater: fix handling of null edits [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615196 (https://phabricator.wikimedia.org/T257766) (owner: 10Catrope) [18:57:56] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1001/24040/" [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [18:58:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:00:04] longma and liw: How many deployers does it take to do Mediawiki train - American+European Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T1900). [19:00:33] longma: Only just syncing my patches, will be done in 2-3 mins [19:00:41] Jenkins took 20 mins <_< [19:01:19] (03PS1) 10Kaldari: Updating UploadWizard template: PD-old-70-1923->PD-old-70-expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615286 (https://phabricator.wikimedia.org/T258523) [19:01:32] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.41/includes/Storage/PageUpdater.php: Fix handling of null edits (T257766) (duration: 01m 11s) [19:01:34] RoanKattouw: np, thanks for the update [19:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:38] T257766: Notification emails repeated every day - https://phabricator.wikimedia.org/T257766 [19:02:51] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.1/includes/Storage/PageUpdater.php: Fix handling of null edits (T257766) (duration: 01m 06s) [19:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:01] longma: OK all done [19:03:15] Thanks! [19:05:27] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10DStrine) 05Open→03Resolved [19:07:31] (03PS1) 10Jeena Huneidi: group0 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615287 [19:07:33] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615287 (owner: 10Jeena Huneidi) [19:08:15] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615287 (owner: 10Jeena Huneidi) [19:10:43] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.1 [19:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:30] (03PS1) 10Cwhite: profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [19:16:22] (03PS2) 10Cwhite: profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [19:17:48] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10CGlenn) No worries @Joe ! No worries. I wasn't sure either. Should I put in a new ticket to add wikimediafoundation.org as a property to Google Search Console? [19:44:41] (03PS1) 10Catrope: GrowthExperiments: variation C/D at 50%/50% in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615291 [19:47:22] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: variation C/D at 50%/50% in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615291 (owner: 10Catrope) [19:48:08] (03Merged) 10jenkins-bot: GrowthExperiments: variation C/D at 50%/50% in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615291 (owner: 10Catrope) [19:49:10] (03PS3) 10Cwhite: profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [19:53:30] (03PS4) 10Cwhite: profile: add prometheus instance for statsv metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) [19:53:51] (03CR) 10Cwhite: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24041/" [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [19:55:13] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/615273 (https://phabricator.wikimedia.org/T243057) (owner: 10Herron) [19:58:48] (03CR) 10Krinkle: [C: 04-1] Enable instrumentation for wikis in the desktop improvements testing group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [20:07:06] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Krinkle) Serving the same wiki through multiple hostnames is wholly unsupported in MediaWiki and... [20:18:52] 10Operations, 10ops-eqiad, 10DBA: db1145 crashed - memory issues - https://phabricator.wikimedia.org/T258249 (10Jclark-ctr) @Marostegui Replacement Dimm has arrived please reach out to me for scheduling down time i am available for the next 2 hours but will be on site tomorrow 9:30am est [20:25:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:36:22] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020): CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Thanks @Trizek-WMF! It took a moment to get everything else lined up, but we're moving forward with September 1. [20:37:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:37:37] (03PS5) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [20:39:19] (03PS3) 10Jdrewniak: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [20:40:41] (03PS2) 10Jdrewniak: Switch test wikis to new version of vector by default (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614890 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [20:41:35] (03PS2) 10Jdrewniak: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) (owner: 10Jdlrobson) [20:52:49] (03CR) 10Jdlrobson: Enable instrumentation for wikis in the desktop improvements testing group (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [20:57:07] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:57:19] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [21:00:09] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [21:02:47] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog2001.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:05:39] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [21:24:23] (03CR) 10Dave Pifke: [C: 03+1] "I cherry picked this in beta and don't see anything obviously broken on deployment-webperf11; will leave it there if anyone else wants to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615269 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite) [21:34:24] (03Abandoned) 10Ammarpad: UrlShortener: Remove config renaming hack [extensions/UrlShortener] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/614772 (https://phabricator.wikimedia.org/T255491) (owner: 10Ammarpad) [21:40:40] (03PS6) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [21:40:42] (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/envoyproxy/ratelimiter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/615306 [21:40:44] (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/envoyproxy/ratelimiter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/615306 (owner: 10QChris) [21:41:38] (03PS1) 10QChris: Import done. Revoke import grants [software/envoyproxy/ratelimiter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/615307 [21:41:40] (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/envoyproxy/ratelimiter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/615307 (owner: 10QChris) [21:43:43] (03PS4) 10Jdrewniak: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [22:09:30] (03CR) 10RLazarus: [C: 03+1] envoy: fix deprecated filter names [puppet] - 10https://gerrit.wikimedia.org/r/615162 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [22:20:10] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10User-Elukey: replace onboard NIC in kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T236327 (10Jclark-ctr) @elukey can you let me know your availability for scheduling this project? [22:21:14] (03PS7) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [22:22:10] (03CR) 10Ppchelko: "LGTM. I believe it would be wise to just try to deploy it instead of trying to catch everything in review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [22:22:16] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Basic envoy chart WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [22:22:34] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Jclark-ctr) [22:22:40] (03CR) 10Ppchelko: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/613160 (https://phabricator.wikimedia.org/T256771) (owner: 10Hnowlan) [22:23:16] (03PS3) 10BryanDavis: Add Cloud VPS global root key for Sam Reed [labs/private] - 10https://gerrit.wikimedia.org/r/612974 (https://phabricator.wikimedia.org/T249774) [22:23:31] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Restrict unauthenticated write HTTP methods, permit read HTTP methods [deployment-charts] - 10https://gerrit.wikimedia.org/r/613650 (https://phabricator.wikimedia.org/T256769) (owner: 10Hnowlan) [22:24:24] (03PS8) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T254228) (owner: 10Jdlrobson) [22:25:13] (03PS5) 10Jdrewniak: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [22:27:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:28:54] (03PS9) 10Jdrewniak: Enable instrumentation for wikis in the desktop improvements testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614888 (https://phabricator.wikimedia.org/T258058) (owner: 10Jdlrobson) [22:29:08] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10Jclark-ctr) @hnowlan replacement drive arrived today can you confirm drive can just be replaced. will take care of tomorrow. [22:29:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:30:04] (03PS6) 10Jdrewniak: Enable desktop improvements for anons (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [22:35:46] (03CR) 10Ppchelko: "A few questions inlined" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615233 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [22:35:50] (03CR) 10Ppchelko: [C: 04-1] ratelimit: create subchart in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/615233 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan) [22:47:01] (03PS7) 10Jdrewniak: Enable desktop improvements by default for testing group (round 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614889 (owner: 10Jdlrobson) [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200721T2300). [23:00:04] musikanimal and kaldari: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:30] I'm here! [23:01:01] here! [23:01:02] hi musikanimal! [23:01:05] I can deploy today! [23:01:13] Yay! [23:01:18] ready when you are :) [23:01:43] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) (owner: 10MusikAnimal) [23:02:45] (03Merged) 10jenkins-bot: Enable $wgWatchlistExpiry on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614832 (https://phabricator.wikimedia.org/T257506) (owner: 10MusikAnimal) [23:03:36] musikanimal: ready for you to test at mwdebug1001 [23:03:48] doing [23:04:46] looks good 👍 [23:05:03] syncing then! [23:06:41] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615286 (https://phabricator.wikimedia.org/T258523) (owner: 10Kaldari) [23:06:54] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 7acc9d966a07d589bb6aed5f801c9e1defc75fe1: Enable $wgWatchlistExpiry on testwiki (T257506) (duration: 01m 08s) [23:07:00] musikanimal: here you go :) [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:01] T257506: Watchlist Expiry: Enable feature on testwiki [small] - https://phabricator.wikimedia.org/T257506 [23:07:27] (03Merged) 10jenkins-bot: Updating UploadWizard template: PD-old-70-1923->PD-old-70-expired [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615286 (https://phabricator.wikimedia.org/T258523) (owner: 10Kaldari) [23:07:32] looks good, thank you! [23:07:43] happy to help! [23:08:08] kaldari: ready for you to test at mwdebug1001 [23:08:14] ok .... [23:08:49] should only take a sec.... [23:10:40] test done. looks good! [23:10:47] thank you, syncing! [23:12:45] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 7a50168d54b5e86834606fb8d7880eb3a923ffd5: Updating UploadWizard template: PD-old-70-1923->PD-old-70-expired (T258523) (duration: 01m 06s) [23:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:50] T258523: UploadWizard adding deprecated PD-old-70-1923 template to new uploads - https://phabricator.wikimedia.org/T258523 [23:13:02] kaldari: done! [23:13:09] !log Evening backport window done [23:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:12] Thankye! [23:13:17] happy to help! [23:18:57] (03PS1) 10Ebernhardson: Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 [23:19:29] I'm going to sneak ^ into the window as well (and put it on wikitech). will deploy myself, will take a minute to verify [23:20:28] ebernhardson: go ahead then, I'm done :) [23:28:20] (03PS2) 10Ebernhardson: Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 [23:31:36] (03CR) 10Ebernhardson: [C: 03+2] Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 (owner: 10Ebernhardson) [23:31:43] (03PS3) 10Ebernhardson: Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 [23:31:50] (03CR) 10Ebernhardson: [C: 03+2] Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 (owner: 10Ebernhardson) [23:32:43] (03Merged) 10jenkins-bot: Bump cirrus MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615319 (owner: 10Ebernhardson) [23:37:46] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump cirrus MLR models to latest (duration: 01m 06s) [23:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:42] all done