[00:00:26] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:30] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:34] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:54] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:20] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:04:50] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:05:32] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:06:24] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:09:38] 10Operations, 10serviceops, 10Growth-Team (Current Sprint), 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10MMiller_WMF) @kostajh -- maybe we should do that, but I would like to hear from @nettrom_WMF about what that would mean for o... [05:09:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:09:59] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 59, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311, db1099:3318 for reimage and MCR change', diff saved to https://phabricator.wikimedia.org/P12263 and previous config saved to /var/cache/conftool/dbconfig/20200817-052147-marostegui.json [05:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:54] (03PS1) 10Marostegui: db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620527 [05:23:58] (03CR) 10Marostegui: [C: 03+2] db1099: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/620527 (owner: 10Marostegui) [05:25:38] !log Deploy schema change on db1139:3311 [05:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:00] (03PS1) 10Marostegui: install_server: Reimage db1099 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/620528 (https://phabricator.wikimedia.org/T250666) [05:28:01] <_joe_> !log depooling mw1281 for testing for T260329 [05:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:04] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [05:28:16] !log Stop mysql on db1099:3311, db1099:3318 for reimage [05:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:31] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1099 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/620528 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:33:04] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) To test the hypothesis that this is related to firejail use, we're sending 1 req/s to on... [05:43:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) [06:03:31] (03PS1) 10Marostegui: mariadb: Allow the installation of clouddb hosts [puppet] - 10https://gerrit.wikimedia.org/r/620529 (https://phabricator.wikimedia.org/T260441) [06:26:18] <_joe_> !log stop testing on mw1281, T260329 [06:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:22] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [06:28:40] (03PS1) 10Evrifaessa: Define Portal and Portal talk namespace for bjnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620533 (https://phabricator.wikimedia.org/T259429) [06:29:49] (03CR) 10Evrifaessa: [C: 03+1] Define Portal and Portal talk namespace for bjnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620533 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:30:03] Hello [06:30:14] Can someone get jenkins-bot to verify this? : https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620533 [06:31:03] <_joe_> Evrifaessa: not even a minute have passed, you might need to tune down your expectations on the quickness of CI :) [06:31:15] <_joe_> well 2 now, but still [06:31:34] _joe_: I'm not on the whitelist of jenkins-bot, so someone needs to get it to check my commits manually [06:32:27] <_joe_> uh ok, sorry, I'm just used to CI taking 5-10 minutes to run :P [06:33:54] <_joe_> Also I have no idea how to do that, I'll try something :) [06:33:58] (03CR) 10Giuseppe Lavagetto: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620533 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:35:18] <_joe_> yes it worked :) [06:35:46] thanks :) [06:36:34] <_joe_> also, just because zuul and I are clearly enemies, CI ran in 1 minute for you [06:38:20] hahah :) [06:47:54] (03PS1) 10Evrifaessa: Set Portal and Portal_talk in bjnwiki as an extra namespace instead of an alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) [06:48:48] (03Abandoned) 10Evrifaessa: Set Portal and Portal_talk in bjnwiki as an extra namespace instead of an alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:48:54] (03Restored) 10Evrifaessa: Set Portal and Portal_talk in bjnwiki as an extra namespace instead of an alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:49:12] (03Abandoned) 10Evrifaessa: Define Portal and Portal talk namespace for bjnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620533 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:51:06] (03CR) 10Evrifaessa: [C: 03+1] Set Portal and Portal_talk in bjnwiki as an extra namespace instead of an alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [06:51:31] _joe_: Can you please run it for this too? : https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620643 [06:55:13] (03PS2) 10Evrifaessa: Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) [07:15:40] <_joe_> !log repooled mw1281 [07:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:59] <_joe_> !log running the same test on mw1381 T260329 [07:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:05] (03PS4) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [07:32:44] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) In T233073 I am experimenting with building Hue 4.7.1, that is the latest upstream and it should support python3.7. The build procedure seems good (https://gerrit.wikimedia.org/... [07:34:53] <_joe_> !log repooling mw1381 [07:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:05] <_joe_> !log running the same test on mw1382 T260329 [07:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:00] (03PS1) 10Marostegui: dbproxy1016,20: Temporary test db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620648 (https://phabricator.wikimedia.org/T259589) [07:46:30] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10Vgutierrez) 05Open→03Declined as I mentioned on my previous comment, being part of the wmf LDAP group is enough. @DVrandecic could you point us to the onboarding documentat... [07:52:14] <_joe_> !log repooling mw1382 [07:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:42] <_joe_> Evrifaessa: sure, sorry I was doing some intense testing [07:52:58] (03CR) 10Giuseppe Lavagetto: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [07:53:36] <_joe_> Evrifaessa: I would suggest to add for this in #wikimedia-releng, though [07:53:44] <_joe_> s/add/ask/ [07:54:32] (03CR) 10Kormat: [C: 03+1] dbproxy1016,20: Temporary test db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620648 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [07:54:41] kormat: <3 [07:54:49] (03CR) 10Marostegui: [C: 03+2] dbproxy1016,20: Temporary test db1132 [puppet] - 10https://gerrit.wikimedia.org/r/620648 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [08:05:44] (03PS1) 10JMeybohm: Disable cgroup memory accounting [puppet] - 10https://gerrit.wikimedia.org/r/620649 (https://phabricator.wikimedia.org/T260329) [08:10:08] (03CR) 10JMeybohm: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24493/mw1380.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620649 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:13:58] (03PS1) 10Jcrespo: mariadb: Disable snapshots being sent to bacula [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) [08:14:26] (03CR) 10Jcrespo: "As mentioned in the last meeting." [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:15:13] (03CR) 10Marostegui: mariadb: Disable snapshots being sent to bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:16:08] (03PS2) 10Jcrespo: mariadb: Disable snapshots being sent to bacula [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) [08:16:23] (03CR) 10Jcrespo: mariadb: Disable snapshots being sent to bacula (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:17:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Disable cgroup memory accounting [puppet] - 10https://gerrit.wikimedia.org/r/620649 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:17:37] (03CR) 10Marostegui: [C: 03+1] mariadb: Disable snapshots being sent to bacula [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:17:42] (03CR) 10Jcrespo: [C: 03+2] mariadb: Disable snapshots being sent to bacula [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:17:50] (03PS3) 10Jcrespo: mariadb: Disable snapshots being sent to bacula [puppet] - 10https://gerrit.wikimedia.org/r/620651 (https://phabricator.wikimedia.org/T138562) [08:19:09] (03CR) 10JMeybohm: [C: 03+2] Disable cgroup memory accounting [puppet] - 10https://gerrit.wikimedia.org/r/620649 (https://phabricator.wikimedia.org/T260329) (owner: 10JMeybohm) [08:24:27] (03Abandoned) 10Jcrespo: check_mariadb.py: Update bacula logic to the latest Bacula class [puppet] - 10https://gerrit.wikimedia.org/r/552860 (owner: 10Jcrespo) [08:25:55] !log forcing a puppet run on all mw-api servers in eqiad - T260329 [08:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:11] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10fgiunchedi) cc @Nuria for approval/signoff (or other folks in Analytics can sign off too? not sure) [08:28:05] (03CR) 10Jcrespo: "We will merge this so backups flow- we will have to refactor on a later change to rename Databases and not needing a separate JobDefaults " [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:28:15] (03PS3) 10Jcrespo: Add new pool DatabasesCodfw to backup data generated on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) [08:29:11] 10Operations, 10SRE-Access-Requests: Request for access to analytics-privatedata-users - https://phabricator.wikimedia.org/T260450 (10fgiunchedi) p:05Triage→03Medium [08:29:56] (03PS4) 10Jcrespo: Add new pool DatabasesCodfw to backup data generated on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) [08:31:35] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10fgiunchedi) @AMooney @jcrespo any updates on this ? thank you! [08:32:00] (03CR) 10Jcrespo: [C: 03+2] Add new pool DatabasesCodfw to backup data generated on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:32:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): The python-build images regenerate wheels even when matching ones are already available - https://phabricator.wikimedia.org/T259611 (10fgiunchedi... [08:32:50] 10Operations, 10Patch-For-Review: logrotate cronspam on ms-be1040 - https://phabricator.wikimedia.org/T205974 (10fgiunchedi) p:05Triage→03Medium [08:34:37] 10Operations, 10Platform Engineering, 10serviceops: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10fgiunchedi) p:05Triage→03Medium [08:34:50] 10Operations, 10observability: Grafana/Thanos serves 503s for long-time-window requests - https://phabricator.wikimedia.org/T260241 (10fgiunchedi) p:05Triage→03High [08:34:57] 10Operations: Fix "Blog" link on noc.wikimedia.org - https://phabricator.wikimedia.org/T259978 (10fgiunchedi) p:05Triage→03Medium [08:38:52] PROBLEM - bacula director process on backup1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:51:16] ^that is me, fix coming [08:51:25] (03PS3) 10Jcrespo: [WIP] Change backup hosts into using the package version of scripts [puppet] - 10https://gerrit.wikimedia.org/r/620312 (https://phabricator.wikimedia.org/T165358) [08:51:27] (03PS1) 10Jcrespo: test [puppet] - 10https://gerrit.wikimedia.org/r/620654 [08:51:29] (03PS1) 10Jcrespo: Backups: Fix storage definition for Databases on codfw [puppet] - 10https://gerrit.wikimedia.org/r/620655 (https://phabricator.wikimedia.org/T79922) [08:51:54] (03PS2) 10Jcrespo: Backups: Fix storage definition for Databases on codfw [puppet] - 10https://gerrit.wikimedia.org/r/620655 (https://phabricator.wikimedia.org/T79922) [08:52:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Change backup hosts into using the package version of scripts [puppet] - 10https://gerrit.wikimedia.org/r/620312 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [08:53:32] (03CR) 10Jcrespo: [C: 03+2] Backups: Fix storage definition for Databases on codfw [puppet] - 10https://gerrit.wikimedia.org/r/620655 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [08:55:32] (03CR) 10Kormat: [C: 03+2] "> Patch Set 5: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/619291 (owner: 10Kormat) [08:56:13] (03PS1) 10Filippo Giunchedi: hieradata: limit queries to Thanos sidecar / Prometheus to last 15d [puppet] - 10https://gerrit.wikimedia.org/r/620656 (https://phabricator.wikimedia.org/T260241) [08:56:32] RECOVERY - bacula director process on backup1001 is OK: PROCS OK: 1 process with UID = 112 (bacula), command name bacula-dir https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:57:01] we are back [09:03:59] (03PS2) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 [09:04:38] (03CR) 10Giuseppe Lavagetto: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [09:05:38] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:12] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:33] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:34] (03CR) 10Filippo Giunchedi: [C: 03+2] templates: add alerts.w.o [dns] - 10https://gerrit.wikimedia.org/r/619752 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:10:40] (03PS2) 10Filippo Giunchedi: templates: add alerts.w.o [dns] - 10https://gerrit.wikimedia.org/r/619752 (https://phabricator.wikimedia.org/T258948) [09:13:23] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: remove unnecessary define and split mediawiki queries by channel [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [09:14:27] (03PS2) 10Jcrespo: mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) [09:14:29] (03CR) 10Filippo Giunchedi: prometheus: add alertmanager jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619738 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:14:31] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alertmanager jobs [puppet] - 10https://gerrit.wikimedia.org/r/619738 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:14:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:29] (03PS1) 10Marostegui: Revert "dbproxy1016,20: Temporary test db1132" [puppet] - 10https://gerrit.wikimedia.org/r/620667 [09:15:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:16:11] <_joe_> !log upgrading packages on mw1377 [09:16:11] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Debian packaging for Grafana plugins [debs/grafana-plugins] - 10https://gerrit.wikimedia.org/r/618953 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [09:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:52] (03CR) 10Kormat: [C: 03+1] Revert "dbproxy1016,20: Temporary test db1132" [puppet] - 10https://gerrit.wikimedia.org/r/620667 (owner: 10Marostegui) [09:17:34] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1016,20: Temporary test db1132" [puppet] - 10https://gerrit.wikimedia.org/r/620667 (owner: 10Marostegui) [09:18:03] <_joe_> !log re-upgrading imagemagick on mw1378 [09:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:24] <_joe_> !log running a full apt-get upgrade on mw1379-1380 [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:04] (03PS3) 10Jcrespo: mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) [09:20:49] (03PS4) 10Jcrespo: mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) [09:21:47] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [09:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:58] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:23:00] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [09:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:33] (03PS1) 10Volans: actions: fix test for pytest regression [software/spicerack] - 10https://gerrit.wikimedia.org/r/620661 [09:27:10] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:09] (03CR) 10Marostegui: "If you've run PCC, it looks clean I assume?" [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:28:33] (03CR) 10Jcrespo: [C: 04-1] "-1 I got it in reverse: https://puppet-compiler.wmflabs.org/compiler1001/24495/dbprov1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:28:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:09] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:14] (03CR) 10Marostegui: Add new pool DatabasesCodfw to backup data generated on eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:30:10] (03PS5) 10Jcrespo: mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) [09:30:17] (03CR) 10Volans: [C: 03+2] "Self-merging to unblock CI and merges on master" [software/spicerack] - 10https://gerrit.wikimedia.org/r/620661 (owner: 10Volans) [09:30:51] (03CR) 10Volans: "For the unit test failure it's a regression on pytest side, I've sent and merged https://gerrit.wikimedia.org/r/c/operations/software/spic" [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [09:32:42] (03Merged) 10jenkins-bot: actions: fix test for pytest regression [software/spicerack] - 10https://gerrit.wikimedia.org/r/620661 (owner: 10Volans) [09:32:55] (03CR) 10Jcrespo: "Answer:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [09:33:34] (03PS2) 10Filippo Giunchedi: profile: switch Grafana plugins to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/619451 (https://phabricator.wikimedia.org/T259143) [09:33:36] (03PS1) 10Filippo Giunchedi: hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) [09:34:09] (03CR) 10jerkins-bot: [V: 04-1] hieradata: switch grafana.w.o to Grafana 7 [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [09:36:09] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:14] (03PS1) 10Marostegui: dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) [09:38:26] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:59] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:11] (03CR) 10Jbond: [C: 03+2] ferm: ensure rules always end in a semi colon [puppet] - 10https://gerrit.wikimedia.org/r/617706 (owner: 10Jbond) [09:39:33] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:21] !log updating compiler facts for cloud puppet compiler project to include new host dbprov2003 [09:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:02] PROBLEM - SSH on db2093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:12] PROBLEM - Check systemd state on mc2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:20] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [09:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:32] PROBLEM - SSH on prometheus2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:47] mmmh 2 ssh failing at the same time? [09:45:56] PROBLEM - Check systemd state on mc1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:19] different racks [09:46:22] B5 and D5 [09:46:23] D5: Ok so I hacked up ssh.py to use mozprocess - https://phabricator.wikimedia.org/D5 [09:46:26] PROBLEM - Check systemd state on mc1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:32] lol stashbot [09:47:35] but networks seems up [09:48:06] PROBLEM - Check systemd state on mc2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:20] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [09:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:36] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [09:49:57] * volans logging to prometheus2004 console [09:50:06] PROBLEM - Check systemd state on mc2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:46] prometheus2004 is up and running and no recent reboots [09:50:51] not even hight load, looking [09:52:22] XioNoX: you around by any chance? [09:52:33] volans: I can connect through mysql to db2093, but not on ssh [09:52:46] also other ports like prometheus seems unaffected [09:52:49] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:51] interesting, and on the mc hosts that failed systemd at least one has ferm failed [09:52:54] I'm looking that one too [09:53:02] although I would bet it is not as much a port as a source host issue? [09:53:03] Empty rule before ";" not allowed [09:53:23] volans: i think this is a change i just pushed [09:53:25] one sec [09:53:35] jbond42: the ferm one? [09:53:41] https://gerrit.wikimedia.org/r/617706 [09:53:42] yes [09:53:44] might affet the ssh part too? [09:54:00] PROBLEM - Check systemd state on mc2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:10] volans: could be reverting now [09:54:27] ack [09:54:28] (03PS1) 10Jbond: Revert "ferm: ensure rules always end in a semi colon" [puppet] - 10https://gerrit.wikimedia.org/r/620668 [09:55:21] jbond42: please tell when puppet-merged that I try to run puppet on prometheus2004 [09:55:22] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [09:55:23] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:40] PROBLEM - Check systemd state on mc1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:44] (03CR) 10Jbond: [C: 03+2] Revert "ferm: ensure rules always end in a semi colon" [puppet] - 10https://gerrit.wikimedia.org/r/620668 (owner: 10Jbond) [09:56:21] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:22] PROBLEM - Check systemd state on mc1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:34] volans: merged [09:56:52] PROBLEM - Check systemd state on mc1027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:54] RECOVERY - SSH on prometheus2004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:57:56] running puppet on failed-only [09:58:02] RECOVERY - Check systemd state on mc1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:20] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:40] PROBLEM - Check systemd state on mc1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:42] RECOVERY - Check systemd state on mc2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:45] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [09:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:02] jbond42: ack thx [09:59:12] RECOVERY - Check systemd state on mc2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:14] RECOVERY - Check systemd state on mc2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:17] prometheus seems ok now [09:59:22] RECOVERY - Check systemd state on mc1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:32] mc also fixed by the revert [09:59:46] RECOVERY - Check systemd state on mc1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:59:46] RECOVERY - Check systemd state on mc2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:00] RECOVERY - Check systemd state on mc1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:05] RECOVERY - Check systemd state on mc1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:08] RECOVERY - Check systemd state on mc1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:20] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:09] (03CR) 10Filippo Giunchedi: "build failure is unrelated to the change:" [puppet] - 10https://gerrit.wikimedia.org/r/620663 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [10:06:43] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:00] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:00] (03CR) 10Jbond: "The original cxhange caused the following issues" [puppet] - 10https://gerrit.wikimedia.org/r/620668 (owner: 10Jbond) [10:10:19] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:12] RECOVERY - SSH on db2093 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:12:43] <_joe_> wait, did the mc hosts just lose connectivity with everything eslse? [10:13:19] _joe_: ferm issue due to a puppet patch [10:13:23] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [10:13:23] already reverted and fixed [10:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:34] <_joe_> volans: ok, I was trying to assess impact [10:13:42] on some of them where puppet run ferm failed to run [10:13:54] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [10:13:55] while for prometheus and that db host ssh was not open anymore from the bastions [10:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:09] <_joe_> if the firewall rules were reverted to the default drop [10:14:25] jbond42 might fill you in with the details ^^^ [10:14:30] <_joe_> that is an incident, but I would suspect that didn't happen just by latency graphs [10:14:31] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:35] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:47] _joe_: for the mc hosts the following rule was altered /etc/ferm/conf.d/10_ferm-ipsec-esp [10:14:53] - proto esp { saddr $DOMAIN_NETWORKS ACCEPT; }; [10:14:55] + proto esp { saddr $DOMAIN_NETWORKS ACCEPT; } [10:15:09] i think ferm failed to restart the service so would have had no affect [10:15:16] <_joe_> ok [10:15:19] but would need to validate that [10:16:35] _joe_: confirmed ferm seas it as a syntax error and bails iout early [10:16:45] <_joe_> ack thanks [10:29:33] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:49] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1030). Please do the needful. [10:30:05] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] "LGTM thx" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/618724 (owner: 10Hashar) [10:32:22] 10Operations, 10User-jbond: update profile::waf::apache2::administrative to use the new abuse_networks hiera key - https://phabricator.wikimedia.org/T253632 (10jbond) 05Open→03Resolved a:03jbond >>! In T253632#6362858, @Aklapper wrote: > @JBond: Both patches in Gerrit have been merged. Can this task be r... [10:35:11] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:17] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:10] (03CR) 10ZPapierski: [C: 04-1] "This requires changes to the oauth secret, horizon config and some community communication (wcqs will ask once again for authorization as " [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski) [10:36:43] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:08] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:28] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:12] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:55] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:21] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [10:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:06] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:35] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:38] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [10:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:47] (03PS6) 10Cparle: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [10:57:55] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [10:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:13] <_joe_> jouncebot: next [10:59:13] In 0 hour(s) and 0 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1100) [10:59:16] <_joe_> ahoem [10:59:21] * cormacparle__ waves [10:59:25] <_joe_> who's doing the backport? [10:59:40] <_joe_> can I ask y'all to wait a few minutes? I'm rebooting some hosts [10:59:44] sure [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1100). [11:00:05] cormacparle and Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] Evrifaessa: ^ [11:00:14] See what joe just said [11:00:28] * Lucas_WMDE acknowledges _joe_ [11:00:29] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/24502/" [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [11:00:32] <_joe_> It's going to be 5 minutes tops [11:00:37] I'm here [11:00:38] (I’d prefer if someone else did the window anyways) [11:00:38] grand [11:01:11] <_joe_> I was too concentrated on doing reboots like a machine should and I lost the sense of time [11:01:36] (03CR) 10Volans: "The approach looks good to me. Some minor comment inline." (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [11:03:28] <_joe_> cormacparle__: you may proceed [11:03:42] ace, thanks _joe_ [11:03:57] (03CR) 10Cparle: [C: 03+2] MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [11:04:46] (03Merged) 10jenkins-bot: MediaSearch A/B test on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/616530 (https://phabricator.wikimedia.org/T254388) (owner: 10DCausse) [11:08:45] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:13] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] configure mediasearch A/B test (duration: 01m 08s) [11:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:48] ok I'm done [11:10:47] who's next? Evrifaessa ? [11:10:53] yup [11:10:53] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:23] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [11:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:43] <_joe_> cormacparle__: sigh sorry I fat-fingered a command I was preparing [11:12:55] <_joe_> :/ [11:13:09] so um, it's my first time attending a deployment, should i just wait [11:13:09] here? [11:13:34] _joe_: do I have to do something? or just ignore the log msgs? [11:13:43] Evrifaessa: yes, cormacparle__ will tell you when you can test. Have you got the wikimedia debug extension ready? [11:13:48] <_joe_> no, you should wait a couple minutes if possible [11:14:01] <_joe_> you can deploy to mwdebug in the meantime [11:14:04] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:18] erm ... I already deployed the patch :/ [11:14:23] RhinosF1: I guess I have it installed. Are you talking about the Chrome extension that hes settings like "mwdebug1001.eqiad.wmnet"? [11:14:27] has* [11:14:39] it's only a config change with no discernable effects (setting something up for later) [11:14:45] so I just tested I didn't break search [11:14:57] (03PS1) 10Jbond: mariadb: escape title as '-' is causes issues [puppet] - 10https://gerrit.wikimedia.org/r/620688 (https://phabricator.wikimedia.org/T257033) [11:15:04] Evrifaessa: yep [11:16:13] <_joe_> cormacparle__: you can proceed [11:16:16] <_joe_> sorry again [11:16:52] erm ... ok _joe_ I already did deploy my patch though before I realised something was wrong [11:17:02] do I need to sync-file again? [11:17:26] <_joe_> cormacparle__: did you see any errors? [11:17:30] nope [11:17:36] <_joe_> then you're ok [11:17:48] <_joe_> I did reboot by error one server after you synced [11:18:00] ok great - in that case Evrifaessa you can go ahead I guess [11:18:08] <_joe_> not really by mistake, I intended to start it after you were done :P [11:18:13] :D [11:18:24] so um, what am I supposed to do now? lol [11:18:46] i have wikimediadebug installed and enabled [11:19:51] erm ... who's running the deployment? Amir1 Lucas_WMDE awight Urbanecm are listed [11:19:58] any of you here? [11:21:02] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10jbond) > m5 has index 15 guessing this should be "m5 has index 5"? or after reading ". So... [11:21:15] anyone? [11:22:05] bruh [11:22:42] sorry Evrifaessa I'd help but I'm actually on vacation today and only dropped in to do my own deployment [11:23:01] gotta leave in 10 mins [11:23:03] so.. should I just wait here for an unknown time period, or leave? [11:23:09] who will deploy these? [11:23:16] normally you deploy them yourself [11:23:23] do you have production access? [11:23:26] idk [11:23:31] i'm new here [11:23:41] i just only know how to commit and change configuration [11:23:47] ok, welcome :) [11:23:49] what team are you on? [11:23:54] what team?? [11:24:04] i probably don't belong to any team, lol [11:24:15] but if you ask for my usergroup, i'm from wikimedia usergroup turkey [11:24:22] Oh sorry, I assumed you worked for the foundation [11:24:31] nope [11:24:34] i'm a volunteer [11:24:43] ok I understand [11:25:23] Amir1 Lucas_WMDE awight Urbanecm, any of you here? or i'll have to leave i guess [11:25:42] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:12] I can deploy [11:26:17] oh [11:26:18] hello [11:26:23] Ah, ace, thanks Lucas_WMDE [11:26:35] though we might not have enough time for all five patches now [11:26:39] looking [11:27:15] Lucas_WMDE: sorry I missed the ping, I can take over if you're busy! [11:28:15] no, it’s okay now [11:28:16] Evrifaessa: oh, I'm here now too, I didn't see the ping [11:28:25] (03PS2) 10Lucas Werkmeister (WMDE): Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620507 (https://phabricator.wikimedia.org/T260492) (owner: 10Evrifaessa) [11:28:37] Lucas_WMDE: but I let you to deploy :) [11:28:42] Lucas_WMDE: ty! [11:29:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620507 (https://phabricator.wikimedia.org/T260492) (owner: 10Evrifaessa) [11:29:57] restarting my IRC client, I’ll be back in a second [11:30:26] (03Merged) 10jenkins-bot: Add Turkish powered by MW and Wikimedia project icons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620507 (https://phabricator.wikimedia.org/T260492) (owner: 10Evrifaessa) [11:31:10] * Lucas_WMDE back [11:31:14] Evrifaessa: do you know how to test changes on mwdebug? [11:31:18] uh [11:31:22] it's my first time here [11:31:27] ok :) [11:31:32] but if you guide me through, i probably can [11:31:48] so basically, we first send most config changes to one or two special servers, called mwdebug [11:31:56] in this case, I pulled the change to “mwdebug1001” [11:32:18] and with a special browser extension (see https://wikitech.wikimedia.org/wiki/WikimediaDebug), you can send your requests to an mwdebug server instead of the regular servers [11:32:30] so you can test if the change works, without it affecting anyone else yet [11:32:36] alright [11:33:12] so i turned it on [11:33:19] and selected 1001 [11:33:26] sounds good [11:33:27] it seems to be working [11:33:30] yay! [11:33:34] I also tested it here and it looks good :) [11:33:38] :) [11:33:39] so then I’ll sync the change to the rest of the servers [11:33:54] alright [11:35:19] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620507|Add Turkish powered by MW and Wikimedia project icons (T260492)]] (duration: 00m 57s) [11:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:22] T260492: Change the footer logos in trwikimedia - https://phabricator.wikimedia.org/T260492 [11:35:42] (03PS2) 10Lucas Werkmeister (WMDE): Add Turkish powered by MW and Wikimedia project icons for Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620509 (https://phabricator.wikimedia.org/T260493) (owner: 10Evrifaessa) [11:35:48] now it also works in real-time [11:35:49] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add Turkish powered by MW and Wikimedia project icons for Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620509 (https://phabricator.wikimedia.org/T260493) (owner: 10Evrifaessa) [11:35:56] \o/ [11:36:01] ok onto the next change [11:36:32] alright [11:36:33] (03Merged) 10jenkins-bot: Add Turkish powered by MW and Wikimedia project icons for Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620509 (https://phabricator.wikimedia.org/T260493) (owner: 10Evrifaessa) [11:36:56] next change is also on mwdebug1001 now [11:38:55] everything seems to be working :) [11:39:05] I tested it on all the mentioned projects, looks good to me (new logos appear) [11:39:07] ok syncing :) [11:40:19] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620509|Add Turkish powered by MW and Wikimedia project icons for Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks (T260493)]] (duration: 00m 55s) [11:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:22] T260493: Change the footer logos in Turkish Wikiquote, Turkish Wiktionary, Turkish Wikisource and Turkish Wikibooks - https://phabricator.wikimedia.org/T260493 [11:40:57] (03PS3) 10Lucas Werkmeister (WMDE): Change the logo of lzh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620510 (https://phabricator.wikimedia.org/T259006) (owner: 10Evrifaessa) [11:41:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Change the logo of lzh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620510 (https://phabricator.wikimedia.org/T259006) (owner: 10Evrifaessa) [11:41:42] ok, a logo change can’t be tested on mwdebug IIRC [11:41:47] (03Merged) 10jenkins-bot: Change the logo of lzh Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620510 (https://phabricator.wikimedia.org/T259006) (owner: 10Evrifaessa) [11:41:52] I just need to sync it and then purge the cache [11:41:57] alright [11:42:26] Lucas_WMDE: it can, mwdebug automatically skips the varnish (or whatever it is nowadays) cache ;) [11:42:32] ah ok :) [11:42:34] then let’s try it! [11:42:49] thanks [11:42:53] Evrifaessa just needs to clear their own browser cache (sth like Ctrl+Shift+R should work) [11:43:01] change is on mwdebug1001 [11:43:17] yup, logo definitely looks different after Ctrl+F5 [11:44:02] yeah, it works in mwdebug1001 [11:45:13] Lucas_WMDE: I hope your connection is all right :) [11:45:22] it’s weird [11:45:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/project-logos/: Config: [[gerrit:620510|Change the logo of lzh Wikipedia (T259006)]] (duration: 00m 55s) [11:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:26] T259006: Change the logo of lzh Wikipedia - https://phabricator.wikimedia.org/T259006 [11:45:28] keeps cutting out [11:45:33] but i can come back in relatively quickly at least [11:46:16] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/project-logos/zh_classicalwiki%s.png\n' '' '-1.5x' '-2x' | mwscript purgeList.php # T259006 [11:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:23] it works in production server too [11:47:27] I’ll edit the next change to keep the list in alphabetical order [11:47:46] okay [11:48:07] (03PS2) 10Lucas Werkmeister (WMDE): Add Wiktionary wordmark for eswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620513 (https://phabricator.wikimedia.org/T254059) (owner: 10Evrifaessa) [11:48:36] (03PS1) 10KartikMistry: Update cxserver to 2020-08-17-090424-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/620692 (https://phabricator.wikimedia.org/T259980) [11:48:37] not sure how to test the width/height, I guess we can just see if it works… [11:48:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Add Wiktionary wordmark for eswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620513 (https://phabricator.wikimedia.org/T254059) (owner: 10Evrifaessa) [11:48:45] +2ed [11:49:28] (03Merged) 10jenkins-bot: Add Wiktionary wordmark for eswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620513 (https://phabricator.wikimedia.org/T254059) (owner: 10Evrifaessa) [11:50:01] change is on mwdebug1001 [11:50:49] seems to work as far as I can tell (es.m.wiktionary.org, upper left corner) [11:51:01] yep, works [11:51:03] the Wikcionario is a tad larger than the Wiktionary that appears there without mwdebug [11:51:05] but that’s fine [11:51:13] it’s not stretched or anything at least :) [11:51:30] ok, so I think I need to first sync the /static file, purge that from the cache, then sync InitialiseSettings.php [11:51:32] * Lucas_WMDE does so [11:51:58] okay [11:53:15] !log lucaswerkmeister-wmde@deploy1001 Synchronized static/images/mobile/copyright/wiktionary-wordmark-es.svg: Config: [[gerrit:620513|Add Wiktionary wordmark for eswiktionary (T254059)]], part 1 (duration: 00m 56s) [11:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:18] T254059: Add localized wordmark to eswiktionary mobile frontend - https://phabricator.wikimedia.org/T254059 [11:53:24] !log lucaswerkmeister-wmde@mwmaint1002:~$ printf 'https://en.wikipedia.org/static/images/mobile/copyright/wiktionary-wordmark-es.svg\n' | mwscript purgeList.php # T254059 [11:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:52] (03PS2) 10Jbond: graphite: move graphite paramters under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) [11:54:48] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620513|Add Wiktionary wordmark for eswiktionary (T254059)]], part 2 (duration: 00m 57s) [11:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:16] ok I think we’ll have time for the last change [11:55:30] okay :) [11:56:47] so in here [11:56:56] i first added it as an alias, which was incorrect [11:57:03] (03PS3) 10Lucas Werkmeister (WMDE): Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [11:57:05] and then abandoned the commit [11:57:10] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [11:57:16] and made another which adds it as an extra ns [11:57:39] (03PS1) 10Matthias Mullie: Fix testwikidata depicts property id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620694 (https://phabricator.wikimedia.org/T258048) [11:57:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12264 and previous config saved to /var/cache/conftool/dbconfig/20200817-115741-marostegui.json [11:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:01] (03Merged) 10jenkins-bot: Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620643 (https://phabricator.wikimedia.org/T259429) (owner: 10Evrifaessa) [11:58:51] ok, the change is on mwdebug1001 [11:59:40] (03PS1) 10Marostegui: Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/620671 [11:59:44] looks ok in https://bjn.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces&formatversion=2 [12:00:01] https://bjn.wikipedia.org/wiki/Pamakai:Evrifaessa/sandbox [12:00:04] okay, check here [12:00:08] it seems to be working [12:00:21] ok, syncing [12:01:17] (03CR) 10Marostegui: [C: 03+2] Revert "db1099: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/620671 (owner: 10Marostegui) [12:01:41] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:620643|Set Portal and Portal_talk namespaces in bjnwiki as an extra namespace. (T259429)]] (duration: 00m 55s) [12:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:44] T259429: "Portal" and "Portal talk" namespaces are missing from bjn.wikipedia.org - https://phabricator.wikimedia.org/T259429 [12:01:52] 10Operations, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10hashar) [12:02:23] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php bjnwiki | tee T259429-dryrun [12:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:30] 112 links to fix, 112 resolvable, 0 deleted – seems fine [12:02:40] alright [12:02:42] (03CR) 10Marostegui: Add new pool DatabasesCodfw to backup data generated on eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/598005 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [12:02:47] will you fix them automatically? [12:02:51] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes.php bjnwiki --fix | tee T259429-fix [12:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:55] huh, 1 was deleted this time [12:03:23] Lucas_WMDE: from table, IIRC it means that there were two duplicates links. It shouldn't delete anything on-wiki. [12:03:28] ok [12:03:33] I’ll paste the output on the task just in case [12:03:38] kk [12:04:08] Lucas_WMDE: a tip: you can redirect the output to `phaste`, which will create a Phabricator paste for you :-). [12:04:48] oh, I should look into that :) [12:04:49] thanks! [12:05:07] !log EU backport window done [12:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:14] so, is this now synced? [12:05:17] only five minutes over time (but nothing else was scheduled for now) [12:05:19] it should be, yes [12:05:25] thank you :)) [12:05:37] thanks you for helping out :) [12:05:50] have a nice day :) [12:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12265 and previous config saved to /var/cache/conftool/dbconfig/20200817-121600-marostegui.json [12:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:08] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:39] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:30] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:47] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12266 and previous config saved to /var/cache/conftool/dbconfig/20200817-122234-marostegui.json [12:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:29] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) >>! In T258978#6340838, @Joe wrote: > This service should /not/ do any caching, which should instead... [12:25:34] (03CR) 10Addshore: [C: 03+1] Fix testwikidata depicts property id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620694 (https://phabricator.wikimedia.org/T258048) (owner: 10Matthias Mullie) [12:26:50] (03PS2) 10Jbond: profile::restbase: update aqs_uri to remove aqs_site variable [puppet] - 10https://gerrit.wikimedia.org/r/617729 (https://phabricator.wikimedia.org/T247956) [12:27:28] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:46] (03CR) 10Jbond: "noop https://puppet-compiler.wmflabs.org/compiler1002/24504/" [puppet] - 10https://gerrit.wikimedia.org/r/617729 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:28:49] (03CR) 10Jbond: [C: 03+2] profile::restbase: update aqs_uri to remove aqs_site variable [puppet] - 10https://gerrit.wikimedia.org/r/617729 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:30:06] (03PS4) 10Jbond: discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617580 (https://phabricator.wikimedia.org/T247956) [12:32:15] (03CR) 10Jbond: [C: 03+2] discovery: clean up old hiera values [puppet] - 10https://gerrit.wikimedia.org/r/617580 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:32:24] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [12:33:06] (03PS3) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 [12:33:22] (03CR) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [12:35:03] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:28] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:58] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [12:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:04] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:41] (03CR) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [12:38:43] (03CR) 10JMeybohm: "What about slightly increasing the wait time for icinga once again? From the reboot_single cookbook I would guess we will end up with at l" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [12:39:25] (03PS4) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 [12:39:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add alertmanagers configuration [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [12:40:10] (03PS2) 10Filippo Giunchedi: prometheus: add alertmanagers configuration [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) [12:42:31] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:44:02] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3311', diff saved to https://phabricator.wikimedia.org/P12267 and previous config saved to /var/cache/conftool/dbconfig/20200817-124409-marostegui.json [12:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:25] (03CR) 10Giuseppe Lavagetto: "> Patch Set 3:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [12:44:47] (03PS5) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 [12:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depoool db1089 for MCR change', diff saved to https://phabricator.wikimedia.org/P12268 and previous config saved to /var/cache/conftool/dbconfig/20200817-124458-marostegui.json [12:44:59] (03CR) 10Giuseppe Lavagetto: Introduce sre.host.reboot-cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [12:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:13] (03PS3) 10Jbond: graphite: move graphite paramters under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) [12:47:47] (03CR) 10Jbond: "re-based ; pcc: https://puppet-compiler.wmflabs.org/compiler1003/24508/" [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:48:23] (03CR) 10Kormat: [C: 04-1] "I'm feeling uneasy about this. Section names are used in many places. E.g:" [puppet] - 10https://gerrit.wikimedia.org/r/620688 (https://phabricator.wikimedia.org/T257033) (owner: 10Jbond) [12:49:40] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10Kormat) >>! In T257033#6388369, @jbond wrote: >> m5 has index 15 > guessing this should be "... [12:50:54] (03PS1) 10Kormat: mariadb: Fix referencing of wrong variable. [puppet] - 10https://gerrit.wikimedia.org/r/620699 [12:52:19] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/620688 (https://phabricator.wikimedia.org/T257033) (owner: 10Jbond) [12:52:27] (03Abandoned) 10Jbond: mariadb: escape title as '-' is causes issues [puppet] - 10https://gerrit.wikimedia.org/r/620688 (https://phabricator.wikimedia.org/T257033) (owner: 10Jbond) [12:52:52] 10Operations, 10netops, 10observability: Add alert[12]001 to network ACLs - https://phabricator.wikimedia.org/T260533 (10fgiunchedi) [12:53:00] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:13] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:28] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:39] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [12:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:40] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat, 10User-jbond: Standardize/centralize mapping from section to mariadb port/socket and prom-mysql-exporter port - https://phabricator.wikimedia.org/T257033 (10jbond) >>! In T257033#6388644, @Kormat wrote: >>>! In T257033#6388369, @jbond wrote: >>> m5... [12:55:57] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:56:48] (03CR) 10Jbond: [C: 03+2] graphite: move graphite paramters under profile namespace [puppet] - 10https://gerrit.wikimedia.org/r/617725 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [12:57:16] (03CR) 10Kormat: [C: 03+1] dbproxy1016,dbproxy1020: Change m3 master [puppet] - 10https://gerrit.wikimedia.org/r/620664 (https://phabricator.wikimedia.org/T259589) (owner: 10Marostegui) [12:58:30] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] Urbanecm and Amir1: I, the Bot under the Fountain, allow thee, The Deployer, to do Create new wikis deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1300). [13:00:18] \o/ [13:00:47] o/ [13:00:53] <_joe_> ouch [13:00:56] <_joe_> wait please [13:01:01] <_joe_> we're rebooting the fleet [13:01:06] noted [13:01:09] _joe_: noted [13:01:11] 🛳️ [13:01:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12269 and previous config saved to /var/cache/conftool/dbconfig/20200817-130127-marostegui.json [13:01:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:09] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: Add unit tests for resolve method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620319 (owner: 10Jcrespo) [13:02:44] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [13:06:45] let us know, I go quickly eat lunch [13:06:46] legoktm: o/ while T260342 is open, any chance i could get permissions to push annotated tags to https://gerrit.wikimedia.org/r/admin/repos/operations/software/wmfmariadbpy please? i'm currently blocked on this for making a release [13:06:46] T260342: Request for Gerrit Managers permissions - https://phabricator.wikimedia.org/T260342 [13:08:44] the logmsgbot-test is me btw [13:09:05] <_joe_> Amir1: sure, as soon as this round of reboots is done, sorry [13:09:12] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [13:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:13] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [13:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:38] <_joe_> Amir1: I'm done [13:10:54] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [13:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:13] thanks [13:12:18] Urbanecm: around? [13:12:22] Amir1: yup [13:12:33] let's start with T259432? [13:12:34] T259432: Create Wikipedia Ladin - https://phabricator.wikimedia.org/T259432 [13:13:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12270 and previous config saved to /var/cache/conftool/dbconfig/20200817-131307-marostegui.json [13:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:26] (03PS4) 10Urbanecm: Initial configuration for lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) [13:13:50] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [13:14:58] (03Merged) 10jenkins-bot: Initial configuration for lijwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [13:15:11] fetching to deploy1001 and pulling to mwmaint1002 [13:16:02] running mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=muswiki lld wikipedia lldwiki lld.wikipedia.org [13:16:30] Urbanecm: wait [13:17:01] Amir1: what's up? [13:17:22] the patch itself is okay but the commit message has the wrong wiki code [13:17:41] lijwiki is incorrect, it should lldwiki [13:17:45] (addWiki.php already entered, btw...) [13:18:08] Amir1: meh, at least it's not going to break the wiki [13:18:08] but it's okay the commit itself is fine, it's not introducing lijwiki [13:18:19] first, thought it would [13:18:26] before double checking the commit [13:19:14] i see [13:19:20] Amir1: okay for me to sync config? [13:19:25] yup [13:19:27] (database seems to be fine and at s5) [13:20:16] (03CR) 10Nintendofan885: "The name of this patch should have been 'Initial configuration for lldwiki' as it's lldwiki rather than lijwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617861 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [13:20:31] thanks, syncing files in he task-order [13:20:41] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating lldwiki (T259432) (duration: 00m 55s) [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:44] T259432: Create Wikipedia Ladin - https://phabricator.wikimedia.org/T259432 [13:22:25] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating lldwiki (T259432) (duration: 00m 56s) [13:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:33] dblists going now [13:23:07] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 4:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:23:25] !log urbanecm@deploy1001 Synchronized dblists: Creating lldwiki (T259432) (duration: 00m 56s) [13:23:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:12] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating lldwiki (T259432) [13:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:53] (03PS1) 10Filippo Giunchedi: profile: refactor logmsgbot to follow Icinga failover [puppet] - 10https://gerrit.wikimedia.org/r/620701 (https://phabricator.wikimedia.org/T247966) [13:26:53] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating lldwiki (T259432) (duration: 00m 53s) [13:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:56] T259432: Create Wikipedia Ladin - https://phabricator.wikimedia.org/T259432 [13:27:04] !log urbanecm@deploy1001 sync-file aborted: Creating lldwiki (T259432)¨ (duration: 00m 00s) [13:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating lldwiki (T259432) (duration: 00m 55s) [13:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:10] !log urbanecm@deploy1001 Synchronized langlist: Creating lldwiki (T259432) (duration: 00m 54s) [13:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:20] Amir1: done, wiki is live! [13:30:10] (03PS3) 10Urbanecm: Initial configuration for thankyouwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) [13:30:21] Amir1: ready to merge https://gerrit.wikimedia.org/r/c/619852 now :) [13:30:28] <_joe_> Urbanecm: lmk when you are doen with wiki creations please [13:30:35] _joe_: sure [13:30:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12271 and previous config saved to /var/cache/conftool/dbconfig/20200817-133043-marostegui.json [13:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:42] Urbanecm: let's go [13:31:49] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for thankyouwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) (owner: 10Urbanecm) [13:32:00] Amir1: thanks [13:32:41] (03Merged) 10jenkins-bot: Initial configuration for thankyouwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) (owner: 10Urbanecm) [13:33:37] !log Restart mysql on db2102 (testing new package) [13:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:54] Amir1: going to run `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=muswiki en wikipedia thankyouwiki thankyou.wikipedia.org` [13:33:59] (03CR) 10Ppchelko: "Thank you so much! :)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [13:34:18] hmm, is it in wikipedia group? [13:34:20] Why? [13:34:27] (03CR) 10Volans: [C: 03+1] "LGTM, go ahead with live testing!" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [13:34:37] !log deploy json-c security update to buster [13:34:39] just thinking out loud right now [13:34:41] !log deploy json-c security update to buster [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] Amir1: you mean, why it's thankyou.wikipedia.org? [13:35:00] IIRC the reason is "end of third party cookies" [13:35:11] no, that's the team's decision [13:35:15] the interwiki group [13:35:20] in sites table [13:35:25] aha [13:36:14] let's double check donatewiki [13:36:19] sure [13:36:25] shouldn't that match the domain name? [13:36:36] Amir1: donatewiki is with .wikimedia.org [13:36:41] ugh [13:36:49] check arbcom_cswiki or something like that, it's more similar [13:36:56] arbcom_cswiki resides at arbcom-cs.wikipedia.org [13:37:14] technically it can be anything, e.g. wikisource has two groups [13:37:57] i see [13:38:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [13:38:28] arbcom_cs has its own group [13:38:29] arbcom-cs [13:38:35] heh [13:38:45] this helps in not showing up in wikidata for example [13:38:58] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Enable eqiad backups, which are sent to codfw (backup2001) [puppet] - 10https://gerrit.wikimedia.org/r/620654 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [13:38:58] otherwise, thankyouwiki will show up in wikidata for sitelinks [13:39:00] (03Merged) 10jenkins-bot: Introduce sre.host.reboot-cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [13:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for mysql upgrade', diff saved to https://phabricator.wikimedia.org/P12272 and previous config saved to /var/cache/conftool/dbconfig/20200817-133905-marostegui.json [13:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:13] Amir1: hmm, we don't want it to be in wikidata [13:39:28] so, let's give it a sui generis group? [13:39:32] !log Upgrade db1088 (s6) to a newer mysql version (10.4.14) [13:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:01] !log imported !log imported to buster-wikimedia [13:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:15] https://www.irccloud.com/pastebin/YJjd05RL/ [13:40:20] Urbanecm: ^ [13:40:45] We need to define a new group for it, just use "thankyou" instead of wikipedia in the second argument [13:41:31] !log imported td-agent-bit_1.5.3-0 to buster-wikimedia - T260536 [13:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:34] T260536: Package an up to date version of fluent-bit / td-agent-bit for buster - https://phabricator.wikimedia.org/T260536 [13:41:37] Amir1: so, `mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=muswiki en thankyou thankyouwiki thankyou.wikipedia.org`? [13:41:57] Urbanecm: ^ muswiki <3 :) [13:42:14] marostegui: yup, second wiki to be created via muswiki! [13:42:19] \o/ [13:42:19] (lldwiki is already done) [13:42:38] Urbanecm: yup [13:42:45] okay, running that Amir1 [13:42:49] Urbanecm: I will wait for the usual comment on the tasks before proceeding with the sanitization [13:42:55] <_joe_> do you need to do more code deployments? [13:42:57] let's cross our fingers [13:42:58] sure [13:43:09] _joe_: yes, I will need to sync a bunch of files [13:43:16] _joe_: yup, it'll be done in five to ten minutes [13:43:25] <_joe_> ok, I'll wait to test my script to reboot a whole cluster then :D [13:43:45] <_joe_> sorry, I'm just on a tight schedule with meetings and such [13:44:23] Amir1: script completed, going to sync the config then [13:44:43] (03CR) 10Jcrespo: "> How much work is it to make the tests work under py3.5?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:44:46] Urbanecm: okay, sync up to wikiversions.json [13:44:51] then pull it in mwdebug [13:45:14] okay [13:45:36] _joe_: sorry :( Is there anything I can help with? I assume most of the work requires root access :D [13:46:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12273 and previous config saved to /var/cache/conftool/dbconfig/20200817-134604-marostegui.json [13:46:06] !log urbanecm@deploy1001 Synchronized wmf-config/db-eqiad.php: Creating thankyouwiki (T259002) (duration: 00m 55s) [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:12] T259002: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 [13:46:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3318', diff saved to https://phabricator.wikimedia.org/P12274 and previous config saved to /var/cache/conftool/dbconfig/20200817-134619-marostegui.json [13:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:53] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/24510/" [puppet] - 10https://gerrit.wikimedia.org/r/620701 (https://phabricator.wikimedia.org/T247966) (owner: 10Filippo Giunchedi) [13:47:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 for MCR change', diff saved to https://phabricator.wikimedia.org/P12275 and previous config saved to /var/cache/conftool/dbconfig/20200817-134701-marostegui.json [13:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:07] !log urbanecm@deploy1001 Synchronized wmf-config/db-codfw.php: Creating thankyouwiki (T259002) (duration: 00m 56s) [13:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:29] !log Deploy MCR change on db1104 [13:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:06] !log urbanecm@deploy1001 Synchronized dblists: Creating thankyouwiki (T259002) (duration: 00m 55s) [13:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:08] Amir1: wiki is live at mwdebug1001! [13:49:15] \o/ [13:49:26] !log urbanecm@deploy1001 rebuilt and synchronized wikiversions files: Creating thankyouwiki (T259002) [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:45] Amir1: so, let's sync the rest and let pcoombe test it works properly for their use-case? [13:49:56] yeah [13:50:09] on it :) [13:50:38] is thankyouwiki being kept out of centralauth intentionally? [13:50:44] Yes [13:51:02] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: Creating thankyouwiki (T259002) (duration: 00m 55s) [13:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Creating thankyouwiki (T259002) (duration: 00m 55s) [13:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:10] T259002: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 [13:52:39] updating interwiki cache now [13:52:46] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620704 [13:52:48] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620704 (owner: 10Urbanecm) [13:53:26] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620704 (owner: 10Urbanecm) [13:54:09] !log Create account Pcoombe (WMF) at thankyouwiki, email set to pcoombe@wikimedia.org (T259002) [13:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] !log urbanecm@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 01m 52s) [13:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:32] pcoombe: ^ [13:54:38] The wiki is ready [13:54:39] Amir1: so, we should be done now? [13:54:43] yup [13:54:47] _joe_: We are done [13:54:48] good! [13:54:58] !log Creating thankyouwiki and lldwiki is done [13:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:21] I start the wikidata bot [13:55:26] thx [13:55:34] <_joe_> great, thanks [13:56:26] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) @Pcoombe The wiki was just created. You should have an accou... [13:56:54] Thanks so much Amir1 and Urbanecm! [13:57:16] pcoombe: happy to help! Let us know if something should be changed/fixed :). [13:58:15] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T259432) [13:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:18] T259432: Create Wikipedia Ladin - https://phabricator.wikimedia.org/T259432 [13:58:32] pcoombe: Urbanecm did most of the work, I just complained [13:59:10] _joe_: my maintenance run should not affect your work, it just changes some stuff in db [13:59:18] it'll be done quickly [14:02:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12276 and previous config saved to /var/cache/conftool/dbconfig/20200817-140229-marostegui.json [14:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:08] !log Sanitize lldwiki on db1124:3315 and db2094:3315 T259436 [14:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:10] T259436: Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 [14:03:55] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (dbprov1001, ...), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [14:09:09] !log Sanitize thankyouwiki on db1124:3315, db2094:3315 - T260551 [14:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:12] T260551: Prepare and check storage layer for thankyouwiki - https://phabricator.wikimedia.org/T260551 [14:11:16] (03PS5) 10Ppchelko: Resurrect fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [14:12:18] (03PS9) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [14:12:38] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) Thanks so much @Urbanecm! Logging in worked fine, I made some... [14:13:02] (03CR) 10Ppchelko: "Done. The image is not based on buster and is using a newer version of td-agent-bit. Verified it works" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [14:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1088 after upgrading its mysql package', diff saved to https://phabricator.wikimedia.org/P12277 and previous config saved to /var/cache/conftool/dbconfig/20200817-141449-marostegui.json [14:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:52] bacula it is me running new backups, will be fixed as soon as the first backup run finises [14:15:40] (03CR) 10Ppchelko: "This doesn't do TLS for fluent-bit sidecar -> eventgate, but I am planning to add it in a followup" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [14:15:42] (03CR) 10JMeybohm: "> Patch Set 4:" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/620305 (owner: 10Giuseppe Lavagetto) [14:15:44] (03PS1) 10Filippo Giunchedi: role: add acmechief config for alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/620709 (https://phabricator.wikimedia.org/T258948) [14:20:56] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) 05Stalled→03Invalid @fgiunchedi, this ticket will be closed [14:21:13] (03CR) 10JMeybohm: [C: 04-1] Resurrect fluent-bit image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [14:21:27] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:09] (03CR) 10Ppchelko: Resurrect fluent-bit image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [14:22:29] (03PS6) 10Ppchelko: Resurrect fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [14:23:26] (03CR) 10JMeybohm: [C: 03+1] "👍" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [14:24:16] (03PS9) 10Kormat: wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:24:18] (03PS1) 10Filippo Giunchedi: icinga: ensure tmpfs cleanup [puppet] - 10https://gerrit.wikimedia.org/r/620710 (https://phabricator.wikimedia.org/T260521) [14:24:41] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eisfeldt (JanWMF) - https://phabricator.wikimedia.org/T260555 (10drochford) [14:25:38] (03PS1) 10Ottomata: wgEventLoggingSchemas - update bugged schema revisions (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620711 (https://phabricator.wikimedia.org/T254606) [14:28:28] (03PS2) 10Ottomata: wgEventLoggingSchemas - update bugged schema revisions (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620711 (https://phabricator.wikimedia.org/T254606) [14:28:37] (03CR) 10Vgutierrez: [C: 03+1] role: add acmechief config for alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/620709 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [14:29:36] (03PS10) 10Kormat: wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:29:54] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:30:23] (03PS3) 10Ottomata: wgEventLoggingSchemas - update bugged schema revisions (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620711 (https://phabricator.wikimedia.org/T254606) [14:30:46] (03CR) 10Milimetric: [C: 03+2] wgEventLoggingSchemas - update bugged schema revisions (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620711 (https://phabricator.wikimedia.org/T254606) (owner: 10Ottomata) [14:31:30] (03Merged) 10jenkins-bot: wgEventLoggingSchemas - update bugged schema revisions (group0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620711 (https://phabricator.wikimedia.org/T254606) (owner: 10Ottomata) [14:35:05] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - group0 - T254606 (duration: 00m 56s) [14:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:09] T254606: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 [14:36:17] (03PS1) 10Ottomata: wgEventLoggingSchemas - update bugged schema revisions - all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620713 (https://phabricator.wikimedia.org/T254606) [14:36:34] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eisfeldt (JanWMF) - https://phabricator.wikimedia.org/T260555 (10fgiunchedi) It looks like the JanWMF user isn't on wikitech ATM: https://wikitech.wikimedia.org/wiki/User:JanWMF, perhaps the wikitech username is another one? [14:36:48] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (JanWMF) - https://phabricator.wikimedia.org/T260555 (10fgiunchedi) [14:38:46] (03CR) 10CDanis: [C: 03+1] hieradata: limit queries to Thanos sidecar / Prometheus to last 15d [puppet] - 10https://gerrit.wikimedia.org/r/620656 (https://phabricator.wikimedia.org/T260241) (owner: 10Filippo Giunchedi) [14:42:09] (03CR) 10Milimetric: [C: 03+2] wgEventLoggingSchemas - update bugged schema revisions - all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620713 (https://phabricator.wikimedia.org/T254606) (owner: 10Ottomata) [14:42:54] (03Merged) 10jenkins-bot: wgEventLoggingSchemas - update bugged schema revisions - all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620713 (https://phabricator.wikimedia.org/T254606) (owner: 10Ottomata) [14:44:09] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis - T254606 (duration: 00m 55s) [14:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:13] T254606: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 [14:47:33] Amir1, Urbanecm, the logo for lldwiki is still in English [14:51:18] (03CR) 10Ppchelko: [C: 04-1] api-gateway: strip cookie headers from requests and responses. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [14:51:23] (03PS1) 10Giuseppe Lavagetto: Fix typo in import [cookbooks] - 10https://gerrit.wikimedia.org/r/620714 [14:52:24] (03CR) 10jerkins-bot: [V: 04-1] Fix typo in import [cookbooks] - 10https://gerrit.wikimedia.org/r/620714 (owner: 10Giuseppe Lavagetto) [14:53:53] (03PS2) 10Giuseppe Lavagetto: Fix typo in import [cookbooks] - 10https://gerrit.wikimedia.org/r/620714 [14:54:13] (03PS11) 10Jcrespo: wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) [14:54:36] (03CR) 10jerkins-bot: [V: 04-1] wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:55:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix typo in import [cookbooks] - 10https://gerrit.wikimedia.org/r/620714 (owner: 10Giuseppe Lavagetto) [14:56:11] (03Merged) 10jenkins-bot: Fix typo in import [cookbooks] - 10https://gerrit.wikimedia.org/r/620714 (owner: 10Giuseppe Lavagetto) [14:57:18] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [14:57:18] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [14:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:33] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - schema revision version bump for erroring schemas - all wikis (take 2) - T254606 (duration: 00m 53s) [14:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:37] T254606: Invalid navigation timing events - https://phabricator.wikimedia.org/T254606 [14:58:28] (03CR) 10Ppchelko: [C: 04-1] api-gateway: strip cookie headers from requests and responses. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/620311 (https://phabricator.wikimedia.org/T259296) (owner: 10Hnowlan) [15:01:57] (03CR) 10Ppchelko: [C: 03+2] Add $schema to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 (owner: 10Ppchelko) [15:01:59] (03PS12) 10Jcrespo: wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) [15:02:57] (03Merged) 10jenkins-bot: Add $schema to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 (owner: 10Ppchelko) [15:04:42] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:53] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:02] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:32] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: Load and provide a method for section to port assignment [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620291 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [15:07:17] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 (10CDanis) [15:08:26] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [15:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:22] (03CR) 10Filippo Giunchedi: "Note this is a workaround for the issue in the task, nevertheless sth we should do anyways I think" [puppet] - 10https://gerrit.wikimedia.org/r/620710 (https://phabricator.wikimedia.org/T260521) (owner: 10Filippo Giunchedi) [15:18:41] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:52] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add acmechief config for alerts.w.o [puppet] - 10https://gerrit.wikimedia.org/r/620709 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:19:41] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054 [15:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:49] T254490: Consider increasing s-maxage for /page/mobile-html-offline-resources - https://phabricator.wikimedia.org/T254490 [15:19:50] T253478: mobile-html: Error previewing edits for pages with slashes in the title in the Android app - https://phabricator.wikimedia.org/T253478 [15:19:51] T260556: Add lldwiki to RESTBase - https://phabricator.wikimedia.org/T260556 [15:19:51] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [15:19:51] T259054: RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 [15:22:11] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054 (duration: 02m 30s) [15:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:25] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054. take 2. feeds timed out [15:22:28] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-40] - https://phabricator.wikimedia.org/T260445 (10Ottomata) [15:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:43] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: 2020-09-15) upgrade/replace memory in stat100[58] - https://phabricator.wikimedia.org/T260448 (10Ottomata) [15:26:53] Jhs: I'll have a look [15:28:16] (03CR) 10CDanis: [C: 03+2] Revert "whitelist broken advertisements from Jio AS55836" [homer/public] - 10https://gerrit.wikimedia.org/r/620386 (https://phabricator.wikimedia.org/T260452) (owner: 10Ssingh) [15:30:36] !log ❌cdanis@cumin1001.eqiad.wmnet ~ 🕦☕ homer 'cr*-codfw*' commit 'revert skipping RPKI validation for Jio AS55836 I0fd4683 T260452' [15:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:39] T260452: clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 [15:34:36] (03PS1) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:35:40] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:36:41] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕦☕ homer 'cr*' commit 'revert skipping RPKI validation for Jio AS55836 I0fd4683 T260452' [15:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:44] T260452: clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 [15:37:10] (03PS2) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:38:13] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:38:14] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:28] (03PS3) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:40:35] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:40:46] (03PS3) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [15:40:57] 10Operations, 10Traffic, 10netops: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites - https://phabricator.wikimedia.org/T260449 (10CDanis) [15:40:59] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: clean up workaround and measurements put in place during Jio RPKI error - https://phabricator.wikimedia.org/T260452 (10CDanis) 05Open→03Resolved [15:41:10] (03PS4) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:42:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:42:30] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10Ottomata) 05Open→03Resolved We realized today that we don't need this instance at all. Deleted it. [15:43:05] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054. take 2. feeds timed out (duration: 20m 40s) [15:43:09] !log ppchelko@deploy1001 Started deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054. take 3. feeds timed out [15:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:11] T254490: Consider increasing s-maxage for /page/mobile-html-offline-resources - https://phabricator.wikimedia.org/T254490 [15:43:11] T253478: mobile-html: Error previewing edits for pages with slashes in the title in the Android app - https://phabricator.wikimedia.org/T253478 [15:43:12] T260556: Add lldwiki to RESTBase - https://phabricator.wikimedia.org/T260556 [15:43:12] T257943: Create Wikipedia Kotava - https://phabricator.wikimedia.org/T257943 [15:43:12] T259054: RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 [15:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:40] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@ddcecce]: T257943 T260556 T253478 T254490 T259054. take 3. feeds timed out (duration: 01m 31s) [15:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:59] (03PS5) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:46:09] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:16] 10Operations, 10RESTBase, 10Traffic, 10Platform Team Workboards (Clinic Duty Team): RESTBase CORS redirect resolve should not hit frontend caches - https://phabricator.wikimedia.org/T259054 (10Pchelolo) 05Open→03Resolved a:03Pchelolo ` curl -i -H 'origin: test' http://restbase.discovery.wmnet:7231/fr... [15:47:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:47:27] (03PS1) 10Jgreen: nsca_frack.cfg.erb - remove check_ipsec from civicrm servers, add to fran1001 [puppet] - 10https://gerrit.wikimedia.org/r/620725 (https://phabricator.wikimedia.org/T258526) [15:47:42] (03PS6) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:48:49] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:50:30] (03PS7) 10Jcrespo: mariadb: Setup section->port assignment on puppet [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) [15:50:46] 10Operations, 10Shinken: Make the Shinken IRC alert and icinga-wm bots use colors - https://phabricator.wikimedia.org/T113785 (10Andrew) 05Open→03Resolved a:03Andrew The shinken project has been closed and deleted. [15:51:15] 10Operations, 10Scap, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Scap configuration for WDQS should get server groups from a known source or truth - https://phabricator.wikimedia.org/T252124 (10Gehel) [15:54:57] (03CR) 10Jcrespo: "This would unblock merging:" [puppet] - 10https://gerrit.wikimedia.org/r/620722 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [15:54:59] (03CR) 10Clarakosi: Modify api-gateway access logging to conform to schema (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [15:55:18] (03CR) 10Dwisehaupt: [C: 03+1] "rusgood. shipit." [puppet] - 10https://gerrit.wikimedia.org/r/620725 (https://phabricator.wikimedia.org/T258526) (owner: 10Jgreen) [15:57:38] (03PS1) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [15:57:40] (03PS1) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [15:58:43] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [15:58:55] (03PS1) 10Giuseppe Lavagetto: reboot-cluster: fix subtraction of excluded hosts. [cookbooks] - 10https://gerrit.wikimedia.org/r/620728 [16:00:21] (03PS2) 10Jcrespo: wmfmariadbpy: Add unit tests for resolve method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620319 [16:00:43] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10RLazarus) Yes please -- let's use the spare machine, add the memory from mc2028, and name the new host mc2037 (that is, let's use @elukey's option #1, but //not// reuse the name "mc2028", at @Volans... [16:01:10] (03CR) 10Volans: [C: 03+1] "LGTM, sorry for have missed that in the first CR" [cookbooks] - 10https://gerrit.wikimedia.org/r/620728 (owner: 10Giuseppe Lavagetto) [16:02:09] (03CR) 10Filippo Giunchedi: "Build failure is due to this:" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:04:41] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24513/" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:05:49] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:50] (03PS1) 10Dzahn: mediawiki: remove mongodb PHP extension from appservers [puppet] - 10https://gerrit.wikimedia.org/r/620729 [16:08:20] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cdanis on cumin1001.eqiad.wmnet for hosts: ` mw1359.eqiad.wmne... [16:08:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] reboot-cluster: fix subtraction of excluded hosts. [cookbooks] - 10https://gerrit.wikimedia.org/r/620728 (owner: 10Giuseppe Lavagetto) [16:08:55] (03PS2) 10Dzahn: mediawiki: remove mongodb PHP extension from appservers [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) [16:09:22] I am checking backup2001 [16:09:47] (03Merged) 10jenkins-bot: reboot-cluster: fix subtraction of excluded hosts. [cookbooks] - 10https://gerrit.wikimedia.org/r/620728 (owner: 10Giuseppe Lavagetto) [16:09:50] systemd-timedated.service failed? [16:10:04] that's weird [16:10:22] (03PS2) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [16:10:24] (03PS2) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [16:10:40] (03PS1) 10Ssingh: wikidough: increase TCP connection limits for dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/620730 (https://phabricator.wikimedia.org/T252132) [16:11:32] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:11:34] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:12:28] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [16:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:00] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [16:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:39] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:06] (03PS2) 10Ottomata: Remove SearchSatisfaction from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620051 (https://phabricator.wikimedia.org/T259163) [16:14:34] !log cdanis@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1359.* [16:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:41] (03PS1) 10Dzahn: releases:reprepro: update outdated warning comments [puppet] - 10https://gerrit.wikimedia.org/r/620731 [16:15:14] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/24515/" [puppet] - 10https://gerrit.wikimedia.org/r/620730 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:15:38] (03CR) 10Dzahn: [C: 03+2] releases:reprepro: update outdated warning comments [puppet] - 10https://gerrit.wikimedia.org/r/620731 (owner: 10Dzahn) [16:15:53] (03CR) 10Ssingh: [C: 03+2] wikidough: increase TCP connection limits for dnsrecursor [puppet] - 10https://gerrit.wikimedia.org/r/620730 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:16:23] mutante: ok to merge your changes? :) [16:17:00] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: fix arg name, reporting [cookbooks] - 10https://gerrit.wikimedia.org/r/620732 [16:17:05] uh [16:17:07] Contact an administrator to fix the permissions [16:17:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-cluster: fix arg name, reporting [cookbooks] - 10https://gerrit.wikimedia.org/r/620732 (owner: 10Giuseppe Lavagetto) [16:17:20] sukhe: yes please [16:17:25] I don't have permissions to push to pywikibot/core [16:17:47] how do I get myself granted permissions? [16:17:57] mutante: done, thanks [16:18:11] thx [16:18:12] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.26:443]) https://wikitech.wikimedia.org/wiki/PyBal [16:18:13] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: fix arg name, reporting [cookbooks] - 10https://gerrit.wikimedia.org/r/620732 (owner: 10Giuseppe Lavagetto) [16:18:21] _joe_: Are you an administrator? Can you grant me push permissions in pywikibot/core? [16:18:43] (03CR) 10Dave Pifke: "The MongoDB stuff from mediawiki-config should probably be removed first:" [puppet] - 10https://gerrit.wikimedia.org/r/620729 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn) [16:18:55] (03CR) 10Ottomata: [C: 03+2] Remove SearchSatisfaction from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620051 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [16:19:54] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.26:443]) https://wikitech.wikimedia.org/wiki/PyBal [16:19:58] <_joe_> that's me [16:20:09] !log cdanis@cumin1001 START - Cookbook sre.hosts.downtime [16:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:33] so um, how can I get myself permissions to push to pywikibot/core? _joe_ [16:20:59] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=codfw [16:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:01] (03PS3) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [16:21:02] @seen hashar [16:21:02] mutante: hashar is in here, right now [16:21:05] (03PS3) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [16:21:12] Evrifaessa: I don't know but suspect that https://www.mediawiki.org/wiki/Gerrit/Privilege_policy applies [16:21:19] !log oblivian@cumin1001 conftool action : set/pooled=inactive; selector: cluster=jobrunner,dc=codfw,name=mw2250.codfw.wmnet [16:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] thanks for confirming the pybal alert re: jobrunner is known, ack [16:22:06] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:22:16] !log cdanis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:32] Evrifaessa: #wikimedia-releng would be a better channel for Gerrit problems [16:22:40] alright [16:23:07] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventLoggingSchemas - remove unneeded override for SearchSatisfaction - T259163 (duration: 00m 56s) [16:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:09] T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 [16:23:36] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:23:42] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) 05Open→03Resolved a:03Urbanecm >>! In T259002#6389047,... [16:24:05] (03PS1) 10Andrew Bogott: Update wmcs admin scripts to use HA-proxy database front end [puppet] - 10https://gerrit.wikimedia.org/r/620734 [16:24:17] ottomata & company: mind me syncing a quick security patch? [16:24:22] PROBLEM - Check systemd state on backup2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:46] 10Puppet: wmf-style lint detects variable expansion in variables as parameter declaration - https://phabricator.wikimedia.org/T260574 (10fgiunchedi) [16:24:54] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:25:00] (03CR) 10Jbond: "lgtm see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:25:21] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:25:21] Urbanecm: sure i'm done [16:25:24] RECOVERY - Check systemd state on backup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:28] 10Operations, 10ops-esams: esams: normalize the power outlet assignments - https://phabricator.wikimedia.org/T243088 (10RobH) 05Open→03Stalled p:05Medium→03Lowest a:05RobH→03None Setting to lowest, stalled, as this isn't worth spending cycles on, but should happen the next time we have one of our o... [16:25:29] thanks [16:25:54] (03CR) 10Andrew Bogott: [C: 03+2] Update wmcs admin scripts to use HA-proxy database front end [puppet] - 10https://gerrit.wikimedia.org/r/620734 (owner: 10Andrew Bogott) [16:27:41] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 00m 56s) [16:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:47] * Urbanecm is done too [16:28:01] hashar: wanna switch that docroot by any chance? [16:28:12] I am in a pres right now : ;) [16:28:20] ok, another day :) [16:28:21] 10Operations, 10DC-Ops, 10SRE-swift-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10RobH) 05Open→03Resolved Wow, I had forgotten this task existed. Closing this loop: * dual ssd in hw raid1 * all lff disks as raid0 independ... [16:29:25] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10spatton) Thank you @Ladsgroup and @Urbanecm for the amazing support. [16:29:36] 10Operations, 10ops-esams: trace qfx5100-spare[12]-esams power cables - https://phabricator.wikimedia.org/T244914 (10RobH) p:05Low→03Lowest a:05RobH→03None This will need to be traced out when we schedule our next onsite work. Setting to lowest priority. [16:29:46] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:30:06] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:31:35] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [16:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:37] (03PS4) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) [16:31:39] (03PS4) 10Filippo Giunchedi: Enable profile::alertmanager::web on alerting_host [puppet] - 10https://gerrit.wikimedia.org/r/620727 (https://phabricator.wikimedia.org/T258948) [16:32:06] PROBLEM - Ensure traffic_server is running for instance tls on cp5006 is CRITICAL: PROCS CRITICAL: 0 processes with args /srv/trafficserver/tls/bin/traffic_server -M --run-root=/srv/trafficserver/tls/runroot.yaml --httpport 443 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [16:32:42] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:11] (03CR) 10Filippo Giunchedi: Add prometheus::karma and profile::alertmanager::web (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:33:32] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 59 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:35:21] !log ppchelko@deploy1001 Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002 [16:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:25] T259002: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 [16:36:15] !log restart backup2001, backup1001 one after the other [16:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:14] 04Critical Alert for device cr3-eqsin.wikimedia.org - Primary outbound port utilisation over 80% #page [16:39:30] (03PS5) 10Jbond: Add prometheus::karma and profile::alertmanager::web [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:39:38] uhm [16:40:06] 👀 [16:40:15] (03CR) 10Jbond: "> For sure! Opened https://phabricator.wikimedia.org/T260574 for this issue" [puppet] - 10https://gerrit.wikimedia.org/r/620726 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:40:34] yeah that's real, outbound eqsin peering is saturating [16:43:13] (03PS10) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [16:43:32] yup [16:43:47] PROBLEM - LibreNMS has a critical alert #page on icinga1001 is CRITICAL: Primary outbound port utilisation over 80% #page (cr3-eqsin.wikimedia.org) https://bit.ly/wmf-librenms [16:43:51] I'm on my phone and 40min away from being able to help [16:43:54] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1359.eqiad.wmnet'] ` and were **ALL** successful. [16:43:57] XioNoX: looks to already be subsiding [16:44:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 45 probes of 573 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:44:46] * apergos peeks in [16:45:21] <_joe_> o/ [16:45:54] * jbond42 also around [16:46:25] aaa [16:46:32] yep around too [16:47:07] I think things are okay now [16:47:28] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: further refinements [cookbooks] - 10https://gerrit.wikimedia.org/r/620741 [16:47:30] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: set the servers to state=inactive [cookbooks] - 10https://gerrit.wikimedia.org/r/620742 [16:48:16] I see it's coming down in the graphs... what would cause a spike there only though? [16:49:12] (03PS11) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [16:49:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr3-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% #page [16:49:39] (03CR) 10Ppchelko: Modify api-gateway access logging to conform to schema (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [16:49:48] apergos: see other channel [16:50:04] (03PS1) 10Urbanecm: Add logo files for lldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620743 (https://phabricator.wikimedia.org/T259432) [16:50:11] yeah I was doing the backread there already, thanks [16:51:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-cluster: further refinements [cookbooks] - 10https://gerrit.wikimedia.org/r/620741 (owner: 10Giuseppe Lavagetto) [16:52:31] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: further refinements [cookbooks] - 10https://gerrit.wikimedia.org/r/620741 (owner: 10Giuseppe Lavagetto) [16:53:18] (03PS1) 10Urbanecm: Change logo for lldwiki to match the requested one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620744 (https://phabricator.wikimedia.org/T259432) [16:53:31] PROBLEM - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:53:51] RECOVERY - LibreNMS has a critical alert #page on icinga1001 is OK: OK: zero critical LibreNMS alerts https://bit.ly/wmf-librenms [16:54:05] (03CR) 10jerkins-bot: [V: 04-1] Change logo for lldwiki to match the requested one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620744 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [16:57:18] (03PS2) 10Urbanecm: Add logo files for lldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620743 (https://phabricator.wikimedia.org/T259432) [16:58:38] (03PS2) 10Urbanecm: Change logo for lldwiki to match the requested one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620744 (https://phabricator.wikimedia.org/T259432) [17:00:04] gehel and onimisionipe: My dear minions, it's time we take the moon! Just kidding. Time for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1700). [17:01:52] (03PS6) 10Ppchelko: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) [17:01:53] !log oblivian@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [17:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:03:41] _joe_: I hope your rebooting script worked as intended :-). [17:04:04] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,dc=codfw,name=mw2246.codfw.wmnet [17:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:29] <_joe_> Urbanecm: yes, I interrupted it by hand because I'm doing tests :P [17:05:03] mutante: yeah sorry I was doing a presentation this evening. Can we do it tomorrow? My agenda is up to date :] [17:05:31] _joe_: hehe, I see :). I'd be afraid to test something that has the ability to poweroff the fleet through :) [17:06:00] hashar: sure, tomorrow then [17:06:01] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [17:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:07] (03CR) 10ZPapierski: [C: 04-1] "For some reason reload failed. It doesn't seem to be an issue with the cron job, but just in case..." [puppet] - 10https://gerrit.wikimedia.org/r/619289 (https://phabricator.wikimedia.org/T251515) (owner: 10ZPapierski) [17:07:07] <_joe_> Urbanecm: well I'm testing on a cluster that's not serving any traffic for a reason :P [17:07:35] still, "cluster overflow" can be a real thing :D [17:07:35] mutante: also jenkins had a security release today but we are not affected (we run a more recent version \o/ ) [17:09:15] hashar: great, guess it was good after all that we are not on LTS :) [17:14:23] 10Operations, 10Discovery-Search (Current work): wdqs1009 has puppet changes on each run - https://phabricator.wikimedia.org/T260123 (10CBogen) [17:15:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:16:48] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (JanWMF) - https://phabricator.wikimedia.org/T260555 (10drochford) Aplogies @fgiunchedi - The correct username is jeissfeldt [17:16:59] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Jan Eissfeldt (jeissfeldt) - https://phabricator.wikimedia.org/T260555 (10drochford) [17:17:32] !log cdanis@cumin1001 conftool action : set/pooled=yes; selector: name=mw1359.* [17:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:56] 10Operations, 10Discovery-Search (Current work): Reshard commonswiki_file elasticsearch index - https://phabricator.wikimedia.org/T260083 (10CBogen) [17:34:50] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10Papaul) [17:35:25] 10Operations, 10ops-codfw, 10DC-Ops, 10SRE-swift-storage: (Need By: ASAP) rack/setup/install ms-be2057.codfw.wmnet (Test Server - Keep Boxes) - https://phabricator.wikimedia.org/T260188 (10RobH) [17:38:29] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) So to summarize: The vd_client/vd_server that are on testreduce1001 should NOT be on it and instead the rt_cl... [17:42:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:43:29] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390190, @Dzahn wrote: > So to summarize: The vd_client/vd_server that are on testreduce1001... [17:44:15] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10Papaul) [17:44:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:37] FYI: going to start a DRY RUN of the switchdc cookbooks shortly, no real changes (or SAL output) expected but I'll be keeping an eye here just in case [17:55:08] * volans shadowing [18:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T1800). [18:00:05] kaldari and Urbanecm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:29] (03PS1) 10Dzahn: parsoid::testreduce: add rt_client profile to role [puppet] - 10https://gerrit.wikimedia.org/r/620757 (https://phabricator.wikimedia.org/T257906) [18:00:36] I can deploy today! [18:01:28] (03CR) 10Urbanecm: [C: 03+2] Add logo files for lldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620743 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [18:02:11] (03Merged) 10jenkins-bot: Add logo files for lldwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620743 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [18:02:32] kaldari: you around too? [18:03:15] (03CR) 10Urbanecm: [C: 03+2] Change logo for lldwiki to match the requested one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620744 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [18:04:00] (03Merged) 10jenkins-bot: Change logo for lldwiki to match the requested one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620744 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [18:04:35] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: 67e8f886cd1a9cd2b63ed69761bec6c52889a5b6: Add logo files for lldwiki (T259432) (duration: 00m 56s) [18:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:40] T259432: Create Wikipedia Ladin - https://phabricator.wikimedia.org/T259432 [18:06:11] 10Operations, 10Traffic, 10netops: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites - https://phabricator.wikimedia.org/T260449 (10Krenair) >>! In T260449#6387326, @CDanis wrote: >>>! In T260449#6387317, @Josve05a wrote: >> For reference https://ticket.wikimedia.org/otrs/index.pl?Action=Age... [18:08:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 808c17d28c5ebf5ed75f70c224d66129eb2edcd8: Change logo for lldwiki to match the requested one (T259432) (duration: 00m 56s) [18:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:48] kaldari: pingie? :-) [18:08:51] (03CR) 10Jgreen: [C: 03+2] nsca_frack.cfg.erb - remove check_ipsec from civicrm servers, add to fran1001 [puppet] - 10https://gerrit.wikimedia.org/r/620725 (https://phabricator.wikimedia.org/T258526) (owner: 10Jgreen) [18:09:02] (03CR) 10Jgreen: [V: 03+2 C: 03+2] nsca_frack.cfg.erb - remove check_ipsec from civicrm servers, add to fran1001 [puppet] - 10https://gerrit.wikimedia.org/r/620725 (https://phabricator.wikimedia.org/T258526) (owner: 10Jgreen) [18:09:51] 10Operations, 10Traffic, 10netops: Users of Jio ISP (India, AS 55836) unable to reach Wikimedia sites - https://phabricator.wikimedia.org/T260449 (10CDanis) Ah, yes -- and replied to them, clarifying both the cause of their outage and what contact addresses they should use for us in the future (although I ha... [18:11:48] (03PS1) 10Andrew Bogott: wmcs-backup-instances: add a dict of regexps to exclude servers from backup [puppet] - 10https://gerrit.wikimedia.org/r/620758 (https://phabricator.wikimedia.org/T259192) [18:14:28] (03PS2) 10Andrew Bogott: wmcs-backup-instances: add a dict of regexps to exclude servers from backup [puppet] - 10https://gerrit.wikimedia.org/r/620758 (https://phabricator.wikimedia.org/T259192) [18:15:28] (03PS1) 10BryanDavis: wmcs: collect prometheus metrics from alertmanager in metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/620760 [18:15:53] (03CR) 10Dzahn: [C: 03+2] parsoid::testreduce: add rt_client profile to role [puppet] - 10https://gerrit.wikimedia.org/r/620757 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [18:26:00] (03PS1) 10Dzahn: parsoid::testreduce: add rt_server profile to role [puppet] - 10https://gerrit.wikimedia.org/r/620763 (https://phabricator.wikimedia.org/T257906) [18:30:04] (03CR) 10Clarakosi: [C: 03+2] Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [18:31:17] (03Merged) 10jenkins-bot: Configure ratelimiter to support authenticated/anon limits for api [deployment-charts] - 10https://gerrit.wikimedia.org/r/619804 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [18:32:35] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [18:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:38] (03PS2) 10BryanDavis: wmcs: collect prometheus metrics from alertmanager in metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/620760 [18:34:12] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10greg) [18:36:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10Dzahn) > the Alternative upgrade method sectio... [18:37:10] 10Operations, 10ops-eqiad, 10netops: asw2-d1-eqiad:VCP failure - https://phabricator.wikimedia.org/T252797 (10Jclark-ctr) a:05Jclark-ctr→03ayounsi [18:39:24] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [18:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:01] (03PS3) 10BryanDavis: wmcs: collect prometheus metrics from alertmanager in metricsinfra [puppet] - 10https://gerrit.wikimedia.org/r/620760 [18:40:04] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10Dzahn) >>! In T260282#6380865, @hashar wrote:... [18:42:19] (03CR) 10BryanDavis: "This broke Puppet on prometheus01.metricsinfra.eqiad.wmflabs via profile::wmcs::prometheus::metricsinfra which was using alertmanager_url." [puppet] - 10https://gerrit.wikimedia.org/r/619739 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [18:43:41] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [18:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:04] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [18:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:21] (03CR) 10Dbarratt: "I feel like this change should depend on Iacbddf26d020a230f40ffefe2c49e2cee747ffd5 ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:46:38] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002 (duration: 131m 17s) [18:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:42] T259002: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 [18:46:43] (03CR) 10Dbarratt: [C: 03+1] Remove the 'investigate' right from testwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620091 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:46:45] !log ppchelko@deploy1001 Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002, take 2 [18:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:38] (03CR) 10BryanDavis: "PCC shows this as a no-op which I don't think is correct. It does at least compi" [puppet] - 10https://gerrit.wikimedia.org/r/620760 (owner: 10BryanDavis) [18:50:59] (03CR) 10Dbarratt: [C: 03+1] "doh! nevermind I see that it's ontop of the other change. 😊" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:53:10] (03CR) 10Tchanders: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:58:03] (03PS1) 10Ppchelko: Switch ratelimit service to V3 protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/620766 (https://phabricator.wikimedia.org/T254914) [18:58:03] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002, take 2 (duration: 11m 19s) [18:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:08] T259002: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 [18:58:09] !log ppchelko@deploy1001 Started deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002, take 3 [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:44] (03CR) 10Clarakosi: [C: 03+2] Switch ratelimit service to V3 protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/620766 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [19:00:48] (03Merged) 10jenkins-bot: Switch ratelimit service to V3 protocol [deployment-charts] - 10https://gerrit.wikimedia.org/r/620766 (https://phabricator.wikimedia.org/T254914) (owner: 10Ppchelko) [19:01:06] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@7f16bad]: Add thankyouwiki T259002, take 3 (duration: 02m 57s) [19:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:13] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24520/testreduce1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/620763 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [19:03:19] (03PS2) 10Dzahn: parsoid::testreduce: add rt_server profile to role [puppet] - 10https://gerrit.wikimedia.org/r/620763 (https://phabricator.wikimedia.org/T257906) [19:10:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-instances: add a dict of regexps to exclude servers from backup [puppet] - 10https://gerrit.wikimedia.org/r/620758 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:22:47] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [19:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:51] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:00] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [19:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:20] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [19:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:43:38] (03PS12) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [19:44:12] (03PS1) 10MarcoAurelio: [trwiki] Enable mapframe tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620673 [19:44:20] Pchelolo: did the deploy to api-gateway caused the error increase? [19:44:40] effie: hmmm... [19:44:54] I have not looked at the errors yert [19:44:58] let me check [19:45:00] unlikely, cause the gateway is not used and not doing anything [19:46:07] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.191e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:46:31] ok that is not good [19:47:43] no it is not you for sure [19:48:04] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 836.7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:48:52] yeah effie looked at the envoy logs more, and it's not doing anything at all [19:48:53] (03Abandoned) 10MarcoAurelio: [trwiki] Enable mapframe tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620673 (owner: 10MarcoAurelio) [19:49:12] Pchelolo: doubt it is you, there was a surge in GETs [19:49:27] and memcecghed complains so it is something else [19:50:38] RoanKattouw: hi - you there? [19:51:50] hauskatze: Yeah what's up [19:51:55] I'm on mobile though [19:51:55] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.201e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:53:15] RoanKattouw: Hi, hope you're well. Trwiki, which is a flaggedrevs wiki, asks to enable the tag. Is that okay? [19:53:30] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [19:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:34] There are a couple of tasks mentioning that such tag had to be disabled, etc. [19:53:53] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 61 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:53:54] I subscribed you to T260594 [19:53:55] T260594: Enable mapframe tag for trwiki - https://phabricator.wikimedia.org/T260594 [19:54:24] Yeah there's a bug that causes maps to become blank on FlaggedRevs wikis in certain cases [19:54:39] It's OK for them to enable it if they understand the risks [19:55:08] I'll write a comment explaining the issue, and if they say "yes we understand this and we still want it" then they can have it [19:55:18] Sounds good to me [19:55:20] Thanks! [19:56:24] (03CR) 10Jeena Huneidi: "> Patch Set 5:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [19:57:05] (03PS6) 10Jeena Huneidi: [WIP] Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [19:58:57] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:00:05] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T2000). [20:02:45] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [20:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:59] kormat: yep, give me a few minutes [20:13:28] jouncebot: now [20:13:28] For the next 0 hour(s) and 46 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T2000) [20:13:40] jouncebot: next [20:13:40] In 0 hour(s) and 46 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T2100) [20:19:14] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [20:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [20:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [20:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:45] (03PS3) 10Ottomata: Revert to anaconda 2020.02, also some activation improvements [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/620144 [20:23:01] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5030 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:23:10] (03PS1) 10Ppchelko: Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) [20:23:58] (03CR) 10jerkins-bot: [V: 04-1] Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [20:24:29] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [20:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:01] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 175.3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:25:32] (03PS2) 10Ppchelko: Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) [20:27:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:15] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) @ssastry @cscott rt_client and rt_server have been added to `testreduce1001.eqiad.wmnet`'. ` [testreduce100... [20:28:18] (03PS1) 10Ppchelko: Precache /api-gateway/request/1.0.0 schema in eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620779 (https://phabricator.wikimedia.org/T259736) [20:28:51] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5073 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:28:53] _joe_: which cluster did you pick? i have 2 jobrunners in reboot because they still had high uptime even though it already ran on jobrunner earlier, afaict [20:30:25] (03CR) 10Ottomata: [C: 03+1] Precache /api-gateway/request/1.0.0 schema in eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620779 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [20:30:49] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 486.7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:31:11] (03Abandoned) 10Rxy: Enable CheckUser at Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/486843 (https://phabricator.wikimedia.org/T214820) (owner: 10Rxy) [20:32:04] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) Thanks! > It fails after a little while though because it does not have access to the database yet. I sup... [20:32:34] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I was about to create a subtask for that. I got it. [20:33:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [20:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:16] (03PS4) 10Ottomata: Revert to anaconda 2020.02, also some activation improvements [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/620144 [20:36:32] (03CR) 10Ottomata: [C: 03+1] Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [20:37:27] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I just saw the DB appears to be running on localhost, not on a cluster, fwiw. [20:38:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [20:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:17] <_joe_> mutante: it interrupted, it's still running on the last two jobrunners in codfw [20:40:30] _joe_: ok. i was going by uptime and doing a few single ones (but not the ones listed to be --excluded in your example [20:40:57] <_joe_> but I'd suggest for now you restart the remaining apis [20:41:11] <_joe_> in eqiad [20:41:16] <_joe_> with the old method [20:41:22] <_joe_> I still have a bug to fix [20:41:24] ok, alright. will do [20:41:53] mw2240 is already running. but that's it [20:42:09] <_joe_> 2240 is not a jobrunner? [20:43:02] eh, yea. that's right. this was just from anything in codfw that had long uptime and a bunch of other mw24* are jobrunners [20:43:45] <_joe_> have you seen the etherpad? just reboot some apis in eqiad, up to 2 at a time [20:43:53] <_joe_> of the ones that weren't already done [20:44:48] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) >>! In T257906#6390874, @Dzahn wrote: > I just saw the DB appears to be running on localhost, not on a clus... [20:44:50] yea, i have. i started adding codfw servers to it. and ok [20:45:54] <_joe_> codfw will be done with the new procedure for sure [20:46:04] <_joe_> and hopefully the other clusters in eqiad too [20:46:23] i guess i should also check for each one if they are memcache or scap proxies [20:46:27] legoktm: OnionsPorFavor? xD [20:46:43] hauskatze: shh, still a work in progress :) [20:46:57] legoktm: Gardening with MediaWiki? lol [20:47:17] qchris was right, best name ever [20:47:53] :-) [20:47:57] <_joe_> legoktm: oh nice :) [20:47:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [20:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:17] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8961 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:52:13] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 48 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:53:34] (03PS1) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [20:54:18] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [20:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:43] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [20:56:07] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6548 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:57:22] (03PS1) 10Giuseppe Lavagetto: sre.hosts.reboot-cluster: remove redundant fail() [cookbooks] - 10https://gerrit.wikimedia.org/r/620785 [20:58:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.hosts.reboot-cluster: remove redundant fail() [cookbooks] - 10https://gerrit.wikimedia.org/r/620785 (owner: 10Giuseppe Lavagetto) [20:58:04] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 29 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:58:59] (03Merged) 10jenkins-bot: sre.hosts.reboot-cluster: remove redundant fail() [cookbooks] - 10https://gerrit.wikimedia.org/r/620785 (owner: 10Giuseppe Lavagetto) [21:00:05] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T2100). [21:01:57] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8316 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:03:55] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 26 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:04:46] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [21:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:49] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5464 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:08:07] (03PS2) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:08:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [21:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:11] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:09:46] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 25.33 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:12:08] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2240.codfw.wmnet [21:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:27] (03PS3) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:13:38] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:14:30] (03PS4) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:15:46] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:16:23] (03PS5) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:17:27] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:18:23] (03PS6) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:19:28] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:19:36] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1276.eqiad.wmnet [21:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:03] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [21:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:21] (03PS7) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:21:22] (03CR) 10jerkins-bot: [V: 04-1] Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) (owner: 10Ottomata) [21:21:35] (03PS13) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [21:21:56] 10Operations, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10Jdlrobson) [21:22:17] (03PS8) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:22:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [21:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:40] (03PS1) 10Dzahn: testreduce: add profile parsoid::testing to role [puppet] - 10https://gerrit.wikimedia.org/r/620787 (https://phabricator.wikimedia.org/T257906) [21:26:04] (03CR) 10jerkins-bot: [V: 04-1] testreduce: add profile parsoid::testing to role [puppet] - 10https://gerrit.wikimedia.org/r/620787 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:27:19] (03PS2) 10Dzahn: testreduce: add profile parsoid::testing to role [puppet] - 10https://gerrit.wikimedia.org/r/620787 (https://phabricator.wikimedia.org/T257906) [21:27:20] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6196 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:28:10] (03PS9) 10Ottomata: Add profile::analytics::jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/620784 (https://phabricator.wikimedia.org/T224658) [21:29:13] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 257 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:29:50] (03CR) 10Dzahn: [C: 03+2] testreduce: add profile parsoid::testing to role [puppet] - 10https://gerrit.wikimedia.org/r/620787 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:34:37] !log blocking temporarily traffic to mc1020 [21:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:23] Hey all - going to sync PS.php now [21:36:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [21:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:59] sbassett: can you please give it 2-3' ? [21:38:06] or am I too late ? [21:38:11] effie: yeah, it's going [21:38:16] oh well [21:38:28] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [21:38:28] !log sbassett@deploy1001 Synchronized private/PrivateSettings.php: Further mitigations for T257687 (duration: 00m 57s) [21:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:30] almost done. very targeted sec mitigation enhancement. [21:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:09] effie: If you want me to roll it back, I can. But it was a slight tweak to an existing sec mitigation for one project. [21:39:17] no no not at all [21:39:22] ok [21:39:34] I am debugging something that is all [21:41:14] PROBLEM - Memcached on mc1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Memcached [21:41:38] (03CR) 10Ppchelko: [C: 03+2] Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [21:42:16] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [21:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:28] (03Merged) 10jenkins-bot: Add api-gateway.request stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620777 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [21:42:58] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1279.eqiad.wmnet [21:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [21:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:01] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-cluster [21:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:41] oh damn [21:46:44] ^ me [21:47:08] <_joe_> ? [21:47:23] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1278.eqiad.wmnet [21:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:57] !log ppchelko@deploy1001 sync-file aborted: Add api-gateway.request stream config T259736 (duration: 05m 01s) [21:48:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [21:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:00] T259736: Unify access log schema for Action API and API Gateway/REST API - https://phabricator.wikimedia.org/T259736 [21:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:59] _joe_: the memcached alert [21:51:26] <_joe_> oh yes that was clear, you did firewall out mc1020, right? [21:51:38] yes, I added a comment [21:51:53] I am creating a task, I am not completely sure this fixed things though [21:53:26] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add api-gateway.request stream config T259736, one host timed out (duration: 00m 55s) [21:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:54] (03CR) 10Ppchelko: [C: 03+2] Precache /api-gateway/request/1.0.0 schema in eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620779 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [21:55:10] (03Merged) 10jenkins-bot: Precache /api-gateway/request/1.0.0 schema in eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620779 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [21:55:28] <_joe_> well it surely didn't fix the root issue [21:56:25] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [21:56:26] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [21:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:20] (03PS1) 10Dzahn: testreduce: add Hiera keys for API URL and turn off monitoring [puppet] - 10https://gerrit.wikimedia.org/r/620792 (https://phabricator.wikimedia.org/T257906) [21:57:31] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Cognate/maintenance/populateCognateSites.php --wiki=aawiktionary --site-group wiktionary (T259360) [21:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:34] T259360: Cognate doesn't properly create interwiki links for Shawiya Wiktionary (shy.wiktionary.org) - https://phabricator.wikimedia.org/T259360 [22:00:03] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [22:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:41] (03CR) 10Dzahn: [C: 03+2] testreduce: add Hiera keys for API URL and turn off monitoring [puppet] - 10https://gerrit.wikimedia.org/r/620792 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:02:29] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1281.eqiad.wmnet [22:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:11] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [22:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:11] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1282.eqiad.wmnet [22:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:24] (03PS1) 10Ppchelko: Add primary schemas to eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620795 (https://phabricator.wikimedia.org/T259736) [22:14:03] (03CR) 10Ppchelko: [C: 03+2] "self-merging since I already broke staging.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/620795 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [22:15:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:37] (03Merged) 10jenkins-bot: Add primary schemas to eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/620795 (https://phabricator.wikimedia.org/T259736) (owner: 10Ppchelko) [22:17:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:04] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [22:21:05] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [22:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:22] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [22:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:51] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [22:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:11] (03PS1) 10Dzahn: parsoid::testing: add support for buster by adding mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/620796 (https://phabricator.wikimedia.org/T257906) [22:26:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1284.eqiad.wmnet [22:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:28] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [22:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:57] (03PS14) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [22:31:08] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: add support for buster by adding mariadb-client [puppet] - 10https://gerrit.wikimedia.org/r/620796 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [22:33:50] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [22:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1285.eqiad.wmnet [22:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:50] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [22:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:32] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [22:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1286.eqiad.wmnet [22:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:56] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:07] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:45:45] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:45:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [22:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:30] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [22:55:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [22:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:52] 10Operations, 10DBA, 10Parsoid, 10serviceops, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) [22:58:19] 10Operations, 10DBA, 10Parsoid, 10serviceops, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) ` [scandium:~] $ mysql -h m5-master.eqiad.wmnet -u testreduce -p testreduce Enter password: Reading table information for completion of table and column... [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200817T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:17] 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Stalled mariadb-client has been installed (added buster support by using that instead of outdate... [23:00:40] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [23:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:29] 10Operations, 10serviceops: High traffic on mc1020 (18 Aug) - https://phabricator.wikimedia.org/T260622 (10jijiki) [23:02:38] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [23:08:48] 10Operations, 10DBA, 10Parsoid, 10serviceops, 10Parsoid-Tests: update mysql GRANTs for testreduce - https://phabricator.wikimedia.org/T260627 (10Dzahn) So this is everything in `modules/role/templates/mariadb/grants/production-m5.sql.erb` that refers to testreduce (line 5 to 48). Please make that work t... [23:10:41] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1288.eqiad.wmnet [23:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:11:07] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [23:11:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:23] (03PS15) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [23:26:45] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [23:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet [23:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:23] (03PS16) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [23:30:22] (03CR) 10jerkins-bot: [V: 04-1] Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [23:30:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:17] (03PS1) 10Ppchelko: api-gateway: remove unnesessary .yaml from jwks file [deployment-charts] - 10https://gerrit.wikimedia.org/r/620803 [23:33:36] (03PS17) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [23:40:27] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [23:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1312.eqiad.wmnet [23:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:11] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:21] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [23:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:05] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:38] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1313.eqiad.wmnet [23:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:58:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [23:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.reboot-single [23:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log