[00:00:04] twentyafterfour: Dear deployers, time to do the Phabricator update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200813T0000). [00:00:16] (03CR) 10BryanDavis: [C: 03+2] Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [00:00:44] (03Merged) 10jenkins-bot: Remove "Max X patches" from window's name [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/619870 (owner: 10Urbanecm) [00:03:15] (03PS1) 10Awight: admin: add ssh key for awight [puppet] - 10https://gerrit.wikimedia.org/r/619874 [00:06:03] (03CR) 10Dzahn: [C: 03+2] "finally does what it should, jenkins enabled on contint2001 and releases1001, not on any other" [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [00:06:49] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Edtadros) @akosiaris thanks! [00:07:26] 10Operations: edtadros is in nda group, but registered with a WMF account - https://phabricator.wikimedia.org/T260070 (10Edtadros) @Aklapper I am a contractor indeed. Please feel free to reach out to me if you have any questions. [00:07:45] jouncebot: now [00:07:45] For the next 0 hour(s) and 52 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200813T0000) [00:07:49] jouncebot: next [00:07:49] In 6 hour(s) and 52 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200813T0700) [00:08:33] Urbanecm: we will see how this works on Monday I guess :) [00:14:26] !log re-enabling puppet on releases* servers [00:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:25] (03CR) 10Dzahn: "contint1001: noop, jenkins masked and stoppged contint2001: noop, jenkins running releases1001: noop, jenkins running releases1002" [puppet] - 10https://gerrit.wikimedia.org/r/619855 (owner: 10Dzahn) [00:32:21] (03PS1) 10Dzahn: releases: do not monitor releases-jenkins on multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/619878 [00:37:53] !log removing jenkins_service_running checks from secondary servers where it's stopped, manually from icinga config, running puppet on icinga [00:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:17] (03PS2) 10Dzahn: releases: do not monitor releases-jenkins on multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/619878 [00:40:17] (03CR) 10Dzahn: [C: 03+2] releases: do not monitor releases-jenkins on multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/619878 (owner: 10Dzahn) [00:47:07] (03PS1) 10Kaldari: Removing obsolete license definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619880 [00:49:53] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=jenkins is now as it should be. monitors jenkins exactly where it's r" [puppet] - 10https://gerrit.wikimedia.org/r/619878 (owner: 10Dzahn) [00:50:20] (03PS2) 10Dzahn: ATS: temp. set backend for releases-jenkins to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/619826 (https://phabricator.wikimedia.org/T247652) [00:54:05] (03PS1) 10Reedy: Set wgLanguageCode for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619882 [00:54:28] (03CR) 10Reedy: [C: 03+2] Set wgLanguageCode for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619882 (owner: 10Reedy) [00:55:13] (03Merged) 10jenkins-bot: Set wgLanguageCode for apiportalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619882 (owner: 10Reedy) [01:18:41] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:20:32] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:29:23] 10Operations, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10CDanis) A thing that someone daring in EUTZ might want to try: Using `perf probe`, or by modifying the `bpfcc-memleak` script, or by writing a trivial [[ https://... [01:35:15] (03PS7) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [02:08:06] (03PS1) 10Dave Pifke: xhgui: enable prod MariaDB, disable labs MongoDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619886 (https://phabricator.wikimedia.org/T180761) [02:21:06] (03PS1) 10Dzahn: parsoid/testreduce: add a service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/619888 [02:22:25] (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: add a service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/619888 (owner: 10Dzahn) [02:27:42] (03PS2) 10Dzahn: parsoid/testreduce: add a service_ensure parameter, stop on new server [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) [02:29:13] (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: add a service_ensure parameter, stop on new server [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [02:40:14] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/24472/" [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [02:40:34] (03CR) 10Dzahn: "the point here is to fix "puppet change on every run" on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [02:43:22] (03PS3) 10Dzahn: parsoid/testreduce: add a service_ensure parameter, stop on new server [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) [02:45:48] (03PS4) 10Dzahn: parsoid/testreduce: add a service_ensure parameter, stop on new server [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) [02:47:27] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24473/" [puppet] - 10https://gerrit.wikimedia.org/r/619888 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [02:56:44] !log testreduce1001 - systemctl reset-failed ; fix parsoid-vd systemd state and icinga alert [02:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:38] (03PS1) 10Dzahn: ci::jenkins: add data types (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/619889 [04:35:38] (03PS2) 10Krinkle: xhgui: enable prod MariaDB, disable labs MongoDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619886 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [04:37:12] (03CR) 10Krinkle: [C: 03+1] xhgui: enable prod MariaDB, disable labs MongoDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619886 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [05:01:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12239 and previous config saved to /var/cache/conftool/dbconfig/20200813-050107-marostegui.json [05:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:32] (03CR) 10Marostegui: [C: 03+2] "> Patch Set 1: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619627 (https://phabricator.wikimedia.org/T259438) (owner: 10Marostegui) [05:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12240 and previous config saved to /var/cache/conftool/dbconfig/20200813-051222-marostegui.json [05:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P12241 and previous config saved to /var/cache/conftool/dbconfig/20200813-052859-marostegui.json [05:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:39] (03PS1) 10Marostegui: db2135: Upgrade m5 codfw master to Buster [puppet] - 10https://gerrit.wikimedia.org/r/619902 (https://phabricator.wikimedia.org/T260324) [05:38:53] (03PS2) 10Marostegui: db2135: Upgrade m5 codfw master to Buster [puppet] - 10https://gerrit.wikimedia.org/r/619902 (https://phabricator.wikimedia.org/T260324) [05:40:34] (03CR) 10Marostegui: [C: 03+2] db2135: Upgrade m5 codfw master to Buster [puppet] - 10https://gerrit.wikimedia.org/r/619902 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [05:43:06] !log Stop MySQL on db2135 (codfw master), haproxy irc alert will fire T260324 [05:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:09] T260324: Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 [05:51:34] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10Joe) p:05Triage→03Unbreak! I'm not 100% sure that slabs are the problem here, but I'll try to followup later. In the... [05:52:50] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1126', diff saved to https://phabricator.wikimedia.org/P12242 and previous config saved to /var/cache/conftool/dbconfig/20200813-060135-marostegui.json [06:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:26] * volans looked at netbox1001 [06:20:38] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:40:28] (03PS2) 10Ema: Exclude thankyou.wikipedia.org for mobile redirect [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) (owner: 10Ladsgroup) [06:48:17] !log Deploy MCR change on dbstore1003:3311 [06:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200813T0700) [07:00:09] (03CR) 10Volans: [C: 04-1] "Some comments inline, feel free to ping me offline if you have any question or the comments are not clear." (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [07:14:52] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) [07:15:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P12243 and previous config saved to /var/cache/conftool/dbconfig/20200813-071545-marostegui.json [07:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:51] (03PS3) 10Ema: Exclude thankyou.wikipedia.org for mobile redirect [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) (owner: 10Ladsgroup) [07:16:36] !log Stop replication on db1082 to remove triggers on sanitarium for MCR changs [07:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12244 and previous config saved to /var/cache/conftool/dbconfig/20200813-071943-marostegui.json [07:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:57] 10Operations, 10Platform Engineering, 10serviceops: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) [07:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1082', diff saved to https://phabricator.wikimedia.org/P12246 and previous config saved to /var/cache/conftool/dbconfig/20200813-074528-marostegui.json [07:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:35] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10ayounsi) Please make sure to update the cables in Netbox: https://netbox.wikimedia.org/dcim/devices/2133/ Now that the switches are managed by Homer/Netbox there are some... [07:48:10] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) The list of software updated that day on the appservers is at P12221 [07:50:22] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) [07:52:55] (03PS1) 10Jcrespo: mariadb-backups: Move backup scripts to its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619953 (https://phabricator.wikimedia.org/T165358) [07:53:42] (03PS2) 10Jcrespo: mariadb-backups: Move backup scripts to its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619953 (https://phabricator.wikimedia.org/T165358) [07:54:45] (03PS3) 10Jcrespo: mariadb-backups: Move backup scripts to its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619953 (https://phabricator.wikimedia.org/T165358) [07:55:24] <_joe_> !log downgrading curl/libcurl3/libcurl3-gnutls on mw1377 T260329 [07:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:27] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [08:05:36] (03PS1) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [08:06:07] (03Abandoned) 10Jcrespo: mariadb-backups: Move backup scripts to its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619953 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [08:13:27] (03PS1) 10Kormat: Move RemoteExecution library to wmfmariadbpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/619959 (https://phabricator.wikimedia.org/T259516) [08:15:00] (03CR) 10jerkins-bot: [V: 04-1] Move RemoteExecution library to wmfmariadbpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/619959 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [08:21:19] (03PS1) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:21:44] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [08:23:15] 10Operations, 10DBA, 10Patch-For-Review, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) a:03Kormat [08:24:23] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [08:24:25] (03PS2) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:25:27] (03PS1) 10Ayounsi: Add workaround for upstream limitations [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 [08:26:01] (03CR) 10Ayounsi: "Tested and works as expected." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 (owner: 10Ayounsi) [08:26:58] (03PS3) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:28:13] (03PS1) 10Ayounsi: Remove netbox_driven_interfaces feature flag for cr devices [homer/public] - 10https://gerrit.wikimedia.org/r/619967 [08:38:07] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [08:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:18] (03PS4) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:38:35] !log ayounsi@cumin1001 START - Cookbook sre.network.prepare-upgrade [08:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [08:40:56] (03PS5) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:43:17] (03PS1) 10Ayounsi: Depool codfw for routers upgrade [dns] - 10https://gerrit.wikimedia.org/r/619969 (https://phabricator.wikimedia.org/T259621) [08:43:31] <_joe_> !log downgrading imagemagick on mw1378 T260281 [08:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:34] T260281: mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 [08:45:23] (03CR) 10Kormat: [C: 04-2] "Tests won't pass until wmfmariadbpy is published on pypi." [software/transferpy] - 10https://gerrit.wikimedia.org/r/619959 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [08:45:28] <_joe_> !log downgrading imagemagick on mw1378 T260329 [08:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:31] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [08:47:05] (03CR) 10Volans: [C: 03+1] "Doc nits inline, seems sane to me. The get_circuits should be at some point either here or in homer and not in both (I know they are sligh" (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 (owner: 10Ayounsi) [08:47:51] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/619967 (owner: 10Ayounsi) [08:48:33] (03CR) 10Volans: [C: 03+1] "LGTM, if you mention a regression would be nice to add the version(s) affected and/or link the upstream changelog." [homer/public] - 10https://gerrit.wikimedia.org/r/619710 (owner: 10Ayounsi) [08:49:25] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [08:49:38] (03PS6) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [08:50:54] (03PS2) 10Ayounsi: Add workaround for upstream limitations [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 [08:52:47] (03CR) 10Ayounsi: Add workaround for upstream limitations (032 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 (owner: 10Ayounsi) [08:53:16] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add workaround for upstream limitations [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619966 (owner: 10Ayounsi) [08:55:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1082', diff saved to https://phabricator.wikimedia.org/P12247 and previous config saved to /var/cache/conftool/dbconfig/20200813-085547-marostegui.json [08:55:48] (03CR) 10Elukey: "Adding Valentin as he is the SRE Clinic duty of the week :)" [puppet] - 10https://gerrit.wikimedia.org/r/619874 (owner: 10Awight) [08:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [08:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.prepare-upgrade (exit_code=0) [08:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:18] (03CR) 10Vgutierrez: [C: 03+2] "yubikey FTW, LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/619874 (owner: 10Awight) [09:01:07] (03CR) 10Ema: [C: 03+2] Exclude thankyou.wikipedia.org for mobile redirect [puppet] - 10https://gerrit.wikimedia.org/r/619446 (https://phabricator.wikimedia.org/T259002) (owner: 10Ladsgroup) [09:03:08] (03CR) 10Awight: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/619874 (owner: 10Awight) [09:12:24] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.2.5 [software/homer] - 10https://gerrit.wikimedia.org/r/619973 [09:15:15] (03CR) 10Ayounsi: [C: 03+1] CHANGELOG: add changelogs for release v0.2.5 [software/homer] - 10https://gerrit.wikimedia.org/r/619973 (owner: 10Volans) [09:15:36] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.2.5 [software/homer] - 10https://gerrit.wikimedia.org/r/619973 (owner: 10Volans) [09:16:52] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.2.5 [software/homer] - 10https://gerrit.wikimedia.org/r/619973 (owner: 10Volans) [09:25:18] (03PS1) 10Volans: Upstream release v0.2.5 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619976 [09:26:54] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Upstream release v0.2.5 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/619976 (owner: 10Volans) [09:34:36] !log ayounsi@deploy1001 Started deploy [homer/deploy@89636df]: Homer release v0.2.5 [09:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:34] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Ladsgroup) For the wikibase part, I highly doubt it, the php entry point calls `wfLoadExtensi... [09:37:39] !log ayounsi@deploy1001 Finished deploy [homer/deploy@89636df]: Homer release v0.2.5 (duration: 03m 03s) [09:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:56] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10Joe) >>! In T260329#6382296, @Ladsgroup wrote: > For the wikibase part, I highly doubt it, th... [09:43:48] (03PS2) 10ArielGlenn: cleanup misc dumps that aren't stored in per-date urls [puppet] - 10https://gerrit.wikimedia.org/r/619571 (https://phabricator.wikimedia.org/T257782) [09:45:03] (03CR) 10ArielGlenn: [C: 03+2] cleanup misc dumps that aren't stored in per-date urls [puppet] - 10https://gerrit.wikimedia.org/r/619571 (https://phabricator.wikimedia.org/T257782) (owner: 10ArielGlenn) [09:48:13] (03PS1) 10Awight: admin: remove old SSH key for awight [puppet] - 10https://gerrit.wikimedia.org/r/619981 [09:57:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:58:50] (03CR) 10Ayounsi: [C: 03+2] Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [09:59:19] (03Merged) 10jenkins-bot: Configure transport links OSPF based on Netbox data [homer/public] - 10https://gerrit.wikimedia.org/r/617603 (https://phabricator.wikimedia.org/T200277) (owner: 10Ayounsi) [09:59:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:04:05] !log re-order OSPF interfaces on all routers (now partially netbox driven) [10:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:14] (03PS1) 10Volans: dns: fix corner case that should not happen [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/619982 [10:12:26] (03CR) 10Ayounsi: [C: 03+2] "> Patch Set 1: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/619710 (owner: 10Ayounsi) [10:12:50] (03Merged) 10jenkins-bot: Workaround a Jinja regression [homer/public] - 10https://gerrit.wikimedia.org/r/619710 (owner: 10Ayounsi) [10:14:13] (03CR) 10Ayounsi: [C: 03+2] Remove netbox_driven_interfaces feature flag for cr devices [homer/public] - 10https://gerrit.wikimedia.org/r/619967 (owner: 10Ayounsi) [10:14:37] (03Merged) 10jenkins-bot: Remove netbox_driven_interfaces feature flag for cr devices [homer/public] - 10https://gerrit.wikimedia.org/r/619967 (owner: 10Ayounsi) [10:17:13] !log depool mw1379 for downgrade of poppler-utils,libpoppler-glib8,libpoppler64,curl,libcurl3,libcurl3-gnutls,libpython3.5,python3.5,libpython3.5-stdlib,python3.5-minimal,libpython3.5-minimal,imagemagick-6-common,libmagickcore-6.q16-3,libmagickwand-6.q16-3,imagemagick-6.q16,imagemagick,e2fslibs,e2fsprogs,libcomerr2,libss2 and reboot - T260329 [10:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:16] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [10:17:34] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [10:20:34] (03CR) 10Vgutierrez: [C: 03+2] admin: remove old SSH key for awight [puppet] - 10https://gerrit.wikimedia.org/r/619981 (owner: 10Awight) [10:23:34] 10Operations, 10Gerrit-Privilege-Requests, 10User-Kormat: Request for Gerrit Managers permissions - https://phabricator.wikimedia.org/T260342 (10Kormat) [10:23:42] 10Operations, 10Gerrit-Privilege-Requests, 10User-Kormat: Request for Gerrit Managers permissions - https://phabricator.wikimedia.org/T260342 (10Kormat) p:05Triage→03Medium [10:27:04] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10Jdforrester-WMF) You shouldn't need `nda`; anything you want to access should be there with `wmf`. [10:42:14] !log Moving api-gateway service to from service_setup to lvs_setup and running puppet on LVS servers [10:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:43] (03CR) 10Vgutierrez: [C: 03+1] api-gateway: change service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/619800 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [10:44:46] kormat: was it you I remember talking about how to transfer files between servers without forwarding my ssh key? [10:45:19] PROBLEM - Host kafka-jumbo1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [10:45:27] jayme: if you can tell me more, i can decide whether i need to deny this or not. ;) [10:46:22] kormat: baseline is that I just need to copy some files around and don't want to pass that through my internet connection (plus I don't want to forward my SSH key ofc) [10:46:37] (03CR) 10Hnowlan: [C: 03+2] api-gateway: change service state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/619800 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [10:46:40] jayme: `transfer.py` can do this [10:47:14] https://doc.wikimedia.org/transferpy/master/usage.html [10:47:25] you'll want --type=file [10:47:52] it's installed on the cumin hosts [10:48:46] kormat: Nice. That whas it I think. Will take a look, thanks! [10:49:07] if you have any issues, please feel free to ping me^Wjynus :) [10:51:17] RECOVERY - Host kafka-jumbo1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [10:53:33] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 59 connections established with conf2001.codfw.wmnet:2379 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.55:8087]) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 69 connections established with conf1004.eqiad.wmnet:4001 (min=70) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 101 connections established with conf1004.eqiad.wmnet:4001 (min=102) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.55:8087]) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 59 connections established with conf2001.codfw.wmnet:2379 (min=60) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:53:57] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.55:8087]) Hnowlan new service https://wikitech.wikimedia.org/wiki/PyBal [10:56:12] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.55:8087]) Hnowlan New LVS service https://wikitech.wikimedia.org/wiki/PyBal [10:56:12] ACKNOWLEDGEMENT - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 79 connections established with conf2001.codfw.wmnet:2379 (min=80) Hnowlan New LVS service https://wikitech.wikimedia.org/wiki/PyBal [10:56:43] that's totally expected :) [10:57:38] away afk! [10:57:40] ufff [10:57:50] that was supposed to be for irssi :D [10:59:27] elukey: I'll handle that for you [10:59:35] elukey: just let me know your root password [11:04:54] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [11:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] !log restarting pybal on lvs1015 T254908 [11:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:11] T254908: API Gateway LVS Endpoint - https://phabricator.wikimedia.org/T254908 [11:05:30] !log depool mw1380 for downgrade of poppler-utils,libpoppler-glib8,libpoppler64,curl,libcurl3,libcurl3-gnutls,libpython3.5,python3.5,libpython3.5-stdlib,python3.5-minimal,libpython3.5-minimal,imagemagick-6-common,libmagickcore-6.q16-3,libmagickwand-6.q16-3,imagemagick-6.q16,imagemagick,e2fslibs,e2fsprogs,libcomerr2,libss2 and reboot - T260329 [11:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:33] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [11:05:57] !log restarting pybal on lvs1016 T254908 [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:42] !log restarting pybal on lvs2009 T254908 [11:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:05] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10Aklapper) >>! In T260279#6380777, @DVrandecic wrote: > I already am. The onboading section says (This makes me wonder which specific onboarding docs this is about, and if they... [11:07:42] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [11:07:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:09:24] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single [11:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:30] !log restarting pybal on lvs2010 T254908 [11:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:41] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:11:25] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 60 connections established with conf2001.codfw.wmnet:2379 (min=60) https://wikitech.wikimedia.org/wiki/PyBal [11:13:22] (03CR) 10Gehel: "Looks good except for the prospector errors (see jenkins logs)" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [11:18:54] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [11:20:16] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) [11:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:02] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [11:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:27] _joe_: the cookbook is holding up it's 50% promise it seems :D [11:27:36] (03CR) 10Gehel: [C: 04-1] "A few additional comments on top of Volans review..." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/619781 (owner: 10Ryan Kemper) [11:34:07] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) [11:45:56] (03PS1) 10Ayounsi: Move vlans to commons, restrict untrust-screen to mr [homer/public] - 10https://gerrit.wikimedia.org/r/619984 [11:47:08] (03CR) 10Ayounsi: [C: 03+2] Move vlans to commons, restrict untrust-screen to mr [homer/public] - 10https://gerrit.wikimedia.org/r/619984 (owner: 10Ayounsi) [11:47:31] (03Merged) 10jenkins-bot: Move vlans to commons, restrict untrust-screen to mr [homer/public] - 10https://gerrit.wikimedia.org/r/619984 (owner: 10Ayounsi) [12:00:17] (03CR) 10Ladsgroup: "I don't have enough knowledge to responsibly review this specially that this affects every request. I added some other people who might kn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [12:02:56] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6381768, @CDanis wrote: > attach a tracepoint to `memcg_schedule_kmem_cache_create` and gather calling... [12:13:51] (03PS1) 10Kormat: Clean-up dependencies. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619988 [12:15:10] (03PS2) 10Kormat: Clean-up dependencies. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619988 [12:16:46] (03CR) 10Kormat: [C: 03+2] Clean-up dependencies. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619988 (owner: 10Kormat) [12:17:15] (03Merged) 10jenkins-bot: Clean-up dependencies. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619988 (owner: 10Kormat) [12:19:49] (03PS1) 10Kormat: Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 [12:20:05] (03PS1) 10ProcrastinatingReader: Remove abusefilter-view right grant from wmf-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619991 (https://phabricator.wikimedia.org/T255506) [12:20:15] (03CR) 10jerkins-bot: [V: 04-1] Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 (owner: 10Kormat) [12:22:08] huh, ci runs python3.5? [12:22:29] (03PS1) 10Ayounsi: Homer: add pfw support [puppet] - 10https://gerrit.wikimedia.org/r/619992 [12:24:22] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) >>! In T260281#6382529, @ema wrote: > I've installed systemtap on mw1357 Nevermind, I've seen only now that mw1357... [12:29:46] (03PS1) 10Elukey: Update README.Debian after the 0.19 release [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/619993 (https://phabricator.wikimedia.org/T244482) [12:30:32] (03PS2) 10Kormat: Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 [12:30:46] (03CR) 10Elukey: [C: 03+2] Update README.Debian after the 0.19 release [debs/druid] (debian) - 10https://gerrit.wikimedia.org/r/619993 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [12:31:38] (03PS3) 10Kormat: Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 [12:32:40] (03CR) 10Kormat: [C: 03+2] Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 (owner: 10Kormat) [12:33:07] (03Merged) 10jenkins-bot: Expand metadata. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619990 (owner: 10Kormat) [12:41:00] (03PS1) 10Giuseppe Lavagetto: Revert "MW firejail: blacklist /run and conf cache" [puppet] - 10https://gerrit.wikimedia.org/r/619622 [12:41:20] (03PS4) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) [12:41:55] (03PS1) 10Giuseppe Lavagetto: Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) [12:45:10] <_joe_> jouncebot: next [12:45:10] In 18 hour(s) and 14 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200814T0700) [12:45:19] <_joe_> oh, nice :) [12:45:20] (03CR) 10Ema: [C: 03+1] Revert "MW firejail: blacklist /run and conf cache" [puppet] - 10https://gerrit.wikimedia.org/r/619622 (owner: 10Giuseppe Lavagetto) [12:45:29] <_joe_> at least I can safely revert stuff :P [12:46:38] (03CR) 10JMeybohm: "> Patch Set 2: Code-Review-1" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 (owner: 10Addshore) [12:49:45] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12 - https://phabricator.wikimedia.org/T260347 (10elukey) [12:50:35] (03PS2) 10Giuseppe Lavagetto: Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) [12:51:13] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 (10elukey) p:05Triage→03Medium a:03elukey [12:51:15] (03CR) 10Kormat: wmfmariadbpy: Reorganize backup scripts into its own directory (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [12:52:19] (03CR) 10Kormat: [C: 04-1] wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [12:52:59] (03CR) 10JMeybohm: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [12:54:07] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 (10elukey) [12:54:09] (03CR) 10Kormat: [C: 04-1] wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [12:54:27] (03PS1) 10Ayounsi: Add security log {} stanza [homer/public] - 10https://gerrit.wikimedia.org/r/619995 [12:56:42] (03CR) 10Reedy: [C: 03+1] Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) (owner: 10Giuseppe Lavagetto) [12:58:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) (owner: 10Giuseppe Lavagetto) [12:58:52] (03CR) 10Ebe123: [C: 03+1] Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) (owner: 10Giuseppe Lavagetto) [12:59:32] (03Merged) 10jenkins-bot: Revert "Re-enable LilyPond/Score in safe mode (3rd attempt)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619623 (https://phabricator.wikimedia.org/T260329) (owner: 10Giuseppe Lavagetto) [12:59:49] (03CR) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:01:00] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 (10elukey) As described in T255026#6276301: >>! In T255026#6276301, @MoritzMuehlenhoff wrote: > When you do that, please use row B/D in eqiad and row C/D in codfw to be... [13:01:31] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Joe) [13:02:19] (03PS1) 10Hnowlan: kubernetes: add api-gateway to LVS pools [puppet] - 10https://gerrit.wikimedia.org/r/619996 (https://phabricator.wikimedia.org/T254908) [13:02:37] (03PS1) 10Vgutierrez: api-gateway: Add LVS IP on kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/619997 (https://phabricator.wikimedia.org/T254908) [13:02:47] lol :) [13:02:52] I'll abandon mine [13:03:15] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) Great! This message will be posted on wikis on week 35. [13:03:25] (03Abandoned) 10Vgutierrez: kubernetes: add api-gateway to LVS pools [puppet] - 10https://gerrit.wikimedia.org/r/619996 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:05:57] (03PS1) 10Elukey: Add schema[12]00[34] A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/619998 (https://phabricator.wikimedia.org/T260347) [13:06:54] (03Restored) 10Vgutierrez: kubernetes: add api-gateway to LVS pools [puppet] - 10https://gerrit.wikimedia.org/r/619996 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:07:14] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/619996 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:07:26] (03CR) 10Hnowlan: [C: 03+2] kubernetes: add api-gateway to LVS pools [puppet] - 10https://gerrit.wikimedia.org/r/619996 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:07:45] !log oblivian@deploy1001 Synchronized wmf-config/CommonSettings.php: revert enabling of lilypond (again) T257091 T260329 (duration: 00m 59s) [13:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:49] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [13:09:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "MW firejail: blacklist /run and conf cache" [puppet] - 10https://gerrit.wikimedia.org/r/619622 (owner: 10Giuseppe Lavagetto) [13:10:03] (03PS3) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 [13:10:05] (03PS2) 10Hashar: .gitignore docker-pkg-build.log [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 [13:10:07] (03Abandoned) 10Vgutierrez: api-gateway: Add LVS IP on kubernetes hosts [puppet] - 10https://gerrit.wikimedia.org/r/619997 (https://phabricator.wikimedia.org/T254908) (owner: 10Vgutierrez) [13:10:23] (03CR) 10Hashar: "Rebased due to a conflict in .gitignore !" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 (owner: 10Hashar) [13:10:47] <_joe_> !log forcing a puppet run on the api appservers in eqiad T260329 [13:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:44] (03CR) 10Ottomata: [C: 03+1] Add schema[12]00[34] A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/619998 (https://phabricator.wikimedia.org/T260347) (owner: 10Elukey) [13:16:24] (03PS2) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [13:17:22] (03PS1) 10Urbanecm: Enable subpages in NS:0 in techconductwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620001 (https://phabricator.wikimedia.org/T260350) [13:17:28] (03CR) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:18:18] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [13:19:23] (03PS3) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [13:20:55] (03CR) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:21:44] (03PS1) 10Hnowlan: api-gateway: enable monitoring setup for service [puppet] - 10https://gerrit.wikimedia.org/r/620004 (https://phabricator.wikimedia.org/T254908) [13:25:45] (03PS1) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [13:27:30] (03PS7) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [13:31:07] (03CR) 10Kormat: wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [13:34:29] (03CR) 10Elukey: [C: 03+2] Add schema[12]00[34] A/PTR records [dns] - 10https://gerrit.wikimedia.org/r/619998 (https://phabricator.wikimedia.org/T260347) (owner: 10Elukey) [13:34:39] (03PS1) 10Ppchelko: Add to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 [13:35:03] (03CR) 10Ppchelko: [C: 03+1] api-gateway: enable monitoring setup for service [puppet] - 10https://gerrit.wikimedia.org/r/620004 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:35:37] (03PS2) 10Ppchelko: Add $schema to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 [13:39:30] (03CR) 10Vgutierrez: [C: 03+2] zuul: stop prefixing report with the job name [puppet] - 10https://gerrit.wikimedia.org/r/608296 (https://phabricator.wikimedia.org/T256575) (owner: 10Hashar) [13:39:44] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [13:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:47] (03CR) 10Hnowlan: [C: 03+2] api-gateway: enable monitoring setup for service [puppet] - 10https://gerrit.wikimedia.org/r/620004 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [13:44:41] !log Gracefully restarting Zuul [13:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:55] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] !log moving api-gateway service to monitoring_setup [13:45:59] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [13:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:09] (03PS3) 10Addshore: golang:1.13-2, Add ca-certificates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 [13:53:09] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [13:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:49] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619731 (owner: 10Addshore) [13:58:14] (03PS1) 10Addshore: build: loki & ratelimit use golang:1.13-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620015 [13:58:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [13:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:54] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [13:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:35] !log create schema[12]00[34] in ganeti - T260347 [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:37] T260347: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 [14:04:55] (03CR) 10JMeybohm: [C: 03+1] "Adding you as reviewer to take notice. Not bumping changelog and rebuilding is okay here I guess." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620015 (owner: 10Addshore) [14:05:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:32] (03PS1) 10Elukey: Add schema[12]00[34] dhcp + role config [puppet] - 10https://gerrit.wikimedia.org/r/620016 (https://phabricator.wikimedia.org/T260347) [14:10:03] (03CR) 10Elukey: [C: 03+2] Add schema[12]00[34] dhcp + role config [puppet] - 10https://gerrit.wikimedia.org/r/620016 (https://phabricator.wikimedia.org/T260347) (owner: 10Elukey) [14:11:15] (03CR) 10Ottomata: [C: 03+1] Add schema[12]00[34] dhcp + role config [puppet] - 10https://gerrit.wikimedia.org/r/620016 (https://phabricator.wikimedia.org/T260347) (owner: 10Elukey) [14:15:02] (03CR) 10Cwhite: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620015 (owner: 10Addshore) [14:20:48] (03PS1) 10Hnowlan: api-gateway: move to production [puppet] - 10https://gerrit.wikimedia.org/r/620019 (https://phabricator.wikimedia.org/T254908) [14:22:37] (03CR) 10Pcoombe: [C: 03+1] "Config looks good to me! Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619852 (https://phabricator.wikimedia.org/T259002) (owner: 10Urbanecm) [14:22:53] thanks pcoombe :) [14:23:18] np Urbanecm! Just replying on the phab task now [14:23:31] (03CR) 10Ppchelko: [C: 03+1] api-gateway: move to production [puppet] - 10https://gerrit.wikimedia.org/r/620019 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:23:47] (03PS1) 10Andrew Bogott: backy2: temporarily hack data dir to /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/620020 [14:24:21] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Urbanecm) >>! In T259002#6382206, @gerritbot wrote: > Change 619446 **... [14:24:56] (03CR) 10Andrew Bogott: [C: 03+2] backy2: temporarily hack data dir to /var/lib/nova/instances [puppet] - 10https://gerrit.wikimedia.org/r/620020 (owner: 10Andrew Bogott) [14:25:20] 10Operations, 10Fundraising-Backlog, 10MW-1.36-notes (1.36.0-wmf.4; 2020-08-11), 10Patch-For-Review, and 3 others: New wiki for fundraising Thank You pages with similar config as donatewiki - https://phabricator.wikimedia.org/T259002 (10Pcoombe) >>! In T259002#6381151, @Urbanecm wrote: > @Ladsgroup @Pcoomb... [14:28:18] (03CR) 10Hnowlan: [C: 03+2] api-gateway: move to production [puppet] - 10https://gerrit.wikimedia.org/r/620019 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:28:24] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 (10ema) `node_vmstat_nr_slab_unreclaimable` is going up indefinitely on nodes affected by the issue, following a pattern that... [14:30:53] (03PS1) 10Hnowlan: api-gateway: Fix healthcheck path [puppet] - 10https://gerrit.wikimedia.org/r/620023 (https://phabricator.wikimedia.org/T254908) [14:31:37] <_joe_> !log installing kernel 4.19.0-0.bpo.9 on mw1381 T260329 [14:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:41] T260329: Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 [14:33:31] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) [14:33:32] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [14:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:36] (03CR) 10Ppchelko: api-gateway: Fix healthcheck path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620023 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:34:04] <_joe_> !log rebooting mw1381 with a newer kernel, mw1383 as control with the old kernel T260329 [14:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:14] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single [14:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:31] (03CR) 10Hnowlan: api-gateway: Fix healthcheck path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620023 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:35:58] (03PS2) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [14:37:31] (03PS3) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [14:38:52] !log reboot mw1382 with kernel memory accounting disabled T260281 [14:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:55] T260281: mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 [14:40:12] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:04] (03PS4) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [14:41:38] (03CR) 10Kormat: "One minor comment." (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:41:43] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:43] (03CR) 10Ppchelko: [C: 03+1] api-gateway: Fix healthcheck path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/620023 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:44:18] (03PS5) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [14:45:01] (03CR) 10Kormat: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:45:24] !log fdans@deploy1001 Started deploy [analytics/refinery@ba1a439]: Regular analytics weekly train [14:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:54] !log repool mw1382 with kernel memory accounting disabled T260281 [14:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:58] T260281: mw* servers memory leaks (12 Aug) - https://phabricator.wikimedia.org/T260281 [14:46:49] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Fix healthcheck path [puppet] - 10https://gerrit.wikimedia.org/r/620023 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [14:47:05] (03PS4) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [14:47:07] (03PS8) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [14:50:55] (03PS5) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [14:51:24] (03CR) 10Jcrespo: "Not sure why the +x was reverted by git, given it was added last time." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [14:52:03] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [14:53:11] (03PS9) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [14:56:20] 10Operations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10ayounsi) p:05Triage→03Low [14:56:59] !log fdans@deploy1001 Finished deploy [analytics/refinery@ba1a439]: Regular analytics weekly train (duration: 11m 34s) [14:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:36] (03PS6) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [15:01:10] (03PS10) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [15:01:12] (03CR) 10Kormat: [C: 03+1] wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [15:02:11] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (6) node(s) change every puppet run: schema1004.eqiad.wmnet, schema1003.eqiad.wmnet, cloudvirt1006.eqiad.wmnet, schema2004.codfw.wmnet, wdqs1009.eqiad.wmnet, schema2003.codfw.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [15:04:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10cloud-services-team (Hardware): (Need By: 2020-06-12) rack/setup/install WMCS 10G switches - https://phabricator.wikimedia.org/T251632 (10ayounsi) I believe: > Also em0, the management ports don't have their cable info in Netbox. Is the last thing to do h... [15:04:42] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_B schema1003.eqiad.wmnet --vcpus 2 --memory 2 --disk 10 START - Cookbook sre.ganeti.makevm /usr/li... [15:05:01] 10Operations, 10Analytics-Clusters, 10vm-requests: Create 4 new VMs to replace schema[12]00[12] - https://phabricator.wikimedia.org/T260347 (10elukey) 05Open→03Resolved All vms created and bootstrapped! [15:07:29] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 310.67 ms [15:08:38] (03PS1) 10Guergana Tzatchkova: Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) [15:09:43] elukey: mc2028 back up [15:10:47] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) ` System Error 01/01/1970 00:01 01/01/1970 00:01 1 Server Critical Fault (Service Information: Power On Fault, System Board, AUX/Main EFUSE Regulator 1 (10h)) [15:13:24] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [15:14:03] (03PS1) 10Ottomata: Remove SearchSatisfaction from wgEventLoggingSchemas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620051 (https://phabricator.wikimedia.org/T259163) [15:15:03] papaul: wow thanks! [15:15:15] papaul: what was the issue?? [15:15:26] elukey: still looking into it [15:16:06] elukey: but the server was completely off when i got here [15:16:12] (03PS2) 10Guergana Tzatchkova: Remove $wgExtraLanguageNames from Wikidata and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620050 (https://phabricator.wikimedia.org/T260118) [15:16:15] elukey: maybe some HW issue [15:17:06] (03CR) 10Ottomata: "Will deploy this monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620051 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [15:19:00] (03CR) 10Jcrespo: [C: 03+1] "Note on the package description, it says it supports:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:20:49] (03CR) 10Jcrespo: [C: 03+1] "Correction to previous comment." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:23:29] papaul: mmm so the mgmt console is reachable, but the server seems still down. [15:23:46] elukey: correct [15:23:53] now I can't reach console anymore [15:24:05] elukey: yes workin on it [15:24:19] ah sorry I'll leave you working then [15:24:53] let me know how it goes, if the server will stay down for days we'll have to fix some configs [15:25:43] elukey: sure [15:27:22] (03PS5) 10Kormat: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) [15:27:25] (03CR) 10Jcrespo: [C: 03+1] "Should RemoteExecution be its own "module/subdir" inside this repo, given that it is use/will be used by WMFMariaDB, WMFBackup AND transfe" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:27:31] (03CR) 10Kormat: Update remote execution libraries from transferpy (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:29:06] (03CR) 10Kormat: "> Patch Set 4:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:30:41] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:31:39] (03CR) 10Jcrespo: "> Patch Set 5:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:32:23] (03CR) 10Jcrespo: [C: 03+1] Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:33:02] (03CR) 10Kormat: [C: 03+2] Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:33:16] (03CR) 10Jcrespo: [C: 03+1] "> Maybe let's merge as is and research on a separate patch? I remember we had issues with CI being not very friendy with with multiple sub" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:33:26] (03CR) 10Hnowlan: [C: 03+1] build: loki & ratelimit use golang:1.13-2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620015 (owner: 10Addshore) [15:33:28] (03Merged) 10jenkins-bot: Update remote execution libraries from transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619476 (https://phabricator.wikimedia.org/T259516) (owner: 10Kormat) [15:34:06] (03CR) 10Hnowlan: [C: 03+1] Add $schema to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 (owner: 10Ppchelko) [15:34:39] (03CR) 10Ottomata: [C: 03+1] Add $schema to resource_purge events [deployment-charts] - 10https://gerrit.wikimedia.org/r/620008 (owner: 10Ppchelko) [15:34:55] (03PS1) 10Ottomata: Refine - temporarily exclude resource-purge from refinement [puppet] - 10https://gerrit.wikimedia.org/r/620056 [15:35:46] (03PS2) 10Ottomata: Refine - temporarily exclude resource-purge from refinement [puppet] - 10https://gerrit.wikimedia.org/r/620056 [15:36:17] RECOVERY - Host mc2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [15:37:34] (03CR) 10Ottomata: [C: 03+2] Refine - temporarily exclude resource-purge from refinement [puppet] - 10https://gerrit.wikimedia.org/r/620056 (owner: 10Ottomata) [15:38:36] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) I can't turn on the server. The power LED is solid amber and the health LED is blinking red. On iLO System Information, it is written that the status of "BIOS/Hardware Health" is "failed",... [15:39:31] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [15:41:26] !log restart ES on logstash1012 [15:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:08] (03PS7) 10C. Scott Ananian: Alternate configuration mechanism for Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612879 (https://phabricator.wikimedia.org/T241961) [15:47:10] (03PS2) 10Hnowlan: api-gateway: move to production [puppet] - 10https://gerrit.wikimedia.org/r/620019 (https://phabricator.wikimedia.org/T254908) [15:49:21] 10Operations, 10GrowthExperiments-NewcomerTasks, 10Product-Infrastructure-Team-Backlog, 10serviceops: Service operations setup for Add a Link project - https://phabricator.wikimedia.org/T258978 (10kostajh) [15:49:27] !log moving api-gateway service to state production. critical set to false [15:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:17] PROBLEM - Host mc2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:55:18] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) @elukey system board problem on the server. And the server is out if warranty @wiki_willy [15:56:17] elukey: task updated for mc2028 [15:56:25] (03PS2) 10Hnowlan: api-gateway: create discovery records [dns] - 10https://gerrit.wikimedia.org/r/619798 (https://phabricator.wikimedia.org/T254908) [15:56:34] papaul: thanks, I have seen it :( [15:56:42] elukey: cool [16:02:30] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10elukey) This host is important for the upcoming DC switchover happening on Sept 1st :( We have in our budget the refresh of the mc hosts IIRC, but now it might be too soon (plus ordering new hosts... [16:03:07] (03CR) 10Ppchelko: "Marking as WIP to investigate alternatives to tail|curl" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [16:06:00] (03PS1) 10Ori.livneh: admin: add mwhist script to ~ori [puppet] - 10https://gerrit.wikimedia.org/r/620061 [16:08:13] 10Operations, 10Platform Engineering, 10Wikidata, 10serviceops, 10Sustainability (Incident Followup): Figure what change caused the ongoing memleak on mw appservers - https://phabricator.wikimedia.org/T260329 (10JMeybohm) [16:10:00] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [16:14:08] (03CR) 10Jcrespo: [C: 03+2] wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [16:14:34] (03PS6) 10Jcrespo: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) [16:17:09] (03CR) 10Jcrespo: [C: 03+2] wmfmariadbpy: Reorganize backup scripts into its own directory (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [16:17:35] (03Merged) 10jenkins-bot: wmfmariadbpy: Reorganize backup scripts into its own directory [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619958 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [16:19:52] (03CR) 10Jcrespo: [C: 03+2] wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) (owner: 10Jcrespo) [16:21:51] (03PS1) 10Andrew Bogott: openstack clients: include python3 openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/620063 [16:21:53] (03PS1) 10Andrew Bogott: backy2: use local sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/620064 [16:21:55] dpifke: i am here if you want to do something with xhgui [16:23:17] (03PS4) 10Hnowlan: Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [16:24:09] (03CR) 10Andrew Bogott: [C: 03+2] openstack clients: include python3 openstack client packages [puppet] - 10https://gerrit.wikimedia.org/r/620063 (owner: 10Andrew Bogott) [16:24:15] (03CR) 10Hnowlan: [C: 03+2] Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [16:24:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [16:24:40] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet - https://phabricator.wikimedia.org/T260370 (10RobH) [16:26:11] !log created api.wikimedia.org [16:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:37] (03PS7) 10Jcrespo: wmfbackups: Copy backup-related scripts from puppet to wmfmariadbpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/620005 (https://phabricator.wikimedia.org/T165358) [16:28:01] (03PS3) 10Hnowlan: api-gateway: create discovery records [dns] - 10https://gerrit.wikimedia.org/r/619798 (https://phabricator.wikimedia.org/T254908) [16:29:27] (03PS2) 10Andrew Bogott: backy2: use local sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/620064 [16:29:29] (03PS1) 10Andrew Bogott: wmcs-backup-instances: fix name of mwopenstackclients [puppet] - 10https://gerrit.wikimedia.org/r/620065 [16:29:30] (03PS1) 10Andrew Bogott: role::wmcs::ceph::backup: include observerenv [puppet] - 10https://gerrit.wikimedia.org/r/620066 [16:30:50] (03CR) 10Andrew Bogott: [C: 03+2] backy2: use local sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/620064 (owner: 10Andrew Bogott) [16:31:06] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-backup-instances: fix name of mwopenstackclients [puppet] - 10https://gerrit.wikimedia.org/r/620065 (owner: 10Andrew Bogott) [16:31:13] (03CR) 10Andrew Bogott: [C: 03+2] role::wmcs::ceph::backup: include observerenv [puppet] - 10https://gerrit.wikimedia.org/r/620066 (owner: 10Andrew Bogott) [16:34:33] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:35:46] (03PS11) 10Jcrespo: mariadb-backups: Reorganize files and update paths [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/619962 (https://phabricator.wikimedia.org/T165358) [16:36:09] (03PS1) 10Hnowlan: trafficserver: route api.wikimedia.org to api-gateway service [puppet] - 10https://gerrit.wikimedia.org/r/620067 (https://phabricator.wikimedia.org/T254908) [16:36:31] (03PS1) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [16:37:50] (03PS2) 10Ppchelko: ratelimit: crash on startup if config is invalid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 [16:38:35] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:39:13] RECOVERY - Thanos query has high latency for instant queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:39:41] (03CR) 10Ppchelko: ratelimit: crash on startup if config is invalid (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/620068 (owner: 10Ppchelko) [16:39:58] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10RobH) [16:40:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10RobH) [16:44:45] (03CR) 10CDanis: [C: 03+1] api-gateway: create discovery records [dns] - 10https://gerrit.wikimedia.org/r/619798 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:47:32] (03CR) 10Hnowlan: [C: 03+2] api-gateway: create discovery records [dns] - 10https://gerrit.wikimedia.org/r/619798 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [16:49:21] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:51:36] mutante: Thanks. I'm still on my first cup of coffee, will be ready to get started in a few. [16:53:17] (03CR) 10Dzahn: [C: 03+2] ATS: temp. set backend for releases-jenkins to releases1001 [puppet] - 10https://gerrit.wikimedia.org/r/619826 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [16:53:45] dpifke: ack, no rush [16:54:01] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:55:05] PROBLEM - Thanos query has many failed HTTP range queries requests on icinga1001 is CRITICAL: 7.368 ge 5 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:55:13] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:59:01] RECOVERY - Thanos query has many failed HTTP range queries requests on icinga1001 is OK: (C)5 ge (W)3 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:59:07] RECOVERY - Thanos query has high latency for range queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [16:59:52] (03PS2) 10Hnowlan: trafficserver: route api.wikimedia.org to api-gateway service [puppet] - 10https://gerrit.wikimedia.org/r/620067 (https://phabricator.wikimedia.org/T254908) [16:59:53] RECOVERY - Thanos query has high latency for instant queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:08:44] (03CR) 10CDanis: [C: 03+1] trafficserver: route api.wikimedia.org to api-gateway service [puppet] - 10https://gerrit.wikimedia.org/r/620067 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:13:18] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route api.wikimedia.org to api-gateway service [puppet] - 10https://gerrit.wikimedia.org/r/620067 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:16:31] !log deployed ATS and varnish rules to route api.wikimedia.org [17:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10RobH) [17:18:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10RobH) [17:20:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10RobH) [17:20:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10RobH) [17:21:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frdb1004.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T260379 (10RobH) [17:29:00] (03PS2) 10Dzahn: Revert "Revert "switch releases.wikimedia.org to buster backends"" [dns] - 10https://gerrit.wikimedia.org/r/619618 [17:29:19] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:29:27] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:29:43] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.01305 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:30:31] PROBLEM - Thanos query has many failed HTTP range queries requests on icinga1001 is CRITICAL: 26.92 ge 5 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:30:37] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:30:42] hnowlan: that looks like there might be puppet errors on caching servers [17:30:53] let's check one [17:31:41] hnowlan: "No rule found for api-gateway.wikimedia.org in profile::trafficserver::backend::mapping_rules " [17:32:13] hnowlan: agh, looking [17:33:13] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 125.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [17:33:31] PROBLEM - puppet last run on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:33:42] hnowlan: so in backend.yaml you have a rule for api.wikimedia.org but in text.yaml you have api-gateway.wikimedia.org [17:34:07] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:13] PROBLEM - Check whether ferm is active by checking the default input chain on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:34:35] mutante: ah, good find. Writing a fix now [17:34:35] PROBLEM - Check size of conntrack table on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:35:01] RECOVERY - Thanos query has many failed HTTP range queries requests on icinga1001 is OK: (C)5 ge (W)3 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:35:03] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:35:11] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1028 site=eqiad tunnel=mc2028_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:35:21] PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:35:29] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:35:41] (03CR) 10Krinkle: [C: 03+2] xhgui: enable prod MariaDB, disable labs MongoDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619886 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:35:47] (03PS1) 10Hnowlan: cache: fix naming of api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/620090 (https://phabricator.wikimedia.org/T254908) [17:35:59] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:36:09] (03CR) 10Dzahn: [C: 03+1] cache: fix naming of api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/620090 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:36:25] (03Merged) 10jenkins-bot: xhgui: enable prod MariaDB, disable labs MongoDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/619886 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [17:36:37] hnowlan: you already got a certificate for api-gateway.discovery ? [17:36:44] (03CR) 10Hnowlan: [C: 03+2] cache: fix naming of api.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/620090 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan) [17:37:23] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:37:43] mutante: yep [17:38:01] PROBLEM - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:38:05] hnowlan: ok. cool. and puppet works again on cp1075. the others should recover. you can optionally use cumin to run puppet on all of them [17:38:20] mutante: sweet, thanks for letting me know [17:38:26] * Krinkle staging on mwdebug1002 with dpifke [17:38:54] it used to be this noisy alert but then it was changed to just one alert for "widespread failures" and i think now it is too easy to overlook, actually [17:40:26] hnowlan: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [17:40:57] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:40:57] PROBLEM - dhclient process on prometheus1004 is CRITICAL: connect to address 10.64.16.38 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [17:40:59] volans: neat, thanks! [17:41:04] the important part is the batch option and the useful part is the -q :D [17:41:11] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1028 site=eqiad tunnel=mc2028_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [17:41:13] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is CRITICAL: 132.2 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [17:41:23] to avoid the clutter of output [17:41:29] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:41:32] heh, cool [17:42:04] hnowlan: you can safely run it across the whole fleet, but if you have a more specific alias/query it's quicker [17:42:32] (03PS1) 10Tchanders: Remove the 'investigate' right from testwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620091 (https://phabricator.wikimedia.org/T260175) [17:42:34] (03PS1) 10Tchanders: Remove 'investigate' from $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) [17:43:19] herron: hey you know where that JVM GC alert from logstash1010? is it kafka? [17:43:49] that's elasticsearch memory afaict [17:43:55] I just bounced the instance [17:44:01] ack, thanks! [17:44:37] don't forget to log the actions :) [17:44:43] PROBLEM - Thanos query has high latency for range queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:45:19] this should recover once puppet ran on cp https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1 [17:45:25] PROBLEM - Thanos query has high latency for instant queries on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:46:00] (03CR) 10Tchanders: "I think this should be separate from the patch underneath, but if not I'll squash them." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [17:46:25] PROBLEM - Thanos query has many failed HTTP range queries requests on icinga1001 is CRITICAL: 5.668 ge 5 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:47:15] RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:47:55] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:20] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "switch releases.wikimedia.org to buster backends"" [dns] - 10https://gerrit.wikimedia.org/r/619618 (owner: 10Dzahn) [17:48:23] RECOVERY - Check size of conntrack table on prometheus1004 is OK: OK: nf_conntrack is 3 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [17:48:55] RECOVERY - Check the last execution of generate-mysqld-exporter-config on prometheus1004 is OK: OK: Status of the systemd unit generate-mysqld-exporter-config https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:49:31] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:50:17] RECOVERY - Thanos query has many failed HTTP range queries requests on icinga1001 is OK: (C)5 ge (W)3 ge 0 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:50:20] hnowlan: ^ that was it. [17:50:23] RECOVERY - Thanos query has high latency for range queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:51:15] RECOVERY - Thanos query has high latency for instant queries on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [17:51:21] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:51:29] (03PS2) 10Dzahn: releases: allow rsyncing jenkins data between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/619822 (https://phabricator.wikimedia.org/T247652) [17:53:00] (03CR) 10Dzahn: [C: 03+2] releases: allow rsyncing jenkins data between releases servers [puppet] - 10https://gerrit.wikimedia.org/r/619822 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [17:56:04] mutante: ahh I see - I'll keep an eye out for it in future. I was looking at the time but mostly eyegrepping for specifically cache or ATS related things [17:57:10] hnowlan: yup, makes sense. just a puppet issue since it kept the service from reloading config [17:59:29] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:01:09] RECOVERY - Check whether ferm is active by checking the default input chain on prometheus1004 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:01:09] RECOVERY - dhclient process on prometheus1004 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:01:09] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:04:59] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [18:05:29] !log dpifke@deploy1001 Synchronized wmf-config/ProductionServices.php: Enabling new XHGui backend (T180761) (duration: 00m 56s) [18:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:32] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [18:05:43] :) [18:06:02] !log restarted ES on logstash1010 [18:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:40] (03PS2) 10Dzahn: remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586) [18:11:35] (03CR) 10Herron: [C: 03+1] remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586) (owner: 10Dzahn) [18:12:56] herron: thanks. should i just merge it or wait until VM is deleted for real? [18:13:34] makes sense to merge IMO [18:14:08] cool, doing [18:14:31] (03CR) 10Dzahn: [C: 03+2] remove fermium from DHCP,partman and acme_chief [puppet] - 10https://gerrit.wikimedia.org/r/619586 (https://phabricator.wikimedia.org/T224586) (owner: 10Dzahn) [18:14:44] great thanks! [18:15:21] the other day i replaced it in some scripts that might not be used anymore [18:15:34] now there is just the DNS record and that's it [18:20:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:20:43] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:20:55] are you already working on a patch for DNS? I can upload one if not [18:22:57] herron: eh, kind of started. you do it then :) [18:23:28] hehe ok [18:31:51] (03PS1) 10Herron: dns: remove fermium records [dns] - 10https://gerrit.wikimedia.org/r/620096 (https://phabricator.wikimedia.org/T224586) [18:40:53] (03CR) 10Dzahn: [C: 03+1] dns: remove fermium records [dns] - 10https://gerrit.wikimedia.org/r/620096 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [18:42:42] (03CR) 10Herron: [C: 03+2] dns: remove fermium records [dns] - 10https://gerrit.wikimedia.org/r/620096 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [18:54:09] (03PS1) 10Dzahn: releases: add quickdatacopy rsync on the primary as well [puppet] - 10https://gerrit.wikimedia.org/r/620099 (https://phabricator.wikimedia.org/T247652) [18:57:51] (03PS2) 10Dzahn: releases: rsync needs to be on all servers incl the primary [puppet] - 10https://gerrit.wikimedia.org/r/620099 (https://phabricator.wikimedia.org/T247652) [19:02:30] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/619982 (owner: 10Volans) [19:04:58] (03PS3) 10Dzahn: releases: rsync needs to be on all servers incl the primary [puppet] - 10https://gerrit.wikimedia.org/r/620099 (https://phabricator.wikimedia.org/T247652) [19:07:59] (03PS1) 10Andrew Bogott: wmcs/ceph/backy2 specify expiration for backups [puppet] - 10https://gerrit.wikimedia.org/r/620102 (https://phabricator.wikimedia.org/T259192) [19:08:01] (03PS1) 10Andrew Bogott: wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) [19:09:29] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:12:07] (03PS2) 10Andrew Bogott: wmcs/ceph/backy2 specify expiration for backups [puppet] - 10https://gerrit.wikimedia.org/r/620102 (https://phabricator.wikimedia.org/T259192) [19:12:09] (03PS2) 10Andrew Bogott: wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) [19:13:32] (03PS3) 10Andrew Bogott: wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) [19:13:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:13:48] (03PS1) 10Bartosz Dziewoński: Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.35.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620029 (https://phabricator.wikimedia.org/T259855) [19:14:22] (03CR) 10Bartosz Dziewoński: [C: 04-1] Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.35.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620029 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:14:47] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:14:53] (03CR) 10jerkins-bot: [V: 04-1] Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.35.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620029 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:16:31] (03Abandoned) 10Bartosz Dziewoński: Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.35.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620029 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:16:55] (03PS1) 10Bartosz Dziewoński: Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620030 (https://phabricator.wikimedia.org/T259855) [19:18:15] (03PS3) 10Andrew Bogott: wmcs/ceph/backy2 specify expiration for backups [puppet] - 10https://gerrit.wikimedia.org/r/620102 (https://phabricator.wikimedia.org/T259192) [19:18:17] (03PS4) 10Andrew Bogott: wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) [19:18:19] (03PS1) 10Andrew Bogott: Remove backup role from cloudvirt1004 [puppet] - 10https://gerrit.wikimedia.org/r/620106 (https://phabricator.wikimedia.org/T259192) [19:18:52] (03PS2) 10Bartosz Dziewoński: Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620030 (https://phabricator.wikimedia.org/T259855) [19:19:34] (03CR) 10Bartosz Dziewoński: "(the latest two commits weren't actually in 1.36.0-wmf.4)" [extensions/DiscussionTools] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620030 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:19:40] (03CR) 10Andrew Bogott: [C: 03+2] Remove backup role from cloudvirt1004 [puppet] - 10https://gerrit.wikimedia.org/r/620106 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:19:45] (03CR) 10jerkins-bot: [V: 04-1] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:19:55] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy2 specify expiration for backups [puppet] - 10https://gerrit.wikimedia.org/r/620102 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:20:55] thcipriani: hey, you around? would you be able to deploy that revert in DiscussionTools? (i think James_F talked with you about it) [19:21:05] the patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/620030 [19:21:11] (03PS5) 10Andrew Bogott: wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) [19:22:32] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy: add timer to run a backup job and a cleanup job once per day [puppet] - 10https://gerrit.wikimedia.org/r/620103 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:24:45] (03PS1) 10Andrew Bogott: wmcs/ceph/backy2: use 'root' user to run backups [puppet] - 10https://gerrit.wikimedia.org/r/620107 (https://phabricator.wikimedia.org/T259192) [19:25:50] MatmaRex: yep I'm around. [19:26:17] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy2: use 'root' user to run backups [puppet] - 10https://gerrit.wikimedia.org/r/620107 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:26:19] (03CR) 10Thcipriani: [C: 03+2] Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620030 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:29:58] (03Merged) 10jenkins-bot: Revert new reply API (again) [extensions/DiscussionTools] (wmf/1.36.0-wmf.4) - 10https://gerrit.wikimedia.org/r/620030 (https://phabricator.wikimedia.org/T259855) (owner: 10Bartosz Dziewoński) [19:31:45] MatmaRex: anything to check on mwdebug? It's staged on mwdebug1002 [19:31:53] thcipriani: yeah, i'll look [19:32:12] thanks [19:32:23] (03PS1) 10Dzahn: releases: set releases1001 as primary to sync jenkins config [puppet] - 10https://gerrit.wikimedia.org/r/620109 (https://phabricator.wikimedia.org/T247652) [19:32:54] thcipriani: seems good [19:32:56] (03CR) 10Dzahn: [C: 03+2] releases: set releases1001 as primary to sync jenkins config [puppet] - 10https://gerrit.wikimedia.org/r/620109 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [19:33:02] (03PS2) 10Dzahn: releases: set releases1001 as primary to sync jenkins config [puppet] - 10https://gerrit.wikimedia.org/r/620109 (https://phabricator.wikimedia.org/T247652) [19:33:49] MatmaRex: k, going live [19:33:53] (03PS1) 10Andrew Bogott: wmcs/ceph/backy2: move our cleanup logic into a script [puppet] - 10https://gerrit.wikimedia.org/r/620110 (https://phabricator.wikimedia.org/T259192) [19:35:06] !log thcipriani@deploy1001 Synchronized php-1.36.0-wmf.4/extensions/DiscussionTools: [[gerrit:620030|Revert new reply API (again)]] T259855 (duration: 00m 57s) [19:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:10] T259855: DiscussionTools touched unrelated parts of the page - https://phabricator.wikimedia.org/T259855 [19:35:13] ^ MatmaRex should be live [19:35:22] (03CR) 10Andrew Bogott: [C: 03+2] wmcs/ceph/backy2: move our cleanup logic into a script [puppet] - 10https://gerrit.wikimedia.org/r/620110 (https://phabricator.wikimedia.org/T259192) (owner: 10Andrew Bogott) [19:35:51] thcipriani: thanks [19:37:54] Thank you so much, Tyler. [19:40:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:53:15] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [19:57:27] https://www.irccloud.com/pastebin/WygupOIU/ [19:57:34] andrewbogott: Error: Contact group 'wmcs-email' [19:58:18] (03PS6) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [20:04:21] (03CR) 10Dzahn: [C: 03+2] releases: rsync needs to be on all servers incl the primary [puppet] - 10https://gerrit.wikimedia.org/r/620099 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [20:11:46] 10Operations, 10ops-codfw: mc2028 regular and mgmt interface down - https://phabricator.wikimedia.org/T260224 (10Papaul) @elukey i spoke to @wiki_willy we do have a spare in place that we can use for mc2028 see link below. https://netbox.wikimedia.org/dcim/devices/1109/ [20:12:09] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:12:20] (03CR) 10Cwhite: [C: 03+2] prometheus: add config tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/619563 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:13:20] (03PS2) 10Cwhite: prometheus: remove unnecessary define and split mediawiki queries by channel [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418) [20:16:19] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/619574 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:16:37] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:28:16] mutante: looking... [20:29:43] (03PS1) 10Andrew Bogott: backy2 alerting: use wmcs-team-email, not wmcs-email [puppet] - 10https://gerrit.wikimedia.org/r/620115 [20:30:04] thanks, ack! [20:31:07] (03CR) 10Andrew Bogott: [C: 03+2] backy2 alerting: use wmcs-team-email, not wmcs-email [puppet] - 10https://gerrit.wikimedia.org/r/620115 (owner: 10Andrew Bogott) [20:33:56] mutante: fixed I think [20:35:25] (03CR) 10Gergő Tisza: [C: 03+1] "testForAuthentication is used during login, it does not interact with account creation or authentication in any way. It doesn't even preve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [20:36:08] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [20:36:36] thanks tgr|away [20:36:39] andrewbogott: ^ confirmed, thx [20:38:22] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [20:49:58] (03CR) 10Gergő Tisza: [C: 03+1] "That makes sense, before the local account gets created PermissionManager can't really tell whether this is the owner of the local account" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm) [20:53:46] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 54 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:53:57] !log dropping xhgui.xhgui on m2 [20:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:02] (03PS4) 10Ppchelko: Resurrect fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) [20:57:09] (03PS7) 10Ppchelko: Modify api-gateway access logging to conform to schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/619341 (https://phabricator.wikimedia.org/T251812) [20:58:05] (03CR) 10Ppchelko: "This change is ready for review." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619512 (https://phabricator.wikimedia.org/T251812) (owner: 10Ppchelko) [21:00:04] (03CR) 10Nuria: [C: 03+1] Refine - temporarily exclude resource-purge from refinement [puppet] - 10https://gerrit.wikimedia.org/r/620056 (owner: 10Ottomata) [21:02:44] PROBLEM - MariaDB Replica SQL: m2 on db1117 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table xhgui.xhgui doesnt exist on query. Default database: xhgui. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:04:16] ACKNOWLEDGEMENT - MariaDB Replica SQL: m2 on db1117 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table xhgui.xhgui doesnt exist on query. Default database: xhgui. [Query snipped] Kormat known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:04:44] PROBLEM - MariaDB Replica SQL: m2 on db2133 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table xhgui.xhgui doesnt exist on query. Default database: xhgui. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:05:29] this is being talked about in -databases ^ [21:05:49] affects m2 because of work on xhgui, it is known already [21:11:24] !log rsyncing /var/lib/jenkins from releases1001 to releases1002 and then all other releases* servers. 57GB, overwriting existing data from manual config (T247652) [21:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:28] T247652: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 [21:12:26] RECOVERY - MariaDB Replica SQL: m2 on db2133 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:14:16] RECOVERY - MariaDB Replica SQL: m2 on db1117 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:31:59] (03PS1) 10BryanDavis: domainproxy: enforce TLS by default [puppet] - 10https://gerrit.wikimedia.org/r/620122 (https://phabricator.wikimedia.org/T120486) [21:35:14] nice ^ [21:36:33] (03CR) 10BryanDavis: "We really never saw any problems of note when we enabled this same functionality for tools.wmflabs.org. It feels safe to me to merge and d" [puppet] - 10https://gerrit.wikimedia.org/r/620122 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [21:37:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:04] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:45:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:46:03] (03PS1) 10Dzahn: Revert "releases: set releases1001 as primary to sync jenkins config" [puppet] - 10https://gerrit.wikimedia.org/r/620032 [21:46:23] (03CR) 10Dzahn: [C: 03+2] "revert means "as planned" here :)" [puppet] - 10https://gerrit.wikimedia.org/r/620032 (owner: 10Dzahn) [21:50:25] !log andrew@deploy1001 Started deploy [horizon/deploy@f3dcb29]: fix proxy in project-local domain --bug T260388 [21:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:29] T260388: HTTP proxy can't be created matching a Cloud VPS project name if it has a Designate zone as well - https://phabricator.wikimedia.org/T260388 [21:50:56] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:51:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:54:06] (03PS1) 10Dave Pifke: xhgui: increase PHP memory limit to 512MB [puppet] - 10https://gerrit.wikimedia.org/r/620126 (https://phabricator.wikimedia.org/T180761) [21:54:17] !log andrew@deploy1001 Finished deploy [horizon/deploy@f3dcb29]: fix proxy in project-local domain --bug T260388 (duration: 03m 53s) [21:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:42] (03CR) 10Bstorm: "Just for the sake of it, here's PCC: https://puppet-compiler.wmflabs.org/compiler1002/24489/" [puppet] - 10https://gerrit.wikimedia.org/r/620122 (https://phabricator.wikimedia.org/T120486) (owner: 10BryanDavis) [21:55:55] (03CR) 10Dzahn: [C: 03+2] xhgui: increase PHP memory limit to 512MB [puppet] - 10https://gerrit.wikimedia.org/r/620126 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [21:57:36] (03CR) 10Dzahn: "[xhgui1001:~] $ grep memory_limit /etc/php/7.3/apache2/php.ini" [puppet] - 10https://gerrit.wikimedia.org/r/620126 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [22:00:23] (03CR) 10Dave Pifke: "Before:" [puppet] - 10https://gerrit.wikimedia.org/r/620126 (https://phabricator.wikimedia.org/T180761) (owner: 10Dave Pifke) [22:00:32] mutante: With that memory limit fix, I think we're now good to go flipping the XHGui front end. Everything else seems to be working. [22:00:55] dpifke: very nice! let's go [22:01:15] (03PS8) 10Dzahn: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) [22:01:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:01:52] (03CR) 10Dzahn: [C: 03+2] webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [22:03:41] !log switching xhgui from tungsten to xhgui1001 - ran puppet on webperf*001 - T180761 T158837 [22:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:45] T180761: Move XHGui from tungsten to xhgui-001 - https://phabricator.wikimedia.org/T180761 [22:03:45] T158837: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 [22:03:47] dpifke: should be done [22:04:20] looks at https://performance.wikimedia.org/xhgui/ [22:04:25] Yup, new instance is getting traffic. [22:04:35] :) [22:05:17] thanks for all your work on this. it has been a long way coming since we were on Mongo and jessie :) [22:07:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:07:30] No problem, thanks for all the work on your end too. [22:07:50] yw. of course now we wait for a couple days before we actually shut down tungsten [22:07:58] but it will be nice to remove it [22:08:10] such an old host [22:32:36] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:37:26] (03PS1) 10Dzahn: webperf: remove the xhgui_old_host parameter [puppet] - 10https://gerrit.wikimedia.org/r/620128 (https://phabricator.wikimedia.org/T180761) [22:37:30] (03PS1) 10Dzahn: remove tungsten from site, DHCP and partman [puppet] - 10https://gerrit.wikimedia.org/r/620129 (https://phabricator.wikimedia.org/T260395) [22:37:32] (03PS1) 10Dzahn: delete role::xhgui::app [puppet] - 10https://gerrit.wikimedia.org/r/620130 (https://phabricator.wikimedia.org/T260395) [22:37:34] (03PS1) 10Dzahn: base: remove tungsten from check-microcode.py [puppet] - 10https://gerrit.wikimedia.org/r/620131 (https://phabricator.wikimedia.org/T260395) [22:42:08] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add DVrandecic to group nda - https://phabricator.wikimedia.org/T260279 (10Dzahn) It's either "wmf" if you are staff OR it is "nda" if you are a volunteer. [22:44:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:44:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:59:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:05:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:11:42] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:25:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:39:10] !log removing 3 files for legal compliance [23:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:42:41] (03PS1) 10Dzahn: releases: reprepro rsync needs to be on all servers [puppet] - 10https://gerrit.wikimedia.org/r/620135 (https://phabricator.wikimedia.org/T247652) [23:45:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:46:37] (03CR) 10Dzahn: [C: 03+2] releases: reprepro rsync needs to be on all servers [puppet] - 10https://gerrit.wikimedia.org/r/620135 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [23:47:58] 10Operations, 10vm-requests, 10Performance-Team (Radar): More RAM needed for webperf1002 and webperf2002 - https://phabricator.wikimedia.org/T260192 (10dpifke) [23:51:20] (03PS9) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 [23:52:27] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [23:53:17] (03CR) 10Ryan Kemper: "Responded to review." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (owner: 10Ryan Kemper) [23:54:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets