[00:08:57] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:09:10] !log wikitech - make JBond a "content administrator" to give the ability to create server fingerprint pages [00:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Papaul) mgmt DNS removed by @Cmjohnson already https://gerrit.wikimedia.org/r/#/c/operations/dns/+/531295/ [00:19:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Papaul) [00:19:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom californium - https://phabricator.wikimedia.org/T189921 (10Papaul) 05Open→03Resolved Complete [00:22:28] 10Operations, 10DNS, 10Toolforge, 10Traffic, 10cloud-services-team (Kanban): Update authoratiative nameservers for the toolforge.org domain to point to Designate - https://phabricator.wikimedia.org/T235303 (10Krenair) [00:30:11] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:30:17] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [01:23:50] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10Bawolff) Looking at puppet, it looks like we already have some customization in modules/mailman/files/templates/* - so at first glance, I assume... [01:33:36] (03PS1) 10Brian Wolff: Customize article.html for better bidi support [puppet] - 10https://gerrit.wikimedia.org/r/543252 (https://phabricator.wikimedia.org/T235458) [01:34:16] (03CR) 10Brian Wolff: "Just to be clear: mailman is a bit outside my wheel house. I think this will work, but I know very little about mailman." [puppet] - 10https://gerrit.wikimedia.org/r/543252 (https://phabricator.wikimedia.org/T235458) (owner: 10Brian Wolff) [01:34:58] 10Operations, 10Wikimedia-Mailing-lists, 10I18n, 10Patch-For-Review, 10RTL: Make pipermail show RTL emails better by emitting dir=auto - https://phabricator.wikimedia.org/T235458 (10crusnov) Okay cool, like I said I'm not completely versed :) [01:53:46] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Traffic, and 2 others: Prepare Phame to support heavy traffic for a Tech Department blog - https://phabricator.wikimedia.org/T226044 (10Krinkle) [02:49:59] (03PS1) 10BryanDavis: toolforge: exclude grid engine TMPDIR directories from tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/543266 (https://phabricator.wikimedia.org/T217815) [02:53:55] (03CR) 10Jforrester: [C: 03+1] "This is nice. Note that we're going to have to decide what we think the YAML build step counts as; I think it's reasonable to be in requir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542658 (owner: 10Krinkle) [03:05:34] (03CR) 10BryanDavis: "See https://phabricator.wikimedia.org/T217815#5578315 for an example shell session showing how `--protect` does what the commit message of" [puppet] - 10https://gerrit.wikimedia.org/r/543266 (https://phabricator.wikimedia.org/T217815) (owner: 10BryanDavis) [03:19:09] PROBLEM - Check the Netbox report librenms for fail status. on netbox1001 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [03:20:36] 10Operations, 10observability, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [03:35:37] !log mobrovac@deploy1001 Started deploy [restbase/deploy@320f3a5]: Parsoid: Use the ETag for retrieving stashed content - T235465 [03:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:42] T235465: Stashing: revid mismatch between URI and Etag - https://phabricator.wikimedia.org/T235465 [03:49:14] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@320f3a5]: Parsoid: Use the ETag for retrieving stashed content - T235465 (duration: 13m 37s) [03:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:18] T235465: Stashing: revid mismatch between URI and Etag - https://phabricator.wikimedia.org/T235465 [03:53:11] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:06:08] 10Operations, 10MediaWiki-REST-API, 10Parsoid-PHP, 10Traffic, and 2 others: Varnish/ATS should not decode URIs for /w/rest.php - https://phabricator.wikimedia.org/T235478 (10mobrovac) >>! In T235478#5576774, @ema wrote: > This quarter we will carry on with the conversion of the cache_text cluster from Varn... [04:14:25] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [04:51:43] (03PS1) 10BryanDavis: cloud: Replace SSHSessions diamond collector with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/543268 (https://phabricator.wikimedia.org/T210993) [04:57:14] (03CR) 10BryanDavis: "We will need to fix any grafana dashboards that are reading SSHSessionsCollector.open_sessions from graphite after deploying this. I was k" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543268 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [05:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 for schema change', diff saved to https://phabricator.wikimedia.org/P9358 and previous config saved to /var/cache/conftool/dbconfig/20191016-050627-marostegui.json [05:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:05] !log Deploy schema change on s7 sanitarium master (db1079) this will create lag on s7 labsdb T233135 T234066 [05:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:09] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:08:10] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:11:32] !log Change s2 triggers for archive table from db1125:3312 T234704 [05:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:35] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [05:14:50] !log Change s7 triggers for archive table from db1125:3317 T234704 [05:14:55] !log Change s7 triggers for archive table from db1125:3317 T234704 [05:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 for schema change', diff saved to https://phabricator.wikimedia.org/P9359 and previous config saved to /var/cache/conftool/dbconfig/20191016-051812-marostegui.json [05:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:28] !log Deploy schema change on s2 sanitarium master (db1074) this will create lag on s2 labsdb T233135 T234066 [05:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:37] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [05:18:37] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [05:21:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311 for schema change', diff saved to https://phabricator.wikimedia.org/P9360 and previous config saved to /var/cache/conftool/dbconfig/20191016-052104-marostegui.json [05:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:15] (03PS1) 10Marostegui: site.pp: Remove puppet references from db2066 [puppet] - 10https://gerrit.wikimedia.org/r/543269 (https://phabricator.wikimedia.org/T230885) [05:26:37] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2066 [dns] - 10https://gerrit.wikimedia.org/r/543270 (https://phabricator.wikimedia.org/T230885) [05:26:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2066.codfw.wmnet` - db2066.codfw.wmnet (**PASS**)... [05:27:47] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove puppet references from db2066 [puppet] - 10https://gerrit.wikimedia.org/r/543269 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:28:10] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2066 [dns] - 10https://gerrit.wikimedia.org/r/543270 (https://phabricator.wikimedia.org/T230885) (owner: 10Marostegui) [05:28:56] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) a:05RobH→03Papaul [05:29:10] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Marostegui) Host ready for on-site steps + switch disablement [06:15:15] <_joe_> !log upgrading envoyproxy in production to 1.11.2 T235412 [06:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:20] T235412: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 [06:32:13] (03PS1) 10Vgutierrez: Testing buffer_upload experimental plugin - do not merge [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/543271 (https://phabricator.wikimedia.org/T234887) [06:32:37] (03PS2) 10Muehlenhoff: Add library hint for fribidi [puppet] - 10https://gerrit.wikimedia.org/r/543163 [06:41:13] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for fribidi [puppet] - 10https://gerrit.wikimedia.org/r/543163 (owner: 10Muehlenhoff) [06:44:37] (03PS1) 10Muehlenhoff: Reenable puppetmaster2002 after hardware maintenance [puppet] - 10https://gerrit.wikimedia.org/r/543272 (https://phabricator.wikimedia.org/T235250) [06:49:19] (03CR) 10Muehlenhoff: [C: 03+2] Reenable puppetmaster2002 after hardware maintenance [puppet] - 10https://gerrit.wikimedia.org/r/543272 (https://phabricator.wikimedia.org/T235250) (owner: 10Muehlenhoff) [07:10:01] (03CR) 10Muehlenhoff: [C: 03+1] admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [07:24:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [07:36:27] (03CR) 10Muehlenhoff: profile::base: add adduser module to profile:base (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [07:44:41] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mass Yahoo / AOL bounces mailman - https://phabricator.wikimedia.org/T232417 (10Lea_Lacroix_WMDE) FYI, a new "AOL spam attack" started yesterday on the wikidata mailing-list. I tried to counter it by changing the //subscribe_policy// parameter to "confirm", b... [07:44:51] (03PS2) 10Jbond: nrpe::check_puppetrun: update check to responde correctly with alert_master_fail [puppet] - 10https://gerrit.wikimedia.org/r/543127 [07:47:37] (03PS1) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [07:48:07] (03CR) 10Jbond: [C: 03+2] nrpe::check_puppetrun: update check to responde correctly with alert_master_fail [puppet] - 10https://gerrit.wikimedia.org/r/543127 (owner: 10Jbond) [07:50:28] (03PS2) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [07:50:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542938 (owner: 10Jbond) [07:55:47] (03CR) 10Jbond: [C: 03+2] adduser: create module to manage /etc/adduser.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [07:55:58] (03PS3) 10Jbond: adduser: create module to manage /etc/adduser.conf [puppet] - 10https://gerrit.wikimedia.org/r/542983 (https://phabricator.wikimedia.org/T235162) [08:01:08] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/18884/aqs1008.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [08:03:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Logstash, and 6 others: Move AQS logging to new logging pipeline - https://phabricator.wikimedia.org/T219928 (10elukey) a:03elukey [08:06:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] "It will indeed stop the service. Nice catch about the mask, btw" [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:08:10] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:09:28] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [08:11:04] (03PS1) 10Jbond: rake_module: update spec_helper to use nuyaml3 [puppet] - 10https://gerrit.wikimedia.org/r/543279 [08:12:31] (03CR) 10Jbond: profile::base: add adduser module to profile:base (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542984 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [08:17:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, with a minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:18:33] (03PS1) 10Jbond: puppet/config-master: migrate puppet and config-master to eqiad [dns] - 10https://gerrit.wikimedia.org/r/543280 [08:19:22] (03PS1) 10Jbond: puppet/config-master: migrate puppet and config-master to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543281 [08:20:02] (03CR) 10jerkins-bot: [V: 04-1] puppet/config-master: migrate puppet and config-master to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543281 (owner: 10Jbond) [08:21:36] (03PS2) 10Jbond: puppet/config-master: migrate puppet and config-master to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543281 (https://phabricator.wikimedia.org/T235250) [08:22:36] (03PS2) 10Jbond: profile::base: remove puppetmaster parameter [puppet] - 10https://gerrit.wikimedia.org/r/542940 [08:23:25] (03CR) 10Effie Mouzeli: hhvm: stop monitoring hhvm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:23:33] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [08:23:43] (03PS3) 10Effie Mouzeli: hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) [08:24:43] (03CR) 10Jbond: [C: 03+2] profile::base: remove puppetmaster parameter [puppet] - 10https://gerrit.wikimedia.org/r/542940 (owner: 10Jbond) [08:26:10] (03PS4) 10Effie Mouzeli: hhvm: stop monitoring hhvm [puppet] - 10https://gerrit.wikimedia.org/r/543129 (https://phabricator.wikimedia.org/T229792) [08:27:27] (03CR) 10Jbond: [C: 03+2] profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 (owner: 10Jbond) [08:27:42] (03PS3) 10Jbond: profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 [08:29:22] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:35:10] PROBLEM - HHVM processes on mw2251 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:35:14] PROBLEM - HHVM processes on mw2231 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:35:36] PROBLEM - HHVM processes on mw1233 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:36:02] missing a puppet run on icinga? [08:36:08] effie: --^ they don't want to let go :D [08:36:14] PROBLEM - HHVM processes on mw1263 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:36:14] PROBLEM - HHVM processes on mw1274 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:36:21] yep probably [08:37:24] <_joe_> effie: yeah you need to remove the monitoring check first [08:37:35] <_joe_> run puppet everywhere, then on icinga [08:37:46] PROBLEM - HHVM processes on mw2208 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:37:54] PROBLEM - HHVM processes on mw2286 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:38:02] PROBLEM - HHVM processes on mw1255 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:39:50] PROBLEM - HHVM processes on mw2226 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:12] PROBLEM - HHVM processes on mw2203 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:18] ^ that is nme [08:40:20] PROBLEM - HHVM processes on mw2283 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:32] PROBLEM - HHVM processes on mw2230 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:38] PROBLEM - HHVM processes on mw1277 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:46] PROBLEM - HHVM processes on mw1221 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:46] PROBLEM - HHVM processes on mw1297 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:40:56] _joe_: yeap I know [08:41:02] PROBLEM - HHVM processes on mw2193 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:06] PROBLEM - HHVM processes on mw1234 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:12] PROBLEM - HHVM processes on mw2215 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:34] PROBLEM - HHVM processes on mw1224 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:38] PROBLEM - HHVM processes on mw1251 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:42] PROBLEM - HHVM processes on mw2185 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:41:48] PROBLEM - HHVM processes on mw1267 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:42:14] PROBLEM - HHVM processes on mw1235 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:42:48] PROBLEM - HHVM processes on mw1331 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [08:42:53] (03PS1) 10KartikMistry: Update cxserver to 2019-10-15-091114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/543386 (https://phabricator.wikimedia.org/T217585) [08:45:50] _joe_: on a brighter side, this is the last time you will ever see an HHVM alert [08:46:03] :D [08:46:07] <_joe_> :( [08:46:15] <_joe_> bye bye old foe [08:46:23] <_joe_> we have a new one now [08:46:45] <_joe_> farewells are always nostalgic [08:49:06] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10jijiki) I will agree with the poolcounter solution :) [08:55:31] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [08:55:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:58] (03PS4) 10Jbond: profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 [09:04:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) (owner: 10Muehlenhoff) [09:05:16] (03CR) 10Jbond: [C: 03+2] profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 (owner: 10Jbond) [09:05:19] (03PS3) 10Muehlenhoff: Add DNS record for idp2001 [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) [09:05:30] (03PS5) 10Jbond: profile::base::puppet: move defaults to hiera [puppet] - 10https://gerrit.wikimedia.org/r/542938 [09:09:57] ACKNOWLEDGEMENT - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_content_1546970425[3](2019-10-09T14:42:44.498Z) Gehel allocation watermark increased for a few minutes, this shard is now recovering https://wikitech.wikimedia.org/wiki/Search%23Administration [09:10:41] (03PS3) 10Jbond: profile::base::puppet: ensure variables all exist in module namespace [puppet] - 10https://gerrit.wikimedia.org/r/542944 [09:13:44] (03CR) 10Muehlenhoff: [C: 03+2] Add DNS record for idp2001 [dns] - 10https://gerrit.wikimedia.org/r/543128 (https://phabricator.wikimedia.org/T235479) (owner: 10Muehlenhoff) [09:14:28] PROBLEM - Host tools.wmflabs.org is DOWN: CRITICAL - Host Unreachable (tools.wmflabs.org) [09:14:52] ^ arturo [09:15:14] :-/ [09:20:12] !log force merging commonswiki_content on elasticsearch codfw [09:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:42] RECOVERY - ElasticSearch shard size check - 9243 on search.svc.codfw.wmnet is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23If_it_has_been_indexed [09:20:52] (03CR) 10Effie Mouzeli: [C: 03+2] hhvm: stop hhvm service from all hosts [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:22:15] !log Disable puppet on mw* hosts to merge 543131 [09:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:44] RECOVERY - Host tools.wmflabs.org is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [09:23:50] 10Operations, 10Traffic, 10Patch-For-Review: ATS-tls nodes on the text cluster have a slightly higher rate of failed fetches on varnish-fe - https://phabricator.wikimedia.org/T234887 (10Vgutierrez) As reported to upstream [[ https://github.com/apache/trafficserver/issues/6018#issuecomment-542590620 | here ]]... [09:25:55] (03CR) 10Effie Mouzeli: [C: 03+2] "Forgot to post https://puppet-compiler.wmflabs.org/compiler1001/18882/" [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) (owner: 10Effie Mouzeli) [09:26:25] 10Operations, 10serviceops, 10Kubernetes: Upgrade the envoyproxy package to its latest version. - https://phabricator.wikimedia.org/T235412 (10Joe) 05Open→03Resolved All servers in production are upgraded. [09:26:28] 10Operations, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [09:27:01] !log Disable puppet on all hosts running hhvm to merge 543131 - T229792 [09:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:05] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [09:29:42] (03PS4) 10Effie Mouzeli: hhvm: stop hhvm service from all hosts [puppet] - 10https://gerrit.wikimedia.org/r/543131 (https://phabricator.wikimedia.org/T229792) [09:34:04] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:36:28] <_joe_> that is sparql [09:36:50] !log restart fastnetmon on netflow2001 [09:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:54] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [09:37:04] RECOVERY - Disk space on netflow2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [09:37:40] (03PS1) 10Ayounsi: Logrotate fastnetmon logs [puppet] - 10https://gerrit.wikimedia.org/r/543392 [09:39:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Logrotate fastnetmon logs [puppet] - 10https://gerrit.wikimedia.org/r/543392 (owner: 10Ayounsi) [09:40:28] !log enable puppet on all hosts running hhvm - T229792 [09:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:32] T229792: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 [09:43:53] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18888/" [puppet] - 10https://gerrit.wikimedia.org/r/543392 (owner: 10Ayounsi) [09:44:03] (03PS2) 10Ayounsi: Logrotate fastnetmon logs [puppet] - 10https://gerrit.wikimedia.org/r/543392 [09:44:08] (03CR) 10Ayounsi: [C: 03+2] Add config.yaml env/ and output/ to gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/543160 (owner: 10Ayounsi) [09:47:11] (03Merged) 10jenkins-bot: Add config.yaml env/ and output/ to gitignore [software/homer] - 10https://gerrit.wikimedia.org/r/543160 (owner: 10Ayounsi) [09:47:18] PROBLEM - Check systemd state on mw1299 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:32] PROBLEM - Check systemd state on mw1296 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:44] that could be me [09:48:46] checking [09:50:32] ok it is hhvm taking a lot of time to be killed [09:51:08] PROBLEM - Check systemd state on mw1335 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:08] PROBLEM - Check systemd state on mw1337 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:42] PROBLEM - Check systemd state on mw1306 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:02] PROBLEM - Check systemd state on mw1334 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:02] PROBLEM - Check systemd state on mw1302 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:17] sigh [09:52:30] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:52:50] PROBLEM - Check systemd state on mw1304 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:12] PROBLEM - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:38] PROBLEM - Check systemd state on mw1336 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:48] RECOVERY - Check systemd state on mw1306 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:49] effie: this needs a "systemctl reset-failed hhvm", as the absent on the service removes the unit [09:55:00] ACKNOWLEDGEMENT - Check systemd state on labweb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:00] ACKNOWLEDGEMENT - Check systemd state on mw1296 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:00] ACKNOWLEDGEMENT - Check systemd state on mw1299 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:00] ACKNOWLEDGEMENT - Check systemd state on mw1302 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:00] ACKNOWLEDGEMENT - Check systemd state on mw1304 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:01] ACKNOWLEDGEMENT - Check systemd state on mw1334 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:01] ACKNOWLEDGEMENT - Check systemd state on mw1335 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:02] ACKNOWLEDGEMENT - Check systemd state on mw1336 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:02] ACKNOWLEDGEMENT - Check systemd state on mw1337 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:55:03] ACKNOWLEDGEMENT - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Effie Mouzeli They will recover soon, hhvm times out stopping https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:58:10] moritzm: thank you! [09:58:58] PROBLEM - Check systemd state on mw1318 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:32] RECOVERY - Check systemd state on mw1318 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:43] that damn thing is not going down without a fight [10:00:44] PROBLEM - Check systemd state on mw1305 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:00] PROBLEM - HHVM processes on mwdebug2002 is CRITICAL: NRPE: Command check_hhvm not defined https://wikitech.wikimedia.org/wiki/Application_servers [10:01:26] RECOVERY - Check systemd state on mw1299 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:26] RECOVERY - Check systemd state on mw1334 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:26] RECOVERY - Check systemd state on mw1302 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:26] RECOVERY - Check systemd state on mw1336 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:06] RECOVERY - Check systemd state on mw1335 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:08] RECOVERY - Check systemd state on mw1337 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:12] RECOVERY - Check systemd state on mw1304 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:20] RECOVERY - Check systemd state on mw1305 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:39] effie: I didn't really make the correction during the code review either :-/ We had some transition code for the gradual diamond removal to properly handle the service removal for systemd as well, but at this point simply run the command via Cumin [10:02:42] RECOVERY - Check systemd state on mw1296 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:06] moritzm: that is what I did :D [10:03:31] ack :-) [10:03:38] it was funny though, it run ok on a couple of hosts I run it [10:03:48] and then it fought like a dog on others [10:04:33] at least it wasnt on all mw hosts :p [10:04:46] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10elukey) Added `dedcode` to the wmf LDAP group after getting a request from @MGerlach. [10:05:01] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10MGerlach) 05Resolved→03Open @DED who just joined the research team is having the same issue. I think he needs to be added to LDAP-group. @elukey - C... [10:05:46] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10MGerlach) 05Open→03Resolved Sorry, didnt see it was already done. Closed [10:07:33] PROBLEM - Check systemd state on mw2246 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:40] (03CR) 10Lucas Werkmeister (WMDE): "Can this be synced as one change or does it need to be split in order to ensure the correct order?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [10:09:43] RECOVERY - Check systemd state on mw2246 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:05] PROBLEM - Check systemd state on labweb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:10] !log Stop replication on s2 codfw master for schema change and to modify sanitarium triggers T234066 T233135 T234704 [10:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:17] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [10:17:17] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [10:17:17] T234704: Remove ar_comment from sanitarium triggers - https://phabricator.wikimedia.org/T234704 [10:20:26] (03CR) 10Mobrovac: [C: 04-1] "s/@aqs/@cee/ otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [10:22:45] (03CR) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [10:24:05] (03PS3) 10Elukey: aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) [10:27:32] (03CR) 10Elukey: "> Probably outside the scope of this patchset, but it would nice to" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [10:31:00] !log upload prometheus-memcached-exporter 0.4.1+git20181010.2fa99eb-1+deb10u1 to buster-wikimedia - T213089 [10:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:05] T213089: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 [10:39:50] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) Interesting reading: https://github.com/facebook/mcrouter/wiki/Shadowing-setup [10:43:51] (03CR) 10Mobrovac: [C: 03+1] aqs: replace logstash host/port with rsyslog localhost/port [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [10:48:34] <_joe_> !log upgrading confd to 0.16.0 across the cluster. T147204. confd will be restarted on the next puppet run [10:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:38] T147204: Update confd package - https://phabricator.wikimedia.org/T147204 [10:51:59] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) Reminder: ` # TODO The IPv6 IP should be converted into a DNS AAAA resolve once we # enabled the DNS record on the director ` [10:53:16] (03PS1) 10Muehlenhoff: Extend wmf-userschema for additional MFA options: [puppet] - 10https://gerrit.wikimedia.org/r/543402 [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1100). [11:00:04] Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:13] o/ [11:01:05] looks like the `fatalmonitor` script is gone now [11:04:44] (03PS2) 10Lucas Werkmeister (WMDE): Configure Citoid+Wikibase integration on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543145 (https://phabricator.wikimedia.org/T228412) [11:05:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543145 (https://phabricator.wikimedia.org/T228412) (owner: 10Lucas Werkmeister (WMDE)) [11:05:59] (03Merged) 10jenkins-bot: Configure Citoid+Wikibase integration on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543145 (https://phabricator.wikimedia.org/T228412) (owner: 10Lucas Werkmeister (WMDE)) [11:06:47] testing on mwdebug1002 [11:06:55] oh wait, no, we were supposed to use mwdebug1001 [11:07:03] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:07:03] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:07:04] <_joe_> I was about to say [11:07:09] * Lucas_WMDE rereads krinkle’s email [11:07:13] PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:07:37] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:37] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:07:48] well, I already did the scap pull on mwdebug1002, hopefully that doesn’t harm anything [11:07:59] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:08:00] (I’ll have to remember to scap pull any other changes to it, if there are problems with this one) [11:08:07] PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:08:11] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:08:12] ^ expected [11:08:12] are those PROBLEMs likely to be SWAT-related? [11:08:17] PROBLEM - ps1-23-ulsfo-infeed-load-tower-B-single-phase on ps1-23-ulsfo is CRITICAL: SNMP CRITICAL - ps1-23-ulsfo-infeed-load-tower-B-single-phase *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:08:17] ok [11:08:37] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:08:53] testing on mwdebug1001, then [11:09:01] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:21] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:09:25] hmmm that's the ulsfo maintenance [11:09:27] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:09:39] Lucas_WMDE: there's maintenance in the SF data center unrelated to SWAT [11:10:01] PROBLEM - Host re0.cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [11:10:12] ok, thanks [11:10:27] that’s *-ulsfo and and 4* hosts, right? [11:10:35] yes :) [11:10:51] ok, then that covers all the problems above [11:11:11] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:11:11] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:11:47] PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:11:49] PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [11:14:29] first config change seems to be working and logs look fine, syncing [11:14:33] <_joe_> !log purging confd from wtp* servers, not needed anymore [11:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:23] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:543145|Configure Citoid+Wikibase integration on Test Wikidata (T228412)]] (duration: 01m 13s) [11:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:27] T228412: Deploy Citoid Wikibase integration to Test Wikidata - https://phabricator.wikimedia.org/T228412 [11:16:35] (03PS3) 10Lucas Werkmeister (WMDE): extension-list: Load FlaggedRevs via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) [11:17:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [11:18:23] (03Merged) 10jenkins-bot: extension-list: Load FlaggedRevs via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543180 (https://phabricator.wikimedia.org/T87915) (owner: 10Lucas Werkmeister (WMDE)) [11:21:14] 04Critical Alert for device asw2-ulsfo.mgmt.ulsfo.wmnet - Juniper alarm active [11:24:14] second change seems fine on mwdebug1001 [11:24:26] and as far as I can tell it should be safe to sync in one go [11:25:02] 10Operations, 10serviceops, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10User-Joe: Update confd package - https://phabricator.wikimedia.org/T147204 (10Joe) 05Open→03Resolved All stretch+ servers in production have been updated to the newer version. Jessie hosts should go away soon. [11:26:51] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:543180|extension-list: Load FlaggedRevs via extension.json (T87915, T139800, T140852)]] (duration: 01m 05s) [11:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:57] T139800: Update wmf-config/extension-list to use extension.json where available - https://phabricator.wikimedia.org/T139800 [11:26:58] T87915: Convert FlaggedRevs to use extension registration - https://phabricator.wikimedia.org/T87915 [11:26:58] T140852: Load all Wikimedia-deployed extensions and skins via extension registration - https://phabricator.wikimedia.org/T140852 [11:28:39] PROBLEM - IPMI Sensor Status on cp4021 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:28:53] PROBLEM - IPMI Sensor Status on cp4023 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:29:05] PROBLEM - IPMI Sensor Status on cp4024 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:29:05] PROBLEM - IPMI Sensor Status on lvs4005 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:30:12] does anyone else have patches for SWAT? [11:31:55] PROBLEM - IPMI Sensor Status on cp4030 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:34:31] !log EU SWAT done [11:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:51] PROBLEM - IPMI Sensor Status on lvs4006 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:38:43] PROBLEM - IPMI Sensor Status on cp4022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:39:31] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@5b42bdf]: Revert wdqs 0.3.4-SNAPSHOT [11:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:12] Lucas_WMDE: SWAT done, right? I want to deploy cxserver fix if yes. [11:41:25] kart_: yup, go ahead [11:41:29] PROBLEM - IPMI Sensor Status on cp4032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:41:41] PROBLEM - IPMI Sensor Status on dns4002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:42:53] OK! [11:43:29] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-10-15-091114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/543386 (https://phabricator.wikimedia.org/T217585) (owner: 10KartikMistry) [11:43:29] PROBLEM - IPMI Sensor Status on cp4025 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:43:48] (03Merged) 10jenkins-bot: Update cxserver to 2019-10-15-091114-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/543386 (https://phabricator.wikimedia.org/T217585) (owner: 10KartikMistry) [11:43:53] PROBLEM - IPMI Sensor Status on cp4028 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:44:43] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:57] PROBLEM - IPMI Sensor Status on cp4029 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:45:21] PROBLEM - IPMI Sensor Status on dns4001 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:45:53] PROBLEM - IPMI Sensor Status on bast4002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:46:18] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:33] PROBLEM - IPMI Sensor Status on cp4031 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:47:12] RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.13 ms [11:47:13] What's these alerts? [11:47:39] <_joe_> kart_: a maintenance on the power supplies in ulsfo [11:47:50] <_joe_> nothing to do with cxserver , don't worry [11:49:40] PROBLEM - IPMI Sensor Status on cp4027 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:49:43] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@5b42bdf]: Revert wdqs 0.3.4-SNAPSHOT (duration: 10m 13s) [11:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:12] PROBLEM - IPMI Sensor Status on cp4026 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:52:18] PROBLEM - IPMI Sensor Status on lvs4007 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [11:56:26] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@c90503b]: Revert to fix T235540 [11:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:30] T235540: SPARQL query causes StackOverflowError and fails to execute - https://phabricator.wikimedia.org/T235540 [11:57:01] _joe_: thanks. [11:57:13] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:58] 10Operations, 10serviceops: Deploy wikidiff2 v1.9.0 - https://phabricator.wikimedia.org/T234175 (10jijiki) [12:00:13] !log Updated cxserver to 2019-10-15-091114-production (T234773, T217585) [12:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:19] T217585: CX2: ISBN doubled, one correctly formatted with {{ISBN}}, another incorrectly formatted with [[Special:BookSources]] - https://phabricator.wikimedia.org/T217585 [12:00:19] T234773: Add banwiki to cxserver - https://phabricator.wikimedia.org/T234773 [12:00:38] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [12:02:54] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review: puppet: remove cluster variable - https://phabricator.wikimedia.org/T234805 (10jbond) 05Open→03Resolved [12:03:47] 10Puppet: puppetdb6 picking wrong password for replication user - https://phabricator.wikimedia.org/T235628 (10jbond) [12:04:15] (03PS1) 10Muehlenhoff: Revert "Add access for the Icinga replication check" [puppet] - 10https://gerrit.wikimedia.org/r/543407 [12:04:45] 10Operations, 10Puppet, 10PostgreSQL: puppetdb6 picking wrong password for replication user - https://phabricator.wikimedia.org/T235628 (10jbond) p:05Triage→03Normal [12:05:29] (03CR) 10Jbond: [C: 03+2] Revert "Add access for the Icinga replication check" [puppet] - 10https://gerrit.wikimedia.org/r/543407 (owner: 10Muehlenhoff) [12:05:38] (03PS2) 10Jbond: Revert "Add access for the Icinga replication check" [puppet] - 10https://gerrit.wikimedia.org/r/543407 (owner: 10Muehlenhoff) [12:10:57] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [12:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] (03PS3) 10Muehlenhoff: puppet/config-master: migrate puppet and config-master to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543281 (https://phabricator.wikimedia.org/T235250) (owner: 10Jbond) [12:15:36] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@c90503b]: Revert to fix T235540 (duration: 19m 09s) [12:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:39] T235540: SPARQL query causes StackOverflowError and fails to execute - https://phabricator.wikimedia.org/T235540 [12:20:26] !log Compress tables on db1099:3311 - T235599 [12:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:31] T235599: Recompress special slaves across eqiad and codfw - https://phabricator.wikimedia.org/T235599 [12:26:37] (03CR) 10Muehlenhoff: [C: 03+2] puppet/config-master: migrate puppet and config-master to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/543281 (https://phabricator.wikimedia.org/T235250) (owner: 10Jbond) [12:26:56] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@217cac5]: redeploy 0.3.4-SNAPSHOT - T235540 [12:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:01] T235540: SPARQL query causes StackOverflowError and fails to execute - https://phabricator.wikimedia.org/T235540 [12:27:11] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) 05Open→03Resolved [12:27:14] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [12:27:25] (03CR) 10Jbond: [C: 03+2] profile::pupetmaster::frontend: manage ca.pem used in apache config [puppet] - 10https://gerrit.wikimedia.org/r/542954 (https://phabricator.wikimedia.org/T234332) (owner: 10Jbond) [12:27:36] (03PS3) 10Jbond: profile::pupetmaster::frontend: manage ca.pem used in apache config [puppet] - 10https://gerrit.wikimedia.org/r/542954 (https://phabricator.wikimedia.org/T234332) [12:29:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:28] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@217cac5]: redeploy 0.3.4-SNAPSHOT - T235540 (duration: 03m 40s) [12:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:46] (03PS2) 10Muehlenhoff: puppet/config-master: migrate puppet and config-master to eqiad [dns] - 10https://gerrit.wikimedia.org/r/543280 (owner: 10Jbond) [12:37:39] 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) 05Open→03Resolved [12:37:44] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [12:38:52] !log remove tex* and math related packages from appserver canaries - T195847 [12:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:56] T195847: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 [12:40:44] (03CR) 10Muehlenhoff: [C: 03+2] puppet/config-master: migrate puppet and config-master to eqiad [dns] - 10https://gerrit.wikimedia.org/r/543280 (owner: 10Jbond) [12:43:39] 10Operations, 10Patch-For-Review: Build cergen for buster - https://phabricator.wikimedia.org/T235405 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff [12:44:01] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) Since we will be reimaging all mw* servers, those packages will need to be removed actually from snapshot*, deploy*, mwmaint* and labweb* [12:49:02] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:50:34] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:50:38] RECOVERY - ps1-23-ulsfo-infeed-load-tower-B-single-phase on ps1-23-ulsfo is OK: SNMP OK - ps1-23-ulsfo-infeed-load-tower-B-single-phase 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:51:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1079 after schema change', diff saved to https://phabricator.wikimedia.org/P9362 and previous config saved to /var/cache/conftool/dbconfig/20191016-125102-marostegui.json [12:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:50] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 16%, RTA = 401.52 ms [12:51:50] RECOVERY - Host lvs4007.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 241.35 ms [12:52:24] RECOVERY - IPMI Sensor Status on cp4026 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:52:42] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:42] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:46] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.07 ms [12:53:22] RECOVERY - Host re0.cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.06 ms [12:53:24] RECOVERY - IPMI Sensor Status on lvs4007 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [12:53:32] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.14 ms [12:53:57] 10Operations, 10Puppet, 10PostgreSQL: puppetdb6 picking wrong password for replication user - https://phabricator.wikimedia.org/T235628 (10jbond) 05Open→03Resolved [12:54:01] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [12:54:16] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.89 ms [12:54:32] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.13 ms [12:55:02] (03PS1) 10Jbond: puppet: update serialization format to json [puppet] - 10https://gerrit.wikimedia.org/r/543415 (https://phabricator.wikimedia.org/T233643) [12:55:20] RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 74.41 ms [12:56:06] RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 77.45 ms [12:56:18] RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms [12:56:34] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.84 ms [12:57:18] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.43 ms [12:57:19] (03CR) 10Jbond: [C: 03+2] puppet: update serialization format to json [puppet] - 10https://gerrit.wikimedia.org/r/543415 (https://phabricator.wikimedia.org/T233643) (owner: 10Jbond) [12:57:24] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.01 ms [12:59:27] (03PS1) 10Muehlenhoff: Add DHCP config for idp2001 [puppet] - 10https://gerrit.wikimedia.org/r/543448 (https://phabricator.wikimedia.org/T235479) [13:00:00] RECOVERY - IPMI Sensor Status on cp4021 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:00:16] RECOVERY - IPMI Sensor Status on cp4023 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:00:28] RECOVERY - IPMI Sensor Status on lvs4005 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:00:28] RECOVERY - IPMI Sensor Status on cp4024 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:03:22] RECOVERY - IPMI Sensor Status on cp4030 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:08:22] RECOVERY - IPMI Sensor Status on lvs4006 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:09:42] 10Operations, 10ops-codfw, 10Patch-For-Review: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10MoritzMuehlenhoff) @Papaul : puppetmaster2001 is ready to be upgraded. [13:09:54] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10Jclark-ctr) @wiki_willy @MoritzMuehlenhoff Found loose power cable reseated cable psu is on now. [13:10:03] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:10:04] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1074 after schema change', diff saved to https://phabricator.wikimedia.org/P9363 and previous config saved to /var/cache/conftool/dbconfig/20191016-131010-marostegui.json [13:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:20] RECOVERY - IPMI Sensor Status on cp4022 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:11:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-ulsfo.mgmt.ulsfo.wmnet recovered from Juniper alarm active [13:11:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] helmfile_log_sal: Fix getting the user and host for logging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac) [13:11:17] 10Puppet, 10Patch-For-Review: Upgrade Puppet Masters and Puppet DB servers - https://phabricator.wikimedia.org/T228657 (10jbond) [13:11:19] 10Operations, 10Puppet, 10Patch-For-Review: occational puppet errors: Error 500 on SERVER: Server Error: Unsupported facts format - https://phabricator.wikimedia.org/T233643 (10jbond) 05Open→03Resolved [13:11:20] RECOVERY - IPMI Sensor Status on maps1002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:12:11] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10MoritzMuehlenhoff) 05Open→03Resolved Thanks, I re-triggered the Icinga check and it recovered now. [13:13:08] RECOVERY - IPMI Sensor Status on cp4032 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:13:22] RECOVERY - IPMI Sensor Status on dns4002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:13:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Fine by me then. Feel free to deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521967 (owner: 10Aaron Schulz) [13:15:10] RECOVERY - IPMI Sensor Status on cp4025 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:15:34] RECOVERY - IPMI Sensor Status on cp4028 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:16:23] (03PS2) 10Muehlenhoff: Add DHCP config for idp2001 [puppet] - 10https://gerrit.wikimedia.org/r/543448 (https://phabricator.wikimedia.org/T235479) [13:16:38] RECOVERY - IPMI Sensor Status on cp4029 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:17:02] RECOVERY - IPMI Sensor Status on dns4001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:17:34] RECOVERY - IPMI Sensor Status on bast4002 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:18:10] RECOVERY - IPMI Sensor Status on cp4031 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:19:49] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 4 others: Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Gehel) [13:21:16] RECOVERY - IPMI Sensor Status on cp4027 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:23:11] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 4 others: Create puppet configs for SDC query - https://phabricator.wikimedia.org/T232297 (10Gehel) [13:26:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1103:3312 and db1094 for schema change', diff saved to https://phabricator.wikimedia.org/P9364 and previous config saved to /var/cache/conftool/dbconfig/20191016-132620-marostegui.json [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:29] (03PS1) 10Elukey: Add deployment-memc08 to the mcrouter config [puppet] - 10https://gerrit.wikimedia.org/r/543456 (https://phabricator.wikimedia.org/T213089) [13:27:36] (03CR) 10Krinkle: "The original motivation behind 800-ish was not to bother compressing small private uncacheable or rarely reused XML or JSON responses from" [puppet] - 10https://gerrit.wikimedia.org/r/542996 (https://phabricator.wikimedia.org/T232615) (owner: 10Ema) [13:29:54] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10akosiaris) Sorry I missed that, thanks for pinging me on T234900. >>! In T229209#5565968, @jcrespo wrote: > @akosiaris We have reached an impass. We... [13:37:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] [WIP] echostore: create staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/543212 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [13:45:06] (03PS2) 10Mobrovac: helmfile_log_sal: Fix getting the user and host for logging [puppet] - 10https://gerrit.wikimedia.org/r/542064 [13:46:25] !log rollback failover VRRP from cr1-eqiad to cr2-eqiad - T226782 [13:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:29] T226782: a1-eqiad pdu refresh (Tuesday 10/15 @11am UTC) - https://phabricator.wikimedia.org/T226782 [13:47:00] (03CR) 10Mobrovac: helmfile_log_sal: Fix getting the user and host for logging (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/542064 (owner: 10Mobrovac) [13:48:09] (03PS1) 10Alexandros Kosiaris: echostore: Add namespace creation stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) [13:51:46] (03PS5) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [13:53:53] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP config for idp2001 [puppet] - 10https://gerrit.wikimedia.org/r/543448 (https://phabricator.wikimedia.org/T235479) (owner: 10Muehlenhoff) [13:54:32] 10Puppet: Populate puppetdb1002 with live data - https://phabricator.wikimedia.org/T235655 (10jbond) [13:56:12] (03PS1) 10Jbond: puppetdb: remove activerecord db settings from servers using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/543465 (https://phabricator.wikimedia.org/T235655) [13:56:32] !log reenabling puppet on helium T229209 [13:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:36] T229209: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 [13:56:53] (03PS6) 10Herron: logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) [13:58:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think you have some copy/pasta" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) (owner: 10Alexandros Kosiaris) [13:59:06] (03CR) 10Herron: [C: 03+2] logstash: add an index for deployment related logs [puppet] - 10https://gerrit.wikimedia.org/r/542557 (https://phabricator.wikimedia.org/T234564) (owner: 10Herron) [14:01:01] (03PS1) 10Jbond: Revert "profile::pupetmaster::frontend: manage ca.pem used in apache config" [puppet] - 10https://gerrit.wikimedia.org/r/543466 [14:01:22] (03PS2) 10Jbond: Revert "profile::pupetmaster::frontend: manage ca.pem used in apache config" [puppet] - 10https://gerrit.wikimedia.org/r/543466 [14:03:38] (03CR) 10Alexandros Kosiaris: "Indeed, all 3 fixed" [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) (owner: 10Alexandros Kosiaris) [14:03:45] 10Operations, 10MediaWiki-extensions-OATHAuth, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: Cannot enable 2FA on testwiki - https://phabricator.wikimedia.org/T233146 (10Reedy) 05Open→03Resolved a:03Reedy [14:03:49] (03PS2) 10Alexandros Kosiaris: echostore: Add namespace creation stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) [14:03:55] (03CR) 10Jbond: [C: 03+2] Revert "profile::pupetmaster::frontend: manage ca.pem used in apache config" [puppet] - 10https://gerrit.wikimedia.org/r/543466 (owner: 10Jbond) [14:04:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] bacula: Force install bacula-director, not a dependency on buster [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [14:06:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] echostore: Add namespace creation stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) (owner: 10Alexandros Kosiaris) [14:06:31] (03Merged) 10jenkins-bot: echostore: Add namespace creation stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/543463 (https://phabricator.wikimedia.org/T234376) (owner: 10Alexandros Kosiaris) [14:10:27] !log installing idp2001 [14:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:46] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Elitre) @Gilles FWIW, I just got: Request from 176.207.117.69 via cp3038 frontend, Va... [14:18:20] 10Operations, 10Release-Engineering-Team, 10Scap, 10Wikimedia-General-or-Unknown, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10thcipriani) >>! In T235338#5569953, @Reedy wrote: > Current implementation: > > `lang=html >

Currently... [14:18:54] !log oblivian@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:18:54] !log oblivian@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:18:55] !log oblivian@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:03] 10Puppet, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) 05Resolved→03Open re-open as the previous change didn't take into account wmcs [14:19:07] 10Puppet, 10Patch-For-Review: upgrade puppet master frontends servers - https://phabricator.wikimedia.org/T234315 (10jbond) [14:19:41] 10Puppet, 10cloud-services-team, 10Patch-For-Review: ensure additional puppetmaster files are managed by puppet - https://phabricator.wikimedia.org/T234332 (10jbond) [14:22:14] (03PS1) 10Herron: logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 [14:24:04] <_joe_> !log creating namespaces and policies for echostore in codfw, T234376 [14:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:08] T234376: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 [14:24:20] !log oblivian@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:24:21] !log oblivian@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:24:22] !log oblivian@ helmfile [CODFW] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:14] !log power down puppetmaster2001 for HW maintenance [14:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] oh hey, helmfile outputs usernames now :D [14:30:13] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [14:30:14] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'coredns' . [14:30:14] !log akosiaris@ helmfile [EQIAD] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [14:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:38] (03PS1) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:33:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10elukey) [14:33:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10elukey) a:03elukey [14:34:23] (03CR) 10Muehlenhoff: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:36:46] (03CR) 10Elukey: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:40:44] (03PS2) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:41:56] (03PS3) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:42:35] (03PS4) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:42:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] conftool: add parsoid-php service to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [14:42:45] ok done :) [14:46:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] discovery.yaml: add parsoid-php microservice (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [14:47:45] (03CR) 10Muehlenhoff: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:49:08] (03PS13) 10Dzahn: conftool: add parsoid-php service to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) [14:49:29] (03CR) 10Dzahn: [C: 03+2] conftool: add parsoid-php service to wtp servers [puppet] - 10https://gerrit.wikimedia.org/r/541377 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [14:49:37] (03CR) 10Elukey: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:52:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-6/0/14; [edit interfaces interface-range disabled] me... [14:52:15] (03CR) 10Muehlenhoff: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:52:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2066.codfw.wmnet - https://phabricator.wikimedia.org/T230885 (10Papaul) [14:52:23] 10Operations: Puppet breakage in automation-framework VMs - https://phabricator.wikimedia.org/T234452 (10Volans) [14:53:14] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [14:53:51] !log Remove tex* and math related packages from deploy*,mwmaint*,snapshot* - T195847 [14:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:55] T195847: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 [14:54:15] (03PS5) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:54:46] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) 05Open→03Resolved a:03herron The requested group memberships have been provisioned. I'll transition this t... [14:56:23] (03CR) 10Muehlenhoff: archiva: ensure /var/run/archiva (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [14:56:36] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) [14:56:38] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10herron) [14:56:40] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10herron) [14:56:51] moritzm: please be patient, long day [14:57:22] I hope to stop being so sloppy [14:57:36] (03PS6) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [14:57:49] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10herron) 05Open→03Resolved Transitioning this resolved as all subtasks have now been resolved. If a... [14:59:23] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) [14:59:40] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) [14:59:52] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp2001.codfw.wmnet,service=parsoid-php [14:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:09] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet,service=parsoid-php [15:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:45] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jijiki) 05Open→03Resolved a:03jijiki mw* servers will be reimaged as part of T229792, this is resolved [15:01:17] (03PS1) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [15:01:50] 10Operations, 10serviceops, 10HHVM, 10Patch-For-Review, 10Performance-Team (Radar): Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [15:02:06] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) [15:02:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [15:02:55] elukey: +1d :-) [15:03:06] (03CR) 10Jbond: "Desired functionality looks good i suspect it will be better for us to just implement this all in groovy as i think this is more flexibili" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543402 (owner: 10Muehlenhoff) [15:03:38] (03CR) 10jerkins-bot: [V: 04-1] Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [15:04:07] !log wtp parsoid servers added to conftool - wtp1025 and wtp2001 pooled in new service parsoid-php (T233654) [15:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:11] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [15:04:16] (03PS7) 10Elukey: archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) [15:04:46] (03PS2) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [15:04:51] !log wtp1025 wtp2001 - scap pull (T233654) [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:04] 10Operations, 10ops-codfw: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster - https://phabricator.wikimedia.org/T235250 (10Papaul) @MoritzMuehlenhoff complete [15:06:57] (03CR) 10Effie Mouzeli: [C: 03+1] ":D" [puppet] - 10https://gerrit.wikimedia.org/r/543456 (https://phabricator.wikimedia.org/T213089) (owner: 10Elukey) [15:07:41] (03CR) 10Elukey: [C: 03+2] archiva: ensure /var/run/archiva [puppet] - 10https://gerrit.wikimedia.org/r/543469 (https://phabricator.wikimedia.org/T214366) (owner: 10Elukey) [15:09:17] !log Recreate views for protected_titles on s2 and s7 on labsdb1009 and labsdb1012 - T233135 [15:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:20] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [15:09:43] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) The `kad` library that the DHT rate limiter is based on was forked. Since it worked OK, the... [15:10:17] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archiva relies on a tmpfs directory that is wiped after each reboot - https://phabricator.wikimedia.org/T214366 (10elukey) [15:12:59] (03PS3) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [15:15:07] (03CR) 10jerkins-bot: [V: 04-1] Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [15:16:07] (03PS4) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [15:16:13] (03PS1) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:17:27] !log Deploy schema change on dbstore1004:3312 - T234066 T233135 [15:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:33] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [15:17:34] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [15:18:36] (03PS5) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [15:19:08] (03PS2) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:20:57] (03CR) 10jerkins-bot: [V: 04-1] Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [15:22:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "you need a separate dsh group for parsoid_php that uses the correct set of tags to find which servers are pooled or not." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:23:03] (03CR) 10Muehlenhoff: Extend wmf-userschema for additional MFA options: (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543402 (owner: 10Muehlenhoff) [15:26:28] 10Operations, 10Release-Engineering-Team, 10Scap, 10Wikimedia-General-or-Unknown, and 2 others: "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10Krinkle) I thought maybe it was user-permission or working-directory related. But, looks like not.. As www... [15:28:50] (03CR) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:29:10] (03PS3) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:31:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) @wiki_willy, is there any update on this issue? We're still a bit short on capacity due to missing this host and... [15:32:11] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Andrew) @wiki_willy, is this currently in your court or ours? [15:33:18] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) [15:34:09] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Papaul) [15:37:26] (03PS4) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:37:40] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Papaul) [15:39:09] (03CR) 10Giuseppe Lavagetto: "one small thing and the patch is good!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:40:27] (03PS5) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:41:34] 10Operations, 10ops-codfw, 10fundraising-tech-ops: rack/setup/install frqueue2001 - https://phabricator.wikimedia.org/T232630 (10Papaul) @Dwisehaupt hello is tihis done? Can we resolve it? [15:41:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:41:55] (03CR) 10Subramanya Sastry: [C: 03+1] "Is this ready to go? And, maybe at this point, this can be for all wtp* hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:45:41] (03CR) 10Dzahn: [C: 03+2] scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:45:50] (03PS6) 10Dzahn: scap/dsh: add parsoid-php servers to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/543479 (https://phabricator.wikimedia.org/T233654) [15:49:09] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2051,db2056 and db2068 [dns] - 10https://gerrit.wikimedia.org/r/543484 [15:53:08] (03PS3) 10Jcrespo: bacula: Force install bacula-director, not a dependency on buster [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) [15:53:22] jouncebot: next [15:53:22] In 0 hour(s) and 6 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1600) [15:53:27] (03CR) 10Jcrespo: [C: 03+2] bacula: Force install bacula-director, not a dependency on buster [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [15:54:45] <_joe_> mutante: you might want to run puppet manually on deploy1001 before scap starts [15:55:25] _joe_: ack, i am doing that on the scap proxy list, adding deploy1001 [15:59:24] !log new dsh group parsoid_php created - parsoid-php servers added to scap / mediawiki-installation dsh group [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1600). [16:00:04] RoanKattouw: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:13] _joe_: confirmed the generated /etc/dsh/group/mediawiki-installation has the 2 new servers. lgtm [16:00:20] (03CR) 10Ottomata: "Made a task:" [puppet] - 10https://gerrit.wikimedia.org/r/541775 (owner: 10Alexandros Kosiaris) [16:00:34] I can do my own SWAT [16:01:39] RoanKattouw: fwyi, scap should now also deploy to the first 2 wtp servers that were just added. i ran scap pull not long ago [16:02:17] <_joe_> RoanKattouw: you're officially our canary [16:04:30] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) [16:04:46] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) [16:08:29] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) The pickup date is set to Octobet 31st for now. Hi Papaul, Thank you for your take back pickup request. Please note that your pickup request for the equipment CTB11362315, 33324924905 is been receive... [16:08:52] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) [16:09:05] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) [16:10:43] 10Operations, 10vm-requests: codfw: 1 VM for idp - https://phabricator.wikimedia.org/T235479 (10MoritzMuehlenhoff) 05Open→03Resolved VM has been created [16:15:46] (03PS2) 10Phamhi: Update all images based on buster (T230961) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/543124 [16:16:31] !log ganeti1003 - shutting down and removing instance moscovium.eqiad.wmnet - recreating under same name with cookbook [16:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:54] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.2/extensions/GrowthExperiments/: Fix help panel button alignment (T235578) (duration: 01m 02s) [16:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:58] T235578: [wmf.2-regression] Help panel: the cog icon is misaligned - https://phabricator.wikimedia.org/T235578 [16:18:13] Everything worked but I did get this message: [16:18:17] https://www.irccloud.com/pastebin/R6vS8Bg5/ [16:18:46] At the very end of the scap process [16:18:56] So it seems like maybe wtp2001 doesn't have the right sudoers configuration or something [16:19:02] cc mutante _joe_ [16:19:08] RoanKattouw: eh.. that seems familiar from the past put should not happen anymore. i'll look [16:19:29] (03PS2) 10Bstorm: toolforge: exclude grid engine TMPDIR directories from tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/543266 (https://phabricator.wikimedia.org/T217815) (owner: 10BryanDavis) [16:19:31] (03CR) 10Jcrespo: "Sadly we will need another conditional, as bacula-director is a real, and needed package on buster, but it is only virtual on lower versio" [puppet] - 10https://gerrit.wikimedia.org/r/541523 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [16:19:34] interesting it is only on the codfw host [16:19:54] puppet will fail on helium, working on a patch [16:20:09] ack jynus [16:20:31] sorry, it is one of those things "works in buster, not before" [16:20:48] <_joe_> mutante: go look at the sudoers files I guess [16:21:01] <_joe_> RoanKattouw: that is not an issue right now indeed [16:21:01] 10Operations, 10Analytics, 10Fundraising-Backlog, 10SRE-Access-Requests: Banner History and page view data access for fundraising analysts - Jerrie and Erin - https://phabricator.wikimedia.org/T233636 (10jrobell) Thank you @herron ! [16:21:06] <_joe_> but needs to be fixed [16:21:49] _joe_: yea, that probably needs "!requiretty". i remember we had this issue before sometimes [16:22:00] (03CR) 10Bstorm: [C: 03+2] toolforge: exclude grid engine TMPDIR directories from tmpreaper [puppet] - 10https://gerrit.wikimedia.org/r/543266 (https://phabricator.wikimedia.org/T217815) (owner: 10BryanDavis) [16:22:51] <_joe_> mutante: no I think the mwdeploy user can't run that command, plain and simple [16:22:55] <_joe_> as root I mean [16:23:42] ok, comparing with the eqiad server [16:23:52] <_joe_> that sudo rule is defined in profile::mediawiki::common so that's definitely very strange [16:24:31] (03PS1) 10Jcrespo: bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) [16:25:25] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [16:28:03] (03PS2) 10Jcrespo: bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) [16:29:05] (03CR) 10jerkins-bot: [V: 04-1] bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [16:31:39] (03PS3) 10Jcrespo: bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) [16:32:30] RECOVERY - mediawiki-installation DSH group on wtp2001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:44] (03CR) 10Jcrespo: [C: 03+2] bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) (owner: 10Jcrespo) [16:33:00] _joe_: it doesnt work on either eqiad or codfw but the difference to mw servers is they get another sudoers.d file, "deployment" and that lets mwdeploy do everything [16:33:01] (03PS4) 10Jcrespo: bacula: Fix error on bacula director install for older hosts [puppet] - 10https://gerrit.wikimedia.org/r/543489 (https://phabricator.wikimedia.org/T229209) [16:33:20] !log upgrading Cassandra to 3.11.4, eqiad, rack a -- T200803 [16:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:25] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [16:33:28] <_joe_> mutante: wut [16:33:37] so it's working on f.e. mwdebug1001 due to that wider rule [16:33:57] while there is no difference in scap_deploy-service_parsoid [16:36:23] 04Critical Alert for device mr1-eqsin.wikimedia.org - Primary outbound port utilisation over 80% [16:37:14] that is not me, I am not touching eqsin [16:38:21] 10Operations, 10Wikimedia-Mailing-lists, 10Privacy: Potential privacy violations in emails on mailing lists (links posted in emails to external websites which track users) - https://phabricator.wikimedia.org/T213044 (10sbassett) p:05Triage→03Normal [16:41:23] 04̶C̶r̶i̶t̶i̶c̶a̶l Device mr1-eqsin.wikimedia.org recovered from Primary outbound port utilisation over 80% [16:41:24] (03PS6) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [16:43:31] (03CR) 10jerkins-bot: [V: 04-1] Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [16:44:22] (03PS1) 10Eevans: cassandra: Pin Cassandra packages to version 3.11.4 [puppet] - 10https://gerrit.wikimedia.org/r/543494 (https://phabricator.wikimedia.org/T200803) [16:44:28] 10Operations, 10DBA, 10serviceops, 10Goal, 10Patch-For-Review: Strengthen backup infrastructure and support - https://phabricator.wikimedia.org/T229209 (10jcrespo) I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document. For now I... [16:44:30] (03PS7) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [16:44:36] jouncebot: now [16:44:37] For the next 0 hour(s) and 15 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1600) [16:44:46] !sal [16:44:46] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [16:52:41] (03PS1) 10Giuseppe Lavagetto: parsoid: also include php restarts profile [puppet] - 10https://gerrit.wikimedia.org/r/543496 [16:52:51] 10Operations, 10Cassandra, 10Core Platform Team Legacy (Later), 10User-Eevans: Upload 3.11.4 packages to APT repo - https://phabricator.wikimedia.org/T235675 (10Eevans) [16:53:21] 10Operations, 10Cassandra, 10Core Platform Team Legacy (Later), 10User-Eevans: Upload 3.11.4 packages to APT repo - https://phabricator.wikimedia.org/T235675 (10Eevans) [16:54:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18893/wtp1025.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/543496 (owner: 10Giuseppe Lavagetto) [16:56:50] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 56.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:58:09] 10Operations, 10CPT Initiatives (PHP7 (TEC4)), 10HHVM, 10MW-1.34-notes (1.34.0-wmf.22; 2019-09-10), and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [17:01:32] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 83.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:06:46] 10Operations, 10Icinga: dwisehaupt needs access to iginca for frack hosts - https://phabricator.wikimedia.org/T235676 (10Dwisehaupt) [17:06:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "This is a bit ad-hoc and maybe it should be done at the source, but it's good enough for now. We will need to change the channel in mediaw" [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [17:06:55] (03PS2) 10Giuseppe Lavagetto: logstash: add wtp1025/wtp2001 to filter-mediawiki with parsoid-php channel [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [17:08:59] <_joe_> come on jenkins. [17:09:33] 10Operations, 10Icinga, 10fundraising-tech-ops: dwisehaupt needs access to iginca for frack hosts - https://phabricator.wikimedia.org/T235676 (10Dwisehaupt) [17:10:26] RECOVERY - mediawiki-installation DSH group on wtp1025 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:10:38] 10Operations, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Volker_E) [17:10:47] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10sbassett) >>! In T230245#5577856, @Reedy wrote: > T... [17:11:46] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Dzahn) [17:13:20] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:13:58] (03CR) 10Subramanya Sastry: "> Patch Set 1: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/541645 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [17:15:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:15:36] PROBLEM - Host dns5001 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:08] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:17:18] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:18:58] RECOVERY - Host dns5001 is UP: PING OK - Packet loss = 0%, RTA = 235.22 ms [17:21:44] (03PS2) 10Dzahn: discovery.yaml: add parsoid-php microservice [puppet] - 10https://gerrit.wikimedia.org/r/542572 (https://phabricator.wikimedia.org/T233654) [17:23:26] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:28:08] PROBLEM - Host cr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [17:28:32] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:28:34] (03PS1) 10CDanis: depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/543500 [17:28:36] PROBLEM - Host mr1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [17:28:40] PROBLEM - Host upload-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:28:48] PROBLEM - Host text-lb.eqsin.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:28:55] hm what up [17:28:57] oh [17:29:26] RECOVERY - Host cr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 237.24 ms [17:29:26] RECOVERY - Host mr1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 256.53 ms [17:29:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/543500 (owner: 10CDanis) [17:29:30] RECOVERY - Host upload-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 235.15 ms [17:29:32] PROBLEM - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2684 bytes in 0.951 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:29:41] (03CR) 10CDanis: [C: 03+2] depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/543500 (owner: 10CDanis) [17:29:44] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job={varnish-text,varnish-upload} site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:29:46] what's up? [17:29:48] RECOVERY - Host text-lb.eqsin.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 235.10 ms [17:29:57] just got a blip of down and up flood from eqsin [17:30:02] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:30:10] huh [17:30:57] RECOVERY - LVS HTTPS IPv6 #page on text-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15426 bytes in 1.255 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:33:48] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [17:34:14] 08Warning Alert for device cr1-eqsin.wikimedia.org - Processor usage over 85% [17:36:10] mr1-eqsin complained before about 80% saturation [17:36:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:36:52] is that librenms alert about processor usage something useful to us, can it tell us anything? [17:39:34] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 38.56 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:39:39] 10Operations, 10ops-eqiad: rack/setup/install cloudwdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) p:05Triage→03Normal [17:39:55] 10Operations, 10ops-eqiad: rack/setup/install cloudwdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) [17:40:20] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060 (10sbassett) p:05Triage→03Normal [17:41:02] 10Operations, 10ops-eqiad: rack/setup/install cloudwdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) I'm not sure if @andrew or @gehel would know this, but I assigned to @Gehel Racking Proposal: This was not answered on #procurement task T232663, and needs to be known before the h... [17:42:45] 10Operations, 10Research, 10The-Wikipedia-Library, 10Traffic, and 4 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276 (10sbassett) [17:42:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:44:14] 10Operations, 10Wikimedia-Apache-configuration, 10Privacy, 10Security: Apache 2.4 exposes server status page by default? - https://phabricator.wikimedia.org/T113090 (10sbassett) [17:44:30] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:45:18] 10Operations, 10ops-eqiad: rack/setup/install cloudwdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Gehel) a:05Gehel→03Andrew There isn't really any racking constraint on my side (as a future user of those systems). We don't have availability or redundancy constraints (those are test... [17:45:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove zookeeper terms from the Analytics filters [homer/public] - 10https://gerrit.wikimedia.org/r/543183 (https://phabricator.wikimedia.org/T217057) (owner: 10Elukey) [17:46:50] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:48:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:49:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [17:49:04] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics inter - https://phabricator.wikimedia.org/T235688 (10Nuria) [17:49:12] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) [17:49:33] 10Puppet, 10MediaWiki-extensions-NavigationTiming, 10Performance-Team, 10Privacy: Track state (region) - https://phabricator.wikimedia.org/T101819 (10sbassett) [17:50:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1800) [18:00:25] 10Operations, 10Cassandra, 10Core Platform Team Legacy (Later), 10User-Eevans: Upload 3.11.4 packages to APT repo - https://phabricator.wikimedia.org/T235675 (10Eevans) p:05Triage→03Normal [18:03:01] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:03:10] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10lexnasser) Approving as the relevant Wikimedia Foundation employee. [18:03:11] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [18:03:23] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:03:41] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:03:53] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:03:55] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:03:55] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:05:11] 10Operations, 10Wikimedia-Mailing-lists: Rename multimedia-team to structured-data-team - https://phabricator.wikimedia.org/T235550 (10MarkTraceur) @crusnov that would be great! Thanks. [18:05:17] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10Nuria) Approved on my end, i think @lexnasser needs to provide ssh keys and sign NDA per https://wikitech.wikimedia.org/wiki/Production_access [18:05:47] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:28] !log upgrading Cassandra to 3.11.4, eqiad, rack b -- T200803 [18:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:32] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [18:06:48] checking stat1007 [18:07:11] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [18:07:13] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:20] restarted the nagios service [18:07:21] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [18:07:28] the oom killer already done its job [18:07:35] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:07:53] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [18:08:07] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:08:09] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:08:09] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [18:09:15] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqsin.wikimedia.org recovered from Processor usage over 85% [18:11:18] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) [18:11:25] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) These boxes will be cloudvirts. So... 1) they should be named cloudvirt-wdqs100x 2) They need to be racked in row B with dual network hookups, just like cloudvirtXXXX [18:14:23] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) [18:15:28] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) a:05Andrew→03Jclark-ctr Racking Setup: These will all be cloudvirt-network-restricted hosts. They must go in 1G racks in Row B. Network Setup: (2) 1G rack connections, simi... [18:15:30] 10Operations, 10Analytics, 10SRE-Access-Requests: SSH access for Lex Nasser, analytics intern - https://phabricator.wikimedia.org/T235688 (10lexnasser) [18:16:32] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) [18:28:57] !log upgrading Cassandra to 3.11.4, eqiad, rack d -- T200803 [18:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:01] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [18:31:25] I am getting 503's [18:31:37] anyone else has wikipedia down? [18:32:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:32:53] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:33:11] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:33:12] and now icinga is complaining about something too [18:33:21] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:33:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:33:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:33:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:33:51] chaomodus or any other ops around ? [18:34:01] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:34:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:27] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:34:45] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:34:55] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:34:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:35] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:35:41] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:37:11] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) So here are some options that we could consider. === Kademlia / DHT As stated above (and i... [18:37:43] Seems better now [18:39:16] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide: Automatic pickup of Gerrit clone master doesn't happen - https://phabricator.wikimedia.org/T235677 (10Dzahn) The changes made in T235013 added a requirement to have git-lfs installed and use a different command to pull data... [18:40:43] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 71.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:41:37] 10Operations, 10ops-eqiad: rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10RobH) [18:46:18] !log upgrading Cassandra to 3.11.4, codfw, rack b -- T200803 [18:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:22] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [18:50:49] (03PS3) 10Dzahn: admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) [18:53:46] (03CR) 10Dzahn: [C: 03+2] admins: add shell account for Reuven Lazarus [puppet] - 10https://gerrit.wikimedia.org/r/543204 (https://phabricator.wikimedia.org/T235215) (owner: 10Dzahn) [18:55:17] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): maps1002: Failed power supply - https://phabricator.wikimedia.org/T235406 (10wiki_willy) Awesome, thanks @Jclark-ctr [18:59:14] (03CR) 10Krinkle: [C: 03+1] logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [19:00:04] longma: That opportune time is upon us again. Time for a MediaWiki train - American version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T1900). [19:01:03] (03CR) 10DannyS712: "Question: if you take a look at, eg, conf-labs-aawiki, there are both 'groupoverrides' and 'groupoverrides2' - the latter was (as far as I" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:03:39] (03CR) 10DannyS712: [C: 04-1] Drop the 'inactive' user rights grant, no longer around post-DisableAccount (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462592 (https://phabricator.wikimedia.org/T158594) (owner: 10Jforrester) [19:04:19] ACKNOWLEDGEMENT - HP RAID on db2067 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:1 - OK: 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:9 - Predictive Failure: 1I:1:8 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T235695 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathe [19:04:22] 10Operations, 10ops-codfw: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T235695 (10ops-monitoring-bot) [19:04:49] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10wiki_willy) @Andrew or @Bstorm - are you ok with us taking the machine down to troubleshoot? Thanks, Willy [19:05:00] (03PS1) 10Jeena Huneidi: group1 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543662 [19:05:02] (03CR) 10Jeena Huneidi: [C: 03+2] group1 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543662 (owner: 10Jeena Huneidi) [19:05:10] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@59a97fa]: Regular analytics weekly train (top-mediarequest endpoint) [19:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:57] (03Merged) 10jenkins-bot: group1 wikis to 1.35.0-wmf.2 refs T233850 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543662 (owner: 10Jeena Huneidi) [19:06:18] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10Dzahn) [19:06:28] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@59a97fa]: Regular analytics weekly train (top-mediarequest endpoint) (duration: 01m 18s) [19:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:22] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.35.0-wmf.2 refs T233850 [19:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:26] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [19:08:21] !log jhuneidi@deploy1001 Synchronized php: group1 wikis to 1.35.0-wmf.2 refs T233850 (duration: 00m 59s) [19:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:51] (03PS1) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [19:09:34] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:09:48] (03PS2) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [19:10:36] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:11:42] (03PS3) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [19:12:27] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:13:05] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@59a97fa]: Regular analytics weekly train - try 2 after fix [19:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:54] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Andrew) >>! In T230289#5581714, @wiki_willy wrote: > @Andrew or @Bstorm - are you ok with us taking the machine down to troubleshoot? Thanks, Willy... [19:14:56] (03CR) 10Jforrester: "> Patch Set 19:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:16:32] (03PS4) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [19:18:59] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@59a97fa]: Regular analytics weekly train - try 2 after fix (duration: 05m 53s) [19:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:57] (03CR) 10Herron: "will need to include ::profile::rsyslog::udp_localhost_compat somewhere (maybe in role::aqs?) to plumb rsyslog with the localhost:10514/UD" [puppet] - 10https://gerrit.wikimedia.org/r/543278 (https://phabricator.wikimedia.org/T219928) (owner: 10Elukey) [19:26:43] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [19:30:59] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.34.0-wmf.25 (duration: 03m 24s) [19:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Hi @Andrew - apologies for the delay. Chris has been out, but @Jclark-ctr is g... [19:33:23] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [19:35:13] !log upgrading Cassandra to 3.11.4, codfw, rack c -- T200803 [19:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:17] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [19:36:28] (03CR) 10Marostegui: [C: 04-1] "You have to also add the host to db-eqiad.php for consistency (like we do with all the DBs)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [19:38:37] (03CR) 10Herron: [C: 03+1] "Overall I'm for it! Please see one question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [19:40:37] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001:9501 job=burrow partition={0,1,2} site=eqiad topic=udp_localhost-warning https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-c [19:40:37] iad&var-topic=All&var-consumer_group=All [19:42:16] oh fun, having a look [19:43:23] (03PS5) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [19:46:58] (03CR) 10Hashar: logstash: raise elasticsearch mapping limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [19:50:25] !log joal@deploy1001 Started deploy [analytics/refinery@1704fdd]: Regular analytics weekly train [19:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:38] 10Operations, 10Wikimedia-Mailing-lists: Create wikimedia sustainability mailing list - https://phabricator.wikimedia.org/T234999 (10mepps) @jijiki Let's go with sustainability@. Thanks! [19:54:18] (03PS1) 10Zoranzoki21: Enable transwiki import from other Wikipedias on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543686 (https://phabricator.wikimedia.org/T235419) [19:54:57] (03CR) 10Cwhite: "This looks good to me as is." [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [19:56:12] seeing a spike in resourceloader errors https://logstash.wikimedia.org/goto/93418b7b98e46622c2959426d0e85687 [19:57:27] which would be using the kafka topic that alerted above (udp_localhost-warning) [19:58:31] (03CR) 10DannyS712: "> > Patch Set 19:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [19:58:55] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Reuven Lazarus - https://phabricator.wikimedia.org/T235215 (10RLazarus) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T2000). [20:00:20] !log upgrading Cassandra to 3.11.4, codfw, rack d -- T200803 [20:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:23] T200803: Upgrade Cassandra 3.11.2 clusters to 3.11.4 (bugfix release) - https://phabricator.wikimedia.org/T200803 [20:01:12] no parsoid deploy today [20:07:32] !log joal@deploy1001 Finished deploy [analytics/refinery@1704fdd]: Regular analytics weekly train (duration: 17m 06s) [20:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:07] (03PS2) 10Herron: logstash: apply truncate filter to all fields [puppet] - 10https://gerrit.wikimedia.org/r/543467 [20:09:51] (03CR) 10Herron: "> This looks good to me as is." [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [20:10:16] (03PS8) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [20:10:32] (03PS9) 10Ottomata: Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) [20:15:58] (03CR) 10Ottomata: [C: 03+2] Upload versioned Spark assembly file to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/543474 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [20:16:16] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.2/includes/resourceloader/ResourceLoaderStartUpModule.php: Expose StartupModule::getConfigSettings for internal use T235350 T229836 (duration: 00m 59s) [20:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:21] T229836: Track module registry size over time - https://phabricator.wikimedia.org/T229836 [20:16:21] (03PS6) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:16:21] T235350: Consider moving site-wide config data out of the startup module - https://phabricator.wikimedia.org/T235350 [20:16:33] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) [20:17:12] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 2 others: Dashboards for monitoring of echostore - https://phabricator.wikimedia.org/T235558 (10Eevans) [20:19:53] (03PS1) 10Ottomata: Strip newline off of spark version in fact [puppet] - 10https://gerrit.wikimedia.org/r/543690 (https://phabricator.wikimedia.org/T222253) [20:20:33] (03CR) 10CDanis: [C: 03+1] "I think this will do what you want, but please do test on mwdebug in eqiad and codfw with some requests for production wikis. It's too ea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:20:44] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Strip newline off of spark version in fact [puppet] - 10https://gerrit.wikimedia.org/r/543690 (https://phabricator.wikimedia.org/T222253) (owner: 10Ottomata) [20:24:32] 10Operations, 10ops-eqiad, 10media-storage, 10User-fgiunchedi: ms-be1020 - firmware upgrade: (was: host went down) - https://phabricator.wikimedia.org/T234698 (10wiki_willy) a:03Cmjohnson [20:25:07] (03PS7) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:25:37] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10wiki_willy) a:05wiki_willy→03RobH [20:25:53] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:26:06] (03PS2) 10Eevans: cassandra: Pin Cassandra packages to version 3.11.4 [puppet] - 10https://gerrit.wikimedia.org/r/543494 (https://phabricator.wikimedia.org/T200803) [20:26:18] (03CR) 10Herron: [C: 03+1] logstash: raise elasticsearch mapping limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543192 (https://phabricator.wikimedia.org/T234564) (owner: 10Hashar) [20:27:16] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) So this didn't work `lines=15 reedy@mwmaint... [20:30:40] (03PS8) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:31:11] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) So in theory, we can work around this with s... [20:31:22] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:32:46] (03PS9) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:33:31] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:35:24] (03PS5) 10Eevans: cassandra config updates for 3.11.4 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/540948 (https://phabricator.wikimedia.org/T200803) [20:37:49] (03PS2) 10Eevans: echostore: create new staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/543212 (https://phabricator.wikimedia.org/T234376) [20:38:40] (03PS10) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:39:26] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:39:28] (03CR) 10Eevans: [V: 03+2 C: 03+2] echostore: create new staging deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/543212 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [20:41:48] (03PS11) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:42:44] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'echostore' for release 'staging' . [20:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:03] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) Fun times ` Copied 900 captchas to storage... [20:47:15] (03CR) 10CDanis: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:52:28] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 236.6 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:56:00] the increase in logstash traffic correlates quite well with the mediawiki train rollout: Synchronized php: group1 wikis to 1.35.0-wmf.2 refs T233850 (duration: 00m 59s) [20:56:00] T233850: 1.35.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T233850 [20:56:06] at 19:08 [20:56:30] (03PS12) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [20:56:43] longma: have you taken a look at logstash? [20:56:56] any idea what the new messages are? [20:56:59] I'm just about to start [20:57:15] (03CR) 10jerkins-bot: [V: 04-1] labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) (owner: 10Andrew Bogott) [20:59:06] cdanis: do you mean on the new errors dashboard or the fatal monitor? [20:59:32] longma: neither, there aren't new errors AFAICT, but the total volume of log messages emitted by mediawiki has gone up by a factor of 2.5 or so [20:59:57] There was a blameStartupRegistry issue but I believe it's been fixed now [21:00:02] more like 3x https://logstash.wikimedia.org/goto/9898d4fdaffc64e93cada889a113bd19 [21:00:15] this is the new errors dashboard, but omitting the "only error/exception/fatal channels" filter [21:00:20] (03PS13) 10Andrew Bogott: labtestwiki: move to a wmcs-hosted database on clouddb2001-dev [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543664 (https://phabricator.wikimedia.org/T233236) [21:02:42] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [21:03:54] herron: shdubsh: do you have some idea of how much more logging is too much more logging for logstash? [21:05:27] cdanis: did you see already the resourceloader warnings? [21:05:48] https://logstash.wikimedia.org/goto/9fb4c2dca052ffc0bd6b5ef11e90b2d6 [21:05:58] herron: yeah, I just found them :) [21:06:15] it's all one normalized_message, even [21:06:28] ha! great, yeah [21:07:13] that seems to correlate with the kafka alert on codfw consumer lag [21:07:52] (03PS1) 10Eevans: echostore: create production deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/543699 (https://phabricator.wikimedia.org/T234376) [21:08:04] herron: yeah, 3x the log traffic... [21:08:30] yeah [21:08:44] althoguh offhand I’m not sure why eqiad is not lagging, maybe latency since both are consuming from eqiad [21:09:17] (03CR) 10Eevans: [V: 03+2 C: 03+2] echostore: create production deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/543699 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [21:10:32] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team: MW 1.35.0-wmf.2 has excessive logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10CDanis) [21:10:41] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team: MW 1.35.0-wmf.2 has excessive logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10CDanis) p:05Triage→03High [21:10:59] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [21:11:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:10] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team: MW 1.35.0-wmf.2 has excessive logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10Krinkle) [21:11:46] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team: MW 1.35.0-wmf.2 has excessive logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10Krinkle) Already spotted by train operator. Fixed since: * * 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team: MW 1.35.0-wmf.2 has excessive logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10CDanis) [21:12:33] Krinkle: ah, nice :) [21:13:45] cdanis: for my future deploys is this dashboard a good one to monitor how much logging has increased? https://logstash.wikimedia.org/goto/9898d4fdaffc64e93cada889a113bd19 [21:15:14] longma: I'd probably go with the simpler https://logstash.wikimedia.org/goto/722b63ca870500cd96efa4cb09e1d4b7 [21:15:45] in additon to Kibana/logstash I’d suggest this grafana dashboard https://grafana.wikimedia.org/d/000000102/production-logging [21:16:25] thanks herron & cdanis [21:16:35] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Patch-For-Review: [1.35.0-wmf.2] Excessive "Module {name} not loadable on target mobile " logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10Krinkle) [21:16:42] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10Patch-For-Review: [1.35.0-wmf.2] Excessive "Module {name} not loadable on target mobile " logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10Krinkle) a:03Krinkle [21:22:20] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 2 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) I'm unable to deploy to codfw; I'm seeing the following: ` $ kubectl get events LAST SEEN TYPE REASON KIND... [21:25:28] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Jclark-ctr) Dell EMC SR # 1000122167 || Service Tag: 31R9KH2 || Server Crashes under Load opened SR with Dell forward TS... [21:27:29] (03PS1) 10Jhedden: toolforge: harvest replicas for current user account state [puppet] - 10https://gerrit.wikimedia.org/r/543706 (https://phabricator.wikimedia.org/T235697) [21:30:17] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 2 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) ` reedy@mwmaint1002:~$ ./captchaloop.sh Gen... [21:30:34] (03PS1) 10Reedy: Workaround for GenerateFancyCaptcha not running as expected in prod [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) [21:31:40] (03CR) 10SBassett: [C: 03+1] Workaround for GenerateFancyCaptcha not running as expected in prod [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [21:31:50] (03PS1) 10Herron: logstash: apply throttle filter to all log levels [puppet] - 10https://gerrit.wikimedia.org/r/543708 [21:36:10] (03PS1) 10Eevans: echostore: remove affinity (copypasta from sessionstore) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543711 (https://phabricator.wikimedia.org/T234376) [21:36:53] (03CR) 10Eevans: [V: 03+2 C: 03+2] echostore: remove affinity (copypasta from sessionstore) [deployment-charts] - 10https://gerrit.wikimedia.org/r/543711 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [21:37:10] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) [21:38:18] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) [21:39:02] * James_F twiddles thumbs waiting for code to merge. [21:39:44] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10wikitech.wikimedia.org: ConfirmEdit seemingly erroneously enabled for some users on wikitech - https://phabricator.wikimedia.org/T233215 (10Reedy) 05Stalled→03Invalid [21:40:17] (03CR) 10BryanDavis: [C: 03+1] toolforge: harvest replicas for current user account state [puppet] - 10https://gerrit.wikimedia.org/r/543706 (https://phabricator.wikimedia.org/T235697) (owner: 10Jhedden) [21:40:58] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:41:34] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [21:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:56] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:42:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:42:42] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:42:54] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'echostore' for release 'production' . [21:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:43:04] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:43:32] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:43:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:02] (03CR) 10Jhedden: [C: 03+2] toolforge: harvest replicas for current user account state [puppet] - 10https://gerrit.wikimedia.org/r/543706 (https://phabricator.wikimedia.org/T235697) (owner: 10Jhedden) [21:44:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:44:16] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:44:37] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'echostore' for release 'production' . [21:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:45:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:46:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:46:10] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:47:52] !log @ helmfile [CODFW] Ran 'sync' command on namespace 'echostore' for release 'production' . [21:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:40] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [21:53:53] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.2/extensions/WikiEditor: T235701 Revert removal of jquery.tabIndex (duration: 00m 59s) [21:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:57] T235701: Opening a dialog (f.e. link dialog) in 2010 wikitext editor results in JS “setTabindexes is not a function” error - https://phabricator.wikimedia.org/T235701 [21:55:27] (03CR) 10Dzahn: Workaround for GenerateFancyCaptcha not running as expected in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [21:57:31] (03CR) 10Cwhite: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/543467 (owner: 10Herron) [21:59:26] (03CR) 10Cwhite: [C: 03+1] "IMHO this is a reasonable mitigation." [puppet] - 10https://gerrit.wikimedia.org/r/543708 (owner: 10Herron) [22:00:07] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 3 others: Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) From a conversation w/ @Joe on IRC, it seems the `nodeAffinity` section (copypasta from the sessionstore deployment) was likely c... [22:00:44] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Jclark-ctr) Cleared Foreign state on ofline drives. offline drives now list as ready [22:02:57] (03CR) 10Reedy: Workaround for GenerateFancyCaptcha not running as expected in prod (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [22:10:12] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:12:38] (03CR) 10Dzahn: "puppet part looks alright https://puppet-compiler.wmflabs.org/compiler1001 /18895/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [22:13:14] 10Operations, 10netbox: update librenms report - https://phabricator.wikimedia.org/T235716 (10RobH) [22:13:15] (03PS2) 10Reedy: Workaround for GenerateFancyCaptcha not running as expected in prod [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) [22:14:40] 10Operations, 10netbox: update librenms report - https://phabricator.wikimedia.org/T235716 (10crusnov) p:05Triage→03Normal [22:21:19] 10Operations, 10MediaWiki-ResourceLoader, 10Performance-Team, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), 10Patch-For-Review: [1.35.0-wmf.2] Excessive "Module {name} not loadable on target mobile " logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 (10Krinkle) 05Open→03Resolved [22:27:28] (03CR) 10CDanis: [C: 03+1] logstash: apply throttle filter to all log levels [puppet] - 10https://gerrit.wikimedia.org/r/543708 (owner: 10Herron) [22:28:59] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.2/includes/OutputPage.php: T235711 Lower severity of targets violation back to DEBUG (duration: 00m 59s) [22:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:04] T235711: [1.35.0-wmf.2] Excessive "Module {name} not loadable on target mobile " logging from ResourceLoader - https://phabricator.wikimedia.org/T235711 [22:29:42] James_F: thanks! already looks better [22:30:35] https://logstash.wikimedia.org/goto/14be320f1c2934077e8e883e93abf894 :) [22:30:41] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.2/resources/src/mediawiki.special/contributions.less: T235137 Don't apply styling for Special:Contributions on other pages (duration: 00m 59s) [22:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:45] T235137: "Edit watchlist" missing namespace designations - https://phabricator.wikimedia.org/T235137 [22:31:13] cdanis: Sorry it took so long to fix; CI choked for 30 mins on one of the simple bits of infrastructure and just didn't do anything. :-( [22:31:41] James_F: :( ooof. [22:31:53] Yeah, something broke in the composer fetch. Oh well. [22:31:56] !log mwmaint1002 - running generate-fancy-captcha-loop to work around issue with generate-captcha cron (T230245) [22:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:00] T230245: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 [22:42:43] !log Zuul: Add composer-php72-docker for wikimedia-cz/web-theme and wikimedia-cz/web-plugin [22:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:47] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Dzahn) I manually ran the script by @Reedy (thanks!... [22:47:57] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [22:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:32] (03PS3) 10Dzahn: Workaround for GenerateFancyCaptcha not running as expected in prod [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [22:49:06] (03CR) 10Dzahn: [C: 03+2] "downloaded script from this change and ran it on mwmaint1002 as www-data. https://phabricator.wikimedia.org/T230245#5582500" [puppet] - 10https://gerrit.wikimedia.org/r/543707 (https://phabricator.wikimedia.org/T230245) (owner: 10Reedy) [22:53:02] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 119.2 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [22:59:12] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Reedy) Also tagging #performance-team and more spec... [22:59:43] stashbot: now [22:59:43] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [22:59:54] jouncebot: now [22:59:54] No deployments scheduled for the next 0 hour(s) and 0 minute(s) [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191016T2300). Please do the needful. [23:00:04] Zoranzoki21: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:05] 10Operations, 10ConfirmEdit (CAPTCHA extension), 10Core Platform Team, 10Editing-team, and 3 others: Mediawiki maintenance job "generate-fancycaptcha" - fatal error when trying to copy new captchas to storage - https://phabricator.wikimedia.org/T230245 (10Dzahn) p:05High→03Normal Workaround merged and... [23:00:24] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.1/resources/src/mediawiki.special/contributions.less: T235137 Don't apply styling for Special:Contributions on other pages (duration: 00m 59s) [23:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:28] T235137: "Edit watchlist" missing namespace designations - https://phabricator.wikimedia.org/T235137 [23:00:44] Hi, I am here for SWAT :) [23:00:50] Zoranzoki21: hi, I can SWAT today! [23:00:56] James_F: you just synced, am I free to swat now? [23:02:34] and you also created /srv/mediawiki-stagging/q, what's that? :) [23:03:04] Urbanecm: Yeah, go ahead. [23:03:15] thanks [23:03:18] And whoops, accidental log output instead of quit. [23:03:18] (03CR) 10Urbanecm: [C: 03+2] Enable transwiki import from other Wikipedias on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543686 (https://phabricator.wikimedia.org/T235419) (owner: 10Zoranzoki21) [23:03:34] aha :) [23:04:23] (03Merged) 10jenkins-bot: Enable transwiki import from other Wikipedias on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543686 (https://phabricator.wikimedia.org/T235419) (owner: 10Zoranzoki21) [23:04:28] (03PS1) 10Eevans: echostore: fixup Cassandra contact list [deployment-charts] - 10https://gerrit.wikimedia.org/r/543731 (https://phabricator.wikimedia.org/T234376) [23:04:45] (03CR) 10Eevans: [V: 03+2 C: 03+2] echostore: fixup Cassandra contact list [deployment-charts] - 10https://gerrit.wikimedia.org/r/543731 (https://phabricator.wikimedia.org/T234376) (owner: 10Eevans) [23:05:01] (03CR) 10Dzahn: [C: 03+1] DNS: Remove mgmt DNS for db2051,db2056 and db2068 [dns] - 10https://gerrit.wikimedia.org/r/543484 (owner: 10Papaul) [23:05:25] Zoranzoki21: syncing your command [23:05:27] *patch [23:05:44] Urbanecm: Should I test it with mwdebug? [23:05:44] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'echostore' for release 'production' . [23:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:57] looks safe enough, I'm syncing directly [23:06:17] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 96c87c7: Enable transwiki import from other Wikipedias on srwikisource (T235419) (duration: 00m 58s) [23:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:20] done! [23:06:21] T235419: Enable transwiki import from other Wikipedias on srwikisource - https://phabricator.wikimedia.org/T235419 [23:06:48] Urbanecm: Yes, thank you! I tested and can confirm to works [23:06:52] good! [23:07:46] (03PS1) 10Urbanecm: New throttle rule for WMCL editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543732 (https://phabricator.wikimedia.org/T235693) [23:09:11] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'echostore' for release 'production' . [23:09:12] (03CR) 10Urbanecm: [C: 03+2] New throttle rule for WMCL editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543732 (https://phabricator.wikimedia.org/T235693) (owner: 10Urbanecm) [23:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:14] (03PS1) 10Zoranzoki21: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543733 [23:09:44] (03PS2) 10Zoranzoki21: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543733 [23:09:59] (03Merged) 10jenkins-bot: New throttle rule for WMCL editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543732 (https://phabricator.wikimedia.org/T235693) (owner: 10Urbanecm) [23:10:03] (03PS1) 10Urbanecm: Change logo for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543734 (https://phabricator.wikimedia.org/T235710) [23:10:15] (03CR) 10Zoranzoki21: "Follow-up patch created: 5c62f11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543732 (https://phabricator.wikimedia.org/T235693) (owner: 10Urbanecm) [23:10:18] (03CR) 10Urbanecm: [C: 03+2] Change logo for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543734 (https://phabricator.wikimedia.org/T235710) (owner: 10Urbanecm) [23:10:58] Urbanecm: And 543733 too [23:11:06] will do in a sec [23:11:07] (03Merged) 10jenkins-bot: Change logo for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543734 (https://phabricator.wikimedia.org/T235710) (owner: 10Urbanecm) [23:11:31] (03CR) 10Urbanecm: [C: 03+2] Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543733 (owner: 10Zoranzoki21) [23:11:45] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: 6dc4c0c: New throttle rule for WMCL editathon (T235693) (duration: 00m 59s) [23:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:49] T235693: Lift IP limit - WMCL Editathon 2019-10-25 - https://phabricator.wikimedia.org/T235693 [23:11:53] Cool! [23:12:20] (03Merged) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/543733 (owner: 10Zoranzoki21) [23:13:05] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2051,db2056 and db2068 [dns] - 10https://gerrit.wikimedia.org/r/543484 (owner: 10Papaul) [23:13:31] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 9c5bcd8: Change logo for azwiki (T235710) (duration: 00m 59s) [23:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:35] T235710: Azerbaijani Wikipedia logo change request - https://phabricator.wikimedia.org/T235710 [23:13:44] (03PS2) 10Papaul: DNS: Remove mgmt DNS for db2051,db2056 and db2068 [dns] - 10https://gerrit.wikimedia.org/r/543484 [23:14:13] !log Purge https://en.wikipedia.org/static/images/project-logos/azwiki.png (T235710) [23:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:22] PROBLEM - Check whether ferm is active by checking the default input chain on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:14:23] !log Purge https://en.wikipedia.org/static/images/project-logos/azwiki-2x.png (T235710) [23:14:26] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:26] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:14:36] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:14:41] !log Purge https://en.wikipedia.org/static/images/project-logos/azwiki-1.5x.png (T235710) [23:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:02] Ahh this stat1007 [23:15:12] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:15:18] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:15:22] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:24] PROBLEM - Check size of conntrack table on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:15:28] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [23:15:50] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:16:52] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [23:16:56] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:17:00] RECOVERY - Check size of conntrack table on stat1007 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [23:17:02] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1007&var-datasource=eqiad+prometheus/ops [23:17:07] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Papaul) [23:17:19] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [23:17:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2051.codfw.wmnet - https://phabricator.wikimedia.org/T230778 (10Papaul) 05Open→03Resolved Complete [23:17:34] RECOVERY - Check whether ferm is active by checking the default input chain on stat1007 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:17:36] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [23:17:36] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:17:45] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: Clean expired rules (duration: 00m 58s) [23:17:46] RECOVERY - DPKG on stat1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:17:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) [23:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:54] !log Evening SWAT done [23:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:58] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) Complete [23:18:32] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [23:18:34] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2056.codfw.wmnet - https://phabricator.wikimedia.org/T230777 (10Papaul) 05Open→03Resolved [23:19:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Papaul) [23:19:25] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Papaul) [23:19:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: Decommission db2068.codfw.wmnet - https://phabricator.wikimedia.org/T235399 (10Papaul) 05Open→03Resolved Complete [23:20:47] 10Operations, 10ops-esams, 10Traffic: rack/setup/install cp30[50-65].esams.wmnet - https://phabricator.wikimedia.org/T233242 (10Papaul) p:05Normal→03High [23:21:19] 10Operations, 10ops-codfw: Recycle Cisco old servers - https://phabricator.wikimedia.org/T235669 (10Papaul) p:05Triage→03Normal [23:21:24] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [23:25:46] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers