[00:00:03] (03CR) 10CRusnov: [C: 03+1] "Just a thought inline, otherwise given that in mind it LGTM." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/571998 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T0000). [00:00:04] tgr and mooeypoo: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:17] o/ I'm here [00:00:23] o/ [00:00:26] Hopefully not breaking anything with a config change [00:00:59] 10Operations, 10Wikimedia-Logstash, 10Wikimedia-Incident: Logstash missing most messages from mediawiki (Aug 2019) - https://phabricator.wikimedia.org/T230847 (10Krinkle) Thx, I've updated the [incident page](https://wikitech.wikimedia.org/wiki/Incident_documentation/20190820-logstash). I'll close this now t... [00:01:34] (03CR) 10CRusnov: [C: 03+1] "LGTM, good to have this refactored." [cookbooks] - 10https://gerrit.wikimedia.org/r/571999 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:02:50] (03CR) 10CRusnov: [C: 03+1] "continues to LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/567169 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:03:33] Anyone deploying? [00:04:04] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) a:03Dzahn [00:04:09] (03CR) 10CRusnov: [C: 03+1] "Linguistic nit, but otherwise good." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/571997 (https://phabricator.wikimedia.org/T231068) (owner: 10Volans) [00:05:30] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [00:08:27] (03PS2) 10Gergő Tisza: Enable password-reset (requireemail pref) on test WD and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573397 (https://phabricator.wikimedia.org/T245660) (owner: 10Samwilson) [00:10:32] (03CR) 10Gergő Tisza: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573397 (https://phabricator.wikimedia.org/T245660) (owner: 10Samwilson) [00:10:34] (03CR) 10Papaul: [C: 03+2] DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 (owner: 10Papaul) [00:10:58] (03PS3) 10Papaul: DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 [00:11:09] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: ADD mgmt DNS for frdb2001, payments200[1-3]-a [dns] - 10https://gerrit.wikimedia.org/r/573358 (owner: 10Papaul) [00:11:37] (03Merged) 10jenkins-bot: Enable password-reset (requireemail pref) on test WD and Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573397 (https://phabricator.wikimedia.org/T245660) (owner: 10Samwilson) [00:13:23] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) [00:15:06] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) All hosts have roles, weight 30, have been pooled and changed to status "Active" in Netbox. ` {"mw1349.eqiad.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=eqiad... [00:15:45] (03PS2) 10Gergő Tisza: Allow non-autoconfirmed users to propose OAuth apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571860 (https://phabricator.wikimedia.org/T213760) [00:16:00] 10Operations, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Dzahn) 05Open→03Resolved [00:16:01] (03CR) 10Gergő Tisza: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571860 (https://phabricator.wikimedia.org/T213760) (owner: 10Gergő Tisza) [00:16:41] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:573397|Enable password-reset (requireemail pref) on test WD and Commons (T245660)]] (duration: 01m 03s) [00:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:45] T245660: PRU: enable pru on test wikidata - https://phabricator.wikimedia.org/T245660 [00:16:48] mooeypoo: it's live [00:16:58] (03Merged) 10jenkins-bot: Allow non-autoconfirmed users to propose OAuth apps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/571860 (https://phabricator.wikimedia.org/T213760) (owner: 10Gergő Tisza) [00:17:30] Thanks!! [00:22:15] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571860|Allow non-autoconfirmed users to propose OAuth apps (T213760)]] (duration: 01m 04s) [00:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:19] T213760: Rethink autoconfirmed requirement for OAuth - https://phabricator.wikimedia.org/T213760 [00:39:28] (03PS1) 10Jdlrobson: Mobile logo should fall back to PNG if no SVG support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573419 (https://phabricator.wikimedia.org/T232140) [00:39:42] (03CR) 10jerkins-bot: [V: 04-1] Mobile logo should fall back to PNG if no SVG support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573419 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [00:46:55] (03CR) 10VolkerE: [C: 04-1] "I don't think that's a) really necessary regarding our browser support and b) more maintenance burden than useful… See task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573419 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [01:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T0100). [01:10:48] (03CR) 10Krinkle: Merge $wgLogo into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [01:13:04] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [01:22:44] (03PS1) 10Bstorm: cloudstore: Update the nfs_hostlist script [puppet] - 10https://gerrit.wikimedia.org/r/573422 (https://phabricator.wikimedia.org/T224582) [01:25:47] (03CR) 10Bstorm: "This is why there's a hostlist backend for cumin in cloud only. It's for this script, and I totally forgot about it (because I don't thin" [puppet] - 10https://gerrit.wikimedia.org/r/573422 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [01:29:04] (03PS6) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [01:29:19] (03PS6) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [02:58:00] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:59:43] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member "ge-[0-1]/0/8"; - member "ge-[0-1]/0/9"; - mem... [03:01:01] 10Operations, 10ops-codfw, 10fundraising-tech-ops: codfw: rack/setup/install 3 new payments server for frack - https://phabricator.wikimedia.org/T244169 (10Papaul) [03:03:54] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 52.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:06:00] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [03:13:51] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range vlan-fundraising] member "ge-[0-1]/0/20" { ... } + member "ge-[0-1]/0/23"; [edit interfaces interface-r... [03:15:12] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) [03:25:03] 10Operations, 10ops-codfw: Alert for device ps1-a8-codfw.mgmt.codfw.wmnet - Device rebooted - https://phabricator.wikimedia.org/T245164 (10Papaul) 05Open→03Resolved We replaced the PDU, all good [03:42:18] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [04:57:46] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 23.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:01:58] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:55:19] 10Operations, 10ops-codfw, 10DBA: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 (10Marostegui) [05:55:37] 10Operations, 10ops-codfw, 10DBA: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 (10Marostegui) p:05Triage→03Medium [05:56:10] ACKNOWLEDGEMENT - IPMI Sensor Status on es2022 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] Marostegui https://phabricator.wikimedia.org/T245714 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:57:14] 10Operations, 10ops-codfw, 10DBA: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 (10Marostegui) [05:59:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 on s8, db1099:3318 back to its original weight', diff saved to https://phabricator.wikimedia.org/P10462 and previous config saved to /var/cache/conftool/dbconfig/20200220-055943-marostegui.json [05:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:00] (03PS1) 10Marostegui: db1087: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573436 (https://phabricator.wikimedia.org/T232446) [06:02:56] (03CR) 10Marostegui: [C: 03+2] db1087: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573436 (https://phabricator.wikimedia.org/T232446) (owner: 10Marostegui) [06:05:36] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [06:09:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3318 to remove revision partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10463 and previous config saved to /var/cache/conftool/dbconfig/20200220-060914-marostegui.json [06:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:19] T239453: Remove partitions from revision table - https://phabricator.wikimedia.org/T239453 [06:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1099:3318 this host already had the partitions removed - T239453', diff saved to https://phabricator.wikimedia.org/P10464 and previous config saved to /var/cache/conftool/dbconfig/20200220-061019-marostegui.json [06:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 to remove revision partitions - T239453', diff saved to https://phabricator.wikimedia.org/P10465 and previous config saved to /var/cache/conftool/dbconfig/20200220-061213-marostegui.json [06:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:40] !log Remove partitions from db1101:3318 - T239453 [06:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:58] (03PS1) 10Marostegui: db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573437 (https://phabricator.wikimedia.org/T239453) [06:14:56] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/573438 [06:15:08] (03CR) 10Marostegui: [C: 03+2] db1101: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/573437 (https://phabricator.wikimedia.org/T239453) (owner: 10Marostegui) [06:16:26] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/573438 (owner: 10Marostegui) [06:17:48] !log Repool labsdb1011 [06:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:15] 10Operations, 10DBA: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 (10Marostegui) Data checksum has finished without issues. So I am going to slowly repool this host so it can at least serve some traffic [06:20:34] (03PS1) 10Marostegui: Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/573439 [06:21:48] (03CR) 10Marostegui: [C: 03+2] Revert "db1084: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/573439 (owner: 10Marostegui) [06:23:55] (03PS1) 10Marostegui: report_users: Remove dbproxy1007 [software] - 10https://gerrit.wikimedia.org/r/573440 [06:24:36] (03CR) 10Marostegui: [C: 03+2] report_users: Remove dbproxy1007 [software] - 10https://gerrit.wikimedia.org/r/573440 (owner: 10Marostegui) [06:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10466 and previous config saved to /var/cache/conftool/dbconfig/20200220-062445-marostegui.json [06:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:50] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [06:30:19] (03CR) 10Petrb: [C: 03+1] Update wm-bot hostname [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [06:31:09] (03PS1) 10Marostegui: mariadb: Productionize es1021 [puppet] - 10https://gerrit.wikimedia.org/r/573441 (https://phabricator.wikimedia.org/T243052) [06:32:18] (03CR) 10Petrb: [C: 03+1] "I was aware that this change of hostname will cause this kind of problems but as I don't know who is actively using this feature, I had n" [puppet] - 10https://gerrit.wikimedia.org/r/572489 (owner: 10Lucas Werkmeister) [06:32:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1021 [puppet] - 10https://gerrit.wikimedia.org/r/573441 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:46:07] !log test trafficserver 8.0.6-rc1 in cp30[64,65] [06:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:19] jouncebot: now [06:49:19] No deployments scheduled for the next 2 hour(s) and 10 minute(s) [06:49:22] jouncebot: nxt [06:49:26] jouncebot: next [06:49:26] In 2 hour(s) and 10 minute(s): m1 primary master database restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T0900) [06:50:46] (03PS1) 10Addshore: From 4k->6k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573442 (https://phabricator.wikimedia.org/T225057) [06:53:36] (03PS1) 10Marostegui: mariadb: Productionize es1022 [puppet] - 10https://gerrit.wikimedia.org/r/573443 (https://phabricator.wikimedia.org/T243052) [06:55:11] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1022 [puppet] - 10https://gerrit.wikimedia.org/r/573443 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [06:57:04] (03CR) 10Addshore: [C: 03+2] From 4k->6k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573442 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [06:58:26] (03Merged) 10jenkins-bot: From 4k->6k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573442 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:00:04] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6000 (T225057) (duration: 01m 06s) [07:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:08] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [07:01:30] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q6000 (T225057) - extra sync for cache issue (duration: 01m 04s) [07:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:01] (03PS1) 10Addshore: From 6k->8k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573446 (https://phabricator.wikimedia.org/T225057) [07:06:12] (03PS7) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [07:07:20] (03Abandoned) 10Elukey: wikistats: serve the v2 version of the website by default [puppet] - 10https://gerrit.wikimedia.org/r/563508 (https://phabricator.wikimedia.org/T237752) (owner: 10Elukey) [07:08:22] * addshore goes to make a cup of coffee, then will go from 6k to 8k [07:08:43] (03CR) 10Giuseppe Lavagetto: "eqiad load balancers: https://puppet-compiler.wmflabs.org/compiler1001/20917/" [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [07:09:39] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [07:13:07] (03PS1) 10Marostegui: mariadb: Productionize es1023 [puppet] - 10https://gerrit.wikimedia.org/r/573447 (https://phabricator.wikimedia.org/T243052) [07:13:22] (03CR) 10Addshore: [C: 03+2] From 6k->8k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573446 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:14:31] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1023 [puppet] - 10https://gerrit.wikimedia.org/r/573447 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [07:14:33] (03Merged) 10jenkins-bot: From 6k->8k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573446 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:15:53] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8000 (T225057) (duration: 01m 03s) [07:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:58] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [07:17:03] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q8000 (T225057) - in case of cache issue (duration: 01m 03s) [07:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:48] (03CR) 10Giuseppe Lavagetto: "codfw https://puppet-compiler.wmflabs.org/compiler1002/20918/" [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [07:20:03] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556281 (owner: 10Krinkle) [07:20:12] (03PS1) 10Addshore: From 8k->10k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573448 (https://phabricator.wikimedia.org/T225057) [07:23:06] (03CR) 10Addshore: [C: 03+2] From 8k->10k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573448 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:23:27] (03PS1) 10Marostegui: es10123: Make it es5 master [puppet] - 10https://gerrit.wikimedia.org/r/573449 (https://phabricator.wikimedia.org/T243052) [07:24:04] (03Merged) 10jenkins-bot: From 8k->10k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573448 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:24:46] (03CR) 10Giuseppe Lavagetto: "esams https://puppet-compiler.wmflabs.org/compiler1003/20919/" [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [07:25:06] (03CR) 10Marostegui: [C: 03+2] es10123: Make it es5 master [puppet] - 10https://gerrit.wikimedia.org/r/573449 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [07:25:43] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q10k (was Q8k) (T225057) (duration: 01m 03s) [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:48] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [07:26:30] (03PS8) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [07:26:54] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q10k (was Q8k) (T225057) - in case of cache issue (duration: 01m 01s) [07:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:17] jouncebot: next [07:29:17] In 1 hour(s) and 30 minute(s): m1 primary master database restart (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T0900) [07:32:55] (03PS9) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [07:34:30] (03PS1) 10Addshore: From 10k->15k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573452 (https://phabricator.wikimedia.org/T225057) [07:41:27] (03PS2) 10ArielGlenn: turn snapshot1010 into an xml dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/573343 (https://phabricator.wikimedia.org/T241794) [07:43:17] (03CR) 10ArielGlenn: [C: 03+2] turn snapshot1010 into an xml dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/573343 (https://phabricator.wikimedia.org/T241794) (owner: 10ArielGlenn) [07:44:17] (03CR) 10Addshore: [C: 03+2] From 10k->15k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573452 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:45:14] (03Merged) 10jenkins-bot: From 10k->15k for items to read from the new store in clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573452 (https://phabricator.wikimedia.org/T225057) (owner: 10Addshore) [07:45:42] icinga will whine about puppet for snapshot1010, just ignore it. thanks [07:45:51] (new host, testbed anyways) [07:46:34] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q15k (was Q10k) (T225057) (duration: 01m 03s) [07:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:39] T225057: Switch `tmpItemTermsMigrationStages` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225057 [07:47:48] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Start reading for the new term store for clients up to Q15k (was Q10k) (T225057) - in case of cache issues (duration: 01m 03s) [07:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:47] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [07:55:29] (03PS3) 10Muehlenhoff: Switch more WMCS systems to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) [07:59:05] (03PS1) 10ArielGlenn: add fake keys for snapshot1010.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573453 [08:01:16] (03PS2) 10ArielGlenn: add fake keys for snapshot1010.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573453 [08:02:59] PROBLEM - Check systemd state on snapshot1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:37] RECOVERY - Check systemd state on snapshot1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:07:30] (03CR) 10Muehlenhoff: [C: 03+2] Switch more WMCS systems to standard Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/572196 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:11:49] (03CR) 10Muehlenhoff: "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [08:12:05] (03Abandoned) 10Muehlenhoff: Remove system::role from role::noc::site [puppet] - 10https://gerrit.wikimedia.org/r/573246 (owner: 10Muehlenhoff) [08:13:13] (03CR) 10Muehlenhoff: [C: 03+2] Add system::role for role::logging::webrequest::ops [puppet] - 10https://gerrit.wikimedia.org/r/573230 (owner: 10Muehlenhoff) [08:15:33] (03PS1) 10ArielGlenn: add cumin canary alias for the new dumps snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/573460 [08:16:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573460 (owner: 10ArielGlenn) [08:18:03] (03PS2) 10ArielGlenn: add cumin canary alias for the new dumps snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/573460 [08:18:20] (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] add fake keys for snapshot1010.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/573453 (owner: 10ArielGlenn) [08:19:17] (03PS1) 10Marostegui: mariadb: Productionize es1024 [puppet] - 10https://gerrit.wikimedia.org/r/573461 (https://phabricator.wikimedia.org/T243052) [08:20:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es1024 [puppet] - 10https://gerrit.wikimedia.org/r/573461 (https://phabricator.wikimedia.org/T243052) (owner: 10Marostegui) [08:31:07] (03PS10) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [08:31:09] (03PS1) 10Giuseppe Lavagetto: role::elasticsearch::cloudelastic: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/573515 [08:31:11] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: drop the lvs service hashes [puppet] - 10https://gerrit.wikimedia.org/r/573516 [08:34:16] (03CR) 10ArielGlenn: [C: 03+2] add cumin canary alias for the new dumps snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/573460 (owner: 10ArielGlenn) [08:35:57] !log Upgrade mysql on db1135 without restart T244238 [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:02] T244238: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 [08:40:25] !log disable puppet and stop bacula service T244238 [08:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:01] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:00:04] marostegui, jynus, and akosiaris: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) m1 primary master database restart deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T0900). [09:00:13] ready [09:00:26] !log Restart m1 database master db1135 (etherpad will not be available for around 1 minute) - T244238 [09:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:31] T244238: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 [09:00:44] will it have a version upgrade? [09:00:49] yep [09:01:49] akosiaris jynus, all done [09:02:18] what is m1 current active proxy, just for curiosity? [09:02:31] !log restart etherpad-lite on etherpad1002 T244238 [09:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:42] jynus: 1012 [09:03:41] akosiaris: etherpad working fine [09:03:42] so it caugh it [09:03:46] indeed [09:04:05] I am going to test the rest of the services [09:04:58] 25 seconds of unavailavility? maybe a bit more? [09:05:29] I will check replication unless you did already [09:05:44] that is done [09:05:52] all services seem to be working fine too [09:07:13] db1117 complained at 2020-02-20 9:00:28 but for some reson it didn't log the reconnect [09:07:38] marostegui: unrelated, but db1117 needs the package upgrade [09:07:45] yep, known [09:07:51] all the multi instance do [09:07:57] mentioned it because I saw the error on the log [09:08:09] feel free to update it if you like [09:08:11] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) This was done successfully. Downtime was from 09:00:28 to 09:01:14 [09:08:59] wait, I will start bacula, was leaving it last as it is not time-sensitive [09:09:00] 10Operations, 10DBA, 10Wikimedia-Etherpad: Upgrade and restart m1 master (db1135) - https://phabricator.wikimedia.org/T244238 (10Marostegui) 05Open→03Resolved Closing this, thanks @jcrespo and @akosiaris for being around to take care of the related services that live in m1. [09:09:02] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:09:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/20922/ LGTM. I'll wait for at least one review before going on with this though." [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [09:09:26] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10467 and previous config saved to /var/cache/conftool/dbconfig/20200220-091233-marostegui.json [09:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:37] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [09:14:49] (03CR) 10Alexandros Kosiaris: Migrate changeprop & cpjobqueue to kubernetes (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:15:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:51] (03CR) 10Alexandros Kosiaris: "Since @ottomata chimed in anyway, what's your take on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:17:18] (03CR) 10Alexandros Kosiaris: "> @Alex - is it ok if we merge this one before helmfile.d protion is done, or the two should be merged together?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:18:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "+2ing and merging. I 'll package the chart and update index in a subsequent patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:19:01] (03Merged) 10jenkins-bot: Migrate changeprop & cpjobqueue to kubernetes [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [09:19:14] (03PS1) 10Alexandros Kosiaris: Package changeprop charts and update index [deployment-charts] - 10https://gerrit.wikimedia.org/r/573521 (https://phabricator.wikimedia.org/T220399) [09:21:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] Package changeprop charts and update index [deployment-charts] - 10https://gerrit.wikimedia.org/r/573521 (https://phabricator.wikimedia.org/T220399) (owner: 10Alexandros Kosiaris) [09:23:08] 10Operations, 10Dumps-Generation: (Need By Jan 25) rack/setup/install snapshot1010.eqiad.wmnet - https://phabricator.wikimedia.org/T241794 (10ArielGlenn) 05Open→03Resolved We're up and running! Thanks everyone. [09:26:02] (03PS1) 10Vgutierrez: systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 [09:27:38] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 01s) [09:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:56] (03PS1) 10Muehlenhoff: Add CAS authentication to tendril [puppet] - 10https://gerrit.wikimedia.org/r/573527 [09:28:03] (03CR) 10jerkins-bot: [V: 04-1] systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (owner: 10Vgutierrez) [09:29:52] (03PS1) 10ArielGlenn: Move kowiki to xml dumps bigwikis list with appropriate settings [puppet] - 10https://gerrit.wikimedia.org/r/573528 (https://phabricator.wikimedia.org/T245721) [09:31:04] (03CR) 10jerkins-bot: [V: 04-1] Add CAS authentication to tendril [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [09:38:06] (03PS2) 10Muehlenhoff: Add CAS authentication to tendril [puppet] - 10https://gerrit.wikimedia.org/r/573527 [09:39:09] 10Operations, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Technical-Debt: Investigate whether GD is still needed on appservers - https://phabricator.wikimedia.org/T227734 (10Joe) 05Open→03Declined I concur with @Jdforrester-WMF - let's not... [09:40:18] (03PS2) 10Vgutierrez: systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [09:40:20] (03CR) 10Marostegui: [C: 03+1] tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [09:42:22] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 03s) [09:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I declined the task, I don't think this is valuable enough for the risk it brings." [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [09:43:13] (03CR) 10Jcrespo: "Ok with the idea, but I think it should be moved within the if block below, so to not have duplicated checks on the pasive host?" [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [09:43:32] (03CR) 10jerkins-bot: [V: 04-1] systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [09:44:41] (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::node: Switch to apt::package_from_component (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/566490 (owner: 10Muehlenhoff) [09:52:03] (03PS1) 10Muehlenhoff: Switch dbproxy1021 to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/573538 (https://phabricator.wikimedia.org/T156955) [09:57:37] (03CR) 10Marostegui: [C: 03+1] Switch dbproxy1021 to standard Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/573538 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [09:59:02] (03PS5) 10Muehlenhoff: profile::url_downloader: Add types and switch to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/562472 [10:02:03] (03PS3) 10Vgutierrez: systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [10:02:35] (03PS4) 10Vgutierrez: systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [10:05:29] (03CR) 10jerkins-bot: [V: 04-1] systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [10:08:04] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.20/extensions/WikimediaMaintenance/storage/make-all-blobs: (no justification provided) (duration: 01m 04s) [10:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:57] !log created $wikidb.blobs_cluster26 on es1020 - T245720 [10:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:00] T245720: Create new blobs_clusterXX tables on new es4 and es5 sections - https://phabricator.wikimedia.org/T245720 [10:09:11] Reedy: <3, checking [10:12:26] !log created $wikidb.blobs_cluster27 on es1023 - T245720 [10:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:23] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/573282 (https://phabricator.wikimedia.org/T245612) (owner: 10Jbond) [10:13:41] (03PS5) 10Vgutierrez: systemd: Provide support for multiple intervals on systemd::job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [10:20:32] (03PS6) 10Vgutierrez: systemd: Support multiple intervals on job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [10:21:45] (03PS1) 10ArielGlenn: add the new dumps xml testbed host to the dumps cluster for grafana [puppet] - 10https://gerrit.wikimedia.org/r/573543 [10:23:22] (03CR) 10ArielGlenn: [C: 03+2] add the new dumps xml testbed host to the dumps cluster for grafana [puppet] - 10https://gerrit.wikimedia.org/r/573543 (owner: 10ArielGlenn) [10:42:17] (03CR) 10Vgutierrez: "looks good, could we get the extra empty line from pybal.conf removed? that way it would be a NOOP on pybal itself" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [10:42:39] (03CR) 10Muehlenhoff: "Are these stretch instances gone?" [puppet] - 10https://gerrit.wikimedia.org/r/563469 (owner: 10Muehlenhoff) [10:43:56] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [10:45:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "One nitpick, and please expand briefly the commit message. Otherwise the patch is good." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [10:47:23] (03PS1) 10Muehlenhoff: Add library hints for postgresql-11 [puppet] - 10https://gerrit.wikimedia.org/r/573549 [10:51:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1084 after crash - T245621', diff saved to https://phabricator.wikimedia.org/P10468 and previous config saved to /var/cache/conftool/dbconfig/20200220-105117-marostegui.json [10:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:22] T245621: db1084 crashed due to BBU failure - https://phabricator.wikimedia.org/T245621 [10:53:33] (03CR) 10Muehlenhoff: [C: 03+2] Add library hints for postgresql-11 [puppet] - 10https://gerrit.wikimedia.org/r/573549 (owner: 10Muehlenhoff) [10:54:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "Patch itself LGTM, could we add all dbproxy* hosts to this too ?" [puppet] - 10https://gerrit.wikimedia.org/r/573538 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [10:55:13] (03PS7) 10Vgutierrez: systemd: Support multiple intervals on job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [10:56:23] (03CR) 10Vgutierrez: systemd: Support multiple intervals on job::timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [10:59:20] (03PS8) 10Vgutierrez: systemd: Support multiple intervals on job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) [11:03:50] (03PS1) 10Muehlenhoff: Add Cumin aliases for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/573550 [11:03:52] (03CR) 10Ema: [C: 03+1] systemd: Support multiple intervals on job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:04:07] (03CR) 10Vgutierrez: [C: 03+2] systemd: Support multiple intervals on job::timer [puppet] - 10https://gerrit.wikimedia.org/r/573526 (https://phabricator.wikimedia.org/T245616) (owner: 10Vgutierrez) [11:07:15] 10Operations: Integrate Buster 10.3 point update - https://phabricator.wikimedia.org/T244693 (10MoritzMuehlenhoff) [11:08:33] !log installing boost update from Buster point release [11:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:02] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [11:18:39] mw1280 has broken memory [11:19:56] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet [11:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:18] 10Operations, 10ops-eqiad: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 (10MoritzMuehlenhoff) The server went down with the following error today: ` Record: 216 Date/Time: 02/20/2020 11:16:09 Source: system Severity: Critical Description: Correcta... [11:22:25] <_joe_> ouch [11:22:59] ACKNOWLEDGEMENT - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff T240187 [11:39:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, with one minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573381 (owner: 10Jhedden) [11:46:33] (03PS1) 10Hnowlan: MWScript: Allow MWScript to be invoked via the debugger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) [11:49:44] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [11:50:43] (03CR) 10Reedy: MWScript: Allow MWScript to be invoked via the debugger (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [11:58:18] (03CR) 10Jcrespo: "Question, will this change the 401 error code for a 3XX (followed by a 401)? I am asking it because we have a "tendril asks for password c" [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:40] I'll push sth [12:00:57] (03CR) 10Urbanecm: [C: 03+2] Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [12:01:02] (03CR) 10Jcrespo: "The other question is that I don't want to maintain cruft at httpd-tendril.erb when most of that is generic. Could that be generalized in " [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [12:01:50] (03PS1) 10Muehlenhoff: Add logstash/kibana IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/573560 [12:01:54] (03Merged) 10jenkins-bot: Add logos for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565724 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [12:03:20] (03PS2) 10Urbanecm: Configure logo for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565725 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [12:04:54] (03CR) 10Urbanecm: [C: 03+2] Configure logo for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565725 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [12:05:00] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: 64240e1: Add logos for ngwikimedia (T242416) (duration: 01m 04s) [12:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:05] T242416: Upload logo for ng.wikimedia - https://phabricator.wikimedia.org/T242416 [12:05:47] (03Merged) 10jenkins-bot: Configure logo for ngwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565725 (https://phabricator.wikimedia.org/T242416) (owner: 10Majavah) [12:09:35] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 728d739: Configure logo for ngwikimedia (T242416) (duration: 01m 04s) [12:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:29] !log EU SWAT done [12:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] (03CR) 10Hnowlan: MWScript: Allow MWScript to be invoked via the debugger (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [12:25:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::lvs: use wmflib::fetch (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/572215 (owner: 10Giuseppe Lavagetto) [12:26:59] (03CR) 10Jbond: [C: 04-1] "LGTM but a couple of errors and a few nits" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [12:28:12] !log installing PHP 7.0 security updates [12:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:57] (03CR) 10Jbond: [C: 03+2] user:dwisehaupt: add alias [puppet] - 10https://gerrit.wikimedia.org/r/572940 (https://phabricator.wikimedia.org/T244901) (owner: 10Jbond) [12:30:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops group for dwisehaupt - https://phabricator.wikimedia.org/T244901 (10jbond) 05Open→03Resolved a:03jbond no problem :) > Do you still need @mepps approval on this to move forward? No this has already been conigured please... [12:31:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/573560 (owner: 10Muehlenhoff) [12:34:07] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [12:34:53] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [12:35:41] (03PS11) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [12:35:43] (03PS2) 10Giuseppe Lavagetto: role::elasticsearch::cloudelastic: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/573515 [12:35:52] (03PS2) 10Giuseppe Lavagetto: lvs::configuration: drop the lvs service hashes [puppet] - 10https://gerrit.wikimedia.org/r/573516 [12:40:02] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [12:40:26] (03PS12) 10Giuseppe Lavagetto: profile::lvs: use wmflib::fetch [puppet] - 10https://gerrit.wikimedia.org/r/572215 [12:40:28] (03PS3) 10Giuseppe Lavagetto: role::elasticsearch::cloudelastic: use profile::lvs::realserver [puppet] - 10https://gerrit.wikimedia.org/r/573515 [12:40:30] (03PS3) 10Giuseppe Lavagetto: lvs::configuration: drop the lvs service hashes [puppet] - 10https://gerrit.wikimedia.org/r/573516 [12:43:03] (03PS5) 10ArielGlenn: weekly dump of machine vision tables from commonswiki [puppet] - 10https://gerrit.wikimedia.org/r/573351 (https://phabricator.wikimedia.org/T236431) [12:44:08] (03CR) 10Reedy: MWScript: Allow MWScript to be invoked via the debugger (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [12:48:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add configuration for a flowspec controller (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [12:50:53] (03CR) 10Ayounsi: "Thanks, answer inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [12:51:27] (03PS4) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [12:53:45] !log installing PHP updates on matomo1001/piwik [12:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:10] (03CR) 10Ayounsi: "And thanks to you too!" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [13:01:52] (03CR) 10Aklapper: "I still like the idea, however it seems that nobody really feels like understanding what needs to be done here, and there are no docs I co" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [13:02:38] (03PS5) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [13:06:45] (03CR) 10Jbond: [C: 03+1] "LGTM please add PCC just to make sure the change to firewall.def is a noop" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [13:08:08] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/573516 (owner: 10Giuseppe Lavagetto) [13:08:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] "fine by me, let's see what fleet wide PCC says" [puppet] - 10https://gerrit.wikimedia.org/r/573516 (owner: 10Giuseppe Lavagetto) [13:25:28] PROBLEM - DPKG on mwmaint1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:32:39] mwmaint1002 is me, will recover soon [13:33:50] RECOVERY - DPKG on mwmaint1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [13:35:21] (03PS6) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [13:36:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] conftool::scripts: remove compatibility, disable draining [puppet] - 10https://gerrit.wikimedia.org/r/573291 (https://phabricator.wikimedia.org/T245594) (owner: 10Giuseppe Lavagetto) [13:41:10] (03PS7) 10Ayounsi: Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 [13:42:59] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10akosiaris) [13:43:51] 10Operations, 10Patch-For-Review, 10Wikimedia-Incident: Tracking task: 2020-02-04 kartotherian outage - https://phabricator.wikimedia.org/T244278 (10akosiaris) p:05Unbreak!→03Medium I 'll remove the UBN, the issue has been mitigated for now. The decision about 3rd parties being able to use the infrastruc... [13:46:44] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [13:49:35] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/20931/" [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [13:49:49] (03CR) 10Ottomata: "> Patch Set 13:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/554576 (owner: 10Holger Knust) [13:51:58] (03PS3) 10Ottomata: airflow: Expand sudo rights to analytics-search user [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [13:54:09] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10MoritzMuehlenhoff) I had missed the followup. sorry. These two spare hosts would be fine as test hosts! [13:55:21] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 with the minor comment about regenerating the index (I doubt a rebase will help)" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [13:56:28] (03CR) 10Alexandros Kosiaris: "+1, let's try it out" [puppet] - 10https://gerrit.wikimedia.org/r/573335 (https://phabricator.wikimedia.org/T206951) (owner: 10Jbond) [14:05:02] (03PS4) 10Ottomata: Include analytics-search-admins in airflow::search role [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [14:05:24] (03PS5) 10Ottomata: Include analytics-search-users in airflow::search role [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [14:08:40] (03PS6) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) [14:09:02] (03CR) 10Ottomata: [C: 03+2] Include analytics-search-users in airflow::search role [puppet] - 10https://gerrit.wikimedia.org/r/572997 (owner: 10EBernhardson) [14:10:06] (03CR) 10Ottomata: New eventgate-analytics-external instance using remote EventStreamConfig API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [14:11:02] (03CR) 10Ottomata: [C: 03+2] New eventgate-analytics-external instance using remote EventStreamConfig API [deployment-charts] - 10https://gerrit.wikimedia.org/r/563211 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [14:15:24] (03PS2) 10Ottomata: Set Superset UPLOAD_FOLDER to /tmp/superset_uploads/ [puppet] - 10https://gerrit.wikimedia.org/r/573393 (https://phabricator.wikimedia.org/T245679) [14:17:50] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/20932/analytics-tool1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/573393 (https://phabricator.wikimedia.org/T245679) (owner: 10Ottomata) [14:17:56] (03PS3) 10Ottomata: Set Superset UPLOAD_FOLDER to /tmp/superset_uploads/ [puppet] - 10https://gerrit.wikimedia.org/r/573393 (https://phabricator.wikimedia.org/T245679) [14:19:26] (03CR) 10Ottomata: [C: 03+2] Set Superset UPLOAD_FOLDER to /tmp/superset_uploads/ [puppet] - 10https://gerrit.wikimedia.org/r/573393 (https://phabricator.wikimedia.org/T245679) (owner: 10Ottomata) [14:20:00] (03PS2) 10Jhedden: openstack: Update cloud-init for virtio-scsi devices [puppet] - 10https://gerrit.wikimedia.org/r/573381 [14:21:04] (03CR) 10Jhedden: openstack: Update cloud-init for virtio-scsi devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573381 (owner: 10Jhedden) [14:24:28] (03PS2) 10RLazarus: Refactor for better multi-host execution: [software/httpbb] - 10https://gerrit.wikimedia.org/r/567147 [14:25:10] (03CR) 10RLazarus: [C: 03+2] Refactor for better multi-host execution: (033 comments) [software/httpbb] - 10https://gerrit.wikimedia.org/r/567147 (owner: 10RLazarus) [14:26:33] (03Merged) 10jenkins-bot: Refactor for better multi-host execution: [software/httpbb] - 10https://gerrit.wikimedia.org/r/567147 (owner: 10RLazarus) [14:32:26] (03PS2) 10Hnowlan: MWScript: Allow MWScript to be invoked via the debugger [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) [14:35:37] (03Abandoned) 10Alexandros Kosiaris: ganeti1009: Reserve ganeti1009 for k8s tests [puppet] - 10https://gerrit.wikimedia.org/r/571958 (owner: 10Alexandros Kosiaris) [14:37:30] 10Operations, 10Security-Team, 10User-jbond: Icinga check for CAS-protected web services - https://phabricator.wikimedia.org/T245743 (10MoritzMuehlenhoff) [14:38:08] !log [dry-run; mwmaint1002] foreachwiki extensions/AbuseFilter/maintenance/fixOldLogEntries.php --dry-run --verbose (T228655) [14:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:14] T228655: Dry-run fixOldLogEntries for AbuseFilter - https://phabricator.wikimedia.org/T228655 [14:38:22] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:40:22] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:57] (03PS1) 10Alexandros Kosiaris: http: Log X-Client-IP header if it exists [puppet] - 10https://gerrit.wikimedia.org/r/573578 [14:46:09] 10Operations, 10ORES, 10Scoring-platform-team, 10vm-requests: New node request: oresrdb[12]003 - https://phabricator.wikimedia.org/T210582 (10Halfak) 05Declined→03Stalled Sounds reasonable. Thanks for circling back @akosiaris [14:47:31] (03PS2) 10Alexandros Kosiaris: httpd: Log X-Client-IP header if it exists [puppet] - 10https://gerrit.wikimedia.org/r/573578 [14:47:56] (03PS1) 10Ottomata: refinery data_purge - Too many backslashes in regex escape [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) [14:50:38] (03CR) 10Alexandros Kosiaris: "An alternative approach that used the remoteip module for more or less the same purpose is up at https://gerrit.wikimedia.org/r/#/c/operat" [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [14:50:51] (03CR) 10jerkins-bot: [V: 04-1] refinery data_purge - Too many backslashes in regex escape [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [14:50:54] (03CR) 10Muehlenhoff: "@Jaime: Thanks for the pointer wrt the Icinga check, I missed that in the tendril::webserver class. The CAS auth code when enabled would i" [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [14:52:30] (03CR) 10RLazarus: [C: 03+1] "See nit, but otherwise looks good. I'm fuzzy on the header semantics -- is this information not already in X-Forwarded-For?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [14:57:26] (03PS2) 10Ottomata: refinery data purge - fix command checksums on refine event drop jobs [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) [14:58:30] (03CR) 10Ottomata: "> Patch Set 2: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [14:58:56] (03CR) 10Alexandros Kosiaris: "> See nit, but otherwise looks good. I'm fuzzy on the header semantics -- is this information not already in X-Forwarded-For?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [14:59:50] (03PS3) 10Alexandros Kosiaris: httpd: Log X-Client-IP header if it exists [puppet] - 10https://gerrit.wikimedia.org/r/573578 [15:00:16] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1003/20934/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [15:00:24] (03PS3) 10Ottomata: refinery data purge - fix command checksums on refine event drop jobs [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) [15:02:56] (03CR) 10Ottomata: [C: 03+2] refinery data purge - fix command checksums on refine event drop jobs [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [15:02:59] (03CR) 10RLazarus: [C: 03+1] "Oh got it, thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573578 (owner: 10Alexandros Kosiaris) [15:03:24] (03CR) 10Ottomata: [V: 03+2 C: 03+2] refinery data purge - fix command checksums on refine event drop jobs [puppet] - 10https://gerrit.wikimedia.org/r/573579 (https://phabricator.wikimedia.org/T245124) (owner: 10Ottomata) [15:12:31] 10Operations, 10LDAP-Access-Requests: Revoke LDAP access for Tobias Schumann (WMDE) - https://phabricator.wikimedia.org/T245747 (10kai.nissen) [15:13:31] (03CR) 10Hnowlan: MWScript: Allow MWScript to be invoked via the debugger (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [15:19:37] !log fdans@deploy1001 Started deploy [analytics/aqs/deploy@cbc3241]: deploying aqs [15:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:44] !log fdans@deploy1001 Finished deploy [analytics/aqs/deploy@cbc3241]: deploying aqs (duration: 04m 06s) [15:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:48] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10RobH) This is still pending mgmt approval for allocation of these two spares: >>! In T214024#5823975, @RobH wrote: > Ok, wmf5175 was ordered and can be allocated as the dual cpu spare pool system curren... [15:32:01] !log fdans@deploy1001 Started deploy [analytics/aqs/deploy@95a7999]: deploying aqs [15:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:49] !log fdans@deploy1001 Finished deploy [analytics/aqs/deploy@95a7999]: deploying aqs (duration: 00m 48s) [15:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:02] 10Operations, 10LDAP-Access-Requests: Revoke LDAP access for Tobias Schumann (WMDE) - https://phabricator.wikimedia.org/T245747 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [15:38:13] (03PS1) 10Muehlenhoff: Remove LDAP access for Tobias Schumann [puppet] - 10https://gerrit.wikimedia.org/r/573598 (https://phabricator.wikimedia.org/T245747) [15:39:25] (03CR) 10Jcrespo: "So I am ok with any change + promise to improve in the future, as long as whoever decides to go forward with it understands there is a sec" [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [15:40:20] !log Poweroff es2022 T245714 [15:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:25] T245714: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 [15:41:18] (03CR) 10Jcrespo: "Just to be clear, the deletion/update of the check is when it is effectively deployed, of course, doesn't have to happen on this patch." [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [15:43:57] (03CR) 10Muehlenhoff: "Ack, I'll make a separate patch to tweak the check via a separate Hiera flag before this goes live. And fully agreed, wrt ownership: This " [puppet] - 10https://gerrit.wikimedia.org/r/573527 (owner: 10Muehlenhoff) [15:44:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for Tobias Schumann [puppet] - 10https://gerrit.wikimedia.org/r/573598 (https://phabricator.wikimedia.org/T245747) (owner: 10Muehlenhoff) [15:45:19] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Revoke LDAP access for Tobias Schumann (WMDE) - https://phabricator.wikimedia.org/T245747 (10MoritzMuehlenhoff) 05Open→03Resolved Thanks for opening a task. I've removed him from the nda and wmde groups. [15:45:23] (03PS1) 10Herron: add dhcp/netboot entries for lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) [15:48:13] 10Operations, 10Security-Team, 10User-jbond: Icinga check for CAS-protected web services - https://phabricator.wikimedia.org/T245743 (10chasemp) p:05Triage→03Medium [15:49:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [15:50:23] (03CR) 10Herron: add dhcp/netboot entries for lists1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [15:50:46] (03PS1) 10Alexandros Kosiaris: eventgate-analytics-external: Add k8s token [puppet] - 10https://gerrit.wikimedia.org/r/573602 (https://phabricator.wikimedia.org/T233629) [15:50:59] (03PS2) 10Herron: add dhcp/netboot entries for lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) [15:52:20] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10HMarcus) @MoritzMuehlenhoff @chasemp Thank you for joining the technical call today and helping to push this further along. Moritz, it sounde... [15:52:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Missing reverse DNS (PTR records)" [dns] - 10https://gerrit.wikimedia.org/r/573362 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [15:55:45] 10Operations, 10ops-codfw, 10DBA: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 (10Papaul) 05Open→03Resolved A reboot fixed the problem [15:57:14] !log fdans@deploy1001 Started deploy [analytics/aqs/deploy@125cffa]: deploying aqs, third time is the charm [15:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:33] 10Operations, 10ops-codfw, 10DBA: es2022 power supply issues - https://phabricator.wikimedia.org/T245714 (10Marostegui) Thank you! ` /admin1/system1 properties ElementName = EnabledState = 2 (Enabled) HealthState = 5 (OK) OperationalStatus[0] = 2 (OK) ` [16:03:18] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:03:30] !log fdans@deploy1001 Finished deploy [analytics/aqs/deploy@125cffa]: deploying aqs, third time is the charm (duration: 06m 15s) [16:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:26] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:10:29] (03PS1) 10RobH: adding new dell skus [software] - 10https://gerrit.wikimedia.org/r/573611 [16:10:55] (03PS1) 10Herron: assigin lists1001 role::lists [puppet] - 10https://gerrit.wikimedia.org/r/573612 (https://phabricator.wikimedia.org/T224586) [16:11:46] (03PS2) 10Herron: assigin lists1001 role::lists [puppet] - 10https://gerrit.wikimedia.org/r/573612 (https://phabricator.wikimedia.org/T224586) [16:12:20] (03CR) 10Jhedden: [C: 03+2] openstack: Update cloud-init for virtio-scsi devices [puppet] - 10https://gerrit.wikimedia.org/r/573381 (owner: 10Jhedden) [16:12:53] !log installing postgres security updates on netboxdb* [16:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:05] (03PS1) 10Elukey: Refactor statistics mountpoints to be included in all stat roles [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) [16:16:07] !log stop, upgrade and restart db1140 [16:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:16] RECOVERY - IPMI Sensor Status on es2022 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:21:55] (03PS2) 10Elukey: Refactor statistics mountpoints to be included in all stat roles [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) [16:22:47] (03CR) 10Muehlenhoff: [C: 03+1] add dhcp/netboot entries for lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:23:10] (03PS3) 10Herron: add dhcp/netboot entries for lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) [16:23:40] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:23:50] !log installing Java security updates on Hadoop/Kafka Jumbo/AQS/Druid [16:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:26:08] !log stop, upgrade and restart dbprov1002 [16:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:58] (03PS3) 10Elukey: Refactor statistics mountpoints to be included in all stat roles [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) [16:29:22] (03CR) 10Herron: [C: 03+2] add dhcp/netboot entries for lists1001 [puppet] - 10https://gerrit.wikimedia.org/r/573600 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:32:42] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/20936/" [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [16:34:24] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/573550 (owner: 10Muehlenhoff) [16:38:00] (03CR) 10Herron: [C: 03+2] assigin lists1001 role::lists [puppet] - 10https://gerrit.wikimedia.org/r/573612 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [16:40:21] (03PS1) 10Alexandros Kosiaris: eventgate-analytics-external: Add the namespace and calico rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) [16:40:56] (03PS2) 10Muehlenhoff: Add Cumin aliases for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/573550 [16:41:21] (03Abandoned) 10Jforrester: mediawiki::php: Don't install gd any more, ZeroBanner is gone [puppet] - 10https://gerrit.wikimedia.org/r/526255 (https://phabricator.wikimedia.org/T227734) (owner: 10Jforrester) [16:44:49] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin aliases for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/573550 (owner: 10Muehlenhoff) [16:45:01] !log stop, upgrade and restart dbprov2002 [16:45:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:47] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10akosiaris) FYI, I 'll also piggybacking some k8s tests on these hosts as my local env doesn't have enough memory anymore [16:53:47] (03CR) 10EBernhardson: [C: 03+1] "Thanks for putting this together, should be useful!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [16:58:29] (03CR) 10Jhedden: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/573422 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [16:59:53] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add configuration for a flowspec controller [puppet] - 10https://gerrit.wikimedia.org/r/573401 (owner: 10Ayounsi) [17:00:03] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10faidon) a:05faidon→03RobH OK, it sounds like @akosiaris and @MoritzMuehlenhoff have coordinated with each other and they can share those two hosts as SRE test hosts. This allocation is approved. @RobH... [17:00:04] godog and _joe_: That opportune time is upon us again. Time for a Puppet SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:04:29] (03PS2) 10Bstorm: cloudstore: Update the nfs_hostlist script [puppet] - 10https://gerrit.wikimedia.org/r/573422 (https://phabricator.wikimedia.org/T224582) [17:05:20] PROBLEM - novaadmin has roles in every project on cloudcontrol1003 is CRITICAL: In cloudinfra, user novaadmin should have roles [user, projectadmin] but has [uprojectadmin, uuser, uadmin] https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:08:33] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].wikimedia.org - https://phabricator.wikimedia.org/T245754 (10RobH) [17:08:44] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10RobH) [17:08:46] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].wikimedia.org - https://phabricator.wikimedia.org/T245754 (10RobH) [17:09:49] (03PS1) 10Andrew Bogott: check_keystone_roles.py: exclude 'cloudinfra' from role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/573630 [17:09:56] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].wikimedia.org - https://phabricator.wikimedia.org/T245754 (10RobH) @MoritzMuehlenhoff: I asked in IRC, and both @akosiaris and I don't have strong feelings on the hostname use for these two hosts. Can you advise if you pre... [17:10:34] (03CR) 10Jhedden: [C: 03+1] check_keystone_roles.py: exclude 'cloudinfra' from role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/573630 (owner: 10Andrew Bogott) [17:10:36] 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10RobH) 05Open→03Resolved These will be setup via T245754. Resolving this allocation task. [17:10:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] check_keystone_roles.py: exclude 'cloudinfra' from role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/573630 (owner: 10Andrew Bogott) [17:11:06] (03CR) 10Andrew Bogott: [C: 03+2] check_keystone_roles.py: exclude 'cloudinfra' from role monitoring [puppet] - 10https://gerrit.wikimedia.org/r/573630 (owner: 10Andrew Bogott) [17:11:33] (03CR) 10Bstorm: [C: 03+2] cloudstore: Update the nfs_hostlist script [puppet] - 10https://gerrit.wikimedia.org/r/573422 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [17:12:31] 10Operations, 10netops, 10Patch-For-Review: Add monitoring for BGP peers exceeding prefix-limit - https://phabricator.wikimedia.org/T239256 (10ayounsi) This surfaced that the v6 sessions to the Equinix router servers in Dallas and Ashburn have been down for quite a while. [17:13:46] RECOVERY - novaadmin has roles in every project on cloudcontrol1003 is OK: novaadmin has the correct roles in all projects. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:17:12] 10Operations, 10SRE-Access-Requests: Requesting access to stat1007 for jmorgan - https://phabricator.wikimedia.org/T244785 (10Capt_Swing) @jbond is endorsement from @leila and @Ottomata sufficient here? [17:18:28] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10aborrero) [17:19:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, I 'd like to do the merge after the releases have been deployed." [puppet] - 10https://gerrit.wikimedia.org/r/573365 (https://phabricator.wikimedia.org/T233629) (owner: 10Ottomata) [17:19:43] (03PS1) 10Bstorm: nfs_hostlist: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/573631 [17:20:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Andrew will be merging/babysitting this patch." [puppet] - 10https://gerrit.wikimedia.org/r/572213 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [17:22:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:06] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ProtocolError(Connection aborted., ConnectionResetError(104, Connection reset by peer)): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [17:25:06] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 51.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:25:36] (03CR) 10Bstorm: [C: 03+2] nfs_hostlist: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/573631 (owner: 10Bstorm) [17:27:26] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - ncredirlb6_443: Servers ncredir3001.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:27:50] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:28:53] 10Operations, 10ORES, 10Scoring-platform-team (Current): ORES uwsgi consumes a large amount of memory and CPU when shutting down (as part of a restart) - https://phabricator.wikimedia.org/T242705 (10Halfak) @akosiaris, any updates on testing with gunicorn? [17:29:32] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:30:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:31:24] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 91.55 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:34:40] (03PS1) 10BBlack: depool codfw + ulsfo from geodns [dns] - 10https://gerrit.wikimedia.org/r/573636 [17:36:12] (03PS1) 10Ayounsi: prepending in esams/knams, MSS clamping in eqiad, eqord, esams, knams, eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573637 [17:37:37] (03CR) 10BBlack: [C: 03+2] depool codfw + ulsfo from geodns [dns] - 10https://gerrit.wikimedia.org/r/573636 (owner: 10BBlack) [17:38:19] !log pushed codfw+ulsfo geodns depool [17:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:40] 10Operations, 10cloud-services-team (Kanban): Netbox: Usage guidelines for WMCS - https://phabricator.wikimedia.org/T208576 (10bd808) [17:45:15] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 392 probes of 603 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:47:09] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10faidon) >>! In T244849#5873311, @crusnov wrote: > On a practical level we already maintain a fork, so if any changes are needed they can be integrated into our fork (we should wait until the post-upgrade ~this week... [17:50:03] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 30 probes of 603 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:50:21] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 86 probes of 607 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:52:58] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: enable BGP in the neutron transport network - https://phabricator.wikimedia.org/T245606 (10aborrero) Additional tests related to this are blocked on missing backported packages for the stretck-pike combo: `python3-os-ken` and `neutron-dynamic-... [17:55:09] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 23 probes of 607 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:58:07] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle, AS1299/IPv4: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:41] (03CR) 10Ayounsi: [C: 03+2] prepending in esams/knams, MSS clamping in eqiad, eqord, esams, knams, eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573637 (owner: 10Ayounsi) [17:58:48] (03CR) 10Ayounsi: [C: 03+2] "Pushed." [homer/public] - 10https://gerrit.wikimedia.org/r/573637 (owner: 10Ayounsi) [17:59:00] (03Merged) 10jenkins-bot: prepending in esams/knams, MSS clamping in eqiad, eqord, esams, knams, eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573637 (owner: 10Ayounsi) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T1800). [18:00:21] no deploys please, we are at the middle an incident [18:00:53] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 276 probes of 607 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:08:40] !log fdans@deploy1001 Started deploy [analytics/refinery@e05ae16]: deploying refinery [18:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:57] (03CR) 10Elukey: [C: 03+2] Refactor statistics mountpoints to be included in all stat roles [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [18:16:15] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle, AS1299/IPv4: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:19:18] jouncebot: now [18:19:19] For the next 0 hour(s) and 40 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T1800) [18:20:11] !log fdans@deploy1001 Finished deploy [analytics/refinery@e05ae16]: deploying refinery (duration: 11m 31s) [18:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:29] fdans: we are at middle of incident [18:24:20] folks, you should check in the channel before moving ahead, just on gp [18:25:35] Routing issues, it looks like? [18:26:00] (03CR) 10Ottomata: [C: 03+1] "Does this also allow for access to api.svc? It would be fine to also allow eventgate-analytics to access that too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [18:27:31] (03PS1) 10Ayounsi: Add prepending in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573650 [18:31:57] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 6 probes of 607 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:34:39] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].wikimedia.org - https://phabricator.wikimedia.org/T245754 (10MoritzMuehlenhoff) sretest100[12] is fine with me, but given that these are meant for various tests let's rather use an internal IP, unless @akosiaris has specific... [18:34:49] PROBLEM - Host db2127.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:41:01] RECOVERY - Host db2127.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [18:42:16] (03PS7) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [18:54:03] (03CR) 10Ayounsi: [C: 03+2] Add prepending in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573650 (owner: 10Ayounsi) [18:54:22] (03Merged) 10jenkins-bot: Add prepending in eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/573650 (owner: 10Ayounsi) [18:56:24] (03PS7) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [19:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T1900). Please do the needful. [19:00:04] Pchelolo: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:49] (Deploys still on hold.) [19:01:30] And I was going to say I'd skip this one regardless [19:02:09] Ha, OK. [19:02:26] * James_F has the train in 58 minutes' time. Would be nice to get clarity by then. [19:08:46] (03PS1) 10Herron: add lists1001 to authorized hosts for lists cert [puppet] - 10https://gerrit.wikimedia.org/r/573659 (https://phabricator.wikimedia.org/T224586) [19:10:41] deploys can proceed as normal for now [19:10:45] we have a handle on the other stuff [19:13:24] Thanks, bblack. [19:14:45] (03CR) 10Andrew Bogott: [C: 03+2] cloud: refresh names for DNS servers in eqiad1/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/572213 (https://phabricator.wikimedia.org/T243766) (owner: 10Arturo Borrero Gonzalez) [19:19:03] 10Operations, 10Security-Team, 10User-jbond: Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP - https://phabricator.wikimedia.org/T244792 (10chasemp) //My take aways// **Jumpcloud**: - does not offer syncrepl access - is rfc 2307 compliant - does offer LDAP bind and search... [19:23:16] (03PS4) 10Dzahn: tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 [19:31:30] (03PS2) 10Herron: acme_cheif: add lists1001 to authorized hosts for lists cert [puppet] - 10https://gerrit.wikimedia.org/r/573659 (https://phabricator.wikimedia.org/T224586) [19:31:52] (03PS1) 10BBlack: Revert "depool codfw + ulsfo from geodns" [dns] - 10https://gerrit.wikimedia.org/r/573666 [19:32:49] (03CR) 10BBlack: [C: 03+2] Revert "depool codfw + ulsfo from geodns" [dns] - 10https://gerrit.wikimedia.org/r/573666 (owner: 10BBlack) [19:33:13] Hello, why [19:33:13] beta-mediawiki-config-update-eqiad have problems again? [19:33:23] operations/mediawiki-config [19:33:23] 573345,4 [19:33:23] 0 min [19:33:23] 24 hr 14 min [19:33:23] beta-mediawiki-config-update-eqiad [19:33:31] See postmerge on https://integration.wikimedia.org/zuul/ [19:33:44] !log codfw+ulsfo repooled in geodns [19:33:45] (03PS1) 10MarcoAurelio: Add .gitreview [software/atskafka] - 10https://gerrit.wikimedia.org/r/573668 [19:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:12] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Add .gitreview [software/atskafka] - 10https://gerrit.wikimedia.org/r/573668 (owner: 10MarcoAurelio) [19:36:16] (03CR) 10Herron: [C: 03+2] acme_cheif: add lists1001 to authorized hosts for lists cert [puppet] - 10https://gerrit.wikimedia.org/r/573659 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [19:45:15] 10Operations, 10Phabricator, 10Security-Team: Adjust onboarding/offboarding scirpts to accomodate subgroups of acl*security (and name change!) - https://phabricator.wikimedia.org/T245771 (10chasemp) p:05Triage→03Medium [19:46:37] (https://phabricator.wikimedia.org/T245770 opened) [19:46:48] (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [19:47:07] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding scirpts to accomodate subgroups of acl*security (and name change!) - https://phabricator.wikimedia.org/T245771 (10chasemp) p:05Medium→03Triage Setting to NT as I'm not sure what the ops workflow is and I don't want... [19:47:47] (03CR) 10Dzahn: [C: 03+2] tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [19:47:56] (03PS5) 10Dzahn: tendril: change Icinga monitoring from HTTP to HTTPS [puppet] - 10https://gerrit.wikimedia.org/r/573034 [19:48:22] 10Operations, 10Phabricator, 10Security-Team, 10Security: Adjust onboarding/offboarding logic to accommodate changes to #security (now acl*security) - https://phabricator.wikimedia.org/T245771 (10chasemp) [19:51:10] !log deploying phabricator hotfix: https://phabricator.wikimedia.org/rPHEX2f36eee7ce67eb0c09e9bb0e79b42fc3b41d3597 for T244165 [19:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:14] T244165: Convert #Security to acl*Security - https://phabricator.wikimedia.org/T244165 [19:51:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 54.86 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:53:41] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].wikimedia.org - https://phabricator.wikimedia.org/T245754 (10akosiaris) >>! In T245754#5901753, @MoritzMuehlenhoff wrote: > sretest100[12] is fine with me, but given that these are meant for various tests let's rather use an... [19:54:23] (03CR) 10Alexandros Kosiaris: "> Does this also allow for access to api.svc? It would be fine to also allow eventgate-analytics to access that too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [19:55:13] !log hotfix deployed [19:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:44] (03CR) 10Dzahn: "ran puppet on dbmonitor[12]001 and icinga1001." [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [20:00:04] James_F and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200220T2000). [20:00:50] (One moment.) [20:01:03] 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) >>! In T244849#5901566, @faidon wrote: >>>! In T244849#5873311, @crusnov wrote: >> On a practical level we already maintain a fork, so if any changes are needed they can be integrated into our fork (we sho... [20:01:35] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573678 [20:01:37] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573678 (owner: 10Jforrester) [20:02:41] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573678 (owner: 10Jforrester) [20:03:09] Now on canaries. [20:04:08] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.20 [20:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:28] Nothing's blown up immediately. [20:06:20] longma: Nothing exceptional to my eyes; do you concur? [20:06:28] yup [20:07:23] !log Train 1.35.0-wmf.20 provisionally looks OK on all wikis. Closing T233868. [20:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:27] T233868: 1.35.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T233868 [20:08:29] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 74.56 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:12:27] (03PS1) 10Herron: lists1001 override lists interface alias [puppet] - 10https://gerrit.wikimedia.org/r/573681 [20:16:48] (03CR) 10Dzahn: "https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=dbmonitor1001&service=HTTPS-dbtree" [puppet] - 10https://gerrit.wikimedia.org/r/573034 (owner: 10Dzahn) [20:16:50] 10Operations, 10ops-codfw, 10DC-Ops: (ASAP) rack/setup/install frdb2001 - https://phabricator.wikimedia.org/T245566 (10Papaul) a:05Papaul→03Jgreen @Jgreen All yours [20:18:03] 10Operations, 10ops-codfw, 10DC-Ops: (no date provided) rack/setup/install ganeti20[19-24] - https://phabricator.wikimedia.org/T244783 (10Papaul) [20:21:35] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&v [20:21:35] g-eqiad&var-topic=All&var-consumer_group=All [20:25:49] there looks to be a recent spike here in mediawiki logs of level info https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-cluster=logstash&var-kafka_broker=All&var-disk_device=All&from=now-3h&to=now [20:30:01] 10Operations, 10ops-eqiad, 10DC-Ops: (No date provided) setup/install sretest100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T245754 (10RobH) a:05MoritzMuehlenhoff→03RobH [20:33:26] https://logstash.wikimedia.org/goto/7618b08032132948faa201d3fbe5de33 - loads of "Use of ResourceLoaderSkinModule::getAvailableLogos with $wgLogoHD set instead of $wgLogos was deprecated in MediaWiki 1.35." [20:37:07] (03CR) 10Jforrester: [C: 04-1] "Ahead of meeting tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572051 (owner: 10C. Scott Ananian) [20:41:59] (03PS8) 10Bstorm: cloudstore: remove dependency on bind mounts [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) [20:44:17] (03CR) 10Bstorm: cloudstore: remove dependency on bind mounts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/571821 (https://phabricator.wikimedia.org/T224582) (owner: 10Bstorm) [20:49:04] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) Sent to list: > Per https://phabricator.wikimedia.org/T230951 if this list is not still active the Security Team is plannin... [20:49:13] (03PS8) 10Bstorm: labstore: introduce a firewall for the old primary NFS cluster [puppet] - 10https://gerrit.wikimedia.org/r/571832 [20:52:23] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 302.9 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:53:21] (03PS1) 10Jforrester: [mediawikiwiki] Deny the 'flow-hide' right to logged out and non-autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573700 (https://phabricator.wikimedia.org/T245780) [20:59:47] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/20939/" [puppet] - 10https://gerrit.wikimedia.org/r/573681 (owner: 10Herron) [20:59:52] (03CR) 10Jforrester: [C: 03+2] [mediawikiwiki] Deny the 'flow-hide' right to logged out and non-autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573700 (https://phabricator.wikimedia.org/T245780) (owner: 10Jforrester) [21:00:52] (03Merged) 10jenkins-bot: [mediawikiwiki] Deny the 'flow-hide' right to logged out and non-autopatrolled users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573700 (https://phabricator.wikimedia.org/T245780) (owner: 10Jforrester) [21:03:16] ema: your atskafka repo is now up and running. [21:04:56] (03CR) 10Herron: [C: 03+2] lists1001 override lists interface alias [puppet] - 10https://gerrit.wikimedia.org/r/573681 (owner: 10Herron) [21:05:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T245780 [mediawikiwiki] Deny the 'flow-hide' right to logged out and non-autoconfirmed users (duration: 00m 56s) [21:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:04] T245780: Deny the 'flow-hide' right to logged out and non-autoconfirmed users on MediaWiki.org - https://phabricator.wikimedia.org/T245780 [21:10:03] 10Operations, 10Security-Team, 10SecTeam Discussion, 10Security: Password Vault for Security Team - https://phabricator.wikimedia.org/T185236 (10chasemp) [21:11:06] 10Operations, 10ContentSecurityPolicy, 10Gerrit, 10Phabricator, and 2 others: Add gerrit.wikimedia.org to the Phabricator CSP - https://phabricator.wikimedia.org/T218308 (10chasemp) [21:13:29] (03CR) 10Dzahn: role::noc::site: refactor role/profile, stop duplicate include (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [21:16:46] (03PS1) 10Herron: hieradata: move lists interface alias definitions to host yaml [puppet] - 10https://gerrit.wikimedia.org/r/573711 (https://phabricator.wikimedia.org/T224586) [21:18:40] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/20940/" [puppet] - 10https://gerrit.wikimedia.org/r/573711 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:20:26] 10Operations, 10Continuous-Integration-Infrastructure: beta-mediawiki-config-update-eqiad no works - https://phabricator.wikimedia.org/T245770 (10Zoranzoki21) [21:23:31] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] [dns] - 10https://gerrit.wikimedia.org/r/573713 [21:26:33] (03PS1) 10Andrew Bogott: pdns alias: move the puppet alias to point to the floating IP [puppet] - 10https://gerrit.wikimedia.org/r/573715 (https://phabricator.wikimedia.org/T235218) [21:29:04] (03CR) 10jerkins-bot: [V: 04-1] mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:31:44] (03PS2) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) [21:32:59] (03CR) 10jerkins-bot: [V: 04-1] mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:35:01] (03PS3) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) [21:38:10] (03CR) 10jerkins-bot: [V: 04-1] mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:39:44] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10chasemp) a:03chasemp [21:42:05] (03PS4) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) [21:44:37] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Transfer ownership of mediawiki-security mailman list to Security Team - https://phabricator.wikimedia.org/T230951 (10Tgr) The list had 23 emails during its entire existence; excluding yours now, the last one was in 2016. [21:46:39] (03CR) 10Herron: "good grief! it passed!" [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:48:42] (03CR) 10Ottomata: [C: 03+1] Refactor statistics mountpoints to be included in all stat roles [puppet] - 10https://gerrit.wikimedia.org/r/573616 (https://phabricator.wikimedia.org/T243934) (owner: 10Elukey) [21:50:41] (03CR) 10Ottomata: [C: 03+1] "Great +1 proceed! :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/573624 (https://phabricator.wikimedia.org/T233629) (owner: 10Alexandros Kosiaris) [21:52:22] (03PS5) 10Dzahn: role::noc::site: refactor role/profile, stop duplicate include [puppet] - 10https://gerrit.wikimedia.org/r/572343 [21:53:49] (03PS1) 10Andrew Bogott: pdns alias: move the puppet alias to the internal IP of the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/573725 (https://phabricator.wikimedia.org/T235218) [21:53:54] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10Dzahn) Could you give an example of a specific thing on doc1001 that needed fixing as root? We already spent quite some tim... [21:55:12] (03CR) 10Alex Monk: [C: 03+1] pdns alias: move the puppet alias to the internal IP of the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/573725 (https://phabricator.wikimedia.org/T235218) (owner: 10Andrew Bogott) [21:55:25] (03CR) 10Andrew Bogott: [C: 03+2] pdns alias: move the puppet alias to the internal IP of the new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/573725 (https://phabricator.wikimedia.org/T235218) (owner: 10Andrew Bogott) [21:58:38] (03CR) 10Herron: "...but introduces a duplicate resource conflict https://puppet-compiler.wmflabs.org/compiler1002/20941/" [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [21:59:29] (03PS5) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) [22:02:20] (03CR) 10Dzahn: DNS: Add mgmt and production DNS for mw236[6-9], mw237[0-6] (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/573713 (owner: 10Papaul) [22:03:07] (03PS6) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) [22:07:50] (03PS1) 10Herron: role::lists ensure apache mod_cgi enabled [puppet] - 10https://gerrit.wikimedia.org/r/573732 (https://phabricator.wikimedia.org/T224586) [22:10:12] 10Operations, 10Continuous-Integration-Infrastructure: beta-mediawiki-config-update-eqiad no works - https://phabricator.wikimedia.org/T245770 (10Zoranzoki21) 05Open→03Invalid Magic happened and jobs are clean. [22:16:09] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:18:57] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Should 'doc' machines (i.e. doc1001) have contint-roots as a group? - https://phabricator.wikimedia.org/T245691 (10Jdforrester-WMF) 05Open→03Stalled This was a hold-over discussion item that came up when the git ownership got broken in... [22:19:25] (03PS1) 10BryanDavis: toolschecker: do not verify TLS cert for k8s API checks [puppet] - 10https://gerrit.wikimedia.org/r/573737 [22:19:27] (03PS1) 10BryanDavis: toolschecker: Add tools-k8s-etcd-[456] to checks [puppet] - 10https://gerrit.wikimedia.org/r/573738 [22:19:30] (03CR) 10Herron: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1002/20944/" [puppet] - 10https://gerrit.wikimedia.org/r/573732 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [22:20:05] (03Abandoned) 10Herron: mailman: ensure apache mod cgi is enabled [puppet] - 10https://gerrit.wikimedia.org/r/573717 (https://phabricator.wikimedia.org/T224586) (owner: 10Herron) [22:23:16] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@8908dd1]: daemons: Install stack printing signal handler on SIGUSR1 [22:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:05] (03PS1) 10Dzahn: site: add mw2317-24 in rack A3 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/573742 (https://phabricator.wikimedia.org/T241852) [22:24:16] (03PS1) 10Alex Monk: cloud eqiad1: configure new puppetmaster to only use new puppet as backend [puppet] - 10https://gerrit.wikimedia.org/r/573743 [22:25:58] (03PS2) 10Alex Monk: cloud eqiad1: configure new puppetmaster to only use new puppet as backend [puppet] - 10https://gerrit.wikimedia.org/r/573743 [22:26:10] (03CR) 10Dzahn: [C: 03+2] site: add mw2317-24 in rack A3 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/573742 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [22:28:21] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@8908dd1]: daemons: Install stack printing signal handler on SIGUSR1 (duration: 05m 05s) [22:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:37] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:43] !log restart mjolnir-kafka-bulk-daemon across eqiad [22:28:48] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 8 host(s) and their services with reason: new_install ` mw[2317-2324].codfw.wmnet ` [22:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:13] (03CR) 10RLazarus: [C: 03+1] site: add mw2317-24 in rack A3 as API appservers [puppet] - 10https://gerrit.wikimedia.org/r/573742 (https://phabricator.wikimedia.org/T241852) (owner: 10Dzahn) [22:38:10] (03CR) 10Bstorm: [C: 03+2] toolschecker: do not verify TLS cert for k8s API checks [puppet] - 10https://gerrit.wikimedia.org/r/573737 (owner: 10BryanDavis) [22:38:18] (03PS2) 10Bstorm: toolschecker: do not verify TLS cert for k8s API checks [puppet] - 10https://gerrit.wikimedia.org/r/573737 (owner: 10BryanDavis) [22:39:30] (03CR) 10Bstorm: [V: 03+2 C: 03+2] toolschecker: do not verify TLS cert for k8s API checks [puppet] - 10https://gerrit.wikimedia.org/r/573737 (owner: 10BryanDavis) [22:47:27] (03PS3) 10Jforrester: MWScript: Allow MWScript to be invoked via phpdbg as well as the cli [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [22:47:39] (03CR) 10Jforrester: [C: 03+1] "Let's land this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573558 (https://phabricator.wikimedia.org/T244549) (owner: 10Hnowlan) [22:47:59] (03CR) 10Bstorm: "This one is not likely to work:" [puppet] - 10https://gerrit.wikimedia.org/r/573738 (owner: 10BryanDavis) [22:48:11] (03CR) 10Bstorm: [C: 04-1] toolschecker: Add tools-k8s-etcd-[456] to checks [puppet] - 10https://gerrit.wikimedia.org/r/573738 (owner: 10BryanDavis) [22:51:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [22:51:26] (03CR) 10BryanDavis: [C: 04-1] "> etcd uses client certs on the new setup, so I suspect it will need" [puppet] - 10https://gerrit.wikimedia.org/r/573738 (owner: 10BryanDavis) [22:58:45] (03CR) 10Thcipriani: [C: 03+2] "Was able to build and update personal testing instance" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [22:59:01] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:28] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 8 host(s) and their services with reason: new_install ` mw[2317-2324].codfw.wmnet ` [23:01:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/20945/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [23:02:38] (03Abandoned) 10Jforrester: Stop setting wgLogoHD for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572981 (owner: 10Jforrester) [23:04:09] (03CR) 10Jforrester: [C: 04-1] Merge $wgLogo into $wgLogos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [23:04:44] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.20/includes/pager/IndexPager.php: IndexPager: Limit offset params to the max of the indices available (duration: 00m 56s) [23:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:36] (03CR) 10Dzahn: "noop except the role got removed. Notice: /Stage[main]/Motd/File[/etc/update-motd.d/05-role-noc--site]/ensure: removed" [puppet] - 10https://gerrit.wikimedia.org/r/572343 (owner: 10Dzahn) [23:06:19] (03Merged) 10jenkins-bot: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/562363 (owner: 10Paladox) [23:08:38] (03PS1) 10Paladox: Update delete-project [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/573759 [23:08:51] (03PS1) 10RLazarus: Convert all the apache-fast-test URLs to httpbb tests. [puppet] - 10https://gerrit.wikimedia.org/r/573760 [23:10:02] (03PS2) 10Jforrester: Stop setting wgVectorPrintLogo for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572980 [23:10:04] (03PS7) 10Jforrester: Merge $wgLogo and $wgLogoHD into $wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) [23:10:06] (03PS3) 10Jforrester: [DNM] Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [23:12:23] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [23:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:43] (03CR) 10Jforrester: [C: 04-1] "Deploy order: CS, then IS. This will mean that LogoHD / Logos['2x'] will be briefly unset." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570379 (https://phabricator.wikimedia.org/T232140) (owner: 10Jforrester) [23:14:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [23:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:49] 10Operations, 10ops-codfw, 10serviceops, 10Patch-For-Review: rack/setup/install new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10ops-monitoring-bot) Icinga downtime for 1:00:00 set by dzahn@cumin1001 on 8 host(s) and their services with reason: new_install ` mw[2317-2324].codfw.wmnet ` [23:17:14] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Paladox) Per discussion me and @thcipriani just had, we found lfs objects using 15G, so if we remove /srv/dbdump and recreate it as a 20g partition, we can move the objects there. [23:19:04] (03PS1) 10Majavah: Add noindex for NS_USER and NS_USER_TALK for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573761 (https://phabricator.wikimedia.org/T245787) [23:19:48] 10Operations, 10Gerrit: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 (10Paladox) Actually, we will use that partition for db readonly, so i think a new /srv/lfs partition 18g would do. [23:21:05] (03PS1) 10Dwisehaupt: Plumb in frdb2001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/573763 (https://phabricator.wikimedia.org/T245566) [23:25:52] !log ganeti1003 - adding another virtual 20G disk to gerrit1002 (T243808) [23:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:56] T243808: gerrit1002 running out of space - https://phabricator.wikimedia.org/T243808 [23:26:43] (03PS4) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [23:28:01] (03PS5) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [23:32:03] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw2381[7-9].codfw.wmnet [23:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:46] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw231[7-9].codfw.wmnet [23:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:01] !log dzahn@cumin1001 conftool action : set/weight=15; selector: name=mw232[0-4].codfw.wmnet [23:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw231[7-9].codfw.wmnet [23:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:16] (03PS6) 10Jforrester: Merge wgMinervaCustomLogos into wgLogos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572998 [23:35:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:37:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:39:09] (03CR) 10Jforrester: "Why not just do I2463fecfafbb4c08d80f624adf4cd47a6fb4e660?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573419 (https://phabricator.wikimedia.org/T232140) (owner: 10Jdlrobson) [23:40:37] (03CR) 10Jforrester: [C: 03+2] Stop setting wgVectorPrintLogo for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572980 (owner: 10Jforrester) [23:41:32] (03Merged) 10jenkins-bot: Stop setting wgVectorPrintLogo for back-compat. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/572980 (owner: 10Jforrester) [23:45:10] !log gerrit1002 - test VM - rebooting for new disk [23:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:55] (03PS1) 10Andrew Bogott: puppetmaster apache config: update allowed path for the master API [puppet] - 10https://gerrit.wikimedia.org/r/573772 [23:45:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw232[0-4].codfw.wmnet [23:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:43] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgVectorPrintLogo for back-compat., not read since wmf.19 (duration: 00m 56s) [23:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:51] (03PS2) 10Jforrester: Add noindex for NS_USER and NS_USER_TALK for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573761 (https://phabricator.wikimedia.org/T245787) (owner: 10Majavah) [23:46:57] (03CR) 10Jforrester: [C: 03+2] Add noindex for NS_USER and NS_USER_TALK for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573761 (https://phabricator.wikimedia.org/T245787) (owner: 10Majavah) [23:47:44] (03CR) 10Jforrester: "This is no longer blocked on scap, right?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [23:47:51] (03Merged) 10jenkins-bot: Add noindex for NS_USER and NS_USER_TALK for nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/573761 (https://phabricator.wikimedia.org/T245787) (owner: 10Majavah) [23:49:00] (03CR) 10Alex Monk: [C: 03+1] "suspect it's a prefix match but sure" [puppet] - 10https://gerrit.wikimedia.org/r/573772 (owner: 10Andrew Bogott) [23:49:52] (03CR) 10Andrew Bogott: [C: 03+2] puppetmaster apache config: update allowed path for the master API [puppet] - 10https://gerrit.wikimedia.org/r/573772 (owner: 10Andrew Bogott) [23:50:49] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T245787 [nlwiki] Add noindex for NS_USER and NS_USER_TALK (duration: 00m 56s) [23:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:53] T245787: Keep userspace of nlwikipedia out of search engines - https://phabricator.wikimedia.org/T245787 [23:52:28] James_F: I already scheduled that for SWAT, but thanks :D [23:52:48] tassu: No reason to wait 10 minutes. :-) [23:54:36] (03PS6) 10Jforrester: [BETA] Enable LCStoreStaticArray format on Beta Cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) [23:58:17] (03CR) 10Jforrester: [C: 03+2] [BETA] Enable LCStoreStaticArray format on Beta Cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester) [23:59:11] (03Merged) 10jenkins-bot: [BETA] Enable LCStoreStaticArray format on Beta Cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508724 (https://phabricator.wikimedia.org/T99740) (owner: 10Jforrester)