[00:00:04] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T0000). [00:00:45] (03CR) 10Bstorm: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626235 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [00:04:58] (03PS2) 10CRusnov: base/apt-upgrade-activity.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) [00:06:09] (03CR) 10jerkins-bot: [V: 04-1] base/apt-upgrade-activity.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:06:29] (03CR) 10CRusnov: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:09:21] !log deploying phabricator update 2020-09-10 https://phabricator.wikimedia.org/project/view/4755/ [00:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:36] 10Operations, 10DNS, 10Traffic: 'skip_first' feature flag for gdnsd GeoIP plugin - https://phabricator.wikimedia.org/T261340 (10BBlack) This is implemented "upstream" in https://github.com/gdnsd/gdnsd/commit/b17bb0b073b4a9c6e2a65d2ddee2e5bc39f1b717 which is released with v3.3.0, so we're over halfway there.... [00:17:08] (03PS3) 10CRusnov: base/apt-upgrade-activity.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) [00:22:13] twentyafterfour just got `Run the storage upgrade script to upgrade databases (host "m3-master.eqiad.wmnet" is out of date). Missing patches:` - is this part of the update or an issue caused by it? [00:22:43] DannyS712: unintended but fixing it now [00:23:17] !log applying database migrations to phabricator db [00:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:36] works now [00:23:44] !log done. Phabricator update complete [00:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:28] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@00b0e20]: Update to current master [00:24:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:10] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@00b0e20]: Update to current master (duration: 06m 42s) [00:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:11] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui Dell wants for us to run onboard hardware diagnostics which can take many hours before completion . ` Papaul, Apolog... [00:42:01] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10Papaul) @wiki_willy this since is out of warranty since 2018 and i have no spare or decommissioned server onsite to pull the BBU from. I am requesting approval to purchase new BBU. Thanks [00:44:13] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) Note that Shellbox depends on monolog and core will depend on shellbox, so this will make core indirectly depend on... [00:49:33] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10wiki_willy) Hi @Papaul - when I look at Netbox, it shows that ms-be2019 was purchased 5yrs ago. https://netbox.wikimedia.org/dcim/devices/240/ @fgiunchedi - isn't this one going to be refreshed, as soon as... [00:58:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) is CRITICAL: Test Get media list from test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [00:59:55] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:03:32] (03PS1) 10Ryan Kemper: elasticsearch: new --write-queue-datacenter flag [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) [01:03:34] (03PS12) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [01:04:44] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [01:05:03] (03PS13) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [01:05:49] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: new --write-queue-datacenter flag [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [01:06:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [01:08:30] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) ` papaul@fasw-c-codfw# show | compare [edit interfaces interface-range disabled] - member... [01:09:18] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) [01:20:25] (03PS14) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [01:21:19] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [01:25:26] (03CR) 10Jeena Huneidi: [C: 03+1] "Thanks! This is really helpful" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [01:31:50] (03PS1) 10Papaul: DNS: Add production DNS for frmx2001 and frdata2001 [dns] - 10https://gerrit.wikimedia.org/r/626242 [01:34:20] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for frmx2001 and frdata2001 [dns] - 10https://gerrit.wikimedia.org/r/626242 (owner: 10Papaul) [01:35:34] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) [01:37:11] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Papaul) @Jgreen @Dwisehaupt this is done on my end, please fell free to take o... [02:04:53] (03PS2) 10Ryan Kemper: elasticsearch: new --write-queue-datacenter flag [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) [02:07:01] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: new --write-queue-datacenter flag [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [02:13:18] (03PS15) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [02:15:02] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [02:16:51] (03PS16) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [02:19:24] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) Regarding restrictions again. I'm not a big fan of Firejail after my recent code review and bug reports, so I'm loo... [02:22:25] (03CR) 10Ryan Kemper: "Last thing for me to address is switching from --write-queue-datacenter (string, single dc) to --write-queue-datacenters (tuple, one or mo" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [03:16:55] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:20:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:39:24] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Marostegui) From what I can see this host isn't assigned to a partman recipe, but I am going to leave this to @jcrespo as this host is going to th... [04:43:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) Thank you @Papaul - let me know when you want me to have the host ready for you and I will make sure to have MySQL stopped there. [04:47:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [04:51:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:13:30] (03PS17) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [05:23:46] (03PS3) 10Ryan Kemper: elasticsearch: new --write-queue-datacenter flag [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) [05:25:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, thanks for digging!" [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) (owner: 10Ryan Kemper) [05:28:42] (03PS18) 10Ryan Kemper: elasticsearch: Let spicerack handle wait for all write queues to clear [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) [05:29:50] !log Deploy schema change on s3 master - T260476 [05:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:57] T260476: Extend sites.site_global_key on WMF production - https://phabricator.wikimedia.org/T260476 [05:36:11] (03CR) 10Ryan Kemper: "See inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [05:38:40] (03PS1) 10Volans: spicerack: remove double require of same profile [puppet] - 10https://gerrit.wikimedia.org/r/626266 [05:44:58] (03CR) 10Volans: [C: 03+2] "Compiler looks happy, merging: https://puppet-compiler.wmflabs.org/compiler1002/25012/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/626266 (owner: 10Volans) [05:49:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 240, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:50:09] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:28] (03CR) 10Volans: [C: 03+2] dns: generate records for all VMs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624154 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [06:04:07] (03PS1) 10Marostegui: db1093,db1109,db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626267 [06:04:50] (03CR) 10Marostegui: [C: 03+2] db1093,db1109,db1123: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626267 (owner: 10Marostegui) [06:04:56] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper) @Papaul Sorry for the delayed response! Thinking out loud, since we're using SSDs I imagine the penalty for not performing sequential reads is no... [06:16:57] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10serviceops, 10Wikimedia-production-error: [Bug] Page content service is deployed with localhost links to the CSS and JS, breaking all pages that have been edited recently - https://phabricator.wikimedia.org/T262437 (10Joe) A short-form inc... [06:26:49] (03PS1) 10Matthias Mullie: Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626256 [06:32:17] (03CR) 10Giuseppe Lavagetto: envoy: add a new endpoint for services calling restbase (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/626158 (owner: 10Giuseppe Lavagetto) [06:34:37] (03CR) 10Giuseppe Lavagetto: envoy: add a new endpoint for services calling restbase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626158 (owner: 10Giuseppe Lavagetto) [06:35:17] (03PS2) 10Giuseppe Lavagetto: envoy: add a new endpoint for services calling restbase [puppet] - 10https://gerrit.wikimedia.org/r/626158 [06:38:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25013/" [puppet] - 10https://gerrit.wikimedia.org/r/626158 (owner: 10Giuseppe Lavagetto) [06:43:50] (03PS2) 10Volans: scripts: allocate IPs, add Cassandra support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) [06:49:26] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10jcrespo) @Papaul, as a general rule, al db* hosts with the same spec, as far as first install, they should have the custom/db.cfg recipe. I believ... [06:53:05] (03CR) 10Muehlenhoff: "I'merging this one as-is, we can still add something like netbox-testing when needed." [puppet] - 10https://gerrit.wikimedia.org/r/626147 (owner: 10Muehlenhoff) [06:53:07] (03CR) 10Muehlenhoff: [C: 03+2] Update netbox Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/626147 (owner: 10Muehlenhoff) [06:57:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [06:57:47] (03PS4) 10Muehlenhoff: Enable CAS for Hue [puppet] - 10https://gerrit.wikimedia.org/r/617385 [07:02:11] (03PS1) 10Giuseppe Lavagetto: mobileapps: use restbase-for-services in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626269 [07:02:13] (03PS1) 10Giuseppe Lavagetto: mobileapps: use restbase-for-services everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626270 [07:03:38] !log resize search-loader vms (+4 vcores +4GB of ram) on Ganeti - T262385 [07:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:45] T262385: Increase vcores and ram on search-loader VMs - https://phabricator.wikimedia.org/T262385 [07:08:57] (03PS2) 10Elukey: Bump msearch daemon parallelism [puppet] - 10https://gerrit.wikimedia.org/r/626105 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:09:39] (03PS3) 10Elukey: mjolnir: Bump msearch daemon parallelism [puppet] - 10https://gerrit.wikimedia.org/r/626105 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:10:16] (03CR) 10Elukey: [C: 03+2] mjolnir: Bump msearch daemon parallelism [puppet] - 10https://gerrit.wikimedia.org/r/626105 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:10:58] 10Operations, 10Discovery, 10Discovery-Search (Current work): Increase vcores and ram on search-loader VMs - https://phabricator.wikimedia.org/T262385 (10elukey) 05Open→03Resolved a:03elukey [07:11:52] 10Operations, 10observability: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10MoritzMuehlenhoff) [07:14:30] (03CR) 10Giuseppe Lavagetto: wikifeeds: use the service proxy in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [07:16:49] (03PS1) 10Muehlenhoff: Add Cumin alias for analytics-launcher [puppet] - 10https://gerrit.wikimedia.org/r/626271 [07:18:10] (03PS2) 10Muehlenhoff: Add Cumin alias for analytics-launcher [puppet] - 10https://gerrit.wikimedia.org/r/626271 [07:24:30] (03CR) 10Muehlenhoff: "The Python parts LGTM, but some unrelated Puppet patch to monitor.pp seems to have crept it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [07:25:10] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Thur, Sept 10 PDU Upgrade 12pm-4pm UTC- Racks D7 and D8 - https://phabricator.wikimedia.org/T261454 (10wiki_willy) Apologies for the last minute change, the upgrades for these 2x PDUs will be postponed until a later date. Both dc-ops engineers at eqiad are recov... [07:25:52] (03CR) 10Matthias Mullie: [C: 03+1] "Ready for backport" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626256 (owner: 10Matthias Mullie) [07:27:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Wed, Sept 9 PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) Latest update: Due to another separate injury, the upgrades for these 2x PDUs will be postponed again for a later date. No PDU upgrades for the rest... [07:27:55] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [07:27:59] (03PS1) 10Marostegui: Revert "db1093,db1109,db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/626257 [07:28:44] (03CR) 10Marostegui: [C: 03+2] Revert "db1093,db1109,db1123: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/626257 (owner: 10Marostegui) [07:31:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2014 after cloning es2026', diff saved to https://phabricator.wikimedia.org/P12550 and previous config saved to /var/cache/conftool/dbconfig/20200910-073107-marostegui.json [07:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:19] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) 05Open→03Resolved As far as I know this didn't reproduce since. 1/ has been solved by removing IGMP and 2/ by disabling the fpc3-fpc8 link [07:33:18] (03PS1) 10Marostegui: es2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626273 [07:34:27] (03CR) 10Marostegui: [C: 03+2] es2014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626273 (owner: 10Marostegui) [07:44:22] (03PS1) 10JMeybohm: Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) [07:45:07] (03PS2) 10JMeybohm: Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) [07:45:47] (03PS4) 10KartikMistry: Update cxserver to 2020-08-30-011854-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) [07:46:34] (03PS1) 10Jcrespo: mariadb: Remove db1133 from full reimage, add db2141 & db1150 [puppet] - 10https://gerrit.wikimedia.org/r/626275 (https://phabricator.wikimedia.org/T260817) [07:47:03] (03CR) 10KartikMistry: "@Alex, can you check if with new deployment method, this is sufficient for updating cxserver. If yes, Please +1, I'll deploy it sometime l" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) (owner: 10KartikMistry) [07:49:27] (03PS2) 10Jcrespo: mariadb: Remove db1133 from full reimage, add db2141 & db1150 [puppet] - 10https://gerrit.wikimedia.org/r/626275 (https://phabricator.wikimedia.org/T260817) [07:52:44] (03CR) 10Jcrespo: [C: 03+2] mariadb: Remove db1133 from full reimage, add db2141 & db1150 [puppet] - 10https://gerrit.wikimedia.org/r/626275 (https://phabricator.wikimedia.org/T260817) (owner: 10Jcrespo) [07:57:47] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) (owner: 10KartikMistry) [07:58:21] (03PS1) 10Muehlenhoff: Remove expiry date/contact for eyener [puppet] - 10https://gerrit.wikimedia.org/r/626336 [07:59:45] (03CR) 10Muehlenhoff: "You can also remove the then obsolete role::etcd::kubernetes" [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [08:00:04] (03CR) 10Muehlenhoff: [C: 03+2] Remove expiry date/contact for eyener [puppet] - 10https://gerrit.wikimedia.org/r/626336 (owner: 10Muehlenhoff) [08:03:14] (03PS1) 10JMeybohm: Remove etcd100[123] hosts [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) [08:08:30] (03PS3) 10JMeybohm: Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) [08:23:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2014 after cloning es2026', diff saved to https://phabricator.wikimedia.org/P12551 and previous config saved to /var/cache/conftool/dbconfig/20200910-082304-marostegui.json [08:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] (03PS3) 10Vgutierrez: varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) [08:25:11] (03CR) 10jerkins-bot: [V: 04-1] varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [08:27:09] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Zbyszko) I went with the fix for wikidata/query/rdf scripts - it made most sense to me, since issues... [08:31:37] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts: ` ['db2141.codfw.wmnet'] ` The log can be fo... [08:44:03] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10jcrespo) [08:47:08] (03CR) 10Hashar: [C: 03+1] logspam-watch: display seconds and refresh each cycle [puppet] - 10https://gerrit.wikimedia.org/r/626224 (https://phabricator.wikimedia.org/T260826) (owner: 10Brennen Bearnes) [08:47:21] (03PS2) 10Arturo Borrero Gonzalez: cumin: aliases: include A:cloudceph in the general A:cloud-eqiad1 alias [puppet] - 10https://gerrit.wikimedia.org/r/623982 [08:47:29] (03PS3) 10Arturo Borrero Gonzalez: cumin: aliases: include A:cloudceph in the general A:cloud-eqiad1 alias [puppet] - 10https://gerrit.wikimedia.org/r/623982 [08:47:36] !log jynus@cumin2001 START - Cookbook sre.hosts.downtime [08:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:48:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) (owner: 10JMeybohm) [08:48:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cumin: aliases: include A:cloudceph in the general A:cloud-eqiad1 alias [puppet] - 10https://gerrit.wikimedia.org/r/623982 (owner: 10Arturo Borrero Gonzalez) [08:49:44] !log jynus@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [08:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: run nginx-ingress on ingress dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/626133 (https://phabricator.wikimedia.org/T250172) (owner: 10Arturo Borrero Gonzalez) [08:56:26] 10Operations, 10netops: Prioritize underdog IXP - https://phabricator.wikimedia.org/T262517 (10ayounsi) p:05Triage→03Medium [09:00:17] (03PS1) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [09:04:10] (03PS1) 10Giuseppe Lavagetto: cxserver: use the restbase-for-services endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626341 [09:12:00] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db2141.codfw.wmnet'] ` and were **ALL** successful. [09:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es2014 after cloning es2026', diff saved to https://phabricator.wikimedia.org/P12554 and previous config saved to /var/cache/conftool/dbconfig/20200910-091335-marostegui.json [09:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:15] (03CR) 10Hnowlan: [C: 03+1] Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) (owner: 10Cicalese) [09:17:17] (03CR) 10Gehel: [C: 04-1] "See comments inline. Ping me on IRC if you want to have a more complete rational." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/626240 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [09:18:02] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:19:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] cxserver: use the restbase-for-services endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626341 (owner: 10Giuseppe Lavagetto) [09:23:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10jcrespo) [09:24:18] !log creating missing cirrus indices for jawikivoyage T260228 [09:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:25] T260228: Add supportive gestures to Wikipedia Preview - Gallery part - https://phabricator.wikimedia.org/T260228 [09:25:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10jcrespo) @Papaul, this is all completed after my patch. Only leaving it open so you can see it (e.g. in case you need to do something else not on... [09:25:26] damn wrong ticket [09:25:37] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work), 10Patch-For-Review: WDQS Categories reload is failing on thankyouwiki - https://phabricator.wikimedia.org/T261097 (10Zbyszko) @RKemper we should be able to retry categories reload after deploying this. [09:26:19] !log creating missing cirrus indices for jawikivoyage T262518 [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] T262518: Search not working on Japanese Wikivoyage - https://phabricator.wikimedia.org/T262518 [09:27:34] dcausse: do you know if that was a step was skipped on the wiki creation steps, or maybe not documented, or something that would be actionable? [09:27:56] jynus: yes we have a task for this I was about to link it to the ticket [09:28:06] cool, so you got it then [09:28:14] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: run nginx-ingress on ingress dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/626133 (https://phabricator.wikimedia.org/T250172) [09:28:16] yes it's a recurring problem :( [09:28:26] I think this created frustration on some users, I got involved because I so one lost [09:28:33] (03PS1) 10Hnowlan: api-gateway: Fix isser to match one used by meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/626342 (https://phabricator.wikimedia.org/T235277) [09:28:37] and tried to push it forward [09:28:48] s/so/saw/ [09:28:52] yes, thanks for pinging our tags [09:29:29] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from me as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/623475 (https://phabricator.wikimedia.org/T253439) (owner: 10KartikMistry) [09:30:15] (03CR) 10Hnowlan: [C: 03+2] api-gateway: Fix isser to match one used by meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/626342 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [09:31:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es2014 after cloning es2026', diff saved to https://phabricator.wikimedia.org/P12555 and previous config saved to /var/cache/conftool/dbconfig/20200910-093106-marostegui.json [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:14] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:31:35] (03Merged) 10jenkins-bot: api-gateway: Fix isser to match one used by meta [deployment-charts] - 10https://gerrit.wikimedia.org/r/626342 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [09:32:04] (03PS2) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [09:34:26] (03CR) 10Jcrespo: "To elaborate on the FYI, the extra backup source will provide extra reliablity: no more "a backup source had a hw problem, now we cannot t" [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:35:01] (03CR) 10Jcrespo: "CC also manager" [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [09:35:44] (03PS6) 10Jbond: base::expose_puppet_certs: add ability to expose p12 cert [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) [09:37:26] (03CR) 10Jbond: [C: 03+2] base::expose_puppet_certs: add ability to expose p12 cert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626137 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:40:33] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: use the restbase-for-services endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626341 (owner: 10Giuseppe Lavagetto) [09:40:42] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [09:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:43] (03Merged) 10jenkins-bot: cxserver: use the restbase-for-services endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/626341 (owner: 10Giuseppe Lavagetto) [09:42:13] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:24] (03CR) 10Jbond: [C: 03+2] profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:42:30] (03PS6) 10Jbond: profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) [09:43:11] (03CR) 10Jbond: [V: 03+2 C: 03+2] profile::java: add the puppet CA cert to the java truststore by default [puppet] - 10https://gerrit.wikimedia.org/r/625631 (https://phabricator.wikimedia.org/T253957) (owner: 10Jbond) [09:43:19] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [09:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:25] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [09:43:49] (03Abandoned) 10ZPapierski: Remove unnecessary daemon definitions [puppet] - 10https://gerrit.wikimedia.org/r/622355 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [09:43:49] <_joe_> hnowlan: we should really convert api-gateway to the new layout :) [09:44:55] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.1288 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:45:01] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [09:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:27] _joe_: yep, have a PR for it but I just want to get to point where the config is churning less [09:49:16] (03CR) 10Filippo Giunchedi: mediawiki: update alerts on logstash logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [09:51:35] jbond42: not sure if fixed yet but the p12 change is causing lots of puppet failures [09:51:47] e.g. Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/rsyslog/ssl/server.p12],File[/etc/debmonitor/ssl/server.p12] [09:52:42] (03PS1) 10Giuseppe Lavagetto: cxserver: bump chart version to pick up the service proxy addition [deployment-charts] - 10https://gerrit.wikimedia.org/r/626345 [09:54:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: bump chart version to pick up the service proxy addition [deployment-charts] - 10https://gerrit.wikimedia.org/r/626345 (owner: 10Giuseppe Lavagetto) [09:55:23] (03Merged) 10jenkins-bot: cxserver: bump chart version to pick up the service proxy addition [deployment-charts] - 10https://gerrit.wikimedia.org/r/626345 (owner: 10Giuseppe Lavagetto) [09:56:36] godog: thanks ill take a look [09:58:40] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [09:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] (03PS2) 10Mvolz: Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) [09:59:21] not sure why it's doing that on debmonitor, it's not using java::profile? [10:00:04] mvolz: It is that lovely time of the day again! You are hereby commanded to deploy Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1000). [10:00:15] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:39] (03PS1) 10Jbond: base::puppet: opnl.y manage p12 file if source exists [puppet] - 10https://gerrit.wikimedia.org/r/626346 [10:00:46] moritzm: the p12 source file dosn;t exist. i think that even though the file resoure hasensure => absent, its still needs the source file to exist [10:01:13] (03PS2) 10Jbond: base::puppet: only manage p12 file if source exists [puppet] - 10https://gerrit.wikimedia.org/r/626346 [10:02:13] ack, makes sense [10:02:28] (03CR) 10Jbond: [C: 03+2] base::puppet: only manage p12 file if source exists [puppet] - 10https://gerrit.wikimedia.org/r/626346 (owner: 10Jbond) [10:03:12] (03CR) 10Mvolz: [C: 03+2] Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) (owner: 10Mvolz) [10:03:54] ok fixed now, re-running puppet on failed nodes [10:04:39] (03CR) 10jerkins-bot: [V: 04-1] Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) (owner: 10Mvolz) [10:05:59] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:13] (03PS2) 10Giuseppe Lavagetto: cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) [10:06:20] (03CR) 10jerkins-bot: [V: 04-1] cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [10:07:08] (03PS3) 10Giuseppe Lavagetto: cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) [10:09:25] _joe_: any idea why this failed? https://integration.wikimedia.org/ci/job/helm-lint/2482/console, https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626138/2/helmfile.d/services/citoid/values.yaml [10:09:36] seems unrelated to the actual change [10:09:48] <_joe_> mvolz: I think you hit a known bug [10:10:05] nooo :) [10:10:20] <_joe_> mvolz: good news is - it's a race condition in the CI process [10:10:38] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) (owner: 10Mvolz) [10:10:50] <_joe_> this time it should go better :P [10:11:04] <_joe_> but point taken, I will fix this horror today. [10:11:54] <_joe_> uhm "recheck" doesn't work? [10:12:09] <_joe_> yes it did [10:12:23] <_joe_> mvolz: you should be able to submit the patch yourself now [10:13:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: run nginx-ingress on ingress dedicated nodes [puppet] - 10https://gerrit.wikimedia.org/r/626133 (https://phabricator.wikimedia.org/T250172) (owner: 10Arturo Borrero Gonzalez) [10:14:33] (03CR) 10Mvolz: [C: 03+2] Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) (owner: 10Mvolz) [10:15:48] (03Merged) 10jenkins-bot: Update citoid to 2020-09-08-122926-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626138 (https://phabricator.wikimedia.org/T248571) (owner: 10Mvolz) [10:21:01] (03CR) 10Jbond: "lgtm to minor nit & question" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:21:09] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:21:17] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:37] 10Operations, 10observability: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10jbond) > https://grafana.com/docs/grafana/latest/auth/auth-proxy/ > > We export the CN and UID of the user logging in in CAS (this is already used for Superset), so depending on what Grafana uses... [10:28:19] !log move VRRP master to cr2-esams [10:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:00] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:25] 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) p:05Triage→03Medium [10:34:51] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] 10Operations, 10netops: cr3-esams linecard diversity issue - https://phabricator.wikimedia.org/T262524 (10ayounsi) [10:37:23] (03CR) 10Hashar: git: allow multiple calls to git::systemconfig (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:37:30] (03PS5) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [10:38:33] (03CR) 10jerkins-bot: [V: 04-1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [10:41:42] (03PS6) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [10:42:55] !log daniel@mwmaint2001:~$ mwscript maintenance/findBadBlobs.php jvwiki --revisions 214173 --mark T262457 [10:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:02] T262457: BlobAccessException: Failed to load data blob from tt:209199: Bad data in text row 209199 - https://phabricator.wikimedia.org/T262457 [10:45:57] 10Operations, 10Wikimedia-Mailing-lists: Create functionaries-ru Mailing list. - https://phabricator.wikimedia.org/T262525 (10Carn) [10:54:37] 10Operations, 10observability: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10MoritzMuehlenhoff) p:05Triage→03Medium [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1100). [11:00:04] matthiasmullie: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:09] o/ [11:01:02] matthiasmullie: wanna self-service, or should I deploy? [11:01:32] that's ok, I'll take care of it myself :) [11:01:46] (03CR) 10Matthias Mullie: [C: 03+2] Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626256 (owner: 10Matthias Mullie) [11:02:18] Okay, cool :) [11:02:42] Btw, why do you have config in code? Wouldn't it make sense to use configurstion variables? [11:04:42] (03Merged) 10jenkins-bot: Enable MediaSearch A/B test [extensions/WikimediaEvents] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626256 (owner: 10Matthias Mullie) [11:05:59] Urbanecm: I don't know - I know nothing about that extension; simply followed instructions and didn't ask questions :D [11:06:28] Okay, thanks :) [11:10:03] (03CR) 10JMeybohm: [C: 04-1] cxserver: enable the service proxy everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [11:10:46] (03PS7) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [11:11:09] !log mlitn@deploy1001 Synchronized php-1.36.0-wmf.8//extensions/WikimediaEvents/: WikimediaEvents: Enable MediaSearch A/B test (duration: 01m 06s) [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:01] !log uploaded git 2.20.1-2+deb10u3~wmf1 to stretch-wikimedia/main T262244 [11:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:06] T262244: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 [11:13:27] !log Euro B&C done [11:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:59] (03CR) 10Effie Mouzeli: Enable OpenAPI spec on push-notifications service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/625832 (https://phabricator.wikimedia.org/T261635) (owner: 10Jgiannelos) [11:34:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1101:3317, db1101:3318', diff saved to https://phabricator.wikimedia.org/P12556 and previous config saved to /var/cache/conftool/dbconfig/20200910-113426-marostegui.json [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P12557 and previous config saved to /var/cache/conftool/dbconfig/20200910-113758-marostegui.json [11:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:35] (03PS2) 10Volans: sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) [11:41:02] (03CR) 10Volans: "Updated to support the transition period that requires a manual patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [11:41:54] (03PS8) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [11:49:45] (03PS4) 10JMeybohm: Remove etcd100[123] hosts [puppet] - 10https://gerrit.wikimedia.org/r/626274 (https://phabricator.wikimedia.org/T239835) [11:54:15] there was obviously some deployment of new stylesheets (idk whether core or extension) recently (like about last 12ish hours) which significantly changed or even broke displaying of various things, could that be reverted for now until fixed please? [11:55:26] what is broken? [11:55:57] hr's are not shown [11:56:30] because the original rule with height 1 is overriden by the new rule with height 0 [11:57:53] Danny_B: known issue ys [11:58:02] red links are shown in #d33 (lighter, more "marker" color) than original (actually again overriden #ba0000) [11:58:19] for
the task is https://phabricator.wikimedia.org/T262507 [11:58:24] (03PS9) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [11:58:35] and I guess the issue with redlinks is similar [11:58:44] some stylesheets are not loaded in the proper order [11:58:48] if something new breaks the current behavior, it should be promptly reverted until fixed [11:59:55] 10Puppet, 10DBA, 10SRE-tools, 10conftool, and 2 others: Alerting spam and wrong state of primary dc source info on databases while switching dc from eqiad -> codfw - https://phabricator.wikimedia.org/T261767 (10Marostegui) @RLazarus what do you want to do with this task? is this something that needs fixing... [12:01:25] !log upgrading deployment servers to git 2.20 T262244 [12:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] T262244: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 [12:01:37] ^ hashar [12:01:52] \o/ [12:04:14] (03CR) 10Jbond: "I have done a bit of a refresh of this CR including getting the script to work on my local machine" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [12:10:40] (03CR) 10Jgiannelos: Enable OpenAPI spec on push-notifications service (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/625832 (https://phabricator.wikimedia.org/T261635) (owner: 10Jgiannelos) [12:10:49] (03Abandoned) 10Jgiannelos: Enable OpenAPI spec on push-notifications service [deployment-charts] - 10https://gerrit.wikimedia.org/r/625832 (https://phabricator.wikimedia.org/T261635) (owner: 10Jgiannelos) [12:13:49] (03CR) 10Jbond: git: allow multiple calls to git::systemconfig (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [12:17:15] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [12:19:45] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) s6 eqiad progress: [] db1085.eqiad.wmnet [] db1088.eqiad.wmnet [] db1093.eqiad.wmnet [] db1096.eqiad.wmnet [] db1098.eqiad.wmnet [] db111... [12:20:18] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [12:20:53] (03CR) 10Hashar: "Thanks for the review and explanations! Next patchset adds a type to $settings parameter and remove the tag=>'gitconfig'." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [12:21:08] (03PS4) 10Hashar: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) [12:25:35] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:27:40] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:30:31] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:33:43] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:33:51] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26856016 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:44] !log installing firejail updates on maps/thumbor/restbase [12:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 286768 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:47] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:38:04] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) [12:39:01] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25020/labtestvirt2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [12:41:22] (03CR) 10Jbond: "LGTM will merge" [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [12:41:32] (03CR) 10Jbond: [C: 03+2] git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [12:43:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:45:45] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:47:43] (03PS1) 10Klausman: modules/admin/files/home/klausman: add useful dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/626367 [12:49:37] (03PS1) 10Jbond: Revert "git: allow multiple calls to git::systemconfig" [puppet] - 10https://gerrit.wikimedia.org/r/626258 [12:50:08] (03PS1) 10Ostrzyciel: EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626259 (https://phabricator.wikimedia.org/T262463) [12:51:16] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "git: allow multiple calls to git::systemconfig" [puppet] - 10https://gerrit.wikimedia.org/r/626258 (owner: 10Jbond) [12:52:51] (03PS3) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [12:52:55] (03PS2) 10Klausman: modules/admin/files/home/klausman: add useful dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/626367 [12:53:22] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [12:55:01] (03CR) 10Klausman: [C: 03+2] modules/admin/files/home/klausman: add useful dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/626367 (owner: 10Klausman) [12:55:04] (03PS1) 10Jbond: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626260 (https://phabricator.wikimedia.org/T262244) [12:56:12] (03CR) 10jerkins-bot: [V: 04-1] git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626260 (https://phabricator.wikimedia.org/T262244) (owner: 10Jbond) [12:57:17] (03PS4) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [12:57:52] !log Ran puppet-merge to get my dotfiles from https://gerrit.wikimedia.org/r/c/operations/puppet/+/626367 out [12:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:53] klausman: thanks for the transparency, but FYI there is no need to ! log every time you puppet-merge ;) [12:59:03] Ok fine :-P [13:00:04] longma and liw: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1300). [13:00:34] (03PS1) 10Klausman: home/klausman: delete some trailing WS [puppet] - 10https://gerrit.wikimedia.org/r/626369 [13:03:47] (03PS5) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [13:03:58] (03PS1) 10Klausman: admin: Remove some comments and trailing ws in dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/626371 [13:04:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [13:04:21] (03Abandoned) 10Klausman: home/klausman: delete some trailing WS [puppet] - 10https://gerrit.wikimedia.org/r/626369 (owner: 10Klausman) [13:07:08] (03CR) 10Klausman: [C: 03+2] admin: Remove some comments and trailing ws in dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/626371 (owner: 10Klausman) [13:07:17] (03PS6) 10Jbond: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [13:07:50] I iz teh gud with gti [13:10:57] !log delete lldwiki_{content|general} indices from search.svc.{eqiad|codfw}.wmnet:9643 (psi), they should be on 9443 (omega) [13:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:08] (03PS7) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [13:12:58] (03PS1) 10Klausman: admin: fix spurious backslash in my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/626373 [13:15:55] (03CR) 10Klausman: [C: 03+2] admin: fix spurious backslash in my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/626373 (owner: 10Klausman) [13:16:46] (03PS1) 10Volans: CHANGELOG: add changelogs for release v4.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/626374 [13:17:34] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [13:18:49] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v4.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/626374 (owner: 10Volans) [13:19:23] (03PS1) 10Ottomata: Default to using API json formatversion=2 [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626375 (https://phabricator.wikimedia.org/T251609) [13:20:10] (03PS2) 10Jbond: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626260 (https://phabricator.wikimedia.org/T262244) [13:20:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) After our IRC chat, this is scheduled for Monday 14th [13:20:58] (03CR) 10Jbond: [C: 04-1] "@hashar seems that gitconfig include doesn't support globs" [puppet] - 10https://gerrit.wikimedia.org/r/626260 (https://phabricator.wikimedia.org/T262244) (owner: 10Jbond) [13:24:14] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [13:24:29] !log installing rake security updates on stretch [13:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:29] (03PS8) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [13:25:45] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) 05Open→03Resolved [13:26:58] (03PS9) 10Jcrespo: mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) [13:27:25] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [13:27:42] (03PS1) 10Volans: tests: fix newly failing tests [software/cumin] - 10https://gerrit.wikimedia.org/r/626377 [13:27:50] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) a:03Papaul [13:32:55] (03CR) 10Volans: [C: 03+2] tests: fix newly failing tests [software/cumin] - 10https://gerrit.wikimedia.org/r/626377 (owner: 10Volans) [13:34:10] (03PS22) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [13:34:58] (03Merged) 10jenkins-bot: tests: fix newly failing tests [software/cumin] - 10https://gerrit.wikimedia.org/r/626377 (owner: 10Volans) [13:35:37] (03PS2) 10Volans: CHANGELOG: add changelogs for release v4.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/626374 [13:36:05] (03PS23) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [13:39:52] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v4.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/626374 (owner: 10Volans) [13:41:39] (03PS1) 10Elukey: Remove some "Spicerack" references from the doc strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626378 (https://phabricator.wikimedia.org/T257905) [13:41:50] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v4.0.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/626374 (owner: 10Volans) [13:42:00] (03PS1) 10Holger Knust: mathoid: Update the version number to latest for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/626379 [13:42:06] !log rebooting etherpad1002 (etherpad.wikimedia.org) for kernel update [13:42:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] mathoid: Update the version number to latest for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/626379 (owner: 10Holger Knust) [13:43:15] (03CR) 10Holger Knust: [C: 03+2] mathoid: Update the version number to latest for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/626379 (owner: 10Holger Knust) [13:43:17] (03PS1) 10Elukey: Add basic debian packaging [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626380 (https://phabricator.wikimedia.org/T257905) [13:44:18] (03CR) 10Volans: [C: 03+1] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626378 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [13:44:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:44:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:32] (03Merged) 10jenkins-bot: mathoid: Update the version number to latest for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/626379 (owner: 10Holger Knust) [13:45:06] (03CR) 10Elukey: [C: 03+2] Remove some "Spicerack" references from the doc strings [software/pywmflib] - 10https://gerrit.wikimedia.org/r/626378 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [13:48:08] (03PS1) 10Volans: Upstream release v4.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/626382 [13:48:55] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce bonding+trunking [puppet] - 10https://gerrit.wikimedia.org/r/626363 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [13:51:28] (03CR) 10Volans: [C: 03+2] Upstream release v4.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/626382 (owner: 10Volans) [13:52:42] 4.0?? :P [13:52:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:52:46] err :O [13:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:37] (03Merged) 10jenkins-bot: Upstream release v4.0.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/626382 (owner: 10Volans) [13:54:00] (03PS7) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [13:54:37] (03CR) 10Elukey: "Thanks!" (033 comments) [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [13:55:22] elukey: ? [13:56:39] I was impressed by the 4.0 version :) [13:56:47] nothing changed :) [13:56:55] we're running the 4.0.0rc1 since like 3 months :D [13:57:10] I decided maybe was time to promote it :-P [13:57:15] volans: then I had no idea that we were at 4.x :) [13:58:04] lol [13:58:11] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:32] hashar: btw https://doc.wikimedia.org/cumin/master/ still got 0.1.dev2 as version (was just built after a new commit) [14:04:48] !log uploaded cumin_4.0.0 to apt.wikimedia.org buster-wikimedia (no code changes) [14:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:56] 10Operations, 10Analytics-Radar, 10Patch-For-Review: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 (10elukey) My pull requests were merged, but I opened https://github.com/cloudera/hue/issues/1262 too. [14:06:02] (03PS1) 10Klausman: prometheus: Add more stats to AMD ROCm GPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/626386 (https://phabricator.wikimedia.org/T262427) [14:06:32] (03PS1) 10Papaul: DNS: Add production DNS for es2027-es2034 [dns] - 10https://gerrit.wikimedia.org/r/626387 [14:08:04] (03CR) 10Papaul: [C: 03+2] DNS: Add production DNS for es2027-es2034 [dns] - 10https://gerrit.wikimedia.org/r/626387 (owner: 10Papaul) [14:10:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [14:11:31] (03CR) 10Hashar: [C: 04-1] "The parent change ended up being reverted cause it did not work as intended." [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [14:11:40] (03CR) 10LSobanski: [C: 03+2] mariadb-backups: Add db2141 to the dbstore role for backup source [puppet] - 10https://gerrit.wikimedia.org/r/626339 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [14:12:27] (03CR) 10Elukey: [C: 03+1] prometheus: Add more stats to AMD ROCm GPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/626386 (https://phabricator.wikimedia.org/T262427) (owner: 10Klausman) [14:15:06] (03CR) 10Klausman: [C: 03+2] prometheus: Add more stats to AMD ROCm GPU exporter [puppet] - 10https://gerrit.wikimedia.org/r/626386 (https://phabricator.wikimedia.org/T262427) (owner: 10Klausman) [14:19:56] jynus: Ok to puppet-merge your change for db2141.yaml and site.pp? [14:20:04] yes [14:20:09] I was on cmd line [14:20:09] alrighty [14:20:21] ah, you have the lock. go ahead [14:20:34] ah, then I do it [14:21:01] (yay race conditions :)) [14:21:14] all done, no errors [14:22:26] Gracias! [14:33:08] (03PS2) 10Filippo Giunchedi: base: add remote syslog queues [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) [14:35:42] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25021/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:37:10] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Thank-You-Page, and 2 others: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Pcoombe) 05Open→03Resolved [14:39:25] (03PS4) 10Vgutierrez: varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) [14:39:57] (03CR) 10jerkins-bot: [V: 04-1] varnishkafka 1.0.15 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/626177 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [14:40:06] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) [14:40:23] 10Operations, 10MassMessage, 10MediaWiki-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Samwalton9) This happened again today with my delivery of Books & Bytes. The initial delivery (starting at https://en.wikipedia.org/w/index.php?title=Special:Contribu... [14:44:07] (03PS1) 10Volans: cli: add a -n/--no-colors option [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) [14:44:09] (03PS1) 10Volans: cli: in dry-run send the list of hosts to stdout [software/cumin] - 10https://gerrit.wikimedia.org/r/626390 (https://phabricator.wikimedia.org/T212783) [14:45:06] kormat: ^^^ :-P (with the recent changes released with v4 it was easy :-P) [14:45:36] (03CR) 10Herron: [C: 03+1] "LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:48:59] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10fgiunchedi) >>! In T262182#6449311, @wiki_willy wrote: > Hi @Papaul - when I look at Netbox, it shows that ms-be2019 was purchased 5yrs ago. > > https://netbox.wikimedia.org/dcim/devices/240/ > > @fgiunched... [14:49:48] (03PS1) 10Herron: kibana: enable logging by default [puppet] - 10https://gerrit.wikimedia.org/r/626391 [14:49:51] 10Operations, 10observability: Logstash-next fails to load properly. - https://phabricator.wikimedia.org/T262492 (10colewhite) 05Open→03Resolved a:03colewhite It does not appear to be reproducible today. Will reopen if it comes back. [14:53:09] (03CR) 10Volans: "Adding in CC those that asked for it first back in the first cumin release :-P" [software/cumin] - 10https://gerrit.wikimedia.org/r/626389 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [14:53:12] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10fgiunchedi) [14:53:37] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10fgiunchedi) Adding #observability for tracking/visibility, +1 on the use cases! [14:54:55] (03PS1) 10Herron: logstash: increase jvm heap memory to 2g [puppet] - 10https://gerrit.wikimedia.org/r/626393 [14:57:12] 10Operations, 10Puppet, 10observability, 10Patch-For-Review, and 2 others: Puppet: get row/rack info from Netbox - https://phabricator.wikimedia.org/T229397 (10crusnov) [15:00:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:03:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:07:28] (03CR) 10Cwhite: [C: 03+1] "LGTM! Will be helpful!" [puppet] - 10https://gerrit.wikimedia.org/r/626391 (owner: 10Herron) [15:09:36] (03PS3) 10Cwhite: mediawiki: update alerts on logstash logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) [15:10:03] (03CR) 10Cwhite: mediawiki: update alerts on logstash logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:10:24] (03CR) 10Cwhite: [C: 03+1] logstash: increase jvm heap memory to 2g [puppet] - 10https://gerrit.wikimedia.org/r/626393 (owner: 10Herron) [15:16:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" (031 comment) [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [15:17:57] (03CR) 10Cwhite: [C: 03+1] "haven't tested it, but if it does what it says on the tin, SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623966 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [15:17:59] (03PS1) 10Ppchelko: Api-gateway: various improvements [deployment-charts] - 10https://gerrit.wikimedia.org/r/626395 [15:21:59] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10observability, 10serviceops: illegal_argument_exception - https://phabricator.wikimedia.org/T262429 (10fgiunchedi) Translated in english, the message `[detail] is defined as an object in mapping [fcm_send_failed] but this... [15:27:17] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Proton, 10Wikimedia-Logstash, and 3 others: Move proton logging to new logging pipeline - https://phabricator.wikimedia.org/T219925 (10fgiunchedi) 05Stalled→03Resolved a:03fgiunchedi Thank you for the update @Mholloway ! That indeed clears up thi... [15:27:22] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Patch-For-Review: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [15:29:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [15:29:24] (03PS3) 10Muehlenhoff: Add Cumin alias for analytics-launcher [puppet] - 10https://gerrit.wikimedia.org/r/626271 [15:29:38] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: increase jvm heap memory to 2g [puppet] - 10https://gerrit.wikimedia.org/r/626393 (owner: 10Herron) [15:29:44] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: enable logging by default [puppet] - 10https://gerrit.wikimedia.org/r/626391 (owner: 10Herron) [15:30:31] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:47] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:58] (03PS2) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [15:40:00] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: add support for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626398 [15:40:02] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records [puppet] - 10https://gerrit.wikimedia.org/r/626399 [15:40:04] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: make into a python3 script [puppet] - 10https://gerrit.wikimedia.org/r/626400 [15:40:06] (03PS1) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) [15:40:33] (03PS1) 10Filippo Giunchedi: switchdc: include swift and thanos [cookbooks] - 10https://gerrit.wikimedia.org/r/626403 [15:41:03] (03CR) 10Filippo Giunchedi: "As per AIs in https://wikitech.wikimedia.org/wiki/Incident_documentation/20200901-data-center-switchover" [cookbooks] - 10https://gerrit.wikimedia.org/r/626403 (owner: 10Filippo Giunchedi) [15:41:35] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: add support for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626398 [15:41:37] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records [puppet] - 10https://gerrit.wikimedia.org/r/626399 [15:41:39] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: make into a python3 script [puppet] - 10https://gerrit.wikimedia.org/r/626400 [15:41:41] (03PS3) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [15:41:43] (03PS2) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) [15:41:46] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [15:43:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "The proper fix is to change the record name. That's a major operation, but we have to do it!" [puppet] - 10https://gerrit.wikimedia.org/r/626399 (owner: 10Andrew Bogott) [15:44:20] (03PS1) 10Papaul: DHCP: Add MAC address for es2027-es2034 [puppet] - 10https://gerrit.wikimedia.org/r/626404 (https://phabricator.wikimedia.org/T260373) [15:46:10] (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records [puppet] - 10https://gerrit.wikimedia.org/r/626399 [15:46:12] (03PS3) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: make into a python3 script [puppet] - 10https://gerrit.wikimedia.org/r/626400 [15:46:14] (03PS4) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [15:46:16] (03PS3) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) [15:48:34] (03PS1) 10Cwhite: raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) [15:49:11] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for analytics-launcher [puppet] - 10https://gerrit.wikimedia.org/r/626271 (owner: 10Muehlenhoff) [15:49:28] (03CR) 10jerkins-bot: [V: 04-1] raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [15:51:10] (03PS2) 10Cwhite: raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) [15:51:57] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for es2027-es2034 [puppet] - 10https://gerrit.wikimedia.org/r/626404 (https://phabricator.wikimedia.org/T260373) (owner: 10Papaul) [15:52:01] (03CR) 10jerkins-bot: [V: 04-1] raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [15:56:43] (03PS26) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [15:57:49] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) @ema I've added to the task description the CRs required to get the packages of all the vmods and varnishkafka, I've seen that we have varnish-modules compiled on... [16:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1600). [16:01:15] (03PS1) 10Muehlenhoff: Install git instead of git-core [puppet] - 10https://gerrit.wikimedia.org/r/626406 [16:01:58] (03PS3) 10Cwhite: raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) [16:02:09] (03CR) 10CRusnov: [C: 03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [16:04:24] !log reprepro: uploaded gdnsd-3.3.0-1~wmf1 - T261340 [16:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:31] T261340: 'skip_first' feature flag for gdnsd GeoIP plugin - https://phabricator.wikimedia.org/T261340 [16:04:39] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2027.codfw.wmnet ` The... [16:06:16] !log dns4001 - upgrade gdnsd to 3.3.0-1~wmf1 [16:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:30] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:17:46] 10Operations, 10observability, 10Patch-For-Review: Evaluate/integrate rasdaemon as a replacement for mcelog - https://phabricator.wikimedia.org/T205396 (10JMeybohm) We would like to roll out Kernel 4.19 on some stretch hosts and if I got this right we will need the backported rasdaemon version rather than th... [16:17:56] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:20:33] (03PS1) 10Krinkle: resourceloader: Fix incorrect order of feature stylesheets [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) [16:21:48] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:22:33] (03CR) 10Krinkle: [C: 04-1] "To be confirmed in Beta first." [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [16:23:38] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2027.codfw.wmnet'] ` Of which those **FAILED**: ` ['es2027.codfw.wmne... [16:23:40] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:25:00] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:25:06] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2027.codfw.wmnet ` The... [16:25:36] !log authdns1001 - upgrade gdnsd to 3.3.0-1~wmf1 [16:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:20] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:29:30] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:30:40] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:31:14] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:34:05] (03PS1) 10Ottomata: Release 2020.02~wmf2 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/626410 [16:34:49] (03CR) 10Ottomata: [C: 03+2] Release 2020.02~wmf2 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/626410 (owner: 10Ottomata) [16:34:51] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Release 2020.02~wmf2 [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/626410 (owner: 10Ottomata) [16:35:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:37:09] (03CR) 10CRusnov: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [16:37:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:38:31] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen) a:05Papaul→03Jgreen [16:41:34] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [16:41:34] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:43:28] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:45:54] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:46:23] ^ looking [16:47:48] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:48:20] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [16:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:58] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:50:21] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:50:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:51:42] (03PS3) 10Volans: scripts: allocate IPs, add Cassandra support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) [16:51:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: add support for .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626398 (owner: 10Andrew Bogott) [16:52:16] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records [puppet] - 10https://gerrit.wikimedia.org/r/626399 (owner: 10Andrew Bogott) [16:52:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:52:26] (03PS4) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records [puppet] - 10https://gerrit.wikimedia.org/r/626399 [16:52:49] (03CR) 10CRusnov: [C: 03+1] "LGTM still, now with better grammar :)" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [16:53:46] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: make into a python3 script [puppet] - 10https://gerrit.wikimedia.org/r/626400 (owner: 10Andrew Bogott) [16:53:54] (03PS4) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: make into a python3 script [puppet] - 10https://gerrit.wikimedia.org/r/626400 [16:54:17] (03CR) 10Volans: [C: 03+2] scripts: allocate IPs, add Cassandra support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/624087 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [16:54:50] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [16:55:26] I need to get on a meeting [16:58:15] (03CR) 10Bstorm: wmcs-novastats-dnsleaks.py: ignore a couple of cursed service records (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626399 (owner: 10Andrew Bogott) [16:58:50] if anyone can have a look, please do [17:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1700). [17:01:19] (03PS5) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [17:01:21] (03PS4) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) [17:01:23] (03PS1) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: added a comment linking a hack to a task [puppet] - 10https://gerrit.wikimedia.org/r/626447 [17:04:56] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:05:04] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:05:58] (03PS1) 10Ottomata: Install anaconda-wmf on hadoop workerrs and clients [puppet] - 10https://gerrit.wikimedia.org/r/626448 (https://phabricator.wikimedia.org/T251006) [17:06:35] (03PS2) 10Ottomata: Install anaconda-wmf on hadoop workerrs and clients [puppet] - 10https://gerrit.wikimedia.org/r/626448 (https://phabricator.wikimedia.org/T251006) [17:08:08] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:08:11] Pchelolo: can you please take a look at restbase ? [17:08:29] https://grafana.wikimedia.org/d/000000068/restbase?viewPanel=16&orgId=1&from=now-1h&to=now [17:08:44] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:08:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:10:18] effie: what's wrong? recommendation api seem to be in trouble, restbase is fine no? [17:10:24] (03CR) 10Ottomata: [C: 03+2] Install anaconda-wmf on hadoop workerrs and clients [puppet] - 10https://gerrit.wikimedia.org/r/626448 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [17:10:36] Pchelolo: looking at the restbase graph, it looks related [17:10:50] I don't know much about those services, so I am not sure where to start debugging [17:10:58] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2027.codfw.wmnet'] ` and were **ALL** successful. [17:11:29] I'd look at mw api latencies, not rb [17:11:43] rb is seing problems where it's proxying to mw api [17:12:22] alright thank you! [17:13:29] (03PS2) 10Andrew Bogott: wmcs-novastats-dnsleaks.py: added a comment linking a hack to a task [puppet] - 10https://gerrit.wikimedia.org/r/626447 [17:13:31] (03PS6) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 [17:13:33] (03PS5) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) [17:14:14] longma, there seems to ber a lot of database lock timeouts :( [17:14:42] kormat: around? [17:14:46] looking [17:14:58] https://logstash.wikimedia.org/goto/44f6a581677a5b21b90f1c781983c70d [17:15:28] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-novastats-dnsleaks.py: added a comment linking a hack to a task [puppet] - 10https://gerrit.wikimedia.org/r/626447 (owner: 10Andrew Bogott) [17:16:49] I don't know if rolling back will fix this issue but I can go ahead and do a roll-back [17:17:19] longma, I don't know either if a rollback will help, might be good to ask a DBA [17:18:12] ot started at ~16:10 UTC [17:18:17] but I don't see anything on SAL [17:20:08] PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:20:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper) @Papaul We should be fine with **256k**. [17:24:00] (03PS1) 10Andrew Bogott: wmcs pdns recursors: add zone forwarding for .cloud lookups [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) [17:24:04] effie, longma: hey. having a look [17:24:13] Thanks! [17:24:23] kormat: I am going to check logs on an api server, but I am in a meeting as well [17:24:41] it is not sure it is a DB issue [17:25:08] (03CR) 10jerkins-bot: [V: 04-1] wmcs pdns recursors: add zone forwarding for .cloud lookups [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:25:12] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:26:12] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [17:26:14] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:26:30] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:27:16] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [17:27:30] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:27:38] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:28:38] (03PS2) 10Andrew Bogott: wmcs pdns recursors: add zone forwarding for .cloud lookups [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) [17:28:40] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:29:29] effie: it seems to be related to the x1 db shard: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2096&var-port=9104 [17:30:02] large increase in UPDAtes starting at around 15:46 [17:30:29] does it make sense to find what triggered that ? [17:30:58] (03CR) 10Andrew Bogott: "pcc results: https://puppet-compiler.wmflabs.org/compiler1002/25024/" [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:31:24] i think at this point we need to ping marostegui (o/) [17:31:32] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=86%): /tmp 0 MB (0% inode=86%): /var/tmp 0 MB (0% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [17:31:41] not that guy [17:31:48] i know, he's the worst [17:32:35] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [17:33:00] I will ping him [17:33:13] !log dns servers: upgrading remainder of fleet to gdnsd-3.3.0-1~wmf1 [17:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:32] (03PS27) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [17:33:58] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626220 (owner: 10Andrew Bogott) [17:34:15] the top queries against that shard are all ReadingLists. https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb2096&hours=1 [17:35:10] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:35:22] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [17:36:33] (03CR) 10Bstorm: wmcs pdns recursors: add zone forwarding for .cloud lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:38:08] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:38:54] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:39:00] (03CR) 10Andrew Bogott: wmcs pdns recursors: add zone forwarding for .cloud lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:40:14] (03PS3) 10Andrew Bogott: wmcs pdns recursors: add zone forwarding for .cloud lookups [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) [17:40:32] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:40:38] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): update SQL schema [puppet] - 10https://gerrit.wikimedia.org/r/625565 (owner: 10Dzahn) [17:41:14] (03CR) 10Dzahn: [C: 03+2] wikistats (cloud): add import cron for neoseeker wikis [puppet] - 10https://gerrit.wikimedia.org/r/625349 (https://phabricator.wikimedia.org/T262113) (owner: 10Dzahn) [17:41:24] (03PS2) 10Dzahn: wikistats (cloud): add import cron for neoseeker wikis [puppet] - 10https://gerrit.wikimedia.org/r/625349 (https://phabricator.wikimedia.org/T262113) [17:41:43] (03CR) 10Bstorm: wmcs pdns recursors: add zone forwarding for .cloud lookups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:43:20] (03PS4) 10Dzahn: wikistats (cloud): update SQL schema [puppet] - 10https://gerrit.wikimedia.org/r/625565 [17:45:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs pdns recursors: add zone forwarding for .cloud lookups [puppet] - 10https://gerrit.wikimedia.org/r/626450 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [17:45:32] (03CR) 10Dzahn: [V: 03+1 C: 03+2] webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [17:46:18] andrewbogott: wanna merge both? [17:46:26] sure [17:46:30] thx [17:47:26] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpect [17:47:26] pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:49:16] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:49:52] (03CR) 10Dzahn: "after running puppet on webperf and then icinga1001, icinga alert in eqiad has been removed" [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [17:50:14] (03CR) 10Ryan Kemper: [C: 03+2] cloudelastic: we do want to use conf_tool [puppet] - 10https://gerrit.wikimedia.org/r/624231 (https://phabricator.wikimedia.org/T261373) (owner: 10Ryan Kemper) [17:50:40] (03CR) 10Dave Pifke: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [17:52:43] (03CR) 10Dzahn: "nice! one comment though: since only the require line actually changes, can we just set that to a variable and not repeat the rest of the " [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [17:56:00] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpect [17:56:00] pecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:57:02] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:57:12] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:58:14] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:59:18] (03CR) 10Dzahn: [C: 03+2] profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [17:59:24] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1800). Please do the needful. [18:00:04] Ostrzyciel and ottomata: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] o/ [18:00:19] !log helium (former backup host) is being removed from ferm rules on all hosts, it was replaced by backup1001 (T260717) [18:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:25] T260717: decom helium and heze - https://phabricator.wikimedia.org/T260717 [18:00:32] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:00:32] (03CR) 10Bstorm: "So most of this doesn't change the replicas that currently exist, rather it sets up some things for the new ones. However, if I merge it, " [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [18:00:34] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:01:29] Ostrzyciel: ottomata: I can deploy, unless one of you prefers to self-deploy? :) [18:01:56] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:02:04] I'm new to the procedure and I'll probably break something :P [18:02:33] Ostrzyciel: heh, if you're not a deployer (shell access to production), you won't be able to do it anyway :-) [18:02:48] Ostrzyciel: i've sync-filed often, never regular sync [18:02:52] please proceed if you don't mind! [18:03:01] riiight, that's what I guessed [18:03:07] I'm a volunteer. :) [18:03:07] sure [18:03:08] i mean: Urbanecm [18:03:31] ottomata: sure :) [18:03:38] (03CR) 10Urbanecm: [C: 03+2] EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626259 (https://phabricator.wikimedia.org/T262463) (owner: 10Ostrzyciel) [18:04:04] (03CR) 10Urbanecm: [C: 03+2] Default to using API json formatversion=2 [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626375 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:05:07] (03PS1) 10Andrew Bogott: clush toolforge_canary_list: update! [puppet] - 10https://gerrit.wikimedia.org/r/626456 [18:05:07] Ostrzyciel: could you please install the mwdebug extension (https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions), so you can test the patch once it's ready? [18:05:09] (03PS1) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) [18:05:22] sure [18:05:32] Ostrzyciel: and, I see that group0 is now at wmf.6. Is it intentional to backport only to wmf.8? [18:06:08] ottomata: same for you (the wmf.6 q), actually :-) [18:06:39] Urbanecm: yeah i see train is blocked....fortunetly this API is really only used from metawiki [18:06:41] which is group1 [18:06:44] Urbanecm: oh, that's because of a missed train... I guess not, we can deploy this to previous versions too [18:06:46] so wmf.8 is fine for me :) [18:07:03] my patch is useful to group2, mostly [18:07:05] (03Abandoned) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626401 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [18:07:06] ottomata: okay, just checking if it's not forgotten :-) [18:07:11] ty [18:07:22] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2028.codfw.wmnet ` The... [18:07:24] Where are you seeing group0 at wmf.6? [18:07:26] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:07:53] longma: sorry, I meant *group2* [18:07:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:07:59] ah okay :) [18:08:14] Ostrzyciel: I'll upload a .6 backport too then [18:08:23] okay [18:08:50] (03PS1) 10Urbanecm: EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/626429 (https://phabricator.wikimedia.org/T262463) [18:08:58] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [18:08:58] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:09:29] (03CR) 10Urbanecm: [C: 03+2] EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/626429 (https://phabricator.wikimedia.org/T262463) (owner: 10Urbanecm) [18:09:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [18:11:02] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:11:12] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:11:28] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [18:12:06] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:13:37] (03PS3) 10Paladox: Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 [18:13:46] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:14:34] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:14:46] (03CR) 10jerkins-bot: [V: 04-1] Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [18:16:58] (03PS4) 10Paladox: Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 [18:19:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:20:04] RECOVERY - SSH on wtp1047.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:21:31] (03PS1) 10Urbanecm: Add throttle rule for Czech senior citizens course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626459 (https://phabricator.wikimedia.org/T262415) [18:21:33] mea culpa: i believe the elevated DB pressure / MW error rate is from a maintenance script i was running for T259740 [18:21:34] T259740: ReadingListRepository::deleteListEntryQuery: BIGINT UNSIGNED value is out of range in '`wikishared`.`reading_list`.`rl_size` - 1' - https://phabricator.wikimedia.org/T259740 [18:21:59] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:22:02] (03CR) 10Urbanecm: [C: 03+2] Add throttle rule for Czech senior citizens course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626459 (https://phabricator.wikimedia.org/T262415) (owner: 10Urbanecm) [18:22:47] (03Merged) 10jenkins-bot: Add throttle rule for Czech senior citizens course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626459 (https://phabricator.wikimedia.org/T262415) (owner: 10Urbanecm) [18:22:51] (03CR) 10Krinkle: "Confirmed in beta." [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [18:23:05] marostegui: kormat: jynus: fyi mdholloway's message few lines abovew [18:23:08] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:24:00] mdholloway: did you just stop it? [18:24:22] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [18:24:23] it exited. [18:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:45] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: 0cde0b15fc1daca2cef904bc7add7e9a1c58e3c9: Add throttle rule for Czech senior citizens course (T262415) (duration: 01m 05s) [18:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:51] T262415: Throttle rule for 2020-09-14 - Senior Citizens Write Wikipedia course - https://phabricator.wikimedia.org/T262415 [18:24:53] but yes, just finished [18:25:25] (03PS5) 10Paladox: Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 [18:25:35] (03PS6) 10Paladox: Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 [18:25:58] i'll create a task to fix or remove it. [18:26:14] (03Merged) 10jenkins-bot: EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626259 (https://phabricator.wikimedia.org/T262463) (owner: 10Ostrzyciel) [18:26:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:46] (03PS7) 10Paladox: Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 [18:26:51] mdholloway: can you add me to that task? there are a few things that need urgent fixing [18:27:02] Ostrzyciel: would you be able to test it at a group0 or group1 wiki? wmf.8 just got merged, but I'm still waiting for CI to merge wmf.6 [18:27:03] yes, will do [18:27:13] mdholloway: thanks [18:28:02] mdholloway: can we use the existing task for what happened? [18:28:23] Urbanecm: I'll have a look [18:28:38] ping me once you know :) [18:28:52] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [18:29:06] (03Merged) 10jenkins-bot: Default to using API json formatversion=2 [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626375 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:29:11] (03Merged) 10jenkins-bot: EditPage: Fix member call on boolean when undo is impossible [core] (wmf/1.36.0-wmf.6) - 10https://gerrit.wikimedia.org/r/626429 (https://phabricator.wikimedia.org/T262463) (owner: 10Urbanecm) [18:29:17] please share the maintenance script, it is unclear on the ticket [18:29:34] to understand where the overhead came from [18:29:35] well, the second just got merged Ostrzyciel, so it doesn't matter now :) [18:29:57] (03PS1) 10Dzahn: site: remove backup host role from helium [puppet] - 10https://gerrit.wikimedia.org/r/626460 (https://phabricator.wikimedia.org/T260717) [18:29:58] ottomata: Ostrzyciel: Im currently pulling your patches to a mwdebug host, will ping you once done [18:30:03] ok ty [18:30:46] jynus: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/ReadingLists/+/refs/heads/master/maintenance/fixListSize.php [18:31:30] thanks mdholloway [18:31:37] Urbanecm: still broken on test wiki; did the update go through? [18:31:50] Ostrzyciel: I didn't pull anything to the mwdebug host yet ;) [18:32:13] Urbanecm: ooooh, sorry, I missed your line, nvm [18:32:22] Ostrzyciel: ottomata: your patches are pulled onto mwdebug2001 for testing. could you test them and let me know please? :-) [18:32:30] RECOVERY - Disk space on an-coord1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [18:32:35] Ostrzyciel: please let me know if you face any issues with the mwdebug extension, so I can help you to debug :) [18:32:59] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [18:32:59] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:13] effie: sorry, missed your ping to me. i'll create a new task and link it here [18:34:33] Urbanecm: verified on mwdebug2001, it works :) [18:34:35] (03PS2) 10DannyS712: resourceloader: Fix incorrect order of feature stylesheets [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [18:34:54] Ostrzyciel: thanks! I'll sync it then :) [18:35:28] mdholloway: please add me so I can add some info [18:36:02] looks great Urbanecm [18:36:03] thank you [18:36:19] proceed! [18:36:34] thanks, will sync :)ú [18:37:20] effie: placeholder task created as T262575. not sure if it makes more sense for me to start a description or leave it to an SRE [18:37:20] T262575: fixListSize.php creates excessive DB load in production - https://phabricator.wikimedia.org/T262575 [18:37:36] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.6/includes/EditPage.php: 824094428c5f41dc9eef7d65c8440dadda4d4dbd: EditPage: Fix member call on boolean when undo is impossible (T262463) (duration: 01m 07s) [18:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:43] T262463: Call to a member function preSaveTransform() on boolean - https://phabricator.wikimedia.org/T262463 [18:39:16] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:39:24] thank you mdholloway [18:40:02] no problem, and sorry i didn't notice the errors earlier [18:40:03] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.8/includes/EditPage.php: 824094428c5f41dc9eef7d65c8440dadda4d4dbd: EditPage: Fix member call on boolean when undo is impossible (T262463) (duration: 01m 03s) [18:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:14] Ostrzyciel: should be live on production :) [18:40:30] mdholloway: no prob, I will send you the bill at the end of the billing cycle [18:40:42] fair enough :) [18:40:50] (03PS2) 10Dzahn: scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 [18:41:57] (03CR) 10jerkins-bot: [V: 04-1] scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 (owner: 10Dzahn) [18:42:02] ottomata: I'm sorry, I just noticed that. There's an i18n change in your backport. To deploy it, we'd need a full scap, which takes...long. Can I just sync the code with sync-file, and leave the message for train? You said it's used only in meta, so we can override locally at worse [18:42:11] yeah, I think it works on enwiki, thanks :) [18:42:16] cool! [18:43:50] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:55] (03PS3) 10Dzahn: scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 [18:44:33] (03PS4) 10Dzahn: scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 (https://phabricator.wikimedia.org/T209953) [18:45:10] ottomata: ping? 🙂 [18:48:05] (03PS2) 10Dzahn: monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) [18:48:30] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:48:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2028.codfw.wmnet'] ` and were **ALL** successful. [18:49:11] (03CR) 10jerkins-bot: [V: 04-1] monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:49:32] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:49:51] ottomata: ping? 🙂 [18:50:04] (03PS2) 10Dzahn: nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 [18:50:16] (03PS3) 10Dzahn: nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 (https://phabricator.wikimedia.org/T209953) [18:50:54] mdholloway: we are still seeing some API 500 errors, do you know if there is something more? [18:51:24] (03CR) 10jerkins-bot: [V: 04-1] nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:52:19] (03PS2) 10Andrew Bogott: clush toolforge_canary_list: update! [puppet] - 10https://gerrit.wikimedia.org/r/626456 [18:52:21] (03PS2) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) [18:52:23] (03PS1) 10Andrew Bogott: Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626463 [18:52:25] (03PS1) 10Andrew Bogott: tools-clush-generator: use eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626464 (https://phabricator.wikimedia.org/T260614) [18:52:27] Hello, can I add one config patch for window now? [18:52:27] (03PS1) 10Andrew Bogott: scap: add support for .eqiad1.wikimedia.cloud targets [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) [18:52:29] (03PS1) 10Andrew Bogott: trafficserver: update to use a .wikimedia.cloud dns name [puppet] - 10https://gerrit.wikimedia.org/r/626466 (https://phabricator.wikimedia.org/T260614) [18:52:31] (03PS1) 10Andrew Bogott: base::remote_syslog: use .wikimedia.cloud naming for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/626467 (https://phabricator.wikimedia.org/T260614) [18:52:33] (03PS1) 10Andrew Bogott: toolschecker: use .eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626468 (https://phabricator.wikimedia.org/T260614) [18:52:37] Zoranzoki21: try that :) [18:52:51] (03PS1) 10Urbanecm: Add a new namespace to frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626469 (https://phabricator.wikimedia.org/T262398) [18:52:57] (03PS3) 10Dzahn: monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) [18:53:19] (03CR) 10Urbanecm: [C: 03+2] Add a new namespace to frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626469 (https://phabricator.wikimedia.org/T262398) (owner: 10Urbanecm) [18:53:44] 20:42 ottomata: I'm sorry, I just noticed that. There's an i18n change in your backport. To deploy it, we'd need a full scap, which takes...long. Can I just sync the code with sync-file, and leave the message for train? You said it's used only in meta, so we can override locally at worse [18:53:47] I had problems with account here, but I've fixed it :) [18:53:50] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:54:06] effie: nothing as far as i know [18:54:50] ok then there is something else happening as well [18:56:14] (03Merged) 10jenkins-bot: Add a new namespace to frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626469 (https://phabricator.wikimedia.org/T262398) (owner: 10Urbanecm) [18:56:32] 10Operations, 10MassMessage, 10MediaWiki-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) I hope this doesn't happen to my delivery as well :((( [18:56:34] (03PS1) 10Urbanecm: Revert "Default to using API json formatversion=2" [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626433 (https://phabricator.wikimedia.org/T251609) [18:56:36] (03CR) 10jerkins-bot: [V: 04-1] monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [18:57:18] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:24] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:33] (03PS3) 10Dzahn: site: add new POP install servers with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/601342 (https://phabricator.wikimedia.org/T252526) [18:58:06] ottomata: I'm sorry, but I have to revert your backport, as I didn't see a reply to my q [18:58:08] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 09e487e76158026ba161acffad277928d2603891: Add a new namespace to frwiktionary (T262398) (duration: 01m 04s) [18:58:12] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Default to using API json formatversion=2" [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626433 (https://phabricator.wikimedia.org/T251609) (owner: 10Urbanecm) [18:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:15] T262398: New Namespace for the French Wiktionary : "Rime" - https://phabricator.wikimedia.org/T262398 [18:58:42] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:55] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=frwiktionary --fix # T262398 [18:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:19] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Module: remove instance_info_dumper.pp [puppet] - 10https://gerrit.wikimedia.org/r/626463 (owner: 10Andrew Bogott) [19:00:04] longma and liw: That opportune time is upon us again. Time for a Mediawiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T1900). [19:00:17] Urbanecm [19:00:20] Oops [19:00:24] Urbanecm: It is ready https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/626432 [19:00:35] Zoranzoki21: But train window is now, so I can't do that now [19:01:16] the train is blocked at the moment, but hoping to resume shortly [19:01:39] longma: does that mean I can deploy Zoran's patch now? [19:02:08] I think that would be okay if it doesn't take long [19:02:17] it should be about five minutes :) [19:02:23] (03PS3) 10Urbanecm: Set $wgCategoryCollation = uca-tr on trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626432 (https://phabricator.wikimedia.org/T262163) (owner: 10Zoranzoki21) [19:02:23] thanks [19:02:26] sure [19:02:29] (03CR) 10Urbanecm: [C: 03+2] Set $wgCategoryCollation = uca-tr on trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626432 (https://phabricator.wikimedia.org/T262163) (owner: 10Zoranzoki21) [19:02:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t [19:02:38] unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:02:41] Thanks longma and Urbanecm <3 [19:02:50] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:02:52] (03PS2) 10Dzahn: aphlict: add second envoy TLS terminator for client port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) [19:02:56] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:03:16] (03Merged) 10jenkins-bot: Set $wgCategoryCollation = uca-tr on trwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626432 (https://phabricator.wikimedia.org/T262163) (owner: 10Zoranzoki21) [19:03:24] (03PS2) 10Dzahn: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) [19:03:57] (03CR) 10jerkins-bot: [V: 04-1] aphlict: add second envoy TLS terminator for client port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [19:04:04] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:28] (03PS4) 10Dzahn: nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 (https://phabricator.wikimedia.org/T209953) [19:04:44] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:04:46] Urbanecm: Should I add patch in the calendar or? [19:04:51] yes please [19:05:12] Urbanecm: Okay, after testing.. :) [19:05:18] Is patch ready for mwdebug? [19:05:29] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 95d2b575c683a1c5c2972a9bf0cf3b87059fbd74: Set $wgCategoryCollation = uca-tr on trwiktionary (T262163) (duration: 01m 05s) [19:05:31] I'm syncing it, as I'd have to run mwscript updateCollation.php [19:05:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:35] T262163: Set $wgCategoryCollation = uca-tr on trwiktionary - https://phabricator.wikimedia.org/T262163 [19:05:48] Urbanecm: Oh, yes. [19:06:14] ]'m adding patch in calendar now... [19:07:19] !log Start of [urbanecm@mwmaint2001 ~]$ mwscript updateCollation.php --wiki=trwiktionary --previous-collation=uppercase # T262163 [19:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:27] Urbanecm: Patch added in calendar [19:08:56] thx [19:09:17] longma: I'm done [19:09:23] thanks Urbanecm [19:09:45] no problem [19:10:30] oh, btw Urbanecm did you revert the backport requiring localization changes? [19:10:38] longma: yes [19:10:53] alright thanks [19:10:58] Thanks Urbanecm :) [19:11:13] mwscript is done? [19:11:25] longma: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventStreamConfig/+/626433 [19:11:28] (03PS3) 10Dzahn: aphlict: add second envoy TLS terminator for client port [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) [19:11:47] Zoranzoki21: no, the script is still running, but that doesn't block next deploys [19:11:56] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:04] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Reading List Service: fixListSize.php creates excessive DB load in production - https://phabricator.wikimedia.org/T262575 (10jijiki) [19:12:12] Urbanecm: Ok, I'm waiting [19:12:14] longma: if train is still blocked for the moment, mind if i restart recommendation-api to see if it clears up the issue causing alerts? [19:12:25] sure, go ahead [19:12:40] thanks, doing now! [19:12:57] !log mholloway-shell@deploy1001 Started restart [recommendation-api/deploy@db7fd80]: (no justification provided) [19:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:00] (03PS1) 10Ebernhardson: admin: Update ebernhardson home files [puppet] - 10https://gerrit.wikimedia.org/r/626474 [19:14:46] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jijiki) [19:14:56] longma: done. looks like the problem is at the db level and not the service level... [19:15:07] alright, thanks mdholloway [19:15:37] (03PS4) 10Dzahn: monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) [19:16:13] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jijiki) [19:16:43] (03CR) 10jerkins-bot: [V: 04-1] monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:18:10] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:36] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:19:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:19:40] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:19:41] (03CR) 10Cwhite: [C: 03+2] mediawiki: update alerts on logstash logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [19:19:49] (03PS4) 10Cwhite: mediawiki: update alerts on logstash logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) [19:21:20] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:21:25] (03PS3) 10Dzahn: puppetmaster::backend: replace hiera with lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) [19:22:04] 10Operations, 10MassMessage, 10Platform Engineering, 10WMF-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Krinkle) >>! [Jul 23, 2020] In T93049#6328594, @Joe added projects: Platform Engineering: > > @Pchelolo @Ottomata do we have any way to verify how... [19:22:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:51] 10Operations, 10Editing-team, 10MassMessage, 10Platform Engineering, 10WMF-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10greg) Adding #editing-team per https://www.mediawiki.org/wiki/Developers/Maintainers [19:26:04] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:30:00] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:30:04] (03CR) 10Hashar: [C: 03+1] resourceloader: Fix incorrect order of feature stylesheets [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [19:32:04] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:32:30] (03Abandoned) 10Dzahn: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [19:34:06] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:34:10] (03PS2) 10Dzahn: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 [19:34:19] (03CR) 10Krinkle: [C: 03+2] resourceloader: Fix incorrect order of feature stylesheets [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [19:35:38] (03Restored) 10Dzahn: puppetmaster: (re)move hiera lookup for scripts to profiles [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [19:36:36] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:36:46] (03PS1) 10Razzi: Add geoip::data::puppet to profile::piwik::instance [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) [19:37:10] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:37] (03CR) 10Hashar: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [19:37:53] (03CR) 10jerkins-bot: [V: 04-1] Add geoip::data::puppet to profile::piwik::instance [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:38:24] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:49] (03PS5) 10Dzahn: monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) [19:39:04] (03CR) 10jerkins-bot: [V: 04-1] update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [19:39:08] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:40:08] (03CR) 10Razzi: "I tested this by adding the class to a new node, and confirmed that it got the right .mmdb files. Could still use a catalog compiler run a" [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) (owner: 10Razzi) [19:40:10] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:18] (03PS2) 10Razzi: Add geoip::data::puppet to profile::piwik::instance [puppet] - 10https://gerrit.wikimedia.org/r/626481 (https://phabricator.wikimedia.org/T213741) [19:41:42] 10Operations, 10Recommendation-API, 10serviceops: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (10jijiki) [19:41:46] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:59] (03PS1) 10Hashar: update_version: replace python 3.4 with 3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/626483 [19:42:10] mdholloway: I created https://phabricator.wikimedia.org/T262587, I need to sign off and I can;t debug it [19:42:27] it could be an API issue, I looked around a bit to no avail [19:42:42] I've just been investigating. it does seem to be related to 503s from the API [19:43:09] That particular endpoint fires off a lot of API requests in parallel, and I believe it throws an error if any of them fail [19:43:10] (03CR) 10Jeena Huneidi: [C: 03+2] "hashar to the rescue" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626483 (owner: 10Hashar) [19:43:16] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:43:26] (03PS2) 10Hashar: update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 [19:44:07] (03CR) 10Dzahn: [C: 03+1] Fix "Could not find resource 'User[deploy-devtools]'" [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [19:44:08] I mentioned above a few minutes ago that I thought it was DB-related, but that was mistaken, I think. (Local config mistake.) [19:44:25] (03PS8) 10Dzahn: scap: Fix "Could not find resource 'User[deploy-devtools]'" in cloud [puppet] - 10https://gerrit.wikimedia.org/r/626035 (owner: 10Paladox) [19:44:33] (03Merged) 10jenkins-bot: update_version: replace python 3.4 with 3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/626483 (owner: 10Hashar) [19:44:36] !log End of [urbanecm@mwmaint2001 ~]$ mwscript updateCollation.php --wiki=trwiktionary --previous-collation=uppercase # T262163 [19:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:42] T262163: Set $wgCategoryCollation = uca-tr on trwiktionary - https://phabricator.wikimedia.org/T262163 [19:44:52] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:00] (03CR) 10Hashar: "Rebased on top of https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/626483 which drops python 3.4 ;]" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [19:45:09] * Krinkle takes deploy lock [19:45:20] * Krinkle staging on mwdebug2001 [19:45:43] (03CR) 10Jeena Huneidi: [C: 03+2] update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [19:45:50] effie: in any case, thanks for filing that! [19:45:54] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:47:18] (03Merged) 10jenkins-bot: update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [19:47:38] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:48:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:49:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:50:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:51:56] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:38] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:53:30] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:53:44] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:54:01] (03Merged) 10jenkins-bot: resourceloader: Fix incorrect order of feature stylesheets [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/626264 (https://phabricator.wikimedia.org/T262507) (owner: 10Krinkle) [19:54:28] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:54:41] (03PS3) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) [19:55:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:17] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:00:58] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:02:17] !log krinkle@deploy1001 Synchronized php-1.36.0-wmf.8/includes/resourceloader/ResourceLoaderSkinModule.php: Ibe2c9f8d024f6 (duration: 01m 05s) [20:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:47] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:48] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:05:42] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2029.codfw.wmnet ` The... [20:09:32] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:17:12] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:19:04] I will proceed with the train now [20:19:06] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:20:05] !log deploying 1.36.0-wmf.8 to all wikis [20:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:50] !log correction: T257976 - 1.36.0-wmf.8 to all wikis [20:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:56] T257976: 1.36.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T257976 [20:21:07] (03CR) 10BryanDavis: scap: add support for .eqiad1.wikimedia.cloud targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [20:21:39] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [20:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:41] (03PS1) 10Jeena Huneidi: all wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626490 [20:22:43] (03CR) 10Jeena Huneidi: [C: 03+2] all wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626490 (owner: 10Jeena Huneidi) [20:23:26] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626490 (owner: 10Jeena Huneidi) [20:23:45] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:03] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.8 [20:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:38] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:28:22] PROBLEM - PHP7 rendering on mw2296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - header X-Powered-By: PHP/7. not found on http://en.wikipedia.org:80/wiki/Main_Page - 1310 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:28:22] PROBLEM - Apache HTTP on mw2296 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1310 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:32:15] 10Operations, 10Puppet: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10Volans) p:05Triage→03Medium [20:33:55] looks like 503s are back down as of the train deploy finishing [20:36:33] 10Operations, 10Recommendation-API, 10serviceops: recommendation-api alerting and api errors - https://phabricator.wikimedia.org/T262587 (10Mholloway) Looks like 503s from the API appservers are back down following today's MW train deployment. The root cause appears to have been T262575 but it's not clear t... [20:44:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2029.codfw.wmnet'] ` and were **ALL** successful. [20:46:29] (03PS1) 10Dzahn: site: add parsoid role to parsoid2020.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/626496 (https://phabricator.wikimedia.org/T247441) [20:47:18] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [20:49:19] (03PS28) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [20:49:36] (03CR) 10Dzahn: [C: 03+2] site: add parsoid role to parsoid2020.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/626496 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [20:50:00] (03PS29) 10CRusnov: customscripts/interface_automation.py: Add Interface and IP Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [20:50:06] (03CR) 10Herron: [C: 03+2] logstash: increase jvm heap memory to 2g [puppet] - 10https://gerrit.wikimedia.org/r/626393 (owner: 10Herron) [20:51:30] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2030.codfw.wmnet ` The... [20:52:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:52:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:27] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:33] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:40] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [21:11:54] 10Operations, 10Puppet: First puppet run after reimage slow (connection timeout) - https://phabricator.wikimedia.org/T262609 (10MoritzMuehlenhoff) From what I can tell that's a known issue/longstanding behaviour with enabling the mapped addressed, the puppet run stalls/times out with the "ifup" call with does... [21:18:00] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10wiki_willy) [21:19:43] 10Operations, 10ops-codfw: Degraded RAID on ms-be2019 - https://phabricator.wikimedia.org/T262182 (10wiki_willy) Thanks for the info @fgiunchedi. T262614 has been created to order the new part. Thanks, Willy >> @fgiunchedi - isn't this one going to be refreshed, as soon as you're done testing out the Dell 7... [21:20:07] (03CR) 10Andrew Bogott: scap: add support for .eqiad1.wikimedia.cloud targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [21:20:25] (03PS3) 10Andrew Bogott: clush toolforge_canary_list: update! [puppet] - 10https://gerrit.wikimedia.org/r/626456 [21:20:27] (03PS3) 10Andrew Bogott: toolforge_canary_list.txt: use new .eqiad1.wikimedia.cloud names [puppet] - 10https://gerrit.wikimedia.org/r/626457 (https://phabricator.wikimedia.org/T260614) [21:20:29] (03PS2) 10Andrew Bogott: tools-clush-generator: use eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626464 (https://phabricator.wikimedia.org/T260614) [21:20:31] (03PS2) 10Andrew Bogott: scap: add support for .eqiad1.wikimedia.cloud targets [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) [21:20:33] (03PS2) 10Andrew Bogott: trafficserver: update to use a .wikimedia.cloud dns name [puppet] - 10https://gerrit.wikimedia.org/r/626466 (https://phabricator.wikimedia.org/T260614) [21:20:35] (03PS2) 10Andrew Bogott: base::remote_syslog: use .wikimedia.cloud naming for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/626467 (https://phabricator.wikimedia.org/T260614) [21:20:37] (03PS2) 10Andrew Bogott: toolschecker: use .eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/626468 (https://phabricator.wikimedia.org/T260614) [21:31:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2030.codfw.wmnet'] ` and were **ALL** successful. [21:33:21] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2031.codfw.wmnet ` The... [21:47:15] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: es2029.codfw.wmnet, webperf1002.eqiad.wmnet, es2030.codfw.wmnet, webperf2002.codfw.wmnet, wdqs1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [21:47:57] taking a look at the webperf part of that [21:48:38] (that alert triggers when the global number of hosts with puppet issues is over a threshold, so not clear which is the new one but all should be fixed) [21:49:21] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [21:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:26] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [21:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10daniel) a:05tstarling→03None [21:57:51] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10daniel) [22:06:20] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10jeena) [22:06:32] !log added mcrouter cert for parse2020, ran mcrouter_generate_certs [22:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:36] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10jeena) I added php support to blubber. An example what would be added to the variant in your blubber... [22:13:10] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2031.codfw.wmnet'] ` and were **ALL** successful. [22:14:12] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2032.codfw.wmnet ` The... [22:17:11] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10jeena) [22:17:23] (03PS1) 10Jeena Huneidi: blubberoid: Update to 2020-09-02-201016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626511 (https://phabricator.wikimedia.org/T261783) [22:28:16] (03CR) 10BryanDavis: [C: 03+1] scap: add support for .eqiad1.wikimedia.cloud targets [puppet] - 10https://gerrit.wikimedia.org/r/626465 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [22:28:40] (03CR) 10Dduvall: [C: 03+2] blubberoid: Update to 2020-09-02-201016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626511 (https://phabricator.wikimedia.org/T261783) (owner: 10Jeena Huneidi) [22:30:01] (03Merged) 10jenkins-bot: blubberoid: Update to 2020-09-02-201016-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626511 (https://phabricator.wikimedia.org/T261783) (owner: 10Jeena Huneidi) [22:31:11] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [22:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:49] !log jhuneidi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [22:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:19] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [22:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:48] !log jhuneidi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:09] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [22:50:48] !log jhuneidi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [22:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:06] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2032.codfw.wmnet'] ` and were **ALL** successful. [22:55:59] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2033.codfw.wmnet ` The... [23:00:05] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200910T2300) [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:06] 10Operations, 10MW-on-K8s, 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Pipeline): Deployment infrastructure for PHP microservices - https://phabricator.wikimedia.org/T261369 (10jeena) [23:11:57] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime [23:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:06] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [23:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:29] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2033.codfw.wmnet'] ` and were **ALL** successful. [23:44:39] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` es2034.codfw.wmnet ` The...