[00:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T0000). [00:00:05] niedzielski: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:40] o/ hello RoanKattouw Niharika Urbanecm! [00:07:56] Hmm... Is anyone around that can SWAT the removal of a period? [00:08:01] https://gerrit.wikimedia.org/r/#/c/mediawiki/skins/MinervaNeue/+/559226/1/resources/skins.minerva.scripts/menu/MainMenu.js [00:10:11] niedzielski: I can probably help with that [00:10:49] Hooray! Thank you twentyafterfour!! [00:28:26] (03PS1) 10Brennen Bearnes: logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 [00:28:28] wow, 13 minutes to get a single punctuation through CI [00:28:58] I've gotten physical products delivered by Amazon faster. [00:29:07] To be fair though, I am a Prime member. [00:30:34] (03PS2) 10Brennen Bearnes: logspam.pl: Shorten paths and include fatals [puppet] - 10https://gerrit.wikimedia.org/r/559246 [00:31:26] niedzielski: haha nice [00:33:07] niedzielski: it's on mwdebug1001 if you don't mind testing? [00:33:18] I'm seeing it there right now. [00:34:03] This seems correct to me, twentyafterfour! [00:35:42] niedzielski: awesome thank you for testing [00:36:01] Looks good on testwiki and hewiki. Thank you, twentyafterfour! [00:36:19] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.457e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:36:50] uhm [00:36:55] !log twentyafterfour@deploy1001 Synchronized php-1.35.0-wmf.11/skins/MinervaNeue/resources/skins.minerva.scripts/menu/MainMenu.js: sync https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/559226 for SWAT (duration: 01m 02s) [00:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:15] hmm that error rate went up just slightly before my sync-file, so I think unrelated? [00:38:50] (03PS3) 10Bstorm: toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) [00:39:55] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5285 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:40:30] I should hope so. I'm not seeing a spike in our JS errors: https://grafana.wikimedia.org/d/000000566/overview?orgId=1 [00:41:20] just memcached errors, but quite a lot of them [00:43:29] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:44:27] ok I guess it was just a transient problem, error rate of memcached has returned to normal [00:44:55] Yay! Thank you again, twentyafterfour! [00:52:27] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.884e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:53:29] ugh. That memcached error seems kinda serious if it's gonna keep coming back. I don't see much in the way of details in logstash. It just says SERVER ERROR without saying what kind of error [00:59:33] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 429 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:00:04] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T0100). [01:03:07] but I was just daydreaming about phabricator [01:04:16] Isn't that the one perf and sre have been talking about on and off? [01:08:39] (03PS1) 10Krinkle: mediawiki: Capture shutdown/destruct backtrace in php7-fatal-error.php [puppet] - 10https://gerrit.wikimedia.org/r/559262 (https://phabricator.wikimedia.org/T241097) [01:14:12] (03CR) 10EBernhardson: [C: 04-1] "needs more than this, still working something out with a test install on stat1007" [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [01:17:06] niedzielski: the spike in MobileView rate does look interesting there [01:17:06] https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now%2Fw&to=now [01:17:14] nearly double? [01:20:58] (03PS1) 10EBernhardson: analytics/hive: Support for kerberos sasl auth [puppet] - 10https://gerrit.wikimedia.org/r/559266 [01:24:44] Krinkle: good eye. Let me ask the team. [01:25:28] * Krinkle adds req/s unit to graph [01:28:47] I think it's primarily the native apps using this endpoint. I'm asking around. [01:33:33] Krinkle: looking further back in time, I don't think this oscillation is so unusual. [01:33:44] https://grafana.wikimedia.org/d/000000566/overview?orgId=1&from=now-1M%2FM&to=now-1M%2FM&fullscreen&panelId=17 [01:35:21] Last iOS release was ~2 weeks ago according to Monte. [01:35:51] Android release was yesterday though. [01:38:04] (03PS1) 10Eevans: deployment-prep: Create new test instances of Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559279 (https://phabricator.wikimedia.org/T218609) [01:44:50] !log deploying phabricator update (tagged release/2019-12-19/1) [01:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:35] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to RESOURCE for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aklapper) Assuming this is about #sre-access-requests - what's the shell account name? [01:50:49] niedzielski: looks like its a bot [01:50:57] https://w.wiki/E7X [01:51:20] anyhow, yeah, not too extreme I suppose, but such a spike and for several hours on a regular basis suggests something unorganic to me. [01:51:40] could be worth looking into to see if there's something broken but yeah, seems fine I guess [01:56:03] (03CR) 10Alex Monk: [C: 03+1] deployment-prep: Create new test instances of Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559279 (https://phabricator.wikimedia.org/T218609) (owner: 10Eevans) [01:57:08] !log phabricator update completed [01:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:35] I just came to report that phab was saying "upstream connect error or disconnect/reset before headers. reset reason: connection failure", but I guess that was part of a phab update? [02:09:21] Thanks Krinkle. I've shared your link with the Android folks (since the bot is called Java/...). [02:20:58] (03PS2) 10EBernhardson: analytics/hive: Support for kerberos sasl auth [puppet] - 10https://gerrit.wikimedia.org/r/559266 [03:05:25] (03PS1) 10Herron: ganeti300[123] disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/559297 (https://phabricator.wikimedia.org/T236216) [03:05:27] (03PS1) 10Herron: ganeti500[123] disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/559298 (https://phabricator.wikimedia.org/T228099) [03:05:52] (03CR) 10BryanDavis: [C: 03+1] toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [03:09:13] (03CR) 10Herron: [C: 03+2] ganeti300[123] disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/559297 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [03:10:03] (03CR) 10Herron: [C: 03+2] ganeti500[123] disable notifications during setup [puppet] - 10https://gerrit.wikimedia.org/r/559298 (https://phabricator.wikimedia.org/T228099) (owner: 10Herron) [03:48:02] !log volker-e@deploy1001 Started deploy [design/style-guide@5cecb37]: Deploy design/style-guide: [03:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:48:10] !log volker-e@deploy1001 Finished deploy [design/style-guide@5cecb37]: Deploy design/style-guide: (duration: 00m 07s) [03:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:28] (03PS3) 10Dzahn: Phragile: Added PHP extensions needed by PHP 7 dependencies [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) (owner: 10WMDE-leszek) [03:52:15] (03CR) 10Dzahn: [C: 03+2] ""labs"-only" [puppet] - 10https://gerrit.wikimedia.org/r/558476 (https://phabricator.wikimedia.org/T211228) (owner: 10WMDE-leszek) [04:03:13] !log LDAP - added mstyles to archiva-deployers (T240865) [04:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:19] T240865: Give deployer access to Archiva for Maryum Styles - https://phabricator.wikimedia.org/T240865 [04:10:06] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [04:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:19] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:12:39] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-12-11-144337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/558778 (https://phabricator.wikimedia.org/T233405) (owner: 10KartikMistry) [04:14:00] (03PS2) 10KartikMistry: Update cxserver to 2019-12-11-144337-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/558778 (https://phabricator.wikimedia.org/T233405) [04:15:41] (03CR) 10Ayounsi: [C: 03+2] Prepare Puppet to apply netinsights to all sites [puppet] - 10https://gerrit.wikimedia.org/r/559155 (owner: 10Ayounsi) [04:16:23] (03PS1) 10Herron: ganeti: assign ganeti300[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/559313 (https://phabricator.wikimedia.org/T236216) [04:19:36] !log kartik@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [04:19:40] (03PS1) 10Herron: add dummy esams and eqsin ganeti keys to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/559315 (https://phabricator.wikimedia.org/T236216) [04:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:20:08] (03CR) 10Herron: [V: 03+2 C: 03+2] add dummy esams and eqsin ganeti keys to pacify PCC [labs/private] - 10https://gerrit.wikimedia.org/r/559315 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [04:21:34] !log kartik@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:50] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1003/20063/" [puppet] - 10https://gerrit.wikimedia.org/r/559313 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [04:25:54] !log kartik@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [04:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:29] (03PS2) 10Ayounsi: CR: add apply-groups [ re0 re1 ]; if multiple REs [homer/public] - 10https://gerrit.wikimedia.org/r/549690 [04:29:33] !log Update cxserver to 2019-12-11-144337-production (T233405, T238118) [04:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:39] T238118: Add gcrwiki to cxserver - https://phabricator.wikimedia.org/T238118 [04:29:39] T233405: Reference shown duplicated in the source document - https://phabricator.wikimedia.org/T233405 [04:29:54] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] CR: add apply-groups [ re0 re1 ]; if multiple REs [homer/public] - 10https://gerrit.wikimedia.org/r/549690 (owner: 10Ayounsi) [04:33:28] (03PS1) 10Herron: add ganeti01.svc.esams.wmnet forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559324 (https://phabricator.wikimedia.org/T236216) [04:42:29] (03PS1) 10Herron: add misc cluster to eqsin and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559330 (https://phabricator.wikimedia.org/T226444) [04:58:09] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [04:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:39] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [04:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:16] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:26] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:10] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [05:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:18] (03PS1) 10Minhducsun2002: Upload HD logos for hi, la and no wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559344 (https://phabricator.wikimedia.org/T150618) [05:48:10] (03PS1) 10Minhducsun2002: Add entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 [06:02:59] (03PS1) 10Andrew Bogott: nova firstboot.sh: remove a no-longer-needed apt hack for codf1dev [puppet] - 10https://gerrit.wikimedia.org/r/559349 (https://phabricator.wikimedia.org/T239347) [06:03:01] (03PS1) 10Andrew Bogott: cloud-vps: allow per-deploy customization of site name in resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/559350 [06:04:11] (03CR) 10Andrew Bogott: [C: 03+2] nova firstboot.sh: remove a no-longer-needed apt hack for codf1dev [puppet] - 10https://gerrit.wikimedia.org/r/559349 (https://phabricator.wikimedia.org/T239347) (owner: 10Andrew Bogott) [06:05:15] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps: allow per-deploy customization of site name in resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/559350 (owner: 10Andrew Bogott) [06:07:11] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] cloud-vps: allow per-deploy customization of site name in resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/559350 (owner: 10Andrew Bogott) [06:13:03] !log Upgrade db2122, db2084 [06:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [06:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:33] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:15:21] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:06] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559352 [06:26:38] 10Operations, 10DBA: Upgrade BIOS and firmware on db2084 - https://phabricator.wikimedia.org/T241103 (10Marostegui) [06:27:09] 10Operations, 10DBA: Upgrade BIOS and firmware on db2084 - https://phabricator.wikimedia.org/T241103 (10Marostegui) p:05Triage→03Normal [06:31:28] (03CR) 10Dzahn: "Added new 4096-bit host key in private repo which should now be accessible under "ssh_host_rsa_key" as used in this patch." [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [06:34:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559352 (owner: 10Marostegui) [06:35:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559352 (owner: 10Marostegui) [06:36:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1008 for upgrade (duration: 01m 03s) [06:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:14] !log Upgrade pc1008 [06:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:48] (03CR) 10Ammarpad: [C: 04-1] Add initial configuration for ng.wikimedia.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [06:45:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559353 [06:46:40] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559353 (owner: 10Marostegui) [06:46:59] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:47:32] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1008" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559353 (owner: 10Marostegui) [06:48:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM; did you test it?" [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) (owner: 10Effie Mouzeli) [06:49:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1008 after upgrade (duration: 01m 02s) [06:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:59] !log cr2-eqdfw:delete chassis alarm management-ethernet - T241105 [06:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:06] T241105: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 [06:52:10] 10Operations, 10netops: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) 05Open→03Resolved p:05Triage→03Normal [06:52:18] (03PS1) 10Marostegui: install_server: Allow install es202[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/559354 (https://phabricator.wikimedia.org/T235820) [06:52:48] (03PS4) 10IAmNetx: Add initial configuration for ng.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) [06:53:43] (03CR) 10Marostegui: [C: 03+2] install_server: Allow install es202[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/559354 (https://phabricator.wikimedia.org/T235820) (owner: 10Marostegui) [06:54:40] (03CR) 10Ammarpad: "This patch is incomplete. You've not uploaded the files for these changes" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [06:54:54] (03CR) 10Ammarpad: [C: 04-1] Add entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [07:05:38] (03CR) 10Dzahn: "correction: reduced to 2048 bits again and also added ed25519 and 3 different ecdsa keys, see ticket comment." [puppet] - 10https://gerrit.wikimedia.org/r/556265 (owner: 10Paladox) [07:06:32] (03CR) 10Dzahn: [C: 04-1] "please rename ssh_host_ecdsa_key to ssh_host_ecdsa_256_key (that's how i did it in private repo for consistency)" [labs/private] - 10https://gerrit.wikimedia.org/r/556271 (owner: 10Paladox) [07:07:13] 10Operations, 10netops: No Juniper alarms in SNMP for MX204 - https://phabricator.wikimedia.org/T241105 (10ayounsi) 05Resolved→03Open As a test, I pushed the following: ` [edit chassis] - alarm { - management-ethernet { - link-down ignore; - } - } ` As cr2-eqdfw doesn't have a mgm... [07:07:40] (03CR) 10Dzahn: [C: 04-1] Gerrit: Add ed25519 and ecdsa ssh host keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [07:08:52] (03CR) 10Dzahn: "arrr.. do we have to match the exact names as upstream uses them? https://github.com/GerritCodeReview/gerrit/blob/8189f32d580ead2dea0648a" [labs/private] - 10https://gerrit.wikimedia.org/r/556271 (owner: 10Paladox) [07:08:56] (03CR) 10Dzahn: [C: 04-1] "arrr.. do we have to match the exact names as upstream uses them? https://github.com/GerritCodeReview/gerrit/blob/8189f32d580ead2dea0648a" [puppet] - 10https://gerrit.wikimedia.org/r/556270 (owner: 10Paladox) [07:10:36] !log Upgrade db1115 (this will make dbtree fail for a few minutes) [07:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:30] (03CR) 10Gergő Tisza: [cirrus] add elastic mapping for ores drafttopics (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558577 (https://phabricator.wikimedia.org/T240550) (owner: 10DCausse) [07:13:00] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:14:01] (03PS2) 10Dzahn: Set vcs user password to '*' [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [07:15:13] PROBLEM - mediawiki-installation DSH group on mw1321 is CRITICAL: Host mw1321 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:15:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076, s2 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9947 and previous config saved to /var/cache/conftool/dbconfig/20191219-071514-marostegui.json [07:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:25] ^ that is me [07:15:39] ack, thx [07:17:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [07:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:24] (03CR) 10Dzahn: [C: 03+2] "! means disabled and * means "password never established" (but keys would work)" [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [07:22:55] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P9948 and previous config saved to /var/cache/conftool/dbconfig/20191219-072413-marostegui.json [07:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] (03PS1) 10Ema: cache: reimage cp1085 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559356 (https://phabricator.wikimedia.org/T227432) [07:24:41] (03PS1) 10Ema: cache: reimage cp1087 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559357 (https://phabricator.wikimedia.org/T227432) [07:24:43] (03PS1) 10Ema: Revert "Revert "cache: reimage cp3064 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559358 (https://phabricator.wikimedia.org/T227432) [07:24:45] (03PS1) 10Ema: Revert "Revert "cache: reimage cp2023 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559359 (https://phabricator.wikimedia.org/T227432) [07:24:53] (03PS3) 10Ema: ATS: disable compress plugin on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559053 (https://phabricator.wikimedia.org/T238494) [07:25:27] (03CR) 10Dzahn: "though strictly it is just by convention and the manpage says both ! and * just mean "user will not be able to use a unix password" (man 5" [puppet] - 10https://gerrit.wikimedia.org/r/556471 (owner: 1020after4) [07:27:01] (03CR) 10Ema: [C: 03+2] ATS: disable compress plugin on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559053 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [07:27:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1014 for upgrade', diff saved to https://phabricator.wikimedia.org/P9949 and previous config saved to /var/cache/conftool/dbconfig/20191219-072728-marostegui.json [07:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:22] (03CR) 10Dzahn: [C: 03+2] "MariaDB [phabricator_maniphest]> SELECT COUNT(*) AS '' FROM maniphest_task WHERE" [puppet] - 10https://gerrit.wikimedia.org/r/557008 (owner: 10Aklapper) [07:30:04] (03CR) 10Dzahn: [C: 03+2] "MariaDB [phabricator_maniphest]> SELECT DISTINCT CONCAT("https://phabricator.wikimedia.org/p/", u.userName) AS userName," [puppet] - 10https://gerrit.wikimedia.org/r/557210 (https://phabricator.wikimedia.org/T227388) (owner: 10Aklapper) [07:30:26] !log cp: rolling ats-backend-restart to disable compress plugin everywhere T238494 [07:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:32] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [07:31:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P9950 and previous config saved to /var/cache/conftool/dbconfig/20191219-073151-marostegui.json [07:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:09] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:33:47] (03PS1) 10Andrew Bogott: cloud-vps nova firstboot.sh: Install curl, puppet, nscd [puppet] - 10https://gerrit.wikimedia.org/r/559361 (https://phabricator.wikimedia.org/T181375) [07:34:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1014', diff saved to https://phabricator.wikimedia.org/P9951 and previous config saved to /var/cache/conftool/dbconfig/20191219-073430-marostegui.json [07:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:19] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps nova firstboot.sh: Install curl, puppet, nscd [puppet] - 10https://gerrit.wikimedia.org/r/559361 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [07:35:48] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10jcrespo) +1 [07:39:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1076', diff saved to https://phabricator.wikimedia.org/P9952 and previous config saved to /var/cache/conftool/dbconfig/20191219-073907-marostegui.json [07:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1014', diff saved to https://phabricator.wikimedia.org/P9953 and previous config saved to /var/cache/conftool/dbconfig/20191219-074122-marostegui.json [07:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:22] (03Abandoned) 10Ema: ATS: lower compress plugin minimum-content-length [puppet] - 10https://gerrit.wikimedia.org/r/542996 (https://phabricator.wikimedia.org/T232615) (owner: 10Ema) [07:43:16] (03Abandoned) 10Ema: VCL: add support for X-Applayer-Cost [puppet] - 10https://gerrit.wikimedia.org/r/359419 (owner: 10Ema) [07:44:01] (03PS3) 10Dzahn: gerrit: Add ed25519, ecdsa and rsa fake ssh host keys [labs/private] - 10https://gerrit.wikimedia.org/r/556271 (owner: 10Paladox) [07:44:47] (03Abandoned) 10Ema: WIP: vcl: size-based cutoff for exp caching policy [puppet] - 10https://gerrit.wikimedia.org/r/393227 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [07:44:56] (03CR) 10Dzahn: [C: 03+2] "I renamed ssh_host_ecdsa_key to ssh_host_ecdsa_256_key and added the ssh_host_rsa.key as well. This matches private repo now." [labs/private] - 10https://gerrit.wikimedia.org/r/556271 (owner: 10Paladox) [07:45:21] (03CR) 10Dzahn: [V: 03+2 C: 03+2] gerrit: Add ed25519, ecdsa and rsa fake ssh host keys [labs/private] - 10https://gerrit.wikimedia.org/r/556271 (owner: 10Paladox) [07:47:03] (03Abandoned) 10Ema: text-be: remove code adding X-F-P to Vary [puppet] - 10https://gerrit.wikimedia.org/r/434644 (https://phabricator.wikimedia.org/T53700) (owner: 10Ema) [07:48:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1076', diff saved to https://phabricator.wikimedia.org/P9954 and previous config saved to /var/cache/conftool/dbconfig/20191219-074800-marostegui.json [07:48:03] (03CR) 10Dzahn: [C: 04-1] "I already added the ssh_host_rsa_key in gerrit:556271 but we can use this to delete the old one later when we are done switching it." [labs/private] - 10https://gerrit.wikimedia.org/r/556268 (owner: 10Paladox) [07:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1014', diff saved to https://phabricator.wikimedia.org/P9955 and previous config saved to /var/cache/conftool/dbconfig/20191219-074840-marostegui.json [07:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:21] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:53:23] !log depool cp1085 and reimage as text_ats T227432 [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:29] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [07:54:33] (03CR) 10Ema: [C: 03+2] cache: reimage cp1085 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559356 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [07:54:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559313 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [07:56:05] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1085.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [07:57:55] (03CR) 10Muehlenhoff: ganeti: allow ssh between cluster regardless of ganeti_cluster fact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [07:59:54] (03CR) 10Muehlenhoff: [C: 04-1] "We don't use ganeti-instance-debootstrap; we PXE-boot our regular netinst images as we do for baremetal servers." [puppet] - 10https://gerrit.wikimedia.org/r/559166 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:05:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1014', diff saved to https://phabricator.wikimedia.org/P9956 and previous config saved to /var/cache/conftool/dbconfig/20191219-080519-marostegui.json [08:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:36] (03PS1) 10Marostegui: dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/559366 [08:09:12] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [08:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:30] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/559366 (owner: 10Marostegui) [08:11:20] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:43] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/559367 [08:12:04] (03CR) 10Marostegui: [V: 03+2 C: 03+2] Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/559367 (owner: 10Marostegui) [08:16:04] (03PS1) 10Marostegui: dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/559406 [08:17:02] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/559406 (owner: 10Marostegui) [08:17:56] !log Restart mysql on labsdb1010 (after depooling it) [08:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [08:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:27] (03PS2) 10Gehel: Reduce osmosis maxInterval in half [puppet] - 10https://gerrit.wikimedia.org/r/559158 (https://phabricator.wikimedia.org/T239728) (owner: 10MSantos) [08:21:50] PROBLEM - MariaDB read only wikireplica on labsdb1010 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:22:06] ^that's me [08:22:08] expected [08:22:28] (03CR) 10Gehel: [C: 03+2] Reduce osmosis maxInterval in half [puppet] - 10https://gerrit.wikimedia.org/r/559158 (https://phabricator.wikimedia.org/T239728) (owner: 10MSantos) [08:22:56] RECOVERY - MariaDB read only wikireplica on labsdb1010 is OK: Version 10.1.43-MariaDB, Uptime 85s, read_only: True, 1230.83 QPS, connection latency: 0.002667s, query latency: 0.000813s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:23:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/559181 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [08:23:11] (03PS1) 10Marostegui: Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/559407 [08:23:31] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1085.eqiad.wmnet'] ` and were **ALL** successful. [08:23:36] (03CR) 10Volans: [C: 03+2] Images: add support for names with slashes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/559181 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [08:24:31] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:24:34] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy: Depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/559407 (owner: 10Marostegui) [08:26:20] (03Merged) 10jenkins-bot: Images: add support for names with slashes [software/debmonitor] - 10https://gerrit.wikimedia.org/r/559181 (https://phabricator.wikimedia.org/T237978) (owner: 10Volans) [08:27:36] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=5 --from-id=30398836 # T237984 [08:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:42] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [08:31:18] !log pool cp1085 with ATS backend T227432 [08:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:23] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:31:44] (03CR) 10Filippo Giunchedi: [C: 03+1] add misc cluster to eqsin and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559330 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:36:33] 10Operations, 10ops-codfw, 10DBA: Upgrade BIOS and firmware on db2084 - https://phabricator.wikimedia.org/T241103 (10Marostegui) [08:41:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559330 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [08:41:59] 10Operations, 10Wikibugs: wikibugs needs restart almost everyday - https://phabricator.wikimedia.org/T241109 (10Marostegui) [08:43:17] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Peachey88) [08:43:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1081, s4 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9957 and previous config saved to /var/cache/conftool/dbconfig/20191219-084346-marostegui.json [08:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:10] 10Operations, 10Mail: Add security-team@wikimedia.org as recipient of abuse@wikimedia.org emails - https://phabricator.wikimedia.org/T241078 (10Peachey88) [08:45:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1084 - depooled by mistake', diff saved to https://phabricator.wikimedia.org/P9958 and previous config saved to /var/cache/conftool/dbconfig/20191219-084518-marostegui.json [08:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1081, s4 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9959 and previous config saved to /var/cache/conftool/dbconfig/20191219-084544-marostegui.json [08:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:23] !log depool cp1087 and reimage as text_ats T227432 [08:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:29] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:55:48] (03PS2) 10Ema: cache: reimage cp1087 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559357 (https://phabricator.wikimedia.org/T227432) [08:58:05] (03CR) 10Ema: [C: 03+2] cache: reimage cp1087 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559357 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:00:18] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1087.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [09:03:53] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [09:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P9960 and previous config saved to /var/cache/conftool/dbconfig/20191219-090455-marostegui.json [09:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:58] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:07:14] PROBLEM - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance=cp3064:9536 site=esams tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:07:28] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp1087_v4,cp1087_v6} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:08:56] (03CR) 10Minhducsun2002: "I uploaded it here :" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [09:09:28] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance=cp2023:9536 site=codfw tunnel={cp1087_v4,cp1087_v6} Ema reimaging cp1087 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:09:28] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status esams on icinga1001 is CRITICAL: instance=cp3064:9536 site=esams tunnel={cp1087_v4,cp1087_v6} Ema reimaging cp1087 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P9961 and previous config saved to /var/cache/conftool/dbconfig/20191219-091205-marostegui.json [09:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:15] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:35] (03PS2) 10Gehel: [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/558526 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [09:15:01] !log running maps osm-replicate process manually on maps1004 - T239728 [09:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:07] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [09:15:24] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:59] (03CR) 10Gehel: [C: 03+2] [wdqs] enable asynchronous imports on wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/558526 (https://phabricator.wikimedia.org/T238045) (owner: 10DCausse) [09:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1081', diff saved to https://phabricator.wikimedia.org/P9962 and previous config saved to /var/cache/conftool/dbconfig/20191219-092257-marostegui.json [09:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:07] !log depool cp3064 and reimage as text_ats T227432 [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:12] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:24:09] (03PS2) 10Ema: Revert "Revert "cache: reimage cp3064 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559358 (https://phabricator.wikimedia.org/T227432) [09:24:29] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1087.eqiad.wmnet'] ` and were **ALL** successful. [09:25:16] (03CR) 10Ema: [C: 03+2] Revert "Revert "cache: reimage cp3064 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559358 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [09:26:28] (03PS1) 10Elukey: Remove Spark SASL encryption options from Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/559418 (https://phabricator.wikimedia.org/T240934) [09:26:38] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3064.esams.wmnet'] ` The log can be found in `/var/log/wm... [09:29:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=ipsec site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:29:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1081', diff saved to https://phabricator.wikimedia.org/P9963 and previous config saved to /var/cache/conftool/dbconfig/20191219-092920-marostegui.json [09:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:14] (03CR) 10Arturo Borrero Gonzalez: Fastnetmon: add thresholds overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559125 (https://phabricator.wikimedia.org/T240789) (owner: 10Ayounsi) [09:30:51] ACKNOWLEDGEMENT - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=cp1089:9536 site=eqiad tunnel={cp3064_v4,cp3064_v6} Ema cp3064 reimaging https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:31:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093, s6 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9964 and previous config saved to /var/cache/conftool/dbconfig/20191219-093116-marostegui.json [09:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:25] (03CR) 10Elukey: [C: 03+2] Remove Spark SASL encryption options from Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/559418 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [09:32:53] PROBLEM - check_trafficserver_log_fifo_tls_tls on cp1087 is CRITICAL: CRITICAL: /srv/trafficserver/tls/var/log/tls.pipe - fifo-log-demux not reading from pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:32:57] RECOVERY - Aggregate IPsec Tunnel Status esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:34:35] (03CR) 10Elukey: [C: 03+2] "Makes sense, this is what I also experienced while working on snakebite + sasl pypi package (that is a wrapper for cyrus sasl). This shoul" [puppet] - 10https://gerrit.wikimedia.org/r/559266 (owner: 10EBernhardson) [09:36:49] RECOVERY - check_trafficserver_log_fifo_tls_tls on cp1087 is OK: OK: TS_MAIN writing to and fifo-log-demux reading from /srv/trafficserver/tls/var/log/tls.pipe https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:37:31] (03CR) 10Elukey: airflow: Enable kerberos configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:39:19] !log pool cp1087 with ATS backend T227432 [09:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:28] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:39:45] RECOVERY - mediawiki-installation DSH group on mw1321 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:39:54] (03PS2) 10Giuseppe Lavagetto: First version of the debmonitor client [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559165 [09:40:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1093', diff saved to https://phabricator.wikimedia.org/P9965 and previous config saved to /var/cache/conftool/dbconfig/20191219-093959-marostegui.json [09:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:12] (03PS1) 10Muehlenhoff: Release v0.2.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/559433 [09:41:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2084:3314, db2084:3315 T241103', diff saved to https://phabricator.wikimedia.org/P9966 and previous config saved to /var/cache/conftool/dbconfig/20191219-094135-marostegui.json [09:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:41] T241103: Upgrade BIOS and firmware on db2084 - https://phabricator.wikimedia.org/T241103 [09:42:12] 10Operations, 10ops-codfw, 10DBA: Upgrade BIOS and firmware on db2084 - https://phabricator.wikimedia.org/T241103 (10Marostegui) I have depooled this host. So before acting on it we just need to stop downtime + stop MySQL [09:42:34] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Release v0.2.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/559433 (owner: 10Muehlenhoff) [09:46:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGMT. I would suggest to introduce the 'k8s' keyword in the script name for easier future context identification." [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [09:47:00] (03PS3) 10Jcrespo: admin: Provide access to kzimmerman (kzeta) to production analytics [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) [09:47:12] !log jmm@deploy1001 Started deploy [debmonitor/deploy@c056c3c]: debmonitor release v0.2.1 - T237978 [09:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:18] T237978: Extend debmonitor with image tracking support - https://phabricator.wikimedia.org/T237978 [09:48:32] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:35] (03CR) 10Jcrespo: [C: 03+2] admin: Provide access to kzimmerman (kzeta) to production analytics [puppet] - 10https://gerrit.wikimedia.org/r/558316 (https://phabricator.wikimedia.org/T240732) (owner: 10Jcrespo) [09:49:36] !log jmm@deploy1001 Finished deploy [debmonitor/deploy@c056c3c]: debmonitor release v0.2.1 - T237978 (duration: 02m 24s) [09:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1093', diff saved to https://phabricator.wikimedia.org/P9967 and previous config saved to /var/cache/conftool/dbconfig/20191219-094945-marostegui.json [09:49:50] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1321.eqiad.wmnet', 'mw1320.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2271.codfw.wmnet', 'mw2255.codfw.wmnet'] ` and were **ALL** successful. [09:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:45] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104, s8 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9968 and previous config saved to /var/cache/conftool/dbconfig/20191219-095158-marostegui.json [09:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:52] PROBLEM - Nginx local proxy to apache on mw1320 is CRITICAL: connect to address 10.64.32.41 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [09:53:12] that server was rrimaged, checking ^ [09:53:22] !log depool mw1320 [09:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:39] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:55:38] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:56] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [09:56:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1093', diff saved to https://phabricator.wikimedia.org/P9969 and previous config saved to /var/cache/conftool/dbconfig/20191219-095559-marostegui.json [09:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:14] (03CR) 10Ayounsi: Fastnetmon: add thresholds overrides (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559125 (https://phabricator.wikimedia.org/T240789) (owner: 10Ayounsi) [09:56:45] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw1286.eqiad.wmnet', 'mw1269.eqiad.wmnet', 'mw2235.codfw.wmnet', 'mw2216.codfw.wmnet'] ` The log can... [09:56:59] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3064.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3064.esams.wmnet'] ` [09:58:34] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat1004, stat1007, stat1006, notebook1003, notebook1004 for Kate Zimmerman - https://phabricator.wikimedia.org/T240732 (10jcrespo) @kzimmerman Access has been deployed, please wait at least 30 minutes after this comment so the cha... [10:00:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P9970 and previous config saved to /var/cache/conftool/dbconfig/20191219-095959-marostegui.json [10:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1093', diff saved to https://phabricator.wikimedia.org/P9971 and previous config saved to /var/cache/conftool/dbconfig/20191219-100938-marostegui.json [10:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) >>! In T222349#5751029, @ArielGlenn wrote: > How fast a download do folks want? As fast as possible! using an external mirror, we... [10:10:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P9972 and previous config saved to /var/cache/conftool/dbconfig/20191219-101024-marostegui.json [10:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:10] RECOVERY - DPKG on analytics1055 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:18:52] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10ArielGlenn) >>! In T222349#5753559, @Gehel wrote: >>>! In T222349#5751029, @ArielGlenn wrote: >> How fast a download do folks want? > >... [10:18:54] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:03] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1104', diff saved to https://phabricator.wikimedia.org/P9973 and previous config saved to /var/cache/conftool/dbconfig/20191219-102316-marostegui.json [10:23:20] 10Operations, 10Data-Services, 10Discovery-Search, 10Wikidata, and 2 others: Do not rate limit dumps from internal network - https://phabricator.wikimedia.org/T222349 (10Gehel) In the case of WDQS, we don't really have a schedule. It's an on demand requirement, whenever we need to do a data reload, which c... [10:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:19] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 78328112 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:35:42] (03CR) 10Volans: [C: 04-1] "Thanks for the code! Few comments inline, a bunch of nitpicks and some question." (0315 comments) [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559165 (owner: 10Giuseppe Lavagetto) [10:38:11] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12448 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [10:38:45] RECOVERY - Nginx local proxy to apache on mw1320 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:40:03] !log pool mw1320 [10:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:42] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) Today at 7:30ish we've disabled the compress plugin everywhere. It's clearly buggy and [[ https://grafana.wikimedia.org/d/7-ZqK8-Wz/... [10:46:31] (03PS6) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 [10:50:42] !log pool cp3064 with ATS backend T227432 [10:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:48] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [10:54:13] (03PS2) 10Ema: Revert "Revert "cache: reimage cp2023 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559359 (https://phabricator.wikimedia.org/T227432) [10:54:15] (03PS1) 10Ema: cache: reimage cp1089 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559440 (https://phabricator.wikimedia.org/T227432) [11:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1104', diff saved to https://phabricator.wikimedia.org/P9974 and previous config saved to /var/cache/conftool/dbconfig/20191219-110404-marostegui.json [11:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:57] !log removing kubestagetcd1001-1003 from debmonitor T224568 [11:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] T224568: Migrate etcd cluster for Kubernetes staging cluster to Stretch/Buster - https://phabricator.wikimedia.org/T224568 [11:07:53] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [11:16:54] (03PS1) 10Mathew.onipe: Increase replication frequency [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) [11:18:17] (03PS2) 10Mathew.onipe: maps: Increase replication frequency [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) [11:19:51] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10MoritzMuehlenhoff) Do we have non-roots who're using this actively? We could simply switch that to cn=ops otherwise? [11:26:35] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [11:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:45] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10jcrespo) I belive mediawiki deployers use it sometimes (I cannot say, we can ask). The issue is that the tool is not segmented for mediawiki-only, it monitors the whole fleet, including sensitive serve... [11:28:45] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:18] (03PS1) 10Jbond: ca nrpe checking: exclude managing public certificates [puppet] - 10https://gerrit.wikimedia.org/r/559443 (https://phabricator.wikimedia.org/T238833) [11:32:04] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) >>! In T238833#5749025, @Ottomata wrote: >> Just so that you aren't caught off guard >> `file {'/srv/private/secret/secrets/... [11:43:49] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=5 --from-id=4089887 # T237984 - Will stop after P433 [11:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:55] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [11:48:54] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [11:49:14] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 80243456 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:50:09] 10Operations, 10DNS, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Aklapper) [11:53:24] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 114966256 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [11:57:21] (03CR) 10Urbanecm: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:17] nothing to do \o/! [12:02:06] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 57160 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:03:04] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 57392 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:03:17] !log installing netflow4001 [12:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:13] (03PS3) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:07:00] (03CR) 10Urbanecm: [C: 04-1] "otherwise, lgtm" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [12:07:04] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [12:08:16] (03CR) 10jerkins-bot: [V: 04-1] Add entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (owner: 10Minhducsun2002) [12:08:49] (03PS4) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:09:08] (03CR) 10Urbanecm: [C: 04-1] Upload HD logos for hi, la and no wikibooks (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559344 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:09:11] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559344 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:11:07] (03PS5) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:12:44] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10jbond) >>! In T236954#5749658, @colewhite wrote: > The changesets look great and appear to do the right thing. > > The only other thing I could think of doing is to have... [12:14:20] (03CR) 10Dzahn: [C: 03+2] "per ISO 3166" [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [12:15:46] (03PS3) 10Dzahn: Add ng.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/559062 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [12:16:03] (03PS1) 10Minhducsun2002: Add entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559446 (https://phabricator.wikimedia.org/T150618) [12:17:21] (03Abandoned) 10Minhducsun2002: Add entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559446 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:20:03] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:20:41] (03PS6) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:20:43] (03PS2) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529046 (https://phabricator.wikimedia.org/T230103) [12:21:24] (03CR) 10jerkins-bot: [V: 04-1] Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [12:21:57] (03PS2) 10Minhducsun2002: Add wgLogoHD entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) [12:22:42] (03CR) 10Minhducsun2002: "This should clarify more." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:22:47] (03PS7) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [12:24:38] (03CR) 10Jbond: [C: 04-1] "looks good to me, a minor nit and a question. the -1 is for the lookup call" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/559125 (https://phabricator.wikimedia.org/T240789) (owner: 10Ayounsi) [12:25:38] (03PS5) 10Urbanecm: Add initial configuration for ng.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [12:27:01] Urbanecm: OK for me to do some clean-up config deploys? [12:27:16] James_F: yes, I'm just preparing :-) [12:27:20] Cool. [12:27:29] (03PS2) 10Jforrester: CommonSettings.php: Move core DB/SQL-related config closer together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558768 (owner: 10Krinkle) [12:27:49] (03CR) 10Jforrester: [C: 03+2] CommonSettings.php: Move core DB/SQL-related config closer together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558768 (owner: 10Krinkle) [12:28:14] (03CR) 10Jforrester: [C: 03+1] CommonSettings.php: Remove 'SERVER_SOFTWARE' override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558763 (https://phabricator.wikimedia.org/T232563) (owner: 10Krinkle) [12:28:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559104 (https://phabricator.wikimedia.org/T235418) (owner: 10Muehlenhoff) [12:28:34] (03Merged) 10jenkins-bot: CommonSettings.php: Move core DB/SQL-related config closer together [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558768 (owner: 10Krinkle) [12:28:55] (03PS2) 10Jforrester: CommonSettings.php: Remove very old 'error_append_string' INI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 (owner: 10Krinkle) [12:29:06] (03PS3) 10Jbond: profile::puppetdb: refactor [puppet] - 10https://gerrit.wikimedia.org/r/554516 [12:30:20] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Cleanup: Move core DB/SQL-related config closer together (duration: 01m 02s) [12:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:55] (03CR) 10Jforrester: [C: 03+2] CommonSettings.php: Remove very old 'error_append_string' INI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 (owner: 10Krinkle) [12:31:25] (03Merged) 10jenkins-bot: CommonSettings.php: Remove very old 'error_append_string' INI override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558761 (owner: 10Krinkle) [12:31:33] (03PS3) 10Jforrester: CommonSettings.php: Remove CLI 'display_errors=stderr' setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558758 (owner: 10Krinkle) [12:32:54] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Cleanup: Remove very old 'error_append_string' INI override (duration: 01m 02s) [12:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:09] (03CR) 10Jbond: [C: 03+2] profile::puppetdb: refactor [puppet] - 10https://gerrit.wikimedia.org/r/554516 (owner: 10Jbond) [12:33:33] (03CR) 10Jforrester: [C: 03+2] CommonSettings.php: Remove CLI 'display_errors=stderr' setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558758 (owner: 10Krinkle) [12:34:40] (03Merged) 10jenkins-bot: CommonSettings.php: Remove CLI 'display_errors=stderr' setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558758 (owner: 10Krinkle) [12:35:04] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:09] (03PS7) 10Jbond: netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 [12:35:43] (03CR) 10Urbanecm: [C: 04-1] Add wgLogoHD entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:35:45] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:35:47] (03CR) 10Jforrester: [C: 03+1] "Are we doing this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559224 (https://phabricator.wikimedia.org/T240691) (owner: 10Krinkle) [12:37:01] (03CR) 10jerkins-bot: [V: 04-1] Add wgLogoHD entry for hi, la and no wikibooks in wmf-config/InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [12:37:08] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Reedy) Did this ever work? Or did someone just start using the email and expect it to work? [12:37:10] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:37:13] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Cleanup: Remove CLI 'display_errors=stderr' setting (duration: 01m 01s) [12:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:21] (03CR) 10Jforrester: [C: 03+1] "Nice chain." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558774 (owner: 10Krinkle) [12:37:46] (03PS1) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [dns] - 10https://gerrit.wikimedia.org/r/559448 (https://phabricator.wikimedia.org/T224585) [12:38:03] (03CR) 10Jbond: [C: 03+2] netbox: create netbox_frontend global variables [puppet] - 10https://gerrit.wikimedia.org/r/554526 (owner: 10Jbond) [12:38:54] (03CR) 10Phamhi: "ar" [dns] - 10https://gerrit.wikimedia.org/r/559448 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:39:47] (03CR) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:39:52] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559073 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [12:40:42] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10HakanIST) It's been working since 2016. I've just checked the queue and latest received email is dated 08/01/2019. [12:41:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM thanks!" [dns] - 10https://gerrit.wikimedia.org/r/559448 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:41:17] (03CR) 10Phamhi: [C: 03+2] wmcs: make cloudmetrics1002 the primary instead of labmon1001 [dns] - 10https://gerrit.wikimedia.org/r/559448 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [12:42:34] (03Abandoned) 10Addshore: Role and profile for wdcm dashboards [puppet] - 10https://gerrit.wikimedia.org/r/387211 (owner: 10Addshore) [12:43:20] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559218 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [12:47:51] 10Operations, 10DNS, 10Mail, 10Traffic: wikimedia.community domain name is not resolving an mx record - https://phabricator.wikimedia.org/T241132 (10Dzahn) Looks like this changed in https://gerrit.wikimedia.org/r/c/operations/dns/+/533219 + @Vgutierrez [12:48:48] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.29e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [12:48:53] (03CR) 10Dzahn: "see https://phabricator.wikimedia.org/T241132" [dns] - 10https://gerrit.wikimedia.org/r/533219 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [12:50:39] (03PS3) 10Minhducsun2002: Add wgLogoHD entry for hi, la and no wikibooks in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) [12:52:29] !log depool cp2023 and cp1089 for ATS reimages T227432. Reimaged together because of T238817 [12:52:31] !log jforrester@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/WikiLove/includes/ApiWikiLove.php: T241094 ApiWikiLove: Don't pass null to implode(), but fall back to [] (duration: 01m 02s) [12:52:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:42] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [12:52:42] T238817: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 [12:52:44] !log failover ganeti master in ulsfo to ganeti4002 for a test [12:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:48] T241094: PHP Warning: "implode(): Invalid arguments passed" from ApiWikiLove.php - https://phabricator.wikimedia.org/T241094 [12:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136, s7 candidate master, for upgrade', diff saved to https://phabricator.wikimedia.org/P9975 and previous config saved to /var/cache/conftool/dbconfig/20191219-125314-marostegui.json [12:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:34] (03CR) 10Jforrester: Clean up unused configs in Wikibase.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559161 (https://phabricator.wikimedia.org/T238154) (owner: 10Ladsgroup) [12:54:19] (03CR) 10Ema: [C: 03+2] Revert "Revert "cache: reimage cp2023 as text_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/559359 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:54:45] (03PS2) 10Ema: cache: reimage cp1089 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559440 (https://phabricator.wikimedia.org/T227432) [12:55:38] (03CR) 10Ema: [C: 03+2] cache: reimage cp1089 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/559440 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:57:01] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [12:57:21] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) All candidate masters in eqiad and codfw have been restarted (and upgraded) [12:57:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P9976 and previous config saved to /var/cache/conftool/dbconfig/20191219-125748-marostegui.json [12:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:57] (03CR) 10Alexandros Kosiaris: [C: 03+1] ganeti: allow ssh between cluster regardless of ganeti_cluster fact [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [12:58:25] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp2023.codfw.wmnet'] ` The log can be found in `/var/log/wm... [12:59:42] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1089.eqiad.wmnet'] ` The log can be found in `/var/log/wm... [12:59:44] 10Operations, 10Traffic, 10Patch-For-Review: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10MoritzMuehlenhoff) I did a reinstall of netflow4001 (had missed this task update and thought it was a botched install) and tested migrations/draining a node, a master failover and a r... [13:00:04] James_F and longma: Time to snap out of that daydream and deploy Mediawiki train - European Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1300). [13:00:19] (03PS1) 10Jforrester: all wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559455 [13:00:21] (03CR) 10Jforrester: [C: 03+2] all wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559455 (owner: 10Jforrester) [13:00:59] (03PS8) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [13:01:14] (03PS13) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [13:01:27] (03Merged) 10jenkins-bot: all wikis to 1.35.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559455 (owner: 10Jforrester) [13:02:48] (03PS2) 10Muehlenhoff: ganeti: assign ganeti300[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/559313 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [13:02:53] !log jforrester@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.35.0-wmf.11 [13:02:54] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:10] (03CR) 10MSantos: [C: 03+1] maps: Increase replication frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [13:04:10] Train seems stable. [13:04:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:07:27] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: assign ganeti300[123] role::ganeti [puppet] - 10https://gerrit.wikimedia.org/r/559313 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [13:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P9977 and previous config saved to /var/cache/conftool/dbconfig/20191219-130832-marostegui.json [13:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:49] (03PS1) 10Ema: Revert "ATS: enable xdebug plugin on 3 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/559457 (https://phabricator.wikimedia.org/T241001) [13:11:42] (03CR) 10Ladsgroup: Clean up unused configs in Wikibase.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559161 (https://phabricator.wikimedia.org/T238154) (owner: 10Ladsgroup) [13:12:52] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:12:54] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [13:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:39] (03PS2) 10Ladsgroup: Clean up unused config in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559162 (https://phabricator.wikimedia.org/T238154) [13:15:02] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:50] !log mwscript emptyUserGroup.php --wiki=mediawikiwiki oauthadmin T241142 [13:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:56] T241142: Remove 'oauthadmin' from mediawiki.org accounts - https://phabricator.wikimedia.org/T241142 [13:17:12] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1136', diff saved to https://phabricator.wikimedia.org/P9978 and previous config saved to /var/cache/conftool/dbconfig/20191219-131832-marostegui.json [13:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:20] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1089.eqiad.wmnet'] ` and were **ALL** successful. [13:25:47] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2023.codfw.wmnet'] ` and were **ALL** successful. [13:27:24] (03PS1) 10Muehlenhoff: Re-enable notifications for ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559462 (https://phabricator.wikimedia.org/T226444) [13:33:03] !log pool cp2023 with ATS backend T227432 [13:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:10] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [13:34:01] !log pool cp1089 with ATS backend T227432 [13:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1136', diff saved to https://phabricator.wikimedia.org/P9979 and previous config saved to /var/cache/conftool/dbconfig/20191219-133525-marostegui.json [13:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:19] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/20067/" [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [13:38:45] cp1089 was the last one, varnish-be is now gone from the fleet :) [13:39:36] (03CR) 10Phamhi: [C: 03+2] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:39:45] ema: \o [13:39:48] \o/ [13:40:51] PROBLEM - mediawiki-installation DSH group on mw1269 is CRITICAL: Host mw1269 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [13:41:01] !log phamhi@cumin1001 START - Cookbook sre.hosts.downtime [13:41:02] !log phamhi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:28] phamhi: hi! LMK how the reimage goes and if you run into troubles with the new partman recipes [13:42:04] will do; thanks, godog [13:43:25] (03PS14) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [13:45:32] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:54] (03CR) 10Phamhi: [C: 03+2] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:48:09] (03CR) 10Jbond: "LGTM some minor comments/questions." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/558620 (owner: 10Giuseppe Lavagetto) [13:48:56] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> We don't use ganeti-instance-debootstrap; we PXE-boot our regular netinst images as we do for baremetal servers." [puppet] - 10https://gerrit.wikimedia.org/r/559166 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [13:53:38] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by phamhi on cumin1001.eqiad.wmnet for hosts: ` labmon1001.eqiad.wmnet... [13:56:21] (03PS5) 10DCausse: [cirrus] move similarity settings to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558576 [13:57:20] 10Operations, 10Performance-Team, 10Traffic: Improve ATS backend connection reuse against origin servers - https://phabricator.wikimedia.org/T241145 (10Vgutierrez) [13:57:59] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10WhitePhosphorus) [13:58:52] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aroraakhil) [14:00:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:00:54] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Aroraakhil) @Aklapper and @leila added the public ssh-key in the task description. Also, the preferred shell-username is: aarora (as stated in the ta... [14:03:35] (03PS1) 10Vgutierrez: ATS: Enable SO_KEEPALIVE and TCP_FASTOPEN for outgoing connections on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/559478 (https://phabricator.wikimedia.org/T241145) [14:04:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:07:01] (03CR) 10Vgutierrez: "pcc shows a sane output: https://puppet-compiler.wmflabs.org/compiler1002/20068/" [puppet] - 10https://gerrit.wikimedia.org/r/559478 (https://phabricator.wikimedia.org/T241145) (owner: 10Vgutierrez) [14:07:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:10:12] !log phamhi@cumin1001 START - Cookbook sre.hosts.downtime [14:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:49] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [14:12:11] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) 05Open→03Resolved a:03ema cp2023 and cp1089 were the last two hosts running Varnish as backend cache. We now have exclusively ats-be across the fleet! [14:12:18] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) [14:12:21] !log phamhi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:14:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:15:52] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10Joe) Port reservations are for now indicated here: https://wikitech.wikimedia.org/wiki/Service_ports [14:16:53] (03PS1) 10Elukey: Disable SASL fallback for Yarn Spark Shuffle service in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/559488 (https://phabricator.wikimedia.org/T240934) [14:17:00] 10Operations, 10Traffic: Request routing to active/passive services active in codfw only stopped working - https://phabricator.wikimedia.org/T238817 (10ema) 05Open→03Resolved a:03ema Having finished the transition to ATS T227432, there is no routing between cache backends anymore. [14:17:40] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: add cxserver-https [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) [14:18:02] (03CR) 10Elukey: [C: 03+2] Disable SASL fallback for Yarn Spark Shuffle service in Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/559488 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [14:21:22] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudmetrics1001.eqiad.wmnet'] ` and were **ALL** successful. [14:22:31] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Ottomata) [14:22:44] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Ottomata) Added researchers group too just for good measure :) [14:23:36] moritzm: we'd need to change procedures to create a kerberos principal in tasks like --^ [14:25:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) Ok! As is now, cergen doesn't do anything to chmod after it creates the files, so they should be created with the default... [14:27:51] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1286.eqiad.wmnet', 'mw1269.eqiad.wmnet', 'mw2235.codfw.wmnet', 'mw2216.codfw.wmnet'] ` and were **ALL** successful. [14:29:36] (03CR) 10Vgutierrez: [C: 03+1] lvs::configuration: add cxserver-https [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:29:54] 10Operations, 10Performance-Team, 10Traffic, 10Patch-For-Review: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 (10ema) >>! In T238494#5753680, @ema wrote: > There are thus two fronts to work on now: (1) increase connection reuse, and (2) decrease the c... [14:31:24] (03CR) 10Jbond: "looks good to me but not too familiar with how all theses are used (however i see Valentin has allready +1ed)" [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:31:28] (03CR) 10Jbond: [C: 03+1] lvs::configuration: add cxserver-https [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:32:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] lvs::configuration: add cxserver-https [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:35:18] (03CR) 10Ema: [C: 03+1] trafficserver::backend: switch to https for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/559495 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:36:36] (03CR) 10Jhedden: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/558707 (https://phabricator.wikimedia.org/T240715) (owner: 10Jhedden) [14:36:38] (03CR) 10Ema: [C: 03+2] Revert "ATS: enable xdebug plugin on 3 hosts" [puppet] - 10https://gerrit.wikimedia.org/r/559457 (https://phabricator.wikimedia.org/T241001) (owner: 10Ema) [14:37:00] (03PS1) 10Lens0021: Have ExtensionDistributor treat REL1_34 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559502 [14:37:23] (03CR) 10CDanis: "driveby comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:38:03] (03CR) 10Herron: "> > We don't use ganeti-instance-debootstrap; we PXE-boot our regular" [puppet] - 10https://gerrit.wikimedia.org/r/559166 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [14:39:38] (03CR) 10Muehlenhoff: [C: 03+1] ganeti: ensure package ganeti-instance-debootstrap installed [puppet] - 10https://gerrit.wikimedia.org/r/559166 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [14:40:46] <_joe_> !log restarting pybal on the backup low-traffic in eqiad,codfw [14:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:10] !log cp1075, cp4028: ats-backend-restart to disable xdebug plugin T241001 [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:15] T241001: cp3050 depooled due to explosion in CPU usage and inuse sockets - https://phabricator.wikimedia.org/T241001 [14:41:24] (03CR) 10Herron: [C: 03+2] ganeti: ensure package ganeti-instance-debootstrap installed [puppet] - 10https://gerrit.wikimedia.org/r/559166 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [14:42:32] (03PS1) 10Arturo Borrero Gonzalez: toolforge: new k8s: add kube-state-metrics.yaml [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) [14:45:39] (03PS2) 10Herron: add misc cluster to eqsin and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559330 (https://phabricator.wikimedia.org/T226444) [14:45:41] (03CR) 10Reedy: [C: 03+2] Have ExtensionDistributor treat REL1_34 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559502 (owner: 10Lens0021) [14:46:45] (03Merged) 10jenkins-bot: Have ExtensionDistributor treat REL1_34 as stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559502 (owner: 10Lens0021) [14:47:22] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [14:47:33] <_joe_> !log restart pybal on the primary low-traffic balancers in eqiad, codfw [14:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:55] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10zeljkofilipin) Thanks @Dzahn! [14:48:39] (03CR) 10jerkins-bot: [V: 04-1] Add wgLogoHD entry for hi, la and no wikibooks in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [14:49:13] (03CR) 10Herron: [C: 03+2] add misc cluster to eqsin and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559330 (https://phabricator.wikimedia.org/T226444) (owner: 10Herron) [14:49:38] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 51 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [14:50:08] _joe_: race condition? ^^^ [14:50:18] <_joe_> uhm strange [14:50:34] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Mark REL1_34 stable in ExtensionDistributor (duration: 00m 53s) [14:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:46] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.18:4002]) https://wikitech.wikimedia.org/wiki/PyBal [14:50:52] <_joe_> volans: possibly, yes [14:51:00] <_joe_> oh sigh I see [14:51:01] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: add kube-state-metrics.yaml [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) [14:51:10] <_joe_> I'm an idiot, i did puppet agent -tv [14:51:18] <_joe_> with an && afterwards [14:51:25] <_joe_> and puppet-sigh [14:51:58] (03PS3) 10Arturo Borrero Gonzalez: toolforge: new k8s: add kube-state-metrics.yaml [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) [14:52:00] on the bright side, we've confirmed that the icinga checks do the right thing :) [14:52:04] you wrote run-puppet-agent :D [14:52:10] old age :) [14:52:32] <_joe_> ema: only if they now recover [14:53:04] (03CR) 10Giuseppe Lavagetto: lvs::configuration: add cxserver-https (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559489 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [14:54:26] (03PS1) 10Muehlenhoff: Readd the late-install hack until WMCS switched to Puppet 5 / Facter 3 as well [puppet] - 10https://gerrit.wikimedia.org/r/559509 (https://phabricator.wikimedia.org/T239832) [14:54:54] (03PS2) 10Muehlenhoff: Re-enable notifications for ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559462 (https://phabricator.wikimedia.org/T226444) [14:55:10] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:55:24] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 52 connections established with conf2001.codfw.wmnet:2379 (min=52) https://wikitech.wikimedia.org/wiki/PyBal [14:55:59] yuhu! [14:56:26] (03CR) 10Muehlenhoff: [C: 03+2] Re-enable notifications for ganeti/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/559462 (https://phabricator.wikimedia.org/T226444) (owner: 10Muehlenhoff) [14:58:08] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10leila) Excellent. this task is ready on my end to be picked up by SRE-Access-Request. I remove myself as assignee. [14:58:09] (03PS3) 10Herron: ganeti: apply ferm regardless of ganeti_cluster fact [puppet] - 10https://gerrit.wikimedia.org/r/559172 [14:58:17] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10leila) a:05leila→03None [15:02:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] trafficserver::backend: switch to https for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/559495 (https://phabricator.wikimedia.org/T235411) (owner: 10Giuseppe Lavagetto) [15:03:03] <_joe_> ema: merging ^^ [15:03:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/559324 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [15:04:14] _joe_: ack [15:04:38] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/20069/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [15:05:33] (03PS2) 10Herron: add ganeti01.svc.esams.wmnet forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559324 (https://phabricator.wikimedia.org/T236216) [15:06:53] (03CR) 10Herron: [C: 03+2] add ganeti01.svc.esams.wmnet forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559324 (https://phabricator.wikimedia.org/T236216) (owner: 10Herron) [15:08:33] (03PS4) 10Bstorm: toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) [15:09:57] 10Operations, 10Research, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10elukey) I added some info to https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups to remember about requesting a K... [15:12:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [15:13:04] 10Operations, 10netops, 10observability: Determine & implement near-term method for escalating network alerts - https://phabricator.wikimedia.org/T237587 (10CDanis) a:03CDanis [15:13:10] 10Operations, 10Traffic, 10netops, 10observability: Network port utilization alerts should be paging - https://phabricator.wikimedia.org/T224888 (10CDanis) a:03CDanis [15:13:49] (03CR) 10Herron: [C: 03+2] ganeti: apply ferm regardless of ganeti_cluster fact [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [15:17:01] PROBLEM - Check systemd state on ganeti4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:31] alerts recently re-enabled for that host, looking [15:18:50] oh joy, it’s the networking unit. going to downtime and fix [15:20:57] this error rings a bell, digging [15:21:45] it was a stale error, just needed a restart, but wanted to downtime just in case [15:22:13] RECOVERY - Check systemd state on ganeti4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:33] "Error: argument "private" is wrong: dev is invalid” from a service restart with a mistake in the /etc/network/interfaces, but I ifup the device after that [15:23:47] herron: see https://phabricator.wikimedia.org/T233906 [15:24:46] ah, yeah I experienced that as well and updated from pre-up to up on the ulsfo hosts [15:25:07] and actually wanted to talk some about configuring the bridge(s) at os install time [15:28:07] PROBLEM - Host netflow4001 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:58] we already have VMs on the new cluster? :) [15:29:43] RECOVERY - Host netflow4001 is UP: PING OK - Packet loss = 0%, RTA = 74.77 ms [15:30:05] yes, netflow4001 is the first [15:30:10] awesome [15:30:16] (03CR) 10Giuseppe Lavagetto: "This needs to run as a user that can access both debmonitor and docker, so yes most likely root. I plan to add, later, checks for that." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/559165 (owner: 10Giuseppe Lavagetto) [15:32:05] yeah, forgot to downtime netflow4001, I tested a gnt-instance reboot on it [15:33:18] and esams is not too far behind, working on init that cluster and building a netflow VM there today [15:33:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:33:28] herron: the error is in fact stale from the initial boot, but later on bypassed by the workaround applied in Puppet, I'll reboot ganeti4001 to confirm that after the next reboot all is fine [15:33:40] sounds good! [15:34:00] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:34:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:55] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:41:36] herron: in fact all fine after a reboot [15:42:09] RECOVERY - mediawiki-installation DSH group on mw1269 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:43:39] jouncebot: now [15:43:39] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [15:43:43] jouncebot: nex [15:43:44] jouncebot: next [15:43:44] In 1 hour(s) and 16 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1700) [15:44:08] addshore: Want to do something fun? [15:44:25] got a fix for this one https://phabricator.wikimedia.org/T237984, but will probably just do it in morning swat [15:44:42] could do it earlier if there is free time, but right now still going through some review and merging motions [15:45:04] (03PS1) 10Elukey: Sync Yarn and Spark2 encryption config in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/559517 (https://phabricator.wikimedia.org/T240934) [15:46:17] (03PS2) 10Elukey: Sync Yarn and Spark2 encryption config in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/559517 (https://phabricator.wikimedia.org/T240934) [15:46:39] ottomata: --^ [15:47:47] ypou want to set that in prod cluster? [15:47:49] OH [15:47:55] nm [15:48:09] that is test cluster [15:48:13] right cool [15:48:14] yes yes [15:48:25] elukey: maybe we don't need toi set some of the things that are already defaultrs in spark? [15:48:53] ottomata: probably not but I wanted to be explicit just in case [15:48:57] keyLength, keyFactoryAlgorithm, [15:48:57] >? [15:48:58] ok [15:49:09] (03CR) 10Ottomata: [C: 03+1] Sync Yarn and Spark2 encryption config in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/559517 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [15:49:15] (03CR) 10Elukey: [C: 03+2] Sync Yarn and Spark2 encryption config in Hadoop Test [puppet] - 10https://gerrit.wikimedia.org/r/559517 (https://phabricator.wikimedia.org/T240934) (owner: 10Elukey) [15:49:56] (03CR) 10Mforns: analytics::search::jobs.pp: Move last deletion timers to drop-older-than (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539094 (https://phabricator.wikimedia.org/T204735) (owner: 10Mforns) [15:51:41] (03CR) 10Mforns: "Checked and this one can also be merged :]" [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) (owner: 10Mforns) [15:53:05] (03CR) 10Bstorm: "Ok, I updated the name and have a paste of a script that can be used by any tool account to set the -obs service account." [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [15:53:53] !log enable netflow sampling in ulsfo [15:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:03] (03PS5) 10Bstorm: toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) [15:54:33] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954 (10colewhite) Great idea. Lets raise it at the next SRE meeting. [15:55:48] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [15:56:01] (03CR) 10BryanDavis: toolforge-k8s: add a script to grant "observer" access to a tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [15:56:06] (03PS6) 10Bstorm: toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) [15:58:19] (03CR) 10Herron: [C: 03+2] "this didn't work as expected. I realize now that the fact is used also in the hiera lookup" [puppet] - 10https://gerrit.wikimedia.org/r/559172 (owner: 10Herron) [16:00:07] (03PS4) 10Mforns: analytics::refinery::job::data_purge: Add timer to delete old MWH dumps [puppet] - 10https://gerrit.wikimedia.org/r/539151 (https://phabricator.wikimedia.org/T208612) [16:00:13] (03PS5) 10Ottomata: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:00:21] (03PS2) 10Mforns: analytics::search::jobs.pp: Move last deletion timers to drop-older-than [puppet] - 10https://gerrit.wikimedia.org/r/539094 (https://phabricator.wikimedia.org/T204735) [16:00:33] (03PS3) 10Mforns: analytics::refinery::job::data_purge: Add growth deletion timers [puppet] - 10https://gerrit.wikimedia.org/r/556232 (https://phabricator.wikimedia.org/T237124) [16:00:55] (03PS1) 10Muehlenhoff: Add a comment on ganeti-instance-debootstrap [puppet] - 10https://gerrit.wikimedia.org/r/559524 [16:00:56] (03PS6) 10Mforns: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) [16:01:03] (03PS1) 10Ayounsi: Enable netflow sampling in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/559525 [16:01:07] (03PS4) 10Minhducsun2002: Add wgLogoHD entry for hi, la and no wikibooks in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) [16:03:16] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) [16:04:10] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Enable netflow sampling in ulsfo [homer/public] - 10https://gerrit.wikimedia.org/r/559525 (owner: 10Ayounsi) [16:04:43] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [16:04:48] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) [16:04:49] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on netflow4001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {ssbd, md_clear} https://wikitech.wikimedia.org/wiki/Microcode [16:05:16] (03CR) 10Ottomata: [C: 03+2] Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:05:38] (03PS1) 10Herron: Revert "ganeti: apply ferm regardless of ganeti_cluster fact" [puppet] - 10https://gerrit.wikimedia.org/r/559527 [16:06:26] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: add a script to grant "observer" access to a tool [puppet] - 10https://gerrit.wikimedia.org/r/559212 (https://phabricator.wikimedia.org/T233372) (owner: 10Bstorm) [16:06:57] (03CR) 10Ottomata: "Oh, Luca is right, we do need to remove metrics.wm.org routing from varnish." [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:07:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "ganeti: apply ferm regardless of ganeti_cluster fact" [puppet] - 10https://gerrit.wikimedia.org/r/559527 (owner: 10Herron) [16:08:20] (03PS2) 10Herron: Revert "ganeti: apply ferm regardless of ganeti_cluster fact" [puppet] - 10https://gerrit.wikimedia.org/r/559527 [16:08:36] (03PS1) 10Ottomata: Remove metrics.wm.org (wikimetrics is gone) [puppet] - 10https://gerrit.wikimedia.org/r/559528 (https://phabricator.wikimedia.org/T211835) [16:10:39] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [16:10:40] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:12] (03CR) 10BBlack: [C: 03+1] Remove metrics.wm.org (wikimetrics is gone) [puppet] - 10https://gerrit.wikimedia.org/r/559528 (https://phabricator.wikimedia.org/T211835) (owner: 10Ottomata) [16:12:53] (03CR) 10Ottomata: [C: 03+2] Remove metrics.wm.org (wikimetrics is gone) [puppet] - 10https://gerrit.wikimedia.org/r/559528 (https://phabricator.wikimedia.org/T211835) (owner: 10Ottomata) [16:15:46] (03CR) 10Herron: [C: 03+2] Revert "ganeti: apply ferm regardless of ganeti_cluster fact" [puppet] - 10https://gerrit.wikimedia.org/r/559527 (owner: 10Herron) [16:17:23] (03CR) 10Ammarpad: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [16:21:38] (03PS1) 10Herron: add netflow3001 forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559531 [16:21:52] !log Processed up to page 36567013 (P4152) [16:21:54] bah [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:01] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=2 --from-id=36185524 # T237984 (For `P4155`, then will stop) [16:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:07] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [16:23:48] (03PS5) 10Effie Mouzeli: mediawiki::php::admin memory optimisation for lib.php [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) [16:25:15] 10Operations, 10OCG-General, 10Readers-Community-Engagement, 10Core Platform Team Legacy (Watching / External), and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871 (10Jdlrobson) [16:29:57] (03PS1) 10Ottomata: Remove metrics.wm.org [dns] - 10https://gerrit.wikimedia.org/r/559533 (https://phabricator.wikimedia.org/T211835) [16:30:34] James_F: my "fun" thing is nearly ready :) https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/559529/ [16:30:47] (03CR) 10Ottomata: [C: 03+2] Remove metrics.wm.org [dns] - 10https://gerrit.wikimedia.org/r/559533 (https://phabricator.wikimedia.org/T211835) (owner: 10Ottomata) [16:31:11] (03CR) 10Ayounsi: [C: 03+1] add netflow3001 forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559531 (owner: 10Herron) [16:31:46] And if noone is currently deploying or touching mediawiki things I would like to deploy it in this 30 mins before puppet swat. [16:32:21] (03CR) 10Herron: [C: 03+2] add netflow3001 forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559531 (owner: 10Herron) [16:32:24] (03PS2) 10Herron: add netflow3001 forward/reverse ipv4 records [dns] - 10https://gerrit.wikimedia.org/r/559531 [16:34:59] addshore: Go for it. [16:35:11] :) few more mins in CI [16:36:38] (03PS1) 10Phamhi: cloudvps: cleanup labmon1001 dns records [dns] - 10https://gerrit.wikimedia.org/r/559535 (https://phabricator.wikimedia.org/T224585) [16:38:14] (03PS2) 10Phamhi: cloudvps: cleanup labmon1001 dns records [dns] - 10https://gerrit.wikimedia.org/r/559535 (https://phabricator.wikimedia.org/T224585) [16:39:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvps: cleanup labmon1001 dns records [dns] - 10https://gerrit.wikimedia.org/r/559535 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [16:39:35] (03CR) 10Phamhi: [C: 03+2] cloudvps: cleanup labmon1001 dns records [dns] - 10https://gerrit.wikimedia.org/r/559535 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [16:42:06] (03PS1) 10Jbond: service definitions: add custome type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [16:42:08] (03PS1) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 [16:42:43] (03CR) 10jerkins-bot: [V: 04-1] service definitions: add custome type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (owner: 10Jbond) [16:43:01] (03CR) 10jerkins-bot: [V: 04-1] service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (owner: 10Jbond) [16:44:03] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) [16:52:33] 10Operations, 10Core Platform Team, 10TechCom, 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825 (10Eevans) >>! In T122825#5752420, @Joe wrote: > I think most of the issues described here have been in the meantime solved by the implementation of the [[ https://w... [16:52:39] (03PS2) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [16:53:37] (03CR) 10jerkins-bot: [V: 04-1] service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (owner: 10Jbond) [16:53:59] !log addshore@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/Wikibase/lib/includes/Store/Sql/Terms/DatabaseTermIdsCleaner.php: T237984 [[gerrit:559529]] Fix incorrect deletion of rows in DatabaseTermIdsCleaner (duration: 00m 56s) [16:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:06] T237984: Some property labels are not displayed on Item pages - https://phabricator.wikimedia.org/T237984 [16:55:04] well, that should be it [16:55:30] (03PS1) 10Herron: install_server: add netflow3001 dhcp entry [puppet] - 10https://gerrit.wikimedia.org/r/559540 [16:56:30] * James_F crosses his fingers for addshore. [16:56:37] :) [16:57:16] (03CR) 10Herron: [C: 03+2] install_server: add netflow3001 dhcp entry [puppet] - 10https://gerrit.wikimedia.org/r/559540 (owner: 10Herron) [16:57:35] !log addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --batch-size=10 # T237984, Full pass (33 rows missing currently) [16:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:46] (03PS3) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [16:58:05] (03PS11) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [16:58:39] (03CR) 10jerkins-bot: [V: 04-1] service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (owner: 10Jbond) [17:00:04] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1700). Please do the needful. [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:44] (03PS4) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:03:13] (03PS2) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 [17:04:27] (03PS12) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:05:15] (03CR) 10Dzahn: [C: 03+1] Add ng.wikimedia.org as chapter site [puppet] - 10https://gerrit.wikimedia.org/r/559073 (https://phabricator.wikimedia.org/T240771) (owner: 10IAmNetx) [17:05:26] (03CR) 10jerkins-bot: [V: 04-1] service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (owner: 10Jbond) [17:06:31] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [17:07:49] (03PS13) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:09:07] (03PS3) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 [17:09:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a comment on ganeti-instance-debootstrap [puppet] - 10https://gerrit.wikimedia.org/r/559524 (owner: 10Muehlenhoff) [17:11:01] (03PS4) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 [17:12:04] (03PS5) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:13:40] (03PS6) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:14:57] (03CR) 10Krinkle: Variant configuration: Replace symfony/yaml with spyc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [17:16:20] (03PS14) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:16:37] (03PS1) 10Filippo Giunchedi: install_server: remove unused raid1-30G.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559547 (https://phabricator.wikimedia.org/T156955) [17:16:39] (03PS1) 10Filippo Giunchedi: install_server: use raid10-8dev standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559548 (https://phabricator.wikimedia.org/T156955) [17:16:41] (03PS1) 10Filippo Giunchedi: install_server: use raid10-6dev standard recipe [puppet] - 10https://gerrit.wikimedia.org/r/559549 (https://phabricator.wikimedia.org/T156955) [17:16:43] (03PS1) 10Filippo Giunchedi: install_server: deprecate raid10-gpt.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559550 (https://phabricator.wikimedia.org/T156955) [17:16:45] (03PS1) 10Filippo Giunchedi: install_server: deprecate raid10-gpt-srv-ext4.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559551 (https://phabricator.wikimedia.org/T156955) [17:16:47] (03PS1) 10Filippo Giunchedi: install_server: deprecate raid10-gpt-srv-lvm-xfs.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559552 (https://phabricator.wikimedia.org/T156955) [17:16:49] (03PS1) 10Filippo Giunchedi: install_server: deprecate raid10-gpt-srv-lvm-ext4.cfg [puppet] - 10https://gerrit.wikimedia.org/r/559553 (https://phabricator.wikimedia.org/T156955) [17:17:01] sorry for the spam! [17:20:46] (03PS5) 10Ammarpad: Add wgLogoHD entry for hi, la and no wikibooks in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [17:21:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] toolforge: Add CORS header to docker registry [puppet] - 10https://gerrit.wikimedia.org/r/558220 (https://phabricator.wikimedia.org/T232135) (owner: 10BryanDavis) [17:21:45] (03CR) 10Krinkle: "The lib looks a bit scary imho, also virtually no documentation and all classes below each other in a single file. It does both read, pars" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [17:23:45] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: Relabel labmon1001.eqiad.wmnet to cloudmetrics1001eqiad.wmnet and labmon1002.eqiad.wmnet to cloudmetrics1002eqiad.wmnet - https://phabricator.wikimedia.org/T241155 (10Phamhi) [17:23:55] (03PS15) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:25:09] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labmon* to Buster - https://phabricator.wikimedia.org/T224585 (10Phamhi) [17:25:38] (03PS7) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:25:54] (03PS8) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:26:37] (03PS9) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 [17:28:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:28:43] (03CR) 10Reedy: "We already have this in vendor for OpenStackManager historically, but also for translate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [17:30:25] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:32:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:32:23] (03PS16) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:33:11] (03CR) 10Jforrester: "> Patch Set 6:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [17:34:07] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:36:01] (03PS17) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [17:38:03] !log Running `mwscript deleteTag.php --wiki=testwiki --batch-size 100 HHVM` for T75181 final testing [17:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:09] T75181: Remove HHVM and PHP7 revision tags - https://phabricator.wikimedia.org/T75181 [17:39:12] (03CR) 10Bstorm: "kubectl and other tools using the same go libraries can automatically mount service account credentials as though it is a kubeconfig...and" [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [17:44:17] 10Operations, 10Puppet, 10User-jbond: puppet: Custom type providers - https://phabricator.wikimedia.org/T241160 (10jbond) [17:44:52] (03PS10) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) [17:44:58] (03PS5) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/559537 (https://phabricator.wikimedia.org/T241160) [17:49:12] (03PS1) 10Mholloway: MachineVision: Remove new upload labeling job delay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559559 (https://phabricator.wikimedia.org/T241072) [17:51:40] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: puppet: Custom type providers - https://phabricator.wikimedia.org/T241160 (10jbond) [17:59:34] (03CR) 10Mholloway: [C: 03+2] MachineVision: Remove new upload labeling job delay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559559 (https://phabricator.wikimedia.org/T241072) (owner: 10Mholloway) [18:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1800). [18:00:43] (03Merged) 10jenkins-bot: MachineVision: Remove new upload labeling job delay [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559559 (https://phabricator.wikimedia.org/T241072) (owner: 10Mholloway) [18:02:55] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: MachineVision: Remove new upload labeling job delay (duration: 00m 57s) [18:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:35] !log Running `foreachwiki maintenance/deleteTag.php --batch-size 500 HHVM` on mwmaint1002 for T75181 [18:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:41] T75181: Remove HHVM and PHP7 revision tags - https://phabricator.wikimedia.org/T75181 [18:11:45] (03PS18) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [18:16:44] !log mforns@deploy1001 Started deploy [analytics/refinery@e7200d2]: deploying analytics-refinery together with refinery-source v0.0.109 [18:16:45] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10herron) The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM. After @MoritzMuehlenhoff has a chance to double check that all looks good, and we've re-enab... [18:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:32] (03CR) 10CRusnov: "PCC output: https://puppet-compiler.wmflabs.org/compiler1002/20078/" [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [18:24:41] !log mforns@deploy1001 Finished deploy [analytics/refinery@e7200d2]: deploying analytics-refinery together with refinery-source v0.0.109 (duration: 07m 58s) [18:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:36] !log mforns@deploy1001 Started deploy [analytics/refinery@e7200d2] (thin): deploying analytics-refinery together with refinery-source v0.0.109 (thin) [18:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:42] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:43] !log mforns@deploy1001 Finished deploy [analytics/refinery@e7200d2] (thin): deploying analytics-refinery together with refinery-source v0.0.109 (thin) (duration: 00m 06s) [18:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:34] (03CR) 10Ammarpad: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559347 (https://phabricator.wikimedia.org/T150618) (owner: 10Minhducsun2002) [18:36:24] (03PS2) 10Gehel: scap: point discovery analytics at proper repository [puppet] - 10https://gerrit.wikimedia.org/r/558695 (owner: 10EBernhardson) [18:37:35] (03CR) 10Gehel: [C: 03+2] scap: point discovery analytics at proper repository [puppet] - 10https://gerrit.wikimedia.org/r/558695 (owner: 10EBernhardson) [18:41:58] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on netflow3001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {md_clear, ssbd} https://wikitech.wikimedia.org/wiki/Microcode [18:43:59] (03PS2) 10EBernhardson: airflow: Enable kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) [18:44:50] (03CR) 10BryanDavis: [C: 03+1] "Should be safe to deploy now that Ica4f8c338abc2116560aeea6ec72703c937ceeed is done" [puppet] - 10https://gerrit.wikimedia.org/r/554041 (https://phabricator.wikimedia.org/T238641) (owner: 10Arturo Borrero Gonzalez) [18:48:25] (03CR) 10EBernhardson: airflow: Enable kerberos configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [18:48:34] (03PS3) 10EBernhardson: airflow: Enable kerberos configuration [puppet] - 10https://gerrit.wikimedia.org/r/558687 (https://phabricator.wikimedia.org/T236180) [18:53:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:54:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:00:02] (03CR) 10Effie Mouzeli: "> LGTM; did you test it?" [puppet] - 10https://gerrit.wikimedia.org/r/558158 (https://phabricator.wikimedia.org/T240824) (owner: 10Effie Mouzeli) [19:00:04] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1900). [19:00:04] urandom: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:31] o/ [19:02:46] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:03:32] ^ that is me [19:03:55] 10Operations, 10netbox: Sync new ganeti clusters with netbox - https://phabricator.wikimedia.org/T241166 (10herron) p:05Triage→03Normal [19:04:32] 10Operations, 10netbox: Sync new ganeti clusters with netbox - https://phabricator.wikimedia.org/T241166 (10herron) esams and ulsfo are online now, and eqsin should be shortly. Not sure if it's best to do all at once, or per-site, but wanted to get a task created to keep tabs on it. [19:04:43] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@99a25c0]: glent: Remove unused esbulk cli parameters [19:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:03] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@99a25c0]: glent: Remove unused esbulk cli parameters (duration: 01m 20s) [19:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:10:34] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:21] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) [19:11:42] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10jijiki) 05Open→03Resolved a:03jijiki FIN! [19:11:46] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [19:12:19] 10Operations, 10serviceops, 10HHVM, 10MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), and 2 others: Remove HHVM from production - https://phabricator.wikimedia.org/T229792 (10jijiki) [19:20:40] jouncebot: now [19:20:41] For the next 0 hour(s) and 39 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191219T1900) [19:22:40] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [19:24:28] (03CR) 10Bstorm: [C: 04-1] "I edited this in place in toolsbeta and found that you can simply remove all CLI args from the command and the kubeconf volume mount. It " [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [19:26:50] (03CR) 10Gehel: maps: Increase replication frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [19:27:51] (03CR) 10Cwhite: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:27:56] (03PS3) 10Gehel: maps: Increase replication frequency [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [19:28:33] (03CR) 10Bstorm: [C: 04-1] "I'd say we could use it on the current iteration after we remove those pieces. We could even leave it in kube-system for the time being i" [puppet] - 10https://gerrit.wikimedia.org/r/559506 (https://phabricator.wikimedia.org/T237643) (owner: 10Arturo Borrero Gonzalez) [19:28:34] No SWAT? [19:29:59] (03CR) 10Gehel: [C: 03+2] maps: Increase replication frequency (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [19:32:22] urandom: seems no swatter arrived [19:32:27] i can swat your patches if you want :) [19:32:50] Urbanecm: it only affects deployment-prep [19:33:08] Urbanecm: if you could, that would be awesome [19:33:20] (03CR) 10Urbanecm: [C: 03+2] deployment-prep: Create new test instances of Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559279 (https://phabricator.wikimedia.org/T218609) (owner: 10Eevans) [19:33:24] sure :) [19:33:30] I need to shutdown some VMs, and can't do so until after that's deployed [19:33:40] end of year housekeeping :) [19:33:44] i see [19:34:12] (03Merged) 10jenkins-bot: deployment-prep: Create new test instances of Kask [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559279 (https://phabricator.wikimedia.org/T218609) (owner: 10Eevans) [19:34:28] urandom: here you are! [19:34:48] Urbanecm: that's it? [19:34:51] Urbanecm: thanks! [19:35:23] urandom: yes, deployment-prep-only patches can just be +2'ed (and fetched at deploy1001 to not confuse others) - they arrive at beta in the same way code changes do [19:35:47] Urbanecm: auh, OK, that's good to know [19:39:29] (03PS1) 10Ammarpad: Add sandboxlink for eswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559576 (https://phabricator.wikimedia.org/T241163) [19:48:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:48] ^ is that graph always so periodic? [19:53:20] iirc no [19:55:49] (03PS19) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [19:58:19] (03CR) 10CRusnov: "Updated https://puppet-compiler.wmflabs.org/compiler1001/20079/" [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [20:06:18] (03PS1) 10Mathew.onipe: maps: Use correct puppet cron syntax [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) [20:08:49] (03CR) 10Mathew.onipe: "PCC is happy: https://puppet-compiler.wmflabs.org/compiler1001/20080/" [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [20:53:53] (03PS1) 10RobH: decom old kafka machines [dns] - 10https://gerrit.wikimedia.org/r/559587 (https://phabricator.wikimedia.org/T226517) [20:54:04] 10Operations, 10MediaWiki-General, 10observability: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) We recently had a conversation about this. - There's no clear way to translate from an arbitrary string to a more Prometheus-friendly metric. Usage within MediaWiki i... [20:54:23] (03CR) 10RobH: [C: 03+2] decom old kafka machines [dns] - 10https://gerrit.wikimedia.org/r/559587 (https://phabricator.wikimedia.org/T226517) (owner: 10RobH) [20:56:16] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10RobH) [21:02:08] (03PS2) 10Mathew.onipe: maps: Use correct puppet cron syntax [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) [21:06:05] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 (10colewhite) [21:06:26] (03PS3) 10Mathew.onipe: maps: Use correct puppet cron syntax [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) [21:09:34] (03PS4) 10Mathew.onipe: maps: Use correct puppet cron syntax [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) [21:11:17] (03CR) 10Mathew.onipe: "https://puppet-compiler.wmflabs.org/compiler1001/20082/" [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [21:11:57] (03CR) 10Gehel: [C: 03+2] maps: Use correct puppet cron syntax [puppet] - 10https://gerrit.wikimedia.org/r/559581 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [21:51:31] (03CR) 10MSantos: maps: Increase replication frequency (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/559442 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [21:54:06] (03PS1) 10Ottomata: Fix kafka-dev chart to work with docker-desktop [deployment-charts] - 10https://gerrit.wikimedia.org/r/559607 [22:13:15] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Sure. The decision was taken in https://phabricator.wikimedia.org/T211881, specifically look at" [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [22:23:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:25:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:27:15] (03PS2) 10Ottomata: Fix kafka-dev chart to work with docker-desktop [deployment-charts] - 10https://gerrit.wikimedia.org/r/559607 [22:33:04] (03PS1) 10NicholasG04: Added throttle rule for University of Derby mini-editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) [22:37:02] (03PS1) 10Mstyles: Add new MLR models [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559614 (https://phabricator.wikimedia.org/T219534) [22:38:28] (03PS20) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [22:42:02] (03PS1) 10RobH: remove silver mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/559616 (https://phabricator.wikimedia.org/T191357) [22:44:05] (03CR) 10RobH: [C: 03+2] remove silver mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/559616 (https://phabricator.wikimedia.org/T191357) (owner: 10RobH) [22:45:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10RobH) [22:48:03] (03CR) 10Ammarpad: [C: 04-1] Added throttle rule for University of Derby mini-editathon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [22:48:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10RobH) Please note that while ' - switch port assignment noted on this task (for later removal)' is checked, it wasn't listed on this sheet. I checked ALL 4 switch... [22:48:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.074e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:48:50] sec-deploying T240502 now [22:49:24] (03PS2) 10NicholasG04: Added throttle rule for University of Derby mini-editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) [22:50:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom silver/WMF3434 - https://phabricator.wikimedia.org/T191357 (10RobH) [22:50:54] (03CR) 10NicholasG04: "Thank you for pointing out my total idiocy. I clearly forgot to change one of the dates - a total oversight on my part. This has now been " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [22:52:30] (03CR) 10Ammarpad: "It seems you've already added the task number in the commit summary, so forget about my comment on that part." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [22:54:19] !log sbassett@deploy1001 Synchronized php-1.35.0-wmf.11/extensions/MobileFrontend/extension.json: Deploy security patch for T240502 (pushed through gerrit) (duration: 00m 55s) [22:54:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:29] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [23:03:58] (03CR) 10Ammarpad: [C: 03+1] "> Thank you for pointing out my total idiocy. I clearly forgot to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [23:04:20] (03PS21) 10CRusnov: netbox: Add automation git machinery [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) [23:04:23] (03PS1) 10RobH: dbproxy1005 mgmt dns removal [dns] - 10https://gerrit.wikimedia.org/r/559619 (https://phabricator.wikimedia.org/T231967) [23:04:58] (03CR) 10RobH: [C: 03+2] dbproxy1005 mgmt dns removal [dns] - 10https://gerrit.wikimedia.org/r/559619 (https://phabricator.wikimedia.org/T231967) (owner: 10RobH) [23:06:54] (03CR) 10NicholasG04: "Thanks for your help and understanding of the mistakes I have made due to being new to this platform. You have helped me a lot!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/559612 (https://phabricator.wikimedia.org/T240845) (owner: 10NicholasG04) [23:09:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission dbproxy1005.eqiad.wmnet - https://phabricator.wikimedia.org/T231967 (10RobH) 05Open→03Resolved [23:11:02] (03PS1) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [23:12:24] (03PS2) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [23:13:05] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:14:23] (03CR) 10jerkins-bot: [V: 04-1] ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) (owner: 10Jhedden) [23:14:51] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:19:13] (03CR) 10Volans: [C: 04-1] "Couple of minor things and few questions inline. Compiler looks ok." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/555715 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [23:21:56] (03PS3) 10Jhedden: ceph: add support for dedicated cluster network [puppet] - 10https://gerrit.wikimedia.org/r/559620 (https://phabricator.wikimedia.org/T240965) [23:34:18] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10colewhite) p:05Triage→03Normal a:03colewhite [23:35:48] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10colewhite) [23:40:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, researchers & wmf for Shay Nowick - https://phabricator.wikimedia.org/T240917 (10colewhite) [23:45:19] (03PS1) 10Cwhite: admin: add shay to analytics-privatedata-users and researchers [puppet] - 10https://gerrit.wikimedia.org/r/559630 (https://phabricator.wikimedia.org/T240917) [23:50:05] (03Abandoned) 10Cwhite: scb: add graphoid matching rules and deploy statsd exporter to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/558732 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:53:56] !log volker-e@deploy1001 Started deploy [design/style-guide@e9bf493]: Deploy design/style-guide: [23:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:05] !log volker-e@deploy1001 Finished deploy [design/style-guide@e9bf493]: Deploy design/style-guide: (duration: 00m 09s) [23:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:56] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/559630 (https://phabricator.wikimedia.org/T240917) (owner: 10Cwhite)