[00:00:23] (03PS3) 10Dzahn: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [00:19:07] 10Operations, 10netops: Standardize cross confederation BGP policies - https://phabricator.wikimedia.org/T227808 (10ayounsi) 05Openโ†’03Resolved This is done and pushed to all the sites. [00:19:09] 10Operations, 10netops: Cleanup confed BGP peerings and policies - https://phabricator.wikimedia.org/T167841 (10ayounsi) [00:20:22] (03CR) 10Dzahn: "PS2/3: rebased on top of prod and fixed issue with variable names in webserver.pp" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [00:23:56] (03PS4) 10Dzahn: mediawiki: allow installing php7 only, disable hhvm on scandium [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [00:43:18] (03PS1) 10Ayounsi: Have syslog.eqiad/codfw point to the anycast name [dns] - 10https://gerrit.wikimedia.org/r/526287 [00:45:36] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/17644/" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [01:01:42] (03PS1) 10Dzahn: parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) [01:02:56] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [01:11:00] (03PS2) 10Dzahn: parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) [01:12:13] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [01:15:25] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/526289" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [01:16:10] (03PS3) 10Dzahn: parsoid::testing: add mediawiki appserver profiles to role [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) [01:25:56] (03PS1) 10Dzahn: role::mediawiki::appserver: merge role::mediawiki::common in [puppet] - 10https://gerrit.wikimedia.org/r/526290 [01:29:01] (03PS2) 10Dzahn: role::mediawiki::appserver: merge role::mediawiki::common in [puppet] - 10https://gerrit.wikimedia.org/r/526290 [01:31:50] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/17645/ shows changes but they don't look like actual changes to me" [puppet] - 10https://gerrit.wikimedia.org/r/526290 (owner: 10Dzahn) [01:42:21] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10Dzahn) |short link | maps to | example target URL | notes |----|----|---- |sal.w.wiki/$FOO | https://twitter.com/wikimediatech/status/$FOO | ex. https://twitter.com/wikimediatech/status/1155... [01:45:41] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10Dzahn) |short link | maps to | example target URL | notes |----|----|---- |incident.w.wiki/$FOO | https://wikitech.wikimedia.org/wiki/Incident_documentation/$FOO | https://wikitech.wikimedia... [01:45:47] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10CDanis) >>! In T223319#5375108, @Dzahn wrote: > > |short link | maps to | example target URL | notes > |----|----|---- > |sal.w.wiki/$FOO | https://twitter.com/wikimediatech/status/$FOO | ex... [01:49:02] (03PS11) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [01:50:47] (03PS12) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [01:55:30] (03PS13) 10CDanis: dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 [02:07:31] 10Operations, 10Wikimedia-Mailing-lists: LGBT mailing list moderator password reset - https://phabricator.wikimedia.org/T225787 (10Quiddity) 05Openโ†’03Resolved a:05MoritzMuehlenhoffโ†’03Quiddity > Can you reset it again for me? Done. :) [02:08:21] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1011.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [02:12:39] PROBLEM - puppet last run on ganeti2018 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:17:45] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10CDanis) p:05Highโ†’03Normal [02:31:37] (03CR) 10CDanis: dbctl: diff PHP vs dbctl configs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [02:40:47] RECOVERY - puppet last run on ganeti2018 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:50:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:50:50] (03CR) 10CDanis: [C: 03+2] dbctl: diff PHP vs dbctl configs [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [02:54:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:55:44] (03CR) 10CDanis: "Manuel, this new alert should fire when you do the planned master failover tomorrow. That's fine -- please let it do so as a test ;)" [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [03:04:41] (03PS1) 10CDanis: cdanis dotfiles: time ๐Ÿ•˜ for prompt updates โœ”๏ธ [puppet] - 10https://gerrit.wikimedia.org/r/526297 [03:06:13] (03CR) 10CDanis: [C: 03+2] cdanis dotfiles: time ๐Ÿ•˜ for prompt updates โœ”๏ธ [puppet] - 10https://gerrit.wikimedia.org/r/526297 (owner: 10CDanis) [03:12:50] (03CR) 10Tim Starling: Initial canary of dbctl, db config from etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [03:17:16] (03CR) 10CDanis: [C: 04-2] Initial canary of dbctl, db config from etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [03:22:30] (03PS1) 10CDanis: cdanis dotfiles: compatibility issue on <=stretch [puppet] - 10https://gerrit.wikimedia.org/r/526300 [03:23:23] (03CR) 10CDanis: [C: 03+2] cdanis dotfiles: compatibility issue on <=stretch [puppet] - 10https://gerrit.wikimedia.org/r/526300 (owner: 10CDanis) [03:26:34] (03CR) 10Tim Starling: Initial canary of dbctl, db config from etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [03:29:34] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:29:54] (03CR) 10CDanis: [C: 04-2] Initial canary of dbctl, db config from etcd (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [03:54:12] PROBLEM - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-07-26 03:43:39 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:12:01] (03CR) 10Marostegui: "> Manuel, this new alert should fire when you do the planned master" [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [04:12:37] (03PS4) 10Marostegui: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) [04:12:47] (03PS6) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) [04:13:01] (03PS5) 10Marostegui: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/524411 (https://phabricator.wikimedia.org/T227062) [04:15:19] !log Start pre-steps for s8 primary master failover - T227062 [04:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:28] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [04:26:01] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/524411 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:30:12] I am going to take over the puppet and the mediawiki config repo, please coordinate with me before pushing anything. I will let you all know once it is fine to merge normally, once I am done with the s8 failover [04:37:24] (03CR) 10Marostegui: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:37:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:38:25] (03Merged) 10jenkins-bot: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:38:41] (03CR) 10jenkins-bot: db-eqiad.php: Set s8 (wikidata) into read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525513 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:39:02] (03PS5) 10Marostegui: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) [04:40:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:41:20] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:41:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s8 (wikidata) into read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526305 [04:41:36] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1104 to s8 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526011 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [04:41:38] (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s8 (wikidata) into read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526305 [05:00:04] marostegui: It is that lovely time of the day again! You are hereby commanded to deploy s8 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T0500). [05:00:08] \o/ [05:00:16] !log Starting s8 failover from db1071 to db1104 - T227062 [05:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:24] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:00:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s8 on read-only T227062 (duration: 00m 26s) [05:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:22] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s8 (wikidata) into read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526305 (owner: 10Marostegui) [05:01:41] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s8 (wikidata) into read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526305 (owner: 10Marostegui) [05:01:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s8 master eqiad from db1071 to db1104 T227062 (duration: 00m 24s) [05:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s8 ready only T227062 (duration: 00m 24s) [05:02:23] switchover is completed, read only is off now [05:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:37] I can edit fine [05:02:49] I am going to deploy a mediawiki core change shortly: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/526306/ [05:03:13] TimStarling: can you give me 5 minutes? [05:03:24] yes, as long as you like [05:03:33] sure, 5 minutes to make sure it is all fine should be enough :) [05:04:17] should I hold off on merging the change? or just on deploying it? [05:04:51] TimStarling: let's hold the merge too just in case [05:05:01] so far it is looking all good, just want to do a few more checks [05:05:03] shouldn't take long [05:08:24] TimStarling: I think you are good to go, thanks for waiting :) [05:08:43] Failover was done, deployments can resume as normal [05:08:54] (03CR) 10Marostegui: [C: 03+2] wmnet: Update CNAME for s8 master [dns] - 10https://gerrit.wikimedia.org/r/526013 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:15:39] I didn't really wait for you marostegui: I gave it +2 at 5:00, switched to -2 at 5:03, marostegui says OK to go at 5:08, back to +2 at 5:11, still waiting from jenkins which is presumably running since 5:00 [05:16:41] hahaha [05:17:10] TimStarling: https://www.easyaslinux.com/memes/sometimes-waiting-for-a-jenkins-job-to-complete-be-like-forever/ ? :) [05:20:57] merged at 3:17, yes jenkins takes forever [05:21:47] I mean 5:17 (forgot I was using UTC) [05:22:19] (03PS1) 10Marostegui: db1071: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526307 (https://phabricator.wikimedia.org/T217396) [05:23:29] TimStarling: is it utc +11 for you in summer? [05:23:41] yes [05:23:46] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/upload/UploadBase.php: T228929 (duration: 00m 48s) [05:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:54] T228929: "Internal error: Server failed to store temporary file" when trying to upload images with upload wizard - https://phabricator.wikimedia.org/T228929 [05:24:08] (03CR) 10Marostegui: [C: 03+2] db1071: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526307 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [05:24:33] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/api/ApiUpload.php: T228929 (duration: 00m 47s) [05:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:20] !log tstarling@deploy1001 Synchronized php-1.34.0-wmf.15/includes/jobqueue/jobs/AssembleUploadChunksJob.php: T228929 (duration: 00m 46s) [05:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:09] ACKNOWLEDGEMENT - snapshot of s7 in codfw on db1115 is CRITICAL: snapshot for s7 at codfw taken more than 4 days ago: Most recent backup 2019-07-26 03:43:39 Marostegui checking https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:31:07] (03PS1) 10BryanDavis: toolforge: modernize updatetools script [puppet] - 10https://gerrit.wikimedia.org/r/526309 (https://phabricator.wikimedia.org/T164971) [05:32:50] I had forgotten that meme, bookmarked [05:36:16] !log Disable puppet on cumin2001 to investigate a backups issue [05:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:56] (03CR) 10BryanDavis: "The updatetools.py script itself has been tested manually. You can copy the file to /data/project/admin on Toolforge and run it with:" [puppet] - 10https://gerrit.wikimedia.org/r/526309 (https://phabricator.wikimedia.org/T164971) (owner: 10BryanDavis) [05:43:12] (03PS1) 10KartikMistry: Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 [05:48:37] 10Operations, 10DBA, 10Patch-For-Review: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [05:49:49] (03PS2) 10KartikMistry: Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) [05:50:17] (03PS1) 10Marostegui: db-eqiad.php: db1109 is now the candidate master for s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526315 [06:05:44] PROBLEM - HP RAID on db2063 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:10, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:05:46] ACKNOWLEDGEMENT - HP RAID on db2063 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:11 - OK: 1I:1:1, 1I:1:10, 1I:1:12, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T229302 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:05:49] 10Operations, 10ops-codfw: Degraded RAID on db2063 - https://phabricator.wikimedia.org/T229302 (10ops-monitoring-bot) [06:05:56] (03PS8) 10Giuseppe Lavagetto: New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 [06:05:58] (03PS6) 10Giuseppe Lavagetto: Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 [06:17:13] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Including a role in another is not /against/ the style guide. What is against it is declaring more than one role in a node stanza." [puppet] - 10https://gerrit.wikimedia.org/r/526290 (owner: 10Dzahn) [06:28:32] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The approach of adding the role here as an inclusion wasn't a bad idea, and is not against the style guide." [puppet] - 10https://gerrit.wikimedia.org/r/526289 (https://phabricator.wikimedia.org/T228069) (owner: 10Dzahn) [06:29:04] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:31:12] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/puppetlabs/facter/facter.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:32:46] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/globalsign-2018-ecdsa-unified.crt] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:33:52] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:37:06] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [06:55:18] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:56:22] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:57:04] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:59:10] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:07:34] PROBLEM - Device not healthy -SMART- on db2063 is CRITICAL: cluster=mysql device=cciss,11 instance=db2063:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2063&var-datasource=codfw+prometheus/ops [07:10:25] (03CR) 10Filippo Giunchedi: [C: 03+1] Have syslog.eqiad/codfw point to the anycast name [dns] - 10https://gerrit.wikimedia.org/r/526287 (owner: 10Ayounsi) [07:12:50] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 53.77, 26.89, 15.84 https://wikitech.wikimedia.org/wiki/Application_servers [07:12:50] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 60.19, 28.47, 16.32 https://wikitech.wikimedia.org/wiki/Application_servers [07:12:52] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 67.46, 32.50, 18.49 https://wikitech.wikimedia.org/wiki/Application_servers [07:12:52] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 79.67, 39.45, 20.75 https://wikitech.wikimedia.org/wiki/Application_servers [07:13:14] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 84.18, 42.86, 22.24 https://wikitech.wikimedia.org/wiki/Application_servers [07:13:30] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 55.98, 26.58, 15.20 https://wikitech.wikimedia.org/wiki/Application_servers [07:13:42] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 48.83, 25.98, 15.24 https://wikitech.wikimedia.org/wiki/Application_servers [07:14:40] (03CR) 10Elukey: [C: 03+1] tlsproxy: conditionally add ssl_ecdhe_curve to XCP [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [07:15:54] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 67.34, 33.76, 18.31 https://wikitech.wikimedia.org/wiki/Application_servers [07:15:54] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 64.15, 34.21, 19.08 https://wikitech.wikimedia.org/wiki/Application_servers [07:15:55] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2063 is CRITICAL: cluster=mysql device=cciss,11 instance=db2063:9100 job=node site=codfw Marostegui smart errors https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2063&var-datasource=codfw+prometheus/ops [07:16:15] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:16:27] (03PS1) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [07:16:52] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:16:56] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 76.08, 45.52, 24.60 https://wikitech.wikimedia.org/wiki/Application_servers [07:17:23] (03PS3) 10Ema: tlsproxy: conditionally add ssl_ecdhe_curve to XCP [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) [07:17:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: db1109 is now the candidate master for s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526315 (owner: 10Marostegui) [07:18:32] (03Merged) 10jenkins-bot: db-eqiad.php: db1109 is now the candidate master for s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526315 (owner: 10Marostegui) [07:18:38] (03CR) 10Ema: [C: 03+2] tlsproxy: conditionally add ssl_ecdhe_curve to XCP [puppet] - 10https://gerrit.wikimedia.org/r/526147 (https://phabricator.wikimedia.org/T228730) (owner: 10Ema) [07:18:40] (03CR) 10Marostegui: "The alerts hasn't been fired, or at least I don't see it on Icinga criticals or warnings" [puppet] - 10https://gerrit.wikimedia.org/r/526154 (owner: 10CDanis) [07:18:49] (03CR) 10jenkins-bot: db-eqiad.php: db1109 is now the candidate master for s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526315 (owner: 10Marostegui) [07:18:58] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:20:02] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:20:56] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:21:22] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 73.65, 38.31, 21.93 https://wikitech.wikimedia.org/wiki/Application_servers [07:22:10] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:22:30] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:22:52] 10Operations, 10ops-codfw: Degraded RAID on db2063 - https://phabricator.wikimedia.org/T229302 (10Marostegui) p:05Triageโ†’03Normal a:03Papaul Let's get a new disk for this? Thanks! [07:23:06] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:24:06] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 86.14, 47.18, 26.40 https://wikitech.wikimedia.org/wiki/Application_servers [07:24:38] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:26:12] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:26:21] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [07:28:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:28:42] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:29:22] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:29:34] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:30:33] !log bounce hhvm on mw1221 [07:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:56] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:31:17] (03PS2) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [07:31:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:38:27] (03PS1) 10Volans: dbctl: fix temporary alarm [puppet] - 10https://gerrit.wikimedia.org/r/526377 [07:39:32] (03PS1) 10Marostegui: mariadb: Allow reimage of db2131 [puppet] - 10https://gerrit.wikimedia.org/r/526378 (https://phabricator.wikimedia.org/T229251) [07:40:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Allow reimage of db2131 [puppet] - 10https://gerrit.wikimedia.org/r/526378 (https://phabricator.wikimedia.org/T229251) (owner: 10Marostegui) [07:45:32] RECOVERY - Check systemd state on analytics-tool1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: (2019-08-31)rack/setup/install db2131.codfw.wmnet - https://phabricator.wikimedia.org/T229251 (10Marostegui) @RobH @Papaul I have merged: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/526378/ The only changes pending from your side to be able t... [07:46:22] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:48:04] (03PS2) 10Volans: dbctl: fix temporary diff alarm [puppet] - 10https://gerrit.wikimedia.org/r/526377 [07:49:11] (03CR) 10Marostegui: [C: 03+1] dbctl: fix temporary diff alarm [puppet] - 10https://gerrit.wikimedia.org/r/526377 (owner: 10Volans) [07:49:39] (03PS3) 10Volans: dbctl: fix temporary diff alarm [puppet] - 10https://gerrit.wikimedia.org/r/526377 [07:50:56] (03CR) 10Volans: [C: 03+2] dbctl: fix temporary diff alarm [puppet] - 10https://gerrit.wikimedia.org/r/526377 (owner: 10Volans) [07:53:34] (03PS1) 10Marostegui: db2128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526381 (https://phabricator.wikimedia.org/T228969) [07:55:12] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 9.64, 11.60, 23.66 https://wikitech.wikimedia.org/wiki/Application_servers [07:55:32] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) (owner: 10KartikMistry) [07:56:52] (03PS2) 10Marostegui: db2128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526381 (https://phabricator.wikimedia.org/T228969) [07:58:04] (03CR) 10Marostegui: [C: 03+2] db2128: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/526381 (https://phabricator.wikimedia.org/T228969) (owner: 10Marostegui) [08:00:30] (03PS5) 10Giuseppe Lavagetto: mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) [08:01:06] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 8.86, 10.12, 22.61 https://wikitech.wikimedia.org/wiki/Application_servers [08:01:27] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: allow installing php7 only [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [08:03:14] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 10.47, 10.40, 23.76 https://wikitech.wikimedia.org/wiki/Application_servers [08:04:09] !log revoke and deactivate orespoolcounter{1,2}00{1,2} T227640 [08:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:16] T227640: Migrate ORES pool counters to Buster - https://phabricator.wikimedia.org/T227640 [08:04:21] !log delete orespoolcounter{1,2}00{1,2} T227640 [08:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:18] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) a:03Marostegui [08:08:05] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 8.64, 9.62, 22.93 https://wikitech.wikimedia.org/wiki/Application_servers [08:08:22] (03PS3) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [08:08:41] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 8.23, 9.93, 20.73 https://wikitech.wikimedia.org/wiki/Application_servers [08:10:07] !log Remove db2038 from tendril and zarcillo T227565 [08:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:14] T227565: decommission db2038 - https://phabricator.wikimedia.org/T227565 [08:10:17] (03PS4) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [08:10:31] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 11.10, 9.49, 19.65 https://wikitech.wikimedia.org/wiki/Application_servers [08:10:52] (03PS1) 10Marostegui: mariadb: Decommission db2038 [puppet] - 10https://gerrit.wikimedia.org/r/526382 (https://phabricator.wikimedia.org/T227565) [08:11:11] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:12:21] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 9.93, 9.55, 19.45 https://wikitech.wikimedia.org/wiki/Application_servers [08:12:29] (03PS5) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [08:13:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2038 [puppet] - 10https://gerrit.wikimedia.org/r/526382 (https://phabricator.wikimedia.org/T227565) (owner: 10Marostegui) [08:14:31] RECOVERY - puppet last run on analytics-tool1004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:16:11] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 8.64, 8.77, 16.28 https://wikitech.wikimedia.org/wiki/Application_servers [08:17:09] !log Stop MySQL on db2038 T227565 [08:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:15] T227565: decommission db2038 - https://phabricator.wikimedia.org/T227565 [08:19:17] RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-30 07:04:00 from db2100.codfw.wmnet:3317 (841 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:21:00] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Marostegui) a:05Marosteguiโ†’03RobH [08:21:20] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Marostegui) This host is ready for #dc-ops to finish its decommission [08:23:49] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 8.03, 8.27, 12.56 https://wikitech.wikimedia.org/wiki/Application_servers [08:23:49] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 9.90, 9.20, 13.37 https://wikitech.wikimedia.org/wiki/Application_servers [08:24:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2038 - https://phabricator.wikimedia.org/T227565 (10Marostegui) [08:27:39] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 11.53, 10.29, 13.46 https://wikitech.wikimedia.org/wiki/Application_servers [08:29:59] (03PS2) 10Mforns: analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) [08:30:44] (03PS4) 10Mforns: analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) [08:33:08] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate mediawiki timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519690 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [08:35:02] (03PS3) 10Mforns: analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) [08:37:17] (03PS4) 10Elukey: analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [08:39:31] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Remove timer for WDQS extract [puppet] - 10https://gerrit.wikimedia.org/r/519688 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [08:40:15] (03PS1) 10Volans: dbctl: print to sdout [puppet] - 10https://gerrit.wikimedia.org/r/526383 [08:40:21] (03CR) 10Alexandros Kosiaris: "Fleet wide PCC gave an all ok on https://puppet-compiler.wmflabs.org/compiler1002/245/. So this is ok to go when we decide it's time to me" [puppet] - 10https://gerrit.wikimedia.org/r/526104 (https://phabricator.wikimedia.org/T228657) (owner: 10Alexandros Kosiaris) [08:40:56] (03PS4) 10Alexandros Kosiaris: helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 [08:41:19] (03PS1) 10Ema: vcl: remove backend-specific code [puppet] - 10https://gerrit.wikimedia.org/r/526384 (https://phabricator.wikimedia.org/T226589) [08:41:32] (03CR) 10Volans: [C: 03+2] dbctl: print to sdout [puppet] - 10https://gerrit.wikimedia.org/r/526383 (owner: 10Volans) [08:41:46] (03PS2) 10Volans: dbctl: print to sdout [puppet] - 10https://gerrit.wikimedia.org/r/526383 [08:43:14] (03PS2) 10Ema: vcl: remove upload-specific backend code [puppet] - 10https://gerrit.wikimedia.org/r/526384 (https://phabricator.wikimedia.org/T226589) [08:43:16] (03PS4) 10Elukey: Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 [08:44:33] (03CR) 10Elukey: [C: 03+2] Introduce role::statistics::explorer::gpu for stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/526115 (owner: 10Elukey) [08:44:53] (03PS5) 10Alexandros Kosiaris: helmfile_sal: Detect sudo usage for logging [puppet] - 10https://gerrit.wikimedia.org/r/526108 [08:46:05] PROBLEM - dbctl differs from mediawiki-config in eqiad- did you forget to update both- on cumin1001 is CRITICAL: Mismatched masters for section s8: PHP db1104 vs dbctl db1071 https://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP [08:46:13] (03PS1) 10Mforns: analytics::refinery::job::data_purge: Remove WDQS extract timer after absent [puppet] - 10https://gerrit.wikimedia.org/r/526386 (https://phabricator.wikimedia.org/T226862) [08:46:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/526108 (owner: 10Alexandros Kosiaris) [08:46:32] (03CR) 10Marostegui: [C: 04-1] "According to this: https://puppet-compiler.wmflabs.org/compiler1001/17651/db1101.eqiad.wmnet/ some of the checks will be losing the sms?" [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [08:46:38] marostegui: finally ^^^ :) [08:46:45] volans: \o/ [08:46:53] volans: let's work in private to push the recover? [08:46:57] yep [08:48:47] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge: Remove WDQS extract timer after absent [puppet] - 10https://gerrit.wikimedia.org/r/526386 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [08:48:54] (03PS2) 10Elukey: analytics::refinery::job::data_purge: Remove WDQS extract timer after absent [puppet] - 10https://gerrit.wikimedia.org/r/526386 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [08:50:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:53:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/17650/mw1348.eqiad.wmnet/ shows that while the code compiles correctly, the removal of HH" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [08:55:37] akosiaris: ok to merge your change? [08:55:45] (puppet-merge I mean) [08:56:18] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:56:28] elukey: yes, thank you [08:57:00] ack :) [08:57:39] (03CR) 10Ema: "pcc noop: https://puppet-compiler.wmflabs.org/compiler1002/17652/" [puppet] - 10https://gerrit.wikimedia.org/r/526384 (https://phabricator.wikimedia.org/T226589) (owner: 10Ema) [08:57:46] (03PS3) 10Ema: vcl: remove upload-specific backend code [puppet] - 10https://gerrit.wikimedia.org/r/526384 (https://phabricator.wikimedia.org/T226589) [08:58:00] (03CR) 10Filippo Giunchedi: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [09:01:38] (03PS1) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 [09:02:38] (03PS2) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 [09:02:40] (03CR) 10jerkins-bot: [V: 04-1] mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [09:05:05] (03CR) 10Ema: [C: 03+2] vcl: remove upload-specific backend code [puppet] - 10https://gerrit.wikimedia.org/r/526384 (https://phabricator.wikimedia.org/T226589) (owner: 10Ema) [09:08:12] (03PS1) 10Alexandros Kosiaris: Add anycast recdns to calico filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/526389 (https://phabricator.wikimedia.org/T228190) [09:09:03] (03PS1) 10Elukey: aptrepo: add thirdparty/cloudera to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/526390 [09:10:56] (03CR) 10Elukey: [C: 03+2] aptrepo: add thirdparty/cloudera to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/526390 (owner: 10Elukey) [09:12:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [09:13:43] ema: --^ [09:14:20] last link seems also broken [09:15:10] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/app/kibana%23/dashboard/Varnish-Webrequest-50X [09:16:07] (03CR) 10Giuseppe Lavagetto: "Thanks a lot \o/ abandoning this patch" [puppet] - 10https://gerrit.wikimedia.org/r/526116 (owner: 10Giuseppe Lavagetto) [09:16:31] (03Abandoned) 10Giuseppe Lavagetto: tox: exclude mitaka admin_scripts [puppet] - 10https://gerrit.wikimedia.org/r/526116 (owner: 10Giuseppe Lavagetto) [09:18:23] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add anycast recdns to calico filters [deployment-charts] - 10https://gerrit.wikimedia.org/r/526389 (https://phabricator.wikimedia.org/T228190) (owner: 10Alexandros Kosiaris) [09:19:30] (03PS1) 10Volans: dbctl: add missing instances [puppet] - 10https://gerrit.wikimedia.org/r/526393 (https://phabricator.wikimedia.org/T229070) [09:21:07] (03CR) 10Marostegui: [C: 03+1] dbctl: add missing instances [puppet] - 10https://gerrit.wikimedia.org/r/526393 (https://phabricator.wikimedia.org/T229070) (owner: 10Volans) [09:21:10] (03PS1) 10Elukey: aptrepo: add cloudera-jessie-pull to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/526394 [09:22:23] elukey: indeed, should be a # but it is urlencoded [09:22:34] (03CR) 10Volans: [C: 03+2] dbctl: add missing instances [puppet] - 10https://gerrit.wikimedia.org/r/526393 (https://phabricator.wikimedia.org/T229070) (owner: 10Volans) [09:22:44] (03CR) 10Marostegui: [C: 03+1] "> > Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [09:22:57] (03CR) 10Elukey: [C: 03+2] aptrepo: add cloudera-jessie-pull to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/526394 (owner: 10Elukey) [09:23:05] (03PS2) 10Elukey: aptrepo: add cloudera-jessie-pull to buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/526394 [09:23:15] I know I am ashamed by --^ but it is the only way [09:23:58] (excluding getting all the source packages and rebuild) [09:27:30] !log add thirdparty/cloudera to buster-wikimedia and import packages to it (pull from the jessie component) [09:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:48] !log akosiaris@ helmfile [STAGING] Ran 'apply' command on namespace 'kube-system' for release 'calico-policy-controller' . [09:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not yet, we still haven't moved the eqiad/codfw clusters to helmfile" [puppet] - 10https://gerrit.wikimedia.org/r/526114 (owner: 10Alexandros Kosiaris) [09:40:26] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:40:46] ^ that is being worked out [09:42:48] (03PS4) 10Filippo Giunchedi: Consolidate 'critical' and 'contact groups' logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) [09:44:47] (03CR) 10Filippo Giunchedi: [C: 03+2] Consolidate 'critical' and 'contact groups' logic [puppet] - 10https://gerrit.wikimedia.org/r/525535 (https://phabricator.wikimedia.org/T228878) (owner: 10Filippo Giunchedi) [09:48:56] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is CRITICAL: CRITICAL - Uncommitted dbctl configuration changes, check dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:49:36] !log upload python-snakebite to buster-wikimedia (rebuilt for buster from source) [09:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:56] (03PS1) 10Elukey: profile::analytics::cluster::client: use s-nail in Buster [puppet] - 10https://gerrit.wikimedia.org/r/526395 [09:51:46] 10Operations: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) @elukey I am up for attempting to patch it and upload to stretch-wikimedia, I will try to do it next week [09:52:24] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) [09:54:11] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: use s-nail in Buster [puppet] - 10https://gerrit.wikimedia.org/r/526395 (owner: 10Elukey) [09:56:39] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) (owner: 10KartikMistry) [09:57:08] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) (owner: 10KartikMistry) [10:08:17] (03PS1) 10Ema: labs: set prometheus::varnishkafka_exporter::stats_default [puppet] - 10https://gerrit.wikimedia.org/r/526397 (https://phabricator.wikimedia.org/T196066) [10:10:56] (03PS1) 10Jbond: buster-backports: add the buster backports repository to puppet [puppet] - 10https://gerrit.wikimedia.org/r/526398 [10:12:01] the dbctl alerts are expected, we're actually testing that they fire as expected [10:12:08] we'll fix the underlying data shortly [10:21:02] (03CR) 10Jbond: [C: 03+1] "LGTM, vote to work with upstream and merge the python code [in a further CR] when fixed there" [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [10:21:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) (owner: 10RobH) [10:23:06] (03CR) 10Fsero: mtail: fix mediawiki access log metrics name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [10:23:56] (03PS11) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [10:25:17] (03CR) 10Jbond: [C: 03+2] lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [10:25:45] (03CR) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [10:26:00] (03PS4) 10Jbond: puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 [10:27:43] (03CR) 10Jbond: [C: 03+2] puppetmaster: use mod rewrite for conditional proxypass [puppet] - 10https://gerrit.wikimedia.org/r/525516 (owner: 10Jbond) [10:31:33] 10Operations, 10Traffic: Remove X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) [10:31:42] 10Operations, 10Traffic: Remove X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) p:05Triageโ†’03Normal [10:35:38] (03PS1) 10Jbond: puppetmaster1003: move sarni and neodymium to the new puppet master [puppet] - 10https://gerrit.wikimedia.org/r/526402 [10:37:11] 10Operations, 10Traffic: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10ema) [10:37:24] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: move sarni and neodymium to the new puppet master [puppet] - 10https://gerrit.wikimedia.org/r/526402 (owner: 10Jbond) [10:37:28] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:37:49] (03PS12) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [10:44:52] (03PS1) 10Jbond: puppetmaster1003: add puppetmaster1003 as a canary puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/526405 [10:45:47] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: add puppetmaster1003 as a canary puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/526405 (owner: 10Jbond) [10:46:01] (03CR) 10Gergล‘ Tisza: Fix AddGroups/RemoveGroups for editor/autoreview (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518759 (https://phabricator.wikimedia.org/T226410) (owner: 10Reedy) [10:52:01] (03PS1) 10Elukey: Introduce cdh::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/526406 [10:56:05] (03PS3) 10KartikMistry: Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) [10:57:14] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17654/" [puppet] - 10https://gerrit.wikimedia.org/r/526406 (owner: 10Elukey) [10:58:44] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Update cxserver to 2019-07-29-154005-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/526311 (https://phabricator.wikimedia.org/T227493) (owner: 10KartikMistry) [10:59:17] (03CR) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:03:28] nothing to do, the moon is safe for today ^^ [11:04:11] 10Operations, 10Security-Team, 10Traffic: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10Peachey88) [11:05:16] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:07:34] (03PS13) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [11:13:34] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:19:15] (03CR) 10Giuseppe Lavagetto: "After further analysis, none of the packages we're removing are needed for running mediawiki at all. So the patch should be good as-is." [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [11:27:14] (03CR) 10Fsero: [C: 03+1] mtail: fix mediawiki access log metrics name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [11:27:27] Lucas_WMDE: great. I can go for cxserver update. [11:28:23] go ahead :) [11:28:25] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [11:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:54] PROBLEM - puppet last run on mw1340 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:29:04] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10elukey) I have created tmpreaper_1.6.13+nmu1+deb9u1+wmf1_amd64.deb on boron, with the following patch: ` elukey@boron:~/tmpreaper-1.6.13+nmu1+deb9u1$ cat patches/no-log-enoent.patch Index: tmpreaper-1.... [11:29:06] First time with new method.. [11:30:31] !log Depool mw1348 and pool back [11:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:32] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:46] Keeps me on hold till all containers upgraded? :) [11:34:37] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'cxserver' for release 'production' . [11:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:50] (03PS1) 10Jakob: Update termbox version to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/526412 [11:39:07] (03CR) 10CDanis: [C: 03+1] dbctl: print to sdout [puppet] - 10https://gerrit.wikimedia.org/r/526383 (owner: 10Volans) [11:39:29] (03CR) 10CDanis: [C: 03+1] "๐Ÿคฆ" [puppet] - 10https://gerrit.wikimedia.org/r/526377 (owner: 10Volans) [11:41:24] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:47:07] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/526412 (owner: 10Jakob) [11:49:08] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:51:29] (03PS2) 10Tarrow: Update termbox version to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/526412 (owner: 10Jakob) [11:52:28] (03CR) 10Tarrow: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/526412 (owner: 10Jakob) [11:54:26] 10Operations, 10serviceops: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10faidon) We can start by responding to [[ https://bugs.debian.org/763858 | Debian bug #763858 ]] with your fix and see if the maintainer is willing to incorporate this! [11:56:48] RECOVERY - puppet last run on mw1340 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:58:33] (03CR) 10Jakob: [V: 03+2 C: 03+2] Update termbox version to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/526412 (owner: 10Jakob) [12:03:14] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/l10nupdate] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:05:29] jouncebot: next [12:05:30] In 2 hour(s) and 54 minute(s): SecureLinkFixer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1500) [12:07:41] (03PS4) 10CDanis: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) [12:08:18] PROBLEM - puppet last run on an-worker1085 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:08:18] PROBLEM - puppet last run on elastic1035 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:08:48] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:04] PROBLEM - puppet last run on an-worker1095 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:10] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [12:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:28] PROBLEM - puppet last run on restbase1018 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:30] PROBLEM - puppet last run on wdqs1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:38] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:09:38] PROBLEM - puppet last run on cloudvirt1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:04] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:12] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:34] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:47] (03PS14) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [12:10:50] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:56] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:10:58] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:11:00] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:11:08] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:11:20] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:11:28] PROBLEM - puppet last run on orespoolcounter1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:12] ^^^ i think theses where caused by some testing i have been ldoing on the puppet master . the half a dozen i have checked have no issues run fin on the next run [12:12:14] PROBLEM - puppet last run on ganeti1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:22] PROBLEM - puppet last run on db1074 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:30] PROBLEM - puppet last run on druid1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:42] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:42] PROBLEM - puppet last run on an-master1002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:12:52] PROBLEM - puppet last run on db1120 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:13:02] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:13:04] !log while testing some changes on the puppet master a bad config caused a small blip in catalouge compilation [12:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:20] PROBLEM - puppet last run on analytics1074 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:13:20] jbond42: so already fixed? [12:13:24] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [12:13:26] yes allready fixed [12:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:33] ok thanks [12:13:34] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:15:02] RECOVERY - puppet last run on restbase1018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:15:06] RECOVERY - puppet last run on wdqs1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:15:29] (03PS15) 10Arturo Borrero Gonzalez: toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) [12:15:56] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [12:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:08] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:17:26] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:18:36] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:19:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:21:19] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [12:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: add nginx-ingress configuration. [puppet] - 10https://gerrit.wikimedia.org/r/524759 (https://phabricator.wikimedia.org/T228500) (owner: 10Arturo Borrero Gonzalez) [12:33:37] (03PS1) 10Jbond: puppetmaster: Fix canary host configueration [puppet] - 10https://gerrit.wikimedia.org/r/526414 [12:34:54] (03CR) 10Jbond: [C: 03+2] puppetmaster: Fix canary host configueration [puppet] - 10https://gerrit.wikimedia.org/r/526414 (owner: 10Jbond) [12:35:06] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:35:40] RECOVERY - puppet last run on analytics1074 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:35:54] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:36:10] RECOVERY - puppet last run on an-worker1085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:36:12] RECOVERY - puppet last run on elastic1035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:36:31] !log marostegui@cumin1001 dbctl commit of MediaWiki config (dc=all), diff saved to 'https://phabricator.wikimedia.org/P8824', previous config saved to /var/cache/conftool/dbconfig/20190730-123630-marostegui.json [12:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:56] RECOVERY - puppet last run on an-worker1095 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:37:15] \o/ [12:37:32] RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:37:32] RECOVERY - puppet last run on cloudvirt1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:37:34] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin1001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:37:54] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:37:59] cdanis: ^^^ :D (the ! log) [12:38:04] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:38:10] volans: I know, it's great :D [12:38:22] volans: the one thing I want to fix is remove the '' around the phab URL :D [12:38:30] RECOVERY - puppet last run on restbase-dev1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:38:40] yeah [12:38:49] oh nice and the recovery for uncommitted changes [12:38:52] RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:38:54] RECOVERY - puppet last run on wtp1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:38:56] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:39:06] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:39:20] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:39:28] RECOVERY - puppet last run on orespoolcounter1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:39:30] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:42] cdanis volans I want to add some doc to our mariadb page in wikitech with the most common commands so those can be linked from the alerts, is it ok if I work on an email for you guys with some questions and commands examples so you can verify those for me? [12:39:53] marostegui: please! [12:40:01] sure! [12:40:12] RECOVERY - puppet last run on ganeti1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:13] there is also a stub at https://wikitech.wikimedia.org/wiki/Dbctl that I was going to work on this week marostegui [12:40:20] RECOVERY - puppet last run on db1074 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:28] RECOVERY - puppet last run on druid1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:32] cdanis: sure, I can add it there or linked it, it doesn't matter :) [12:40:40] RECOVERY - puppet last run on an-master1002 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:50] RECOVERY - puppet last run on db1120 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [12:40:51] either is ok by me [12:40:54] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2001 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [12:40:56] cdanis: I was mentioning to volans that it would be nice to have a way to add a message to the commit so it gets show on the !log ie: promote db1104 to master - T131513215 [12:41:05] aha [12:41:07] yeah [12:41:38] RECOVERY - dbctl differs from mediawiki-config in eqiad- did you forget to update both- on cumin1001 is OK: OK: configurations match https://wikitech.wikimedia.org/wiki/Dbctl%23Configuration_deltas_vs_PHP [12:41:50] yay recovery [12:43:10] (03CR) 10Filippo Giunchedi: "LGTM, we'd likely want more stats in labs but good enough for now" [puppet] - 10https://gerrit.wikimedia.org/r/526397 (https://phabricator.wikimedia.org/T196066) (owner: 10Ema) [12:44:00] (03CR) 10Filippo Giunchedi: [C: 03+1] Anycast: Add Prometheus exporter to Bird (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/526203 (owner: 10Ayounsi) [12:44:18] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:03] (03CR) 10Filippo Giunchedi: [C: 04-1] mtail: fix mediawiki access log metrics name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [12:59:36] (03PS1) 10Jbond: puppetmaster - canary master: Raise an error if canary hosts cant beresolved [puppet] - 10https://gerrit.wikimedia.org/r/526415 [13:02:04] (03PS2) 10Jbond: puppetmaster - canary master: Raise an error if canary hosts cant be resolved [puppet] - 10https://gerrit.wikimedia.org/r/526415 [13:02:58] (03CR) 10Jbond: [C: 03+2] puppetmaster - canary master: Raise an error if canary hosts cant be resolved [puppet] - 10https://gerrit.wikimedia.org/r/526415 (owner: 10Jbond) [13:03:05] (03CR) 10Ema: [C: 03+1] "A few suggestions for documentation improvements, really nice work." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [13:06:50] PROBLEM - High lag on wdqs1009 is CRITICAL: 3635 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:09:02] PROBLEM - puppet last run on elastic1039 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:09:46] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:56] (03CR) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [13:12:23] * volans looking at wdqs1009 [13:13:13] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [13:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:34] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:49] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:39] 10Operations, 10ops-eqiad, 10DC-Ops: ps1 eqiad Icinga UNKNOWNs - https://phabricator.wikimedia.org/T229328 (10ema) [13:22:08] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:23:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [13:30:04] (03PS1) 10Jbond: hiera backends: update the config and hiera backend with the correct names [puppet] - 10https://gerrit.wikimedia.org/r/526420 [13:30:37] (03PS3) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 [13:31:43] (03CR) 10jerkins-bot: [V: 04-1] mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 (owner: 10Giuseppe Lavagetto) [13:33:59] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [13:35:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) 05Stalledโ†’03Open [13:36:01] (03CR) 10Jbond: "PCC (prod): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17655/" [puppet] - 10https://gerrit.wikimedia.org/r/526420 (owner: 10Jbond) [13:36:22] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [13:36:32] <_joe_> cdanis: ^^ [13:36:39] _joe_: <3 [13:36:58] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:37:11] I'm getting some more coffee and then will begin soon [13:38:30] !log Move db2094:3315 from db2066 to db2128 - T228258 [13:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:37] T228258: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 [13:39:09] (03PS4) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics name [puppet] - 10https://gerrit.wikimedia.org/r/526388 [13:40:08] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:56] (03PS3) 10Ottomata: Use schema aware refine for revision score and resource change [puppet] - 10https://gerrit.wikimedia.org/r/526180 (https://phabricator.wikimedia.org/T211248) [13:44:25] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Use schema aware refine for revision score and resource change [puppet] - 10https://gerrit.wikimedia.org/r/526180 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [13:44:58] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:11] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Papaul) @Marostegui This system crashed again . This time the error is on DIMM A1 see below. ` "Correctable memory error logging disabled for a memory device at location DIMM_A1. Mon 29 Jul 2019... [13:47:24] 10Operations, 10ops-codfw, 10DBA: pc2010 possibly broken memory - https://phabricator.wikimedia.org/T227552 (10Marostegui) @Papaul interesting, that crash didn't make MySQL or the host to frozen this time. Good catch! It did kill other processes: ` [Tue Jul 30 00:47:38 2019] mce: Uncorrected hardware memory... [13:48:49] (03PS3) 10Alexandros Kosiaris: anycast recdns: Add to calico filters [puppet] - 10https://gerrit.wikimedia.org/r/526178 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [13:48:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] anycast recdns: Add to calico filters [puppet] - 10https://gerrit.wikimedia.org/r/526178 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [13:49:47] re:wdqs1009 it's a test instance, downtimed linking to the related task (see details in -discovery) [13:52:10] (03CR) 10Volans: [C: 03+1] "couple of nit inline, looks good otherwise" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [13:55:26] (03PS2) 10Filippo Giunchedi: Have syslog.eqiad/codfw point to the anycast name [dns] - 10https://gerrit.wikimedia.org/r/526287 (owner: 10Ayounsi) [13:55:33] (03CR) 10Filippo Giunchedi: [C: 03+2] Have syslog.eqiad/codfw point to the anycast name [dns] - 10https://gerrit.wikimedia.org/r/526287 (owner: 10Ayounsi) [13:58:16] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Have syslog.eqiad/codfw point to the anycast name [dns] - 10https://gerrit.wikimedia.org/r/526287 (owner: 10Ayounsi) [14:00:57] (03PS1) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:02:34] (03CR) 10Ayounsi: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [14:03:45] (03PS2) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:09:20] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:57] !log refreshing calico policy from code in codfw [14:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:04] (03PS5) 10CDanis: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) [14:14:19] !log refreshing calico policy from code in eqiad [14:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:05] (03CR) 10Ottomata: [C: 03+1] Introduce cdh::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/526406 (owner: 10Elukey) [14:16:09] (03PS3) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:16:45] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merged and applied in eqiad+codfw (staging is now handle via deployment-charts repo)" [puppet] - 10https://gerrit.wikimedia.org/r/526178 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [14:22:47] (03PS1) 10Filippo Giunchedi: WIP: alert on widespread puppet failures [puppet] - 10https://gerrit.wikimedia.org/r/526431 [14:22:54] (03PS4) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:23:49] jouncebot: next [14:23:49] In 0 hour(s) and 36 minute(s): SecureLinkFixer (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1500) [14:25:54] (03PS5) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:29:21] (03CR) 10CDanis: [C: 03+2] Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:29:39] (03PS1) 10Effie Mouzeli: hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) [14:29:54] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17661/analytics-tool1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:30:01] (03PS6) 10Elukey: role::druid::turnilo: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526428 (https://phabricator.wikimedia.org/T227860) [14:30:58] (03PS1) 10Ottomata: Remove unsed refine job refine_eventlogging_eventbus [puppet] - 10https://gerrit.wikimedia.org/r/526435 (https://phabricator.wikimedia.org/T211248) [14:31:17] (03Merged) 10jenkins-bot: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:31:55] (03PS1) 10Ema: ATS: add support for the compress plugin and enable it [puppet] - 10https://gerrit.wikimedia.org/r/526436 (https://phabricator.wikimedia.org/T227432) [14:33:08] (03CR) 10jerkins-bot: [V: 04-1] ATS: add support for the compress plugin and enable it [puppet] - 10https://gerrit.wikimedia.org/r/526436 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:33:21] (03PS2) 10Effie Mouzeli: hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) [14:33:21] !log cdanis@deploy1001 Synchronized docroot/noc/db.php: Ie98a8d9e dbctl canary on mwdebug1001 (duration: 00m 48s) [14:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:28] argh [14:33:30] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/526398 (owner: 10Jbond) [14:33:50] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:34:21] (03CR) 10jenkins-bot: Initial canary of dbctl, db config from etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525684 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:34:28] (03CR) 10Ottomata: [C: 03+2] Remove unsed refine job refine_eventlogging_eventbus [puppet] - 10https://gerrit.wikimedia.org/r/526435 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [14:34:36] (03PS2) 10Ottomata: Remove unsed refine job refine_eventlogging_eventbus [puppet] - 10https://gerrit.wikimedia.org/r/526435 (https://phabricator.wikimedia.org/T211248) [14:34:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove unsed refine job refine_eventlogging_eventbus [puppet] - 10https://gerrit.wikimedia.org/r/526435 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [14:34:51] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: Ie98a8d9e dbctl canary on mwdebug1001 (duration: 00m 47s) [14:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:01] (03PS2) 10Ema: ATS: add support for the compress plugin and enable it [puppet] - 10https://gerrit.wikimedia.org/r/526436 (https://phabricator.wikimedia.org/T227432) [14:36:02] !log cdanis@deploy1001 Synchronized wmf-config/CommonSettings.php: Ie98a8d9e dbctl canary on mwdebug1001 (duration: 00m 47s) [14:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:11] (03PS3) 10Effie Mouzeli: hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) [14:38:53] (03PS3) 10Ema: ATS: add support for the compress plugin and enable it [puppet] - 10https://gerrit.wikimedia.org/r/526436 (https://phabricator.wikimedia.org/T227432) [14:40:25] (03PS1) 10CDanis: dbctl: enable on mwdebug* and two canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526437 (https://phabricator.wikimedia.org/T229070) [14:40:51] (03PS4) 10Effie Mouzeli: hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) [14:43:22] (03CR) 10CDanis: [C: 03+2] dbctl: enable on mwdebug* and two canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526437 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:44:19] (03PS2) 10Elukey: Introduce cdh::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/526406 [14:44:43] (03PS6) 10Elukey: sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) [14:45:21] (03CR) 10Elukey: "Volas: fixed both your last comments (thanks!) and also added a fence to avoid values too low for sleep time." [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [14:45:54] elukey: typo, Volans :-P [14:46:02] ahhahaha [14:46:04] * volans hides [14:46:08] -1 [14:46:23] (03Merged) 10jenkins-bot: dbctl: enable on mwdebug* and two canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526437 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:46:41] (03CR) 10jenkins-bot: dbctl: enable on mwdebug* and two canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526437 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [14:47:54] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: I17c55428 dbctl canary on mwdebug*, mw1261, mw1276 (duration: 00m 47s) [14:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:38] (03PS1) 10Elukey: role::analytics_cluster::webserver: add TLS proxy [puppet] - 10https://gerrit.wikimedia.org/r/526438 (https://phabricator.wikimedia.org/T227860) [14:51:57] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17666/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/526438 (https://phabricator.wikimedia.org/T227860) (owner: 10Elukey) [14:52:46] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/17665/cp1076.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/526436 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [14:53:18] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:54:51] (03PS2) 10Ema: labs: set prometheus::varnishkafka_exporter::stats_default [puppet] - 10https://gerrit.wikimedia.org/r/526397 (https://phabricator.wikimedia.org/T196066) [14:55:38] (03PS1) 10Jbond: puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) [14:56:19] (03CR) 10Ema: [C: 03+2] labs: set prometheus::varnishkafka_exporter::stats_default [puppet] - 10https://gerrit.wikimedia.org/r/526397 (https://phabricator.wikimedia.org/T196066) (owner: 10Ema) [14:56:44] (03PS4) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [14:56:58] (03PS2) 10Jbond: puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) [14:58:03] (03CR) 10Andrew Bogott: puppet: fix config permissions on puppetdir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [14:58:29] (03CR) 10jerkins-bot: [V: 04-1] puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [14:58:52] (03CR) 10BBlack: "Yeah 555 seems appropriate here, unless there are security-sensitive sensitive filenames within." [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [14:59:59] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [15:00:04] legoktm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) SecureLinkFixer deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1500). [15:00:58] (03PS3) 10Jbond: puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) [15:01:23] I'll start in a few minutes [15:01:32] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[nginx-reload] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:02:01] should be fixed --^ [15:02:38] (03PS4) 10Jbond: puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) [15:02:40] (03PS3) 10Elukey: Introduce cdh::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/526406 [15:02:51] (03PS3) 10Legoktm: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) [15:03:00] (03CR) 10Legoktm: [C: 03+2] Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:03:15] (03CR) 10BBlack: [C: 03+1] puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [15:03:17] (03CR) 10Andrew Bogott: [C: 03+1] puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [15:03:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [15:04:10] (03CR) 10Jbond: [C: 03+2] puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) (owner: 10Jbond) [15:04:10] (03CR) 10Elukey: [C: 03+2] Introduce cdh::mysql_jdbc [puppet] - 10https://gerrit.wikimedia.org/r/526406 (owner: 10Elukey) [15:04:20] (03PS5) 10Jbond: puppet: fix config permissions on puppetdir [puppet] - 10https://gerrit.wikimedia.org/r/526441 (https://phabricator.wikimedia.org/T228805) [15:04:34] (03Merged) 10jenkins-bot: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:05:04] (03PS1) 10CRusnov: Upgrade netbox to v2.6.1-wmf1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526447 [15:05:33] (03CR) 10jenkins-bot: Enable SecureLinkFixer everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525347 (https://phabricator.wikimedia.org/T200751) (owner: 10Legoktm) [15:06:33] !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable SecureLinkFixer everywhere (T200751) (duration: 00m 47s) [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] T200751: Review and deploy SecureLinkFixer extension - https://phabricator.wikimedia.org/T200751 [15:06:50] wee !!!! [15:06:58] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:07:09] legoktm: nice work ! [15:07:28] thedj: thanks :)) [15:07:51] \o/ [15:07:59] James_F: when you added SecureLinkFixer to extension-list, did you scap afterwards? [15:08:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "IPs assigned, and a minor inline comment" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521584 (https://phabricator.wikimedia.org/T223953) (owner: 10Ppchelko) [15:09:18] (03PS1) 10Alexandros Kosiaris: Assign restrouter LVS IPs [dns] - 10https://gerrit.wikimedia.org/r/526448 (https://phabricator.wikimedia.org/T223953) [15:09:21] (03PS1) 10Alexandros Kosiaris: Activate restrouter discovery records [dns] - 10https://gerrit.wikimedia.org/r/526449 (https://phabricator.wikimedia.org/T223953) [15:09:33] !log legoktm@deploy1001 Started scap: Rebuild l10n cache for SecureLinkFixer message [15:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:47] (03PS1) 10Elukey: profile::analytics::cluster::packages::hadoop: limit snakebite's deploy [puppet] - 10https://gerrit.wikimedia.org/r/526451 [15:13:22] (03CR) 10Ottomata: [C: 03+1] profile::analytics::cluster::packages::hadoop: limit snakebite's deploy [puppet] - 10https://gerrit.wikimedia.org/r/526451 (owner: 10Elukey) [15:13:54] !log remove snakebite from buster-wikimedia (not needed anymore) [15:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:11] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::packages::hadoop: limit snakebite's deploy [puppet] - 10https://gerrit.wikimedia.org/r/526451 (owner: 10Elukey) [15:16:17] (03CR) 10Elukey: [C: 03+2] sre.kafka.roll-restart-brokers.py: improvements to the procedure [cookbooks] - 10https://gerrit.wikimedia.org/r/526376 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [15:16:44] (03CR) 10Jbond: [C: 03+2] buster-backports: add the buster backports repository to puppet [puppet] - 10https://gerrit.wikimedia.org/r/526398 (owner: 10Jbond) [15:16:52] (03PS2) 10Jbond: buster-backports: add the buster backports repository to puppet [puppet] - 10https://gerrit.wikimedia.org/r/526398 [15:17:17] (03CR) 10Effie Mouzeli: [C: 03+1] "Nitpicks otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [15:18:28] !log Disable puppet on mw1347 and mw2136, depool and pool back - T219150 [15:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:35] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [15:19:37] woo progress [15:19:59] RECOVERY - puppet last run on wtp1037 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:20:30] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable php72_only on mw1347 and mw2136 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [15:20:40] (03PS5) 10Effie Mouzeli: hieradata: enable php72_only on mw1347 and mw2136 [puppet] - 10https://gerrit.wikimedia.org/r/526434 (https://phabricator.wikimedia.org/T219150) [15:21:03] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [15:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:27] round 2 :) [15:26:59] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:27:51] (03PS3) 10Jbond: buster-backports: add the buster backports repository to puppet [puppet] - 10https://gerrit.wikimedia.org/r/526398 [15:28:01] legoktm: No, I was waiting for today's full scap to build it. [15:28:24] !log legoktm@deploy1001 Finished scap: Rebuild l10n cache for SecureLinkFixer message (duration: 18m 51s) [15:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:36] So, good guess. [15:29:31] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/526420 (owner: 10Jbond) [15:31:38] James_F: yeah, it was just 1 message so there was no urgency whatsoever but since I had the time... [15:32:08] * James_F nods. [15:34:19] (03PS1) 10Elukey: Add stat1005 back in the pool of statistics_servers [puppet] - 10https://gerrit.wikimedia.org/r/526457 [15:34:58] * legoktm puts down the deploy stick [15:36:11] (03CR) 10Elukey: [C: 03+2] Add stat1005 back in the pool of statistics_servers [puppet] - 10https://gerrit.wikimedia.org/r/526457 (owner: 10Elukey) [15:38:19] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:41:24] !log elukey@cumin1001 END (FAIL) - Cookbook sre.kafka.roll-restart-brokers (exit_code=99) [15:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:49] (03CR) 10CRusnov: netbox: Add configuration for REDIS (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [15:51:38] (03PS1) 10Pmiazga: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) [15:52:53] (03CR) 10jerkins-bot: [V: 04-1] Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [15:53:53] legoktm: btw, are you all done? [15:54:04] cdanis: yep [15:54:13] ty :) [15:54:16] _joe_: godog: looks like nothing planned for Puppet SWAT today? I will probably take over the window then [15:54:38] <_joe_> cdanis: go on [15:57:11] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:59:15] (03PS2) 10Pmiazga: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:09] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:00:11] (03CR) 10jerkins-bot: [V: 04-1] Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [16:03:28] (03PS5) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [16:03:30] (03PS1) 10BBlack: recdns: refactor and rationalize resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/526465 (https://phabricator.wikimedia.org/T228190) [16:06:53] (03PS1) 10CDanis: dbctl: enable on all canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526468 (https://phabricator.wikimedia.org/T229070) [16:07:42] (03PS1) 10Giuseppe Lavagetto: utils: add run_ci_locally.sh [puppet] - 10https://gerrit.wikimedia.org/r/526469 [16:12:53] (03PS1) 10Jbond: puppetdb (buster): dont install the puppetdb4 component on buster servers [puppet] - 10https://gerrit.wikimedia.org/r/526470 [16:22:45] !log bounce rsyslog on centrallog1001 - T199406 [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:53] T199406: rsyslog's in:imtcp thread stuck on recvfrom loop from down/rebooted hosts - https://phabricator.wikimedia.org/T199406 [16:23:27] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [16:24:20] (03PS1) 10Elukey: sre.kafka.roll-restart-brokers: source /etc/profile.d/kafka.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/526472 (https://phabricator.wikimedia.org/T229003) [16:24:28] (03PS3) 10CRusnov: netbox: Add configuration for REDIS [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) [16:25:03] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1791 days) https://wikitech.wikimedia.org/wiki/Logs [16:25:09] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:26:25] (03PS2) 10Elukey: sre.kafka.roll-restart-brokers: source /etc/profile.d/kafka.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/526472 (https://phabricator.wikimedia.org/T229003) [16:26:58] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/526472 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [16:28:10] (03CR) 10Effie Mouzeli: [C: 03+1] "> (2 comments)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [16:30:32] (03CR) 10CRusnov: [C: 03+2] netbox: Add configuration for REDIS [puppet] - 10https://gerrit.wikimedia.org/r/526173 (https://phabricator.wikimedia.org/T226331) (owner: 10CRusnov) [16:30:52] (03CR) 10BBlack: [C: 04-1] "I don't think 6 new services to monitor is going to kill us on upload-lb." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [16:31:02] (03CR) 10Elukey: [C: 03+2] sre.kafka.roll-restart-brokers: source /etc/profile.d/kafka.sh [cookbooks] - 10https://gerrit.wikimedia.org/r/526472 (https://phabricator.wikimedia.org/T229003) (owner: 10Elukey) [16:31:03] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [16:33:04] (03CR) 10BBlack: [C: 04-1] Add cloudelastic LVS to DNS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/512924 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [16:33:27] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:10] round 3 [16:36:58] the rsyslog delivery failure is me btw [16:37:42] !log cutting 1.34-wmf.16 [16:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] (03CR) 10CDanis: [C: 03+2] dbctl: enable on all canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526468 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [16:39:26] (03Merged) 10jenkins-bot: dbctl: enable on all canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526468 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [16:40:09] (03CR) 10jenkins-bot: dbctl: enable on all canaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526468 (https://phabricator.wikimedia.org/T229070) (owner: 10CDanis) [16:41:21] !log cdanis@deploy1001 Synchronized wmf-config/etcd.php: Icf57a2ab enable dbctl on all mw canaries (duration: 00m 47s) [16:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:33] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 313 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:42:48] not unexpected, fixing [16:46:23] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 1.852 second response time https://wikitech.wikimedia.org/wiki/Netbox [16:46:25] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_syslog.codfw.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [16:46:51] !log adding port 9105 to term prometheus in filter labs-in4 - T225296 [16:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:58] T225296: High Prometheus TCP retransmits - https://phabricator.wikimedia.org/T225296 [16:50:48] robh: you should update the clinic duty topic :) [16:51:06] oh, is it my week? [16:51:11] i didnt attend the meeting yesterday [16:51:34] the notes say so [16:51:59] yep [16:54:27] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [16:54:35] (03CR) 10BBlack: [C: 03+2] "Re-checked this a few different ways, and compiler output seems right as well: https://puppet-compiler.wmflabs.org/compiler1001/17670/" [puppet] - 10https://gerrit.wikimedia.org/r/526465 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [16:54:47] (03PS2) 10BBlack: recdns: refactor and rationalize resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/526465 (https://phabricator.wikimedia.org/T228190) [16:57:29] (03PS1) 10Thcipriani: gerrit: UseStringDeduplication in jvm [puppet] - 10https://gerrit.wikimedia.org/r/526478 [16:58:23] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 54.11, 32.39, 25.33 https://wikitech.wikimedia.org/wiki/Application_servers [16:59:43] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [17:00:02] (03PS3) 10Pmiazga: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services โ€“ Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1700). [17:00:28] !log gerrit restart incoming -- gc time increasing causing timeouts [17:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:45] no parsoid deploy today [17:01:09] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.14-16-g855b179b5f (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [17:01:24] (03CR) 10jerkins-bot: [V: 04-1] Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [17:03:09] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 64.43, 38.46, 26.84 https://wikitech.wikimedia.org/wiki/Application_servers [17:03:53] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hiera backends: update the config and hiera backend with the correct names [puppet] - 10https://gerrit.wikimedia.org/r/526420 (owner: 10Jbond) [17:04:21] (03CR) 10Paladox: [C: 03+1] gerrit: UseStringDeduplication in jvm [puppet] - 10https://gerrit.wikimedia.org/r/526478 (owner: 10Thcipriani) [17:05:45] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:05:50] (03PS1) 10Ayounsi: Depool ulsfo for cr3/4-ulsfo upgrade [dns] - 10https://gerrit.wikimedia.org/r/526487 (https://phabricator.wikimedia.org/T227886) [17:06:11] PROBLEM - puppet last run on kafka-main2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:06:25] git pull errors must be because of the gerrit restart [17:06:55] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:06:55] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:07:37] (03PS1) 10Filippo Giunchedi: hieradata: use wezen.codfw.wmnet instead of syslog CNAME [puppet] - 10https://gerrit.wikimedia.org/r/526488 [17:07:57] PROBLEM - puppet last run on kafka-main2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:09:03] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:09:03] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: use wezen.codfw.wmnet instead of syslog CNAME [puppet] - 10https://gerrit.wikimedia.org/r/526488 (owner: 10Filippo Giunchedi) [17:09:15] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:09:23] (03PS3) 10BPirkle: Specify CentralAuth and OAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) [17:10:12] bblack: ok to merge your patch ? [17:10:33] I'm a little intimidated by the commit message :) [17:11:24] godog: please go ahead, I got stuck on my puppet-merge failing on slow-ass gerrit and spaced out :) [17:11:48] haha ok! merging now [17:12:36] {{done}} [17:15:11] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action={fwd_centrallog1001.eqiad.wmnet:6514,fwd_syslog.codfw.wmnet:6514} https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:15:18] known/expected ^ [17:15:43] !log use wezen.codfw.wmnet instead of syslog.codfw.wmnet for production hosts [17:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:53] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 16.98, 19.91, 23.49 https://wikitech.wikimedia.org/wiki/Application_servers [17:17:00] (03PS6) 10BBlack: anycast recdns: use for all hosts at edge sites [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) [17:19:30] (03CR) 10Ayounsi: [C: 03+2] Depool ulsfo for cr3/4-ulsfo upgrade [dns] - 10https://gerrit.wikimedia.org/r/526487 (https://phabricator.wikimedia.org/T227886) (owner: 10Ayounsi) [17:20:08] !log depool ulsfo for routers upgrades - T227886 [17:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [17:23:43] (03CR) 10BBlack: [C: 03+2] "Looks correct now!" [puppet] - 10https://gerrit.wikimedia.org/r/526169 (https://phabricator.wikimedia.org/T228190) (owner: 10BBlack) [17:27:03] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 15.36, 17.17, 23.78 https://wikitech.wikimedia.org/wiki/Application_servers [17:28:08] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:30:17] RECOVERY - puppet last run on kafka-main2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:30:39] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 50.44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:31:44] (03PS1) 10Urbanecm: flaggedrevs.php: Allow wikis to remove ability to promote to/demote from autoreview/editor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526492 (https://phabricator.wikimedia.org/T229346) [17:32:28] it sure would be nice to make icinga's traffic drop alert for a site have a dependency on "is the site depooled" ๐Ÿ˜‚ [17:32:51] +1 [17:33:13] also different levels of pooled, dns-pooled/conftool-pooled/maybe more ? [17:34:09] RECOVERY - puppet last run on kafka-main2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:34:41] (03PS1) 10Brennen Bearnes: Group0 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526496 [17:34:49] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:34:49] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:34:57] godog: for this one i think just dns-pooled matters, but yea [17:36:55] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:37:03] cdanis: indeed, dns only for this one [17:37:09] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:37:43] spitballing here, but since that alert is from prometheus, if gdnsd (already?) exports pooled status we could hack together some inihibition [17:37:52] (03PS5) 10Giuseppe Lavagetto: mtail: fix mediawiki access log metrics names [puppet] - 10https://gerrit.wikimedia.org/r/526388 [17:39:35] godog: the HTTP-level traffic stats for codfw/ulsfo seem to show the traffic shift (like the alert above)... [17:40:22] hmmm nevermind, there was a followup question to that, but I think I've figured it out [17:42:57] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [17:47:44] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@af8b471]: Update mobileapps to ec865a7 [17:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:51] (03PS9) 10EBernhardson: LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) [17:52:09] (03CR) 10Dzahn: [C: 03+2] "https://bugs.eclipse.org/bugs/show_bug.cgi?id=490341 has some info about this flag and memory savings in other projects (eclipse)" [puppet] - 10https://gerrit.wikimedia.org/r/526478 (owner: 10Thcipriani) [17:52:34] (03PS3) 10EBernhardson: Add cloudelastic LVS to DNS [dns] - 10https://gerrit.wikimedia.org/r/512924 (https://phabricator.wikimedia.org/T224324) [17:52:41] (03CR) 10jerkins-bot: [V: 04-1] LVS for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:53:15] (03PS2) 10Dzahn: gerrit: UseStringDeduplication in jvm [puppet] - 10https://gerrit.wikimedia.org/r/526478 (owner: 10Thcipriani) [17:53:29] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@af8b471]: Update mobileapps to ec865a7 (duration: 05m 45s) [17:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:46] (03PS4) 10EBernhardson: Add cloudelastic LVS to DNS [dns] - 10https://gerrit.wikimedia.org/r/512924 (https://phabricator.wikimedia.org/T224324) [17:54:18] (03CR) 10EBernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/512925 (https://phabricator.wikimedia.org/T224324) (owner: 10EBernhardson) [17:55:19] !log brennen@deploy1001 Pruned MediaWiki: 1.34.0-wmf.11 (duration: 07m 40s) [17:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:54] !log brennen@deploy1001 Started scap: testwiki to php-1.34.0-wmf.16 and rebuild l10n cache [17:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1800) [18:06:06] !log failover VRRP master to cr4-ulsfo [18:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:41] !log deactivate transit BGP groups on cr3-ulsfo [18:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:08] msg XioNoX I had a passing thought that has probably already occurred to you, and I know it's a bit late and you're already going, but: [18:08:16] heh msg fail [18:08:40] I may as well continue here to avoid the curiosity-spam [18:09:17] XioNoX: any expected impact on OIT routing office traffic through ulsfo? Do we think their fallback to their other link or whatever works fine now? [18:10:34] bblack: they use the other link as primary link, and the DC link for wiki only (by default), they have a session with both routers so it should be transparent for them, and if it goes totally down they will go through their main for wiki sites [18:11:10] ok sorry for the interrupt, I just had a sudden memory of past problems in this area :) [18:12:28] was worth it though, so I could double check it [18:13:31] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.07, 23.47, 17.75 https://wikitech.wikimedia.org/wiki/Application_servers [18:13:56] ^ rebuilding cdb files now for train, FYI [18:14:36] !log bump cr3-ulsfo<->cr2-eqord ospf metric [18:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:07] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 21.01, 21.39, 17.60 https://wikitech.wikimedia.org/wiki/Application_servers [18:15:17] !log brennen@deploy1001 Finished scap: testwiki to php-1.34.0-wmf.16 and rebuild l10n cache (duration: 18m 23s) [18:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:24] !log restart cr3-ulsfo [18:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:17] woop, no alert so far? [18:17:53] did I win at the "downtime all the proper things" game? [18:20:19] PROBLEM - OSPF status on mr1-ulsfo is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:34] (03CR) 10Dzahn: "i am kind of surprised about that. when looking at https://phabricator.wikimedia.org/T147718#2885950 it says inheritance between roles is " [puppet] - 10https://gerrit.wikimedia.org/r/526290 (owner: 10Dzahn) [18:22:18] ah! forgot that one [18:22:24] anyway cr3-ulsfo is back [18:22:25] (03PS33) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [18:23:33] RECOVERY - OSPF status on mr1-ulsfo is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:24:56] (03PS2) 10Mforns: analytics::refinery::job::data_purge Migrate banner timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) [18:25:10] !log rollback - bump cr3-ulsfo<->cr2-eqord ospf metric [18:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:53] !log activate transit BGP groups on cr3-ulsfo [18:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:38] !log failover VRRP master to cr3-ulsfo [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:06] (03CR) 10Dzahn: "not even "Class[Mediawiki::Mwrepl]" is needed?" [puppet] - 10https://gerrit.wikimedia.org/r/525584 (https://phabricator.wikimedia.org/T228976) (owner: 10Giuseppe Lavagetto) [18:29:47] (03PS3) 10Elukey: analytics::refinery::job::data_purge Migrate banner timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [18:31:20] Hello. Confirming: to get code review for a production config change, I should tag with Operations and Patch-For-Review? This is on T227097. [18:31:21] T227097: Make sure that we're taking CentralAuth into consideration for staging release - https://phabricator.wikimedia.org/T227097 [18:32:21] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge Migrate banner timers to new script [puppet] - 10https://gerrit.wikimedia.org/r/519691 (https://phabricator.wikimedia.org/T226862) (owner: 10Mforns) [18:38:39] !log bump cr4-ulsfo<->cr1-codfw ospf metric [18:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [18:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:20] !log restart cr4-ulsfo [18:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:45] bpirkle: what's the gerrit change in question? [18:40:28] yesss (for the cookbook!) [18:41:50] cdanis: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/521409/ [18:42:46] bpirkle: I'm no expert but this seems like a kind of change that could be done during a SWAT window [18:43:10] cr4-ulsfo is back on the prompt [18:43:21] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:25] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:43:26] time for routing to start working [18:43:42] cdanis: Ok, thanks. Will schedule there. [18:43:45] ah, yup forgot those, no big deal [18:44:23] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 58.56 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:45:12] that can be ignored ^ [18:45:53] alright [18:46:09] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 73 probes of 488 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:46:33] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:46:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:46:37] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:46:55] PROBLEM - PyBal BGP sessions are established on lvs4007 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops [18:46:56] alright fully back [18:47:35] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 84.48 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:47:39] (03PS1) 10CRusnov: netbox: move config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/526507 [18:48:31] RECOVERY - PyBal BGP sessions are established on lvs4007 is OK: (C)0 le (W)0 le 1 https://wikitech.wikimedia.org/wiki/PyBal https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops [18:48:44] !log rollback bump cr4-ulsfo<->cr1-codfw ospf metric [18:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:30] !log rollback vrrp priority changes on cr4-ulsfo [18:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:47] (03PS2) 10CRusnov: Upgrade netbox to v2.6.1-wmf1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526447 [18:50:05] (03PS2) 10CRusnov: netbox: move config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/526507 [18:50:15] (03PS1) 10Ayounsi: Revert "Depool ulsfo for cr3/4-ulsfo upgrade" [dns] - 10https://gerrit.wikimedia.org/r/526508 [18:51:16] alright, other than the ripe atlas check, everything is green [18:51:24] and that check should be back to normal very soon [18:51:47] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 4 probes of 488 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:54:00] yep ^ [18:54:32] XioNoX: ๐Ÿ‘ ๐Ÿ˜Ž [18:56:26] !emoji is https://en.wikipedia.org/wiki/List_of_emoticons#Western [18:56:27] Key was added [19:00:04] brennen: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - American version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T1900). [19:03:24] (03PS1) 10Thcipriani: scap: prep and clean git ops for /srv/patches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526509 (https://phabricator.wikimedia.org/T222240) [19:07:10] (03CR) 10Brennen Bearnes: [C: 03+2] Group0 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526496 (owner: 10Brennen Bearnes) [19:08:07] (03Merged) 10jenkins-bot: Group0 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526496 (owner: 10Brennen Bearnes) [19:08:22] (03CR) 10jenkins-bot: Group0 to 1.34.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526496 (owner: 10Brennen Bearnes) [19:13:23] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.34.0-wmf.16 [19:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:17] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:21] PROBLEM - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [19:16:19] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [19:16:21] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:19:35] (03CR) 10Ayounsi: [C: 03+2] Revert "Depool ulsfo for cr3/4-ulsfo upgrade" [dns] - 10https://gerrit.wikimedia.org/r/526508 (owner: 10Ayounsi) [19:19:39] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:19:46] !log repool ulsfo [19:19:52] !log restbase2017 - sudo systemctl start cassandra-b after it had failed for unknown reason [19:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:15] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:11] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [19:21:53] RECOVERY - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-b valid until 2020-11-29 09:26:18 +0000 (expires in 487 days) https://phabricator.wikimedia.org/T120662 [19:29:34] (03CR) 10Jforrester: [C: 03+1] scap: prep and clean git ops for /srv/patches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526509 (https://phabricator.wikimedia.org/T222240) (owner: 10Thcipriani) [19:35:05] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.28 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:38:03] hah [19:39:21] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.34.0-wmf.15 [19:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:11] (03CR) 10Ayounsi: [C: 03+2] Anycast: Add Prometheus exporter to Bird (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/526203 (owner: 10Ayounsi) [19:41:15] PROBLEM - puppet last run on dns4002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:41:20] (03PS2) 10Ayounsi: Anycast: Add Prometheus exporter to Bird (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/526203 [19:42:09] (03PS1) 10Brennen Bearnes: Revert "Group0 to 1.34.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526513 [19:42:11] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Group0 to 1.34.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526513 (owner: 10Brennen Bearnes) [19:44:03] (03Merged) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526513 (owner: 10Brennen Bearnes) [19:44:18] (03CR) 10jenkins-bot: Revert "Group0 to 1.34.0-wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526513 (owner: 10Brennen Bearnes) [20:02:17] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 78.23 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:08:32] the varnish alert is expected (due to the ulsfo repool), the dns4002 not. looking [20:09:15] RECOVERY - puppet last run on dns4002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:09:19] XioNoX: puppet run showed no error at all, right [20:09:35] yeah, all good it seems [20:09:37] hmm. we keep having these [20:09:44] XioNoX: that flavor of puppet failure is usually a dumb transient failure we haven't tracked down yet [20:09:47] on seemingly random hosts, like it's a master thing [20:09:53] yep, that [20:10:49] (03CR) 10Ayounsi: [C: 03+1] netbox: move config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/526507 (owner: 10CRusnov) [20:10:59] (03CR) 10Ayounsi: [C: 03+1] Upgrade netbox to v2.6.1-wmf1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526447 (owner: 10CRusnov) [20:11:41] (03PS2) 10RobH: adding jclark to shell and dc ops group [puppet] - 10https://gerrit.wikimedia.org/r/525847 (https://phabricator.wikimedia.org/T229124) [20:14:59] (03PS1) 10Pmiazga: Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) [20:15:50] (03CR) 10jerkins-bot: [V: 04-1] Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [20:16:47] (03PS2) 10Pmiazga: Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) [20:20:10] (03PS1) 10CDanis: dbctl: don't quote phaste urls [software/conftool] - 10https://gerrit.wikimedia.org/r/526518 [20:22:07] (03CR) 10CRusnov: [C: 03+2] netbox: move config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/526507 (owner: 10CRusnov) [20:22:17] (03PS3) 10CRusnov: netbox: move config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/526507 [20:33:28] (03CR) 10Pmiazga: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [20:33:45] (03PS1) 10Ayounsi: Prometheus, set bird class name to profile::bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/526521 [20:34:32] (03CR) 10jerkins-bot: [V: 04-1] Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) (owner: 10Pmiazga) [20:34:52] (03CR) 10CDanis: [C: 03+1] Prometheus, set bird class name to profile::bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/526521 (owner: 10Ayounsi) [20:35:15] (03CR) 10Ayounsi: [C: 03+2] Prometheus, set bird class name to profile::bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/526521 (owner: 10Ayounsi) [20:35:26] (03PS2) 10Ayounsi: Prometheus, set bird class name to profile::bird::anycast [puppet] - 10https://gerrit.wikimedia.org/r/526521 [20:37:04] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Upgrade netbox to v2.6.1-wmf1 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526447 (owner: 10CRusnov) [20:58:29] (03PS1) 10Ottomata: Release 2.4.3 for Debian Buster [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/526527 (https://phabricator.wikimedia.org/T222253) [21:11:53] (03PS1) 10Andrew Bogott: bootstrap-vz: remove swap partitions from Stretch and Buster images [puppet] - 10https://gerrit.wikimedia.org/r/526529 (https://phabricator.wikimedia.org/T229372) [21:12:30] (03PS2) 10Jdlrobson: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) [21:16:53] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: remove swap partitions from Stretch and Buster images [puppet] - 10https://gerrit.wikimedia.org/r/526529 (https://phabricator.wikimedia.org/T229372) (owner: 10Andrew Bogott) [21:19:12] (03PS3) 10Jdlrobson: Update wgSkipSkins to experiment with not showing skins to users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511321 (https://phabricator.wikimedia.org/T223824) [21:22:18] (03PS4) 10Pmiazga: Enable reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) [21:29:47] (03PS5) 10Pmiazga: Enable editor gender surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526461 (https://phabricator.wikimedia.org/T227793) [21:33:25] (03PS1) 10Ayounsi: Prometheus add bird prefix export count to global metrics [puppet] - 10https://gerrit.wikimedia.org/r/526536 [21:42:02] !log ppchelko@deploy1001 Started deploy [restbase/deploy@c7e0e33]: Enable language variants filter for PCS endpoints. T229060 [21:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:09] T229060: Enable language_variants_filter for PCS endpoints - https://phabricator.wikimedia.org/T229060 [21:48:31] (03PS1) 10Mholloway: Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) [21:48:33] (03PS1) 10Mholloway: Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) [21:48:35] (03PS1) 10Mholloway: Enable MachineVision on (beta) commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) [21:48:42] (03PS1) 10Mholloway: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) [21:49:22] (03CR) 10Mholloway: [C: 04-2] "Hold this series until ready to deploy to Beta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [21:50:00] (03CR) 10jerkins-bot: [V: 04-1] Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [21:52:28] (03CR) 10Mholloway: [C: 04-2] "Does this depend on the extension submodule being present in mediawiki/extensions? That won't happen until next week (assuming https://ger" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [21:54:19] (03PS2) 10Mholloway: Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) [21:56:14] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [21:58:43] (03CR) 10Volans: [C: 03+2] dbctl: don't quote phaste urls [software/conftool] - 10https://gerrit.wikimedia.org/r/526518 (owner: 10CDanis) [22:00:43] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c7e0e33]: Enable language variants filter for PCS endpoints. T229060 (duration: 18m 40s) [22:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:50] T229060: Enable language_variants_filter for PCS endpoints - https://phabricator.wikimedia.org/T229060 [22:00:56] !log ppchelko@deploy1001 Started deploy [restbase/deploy@c7e0e33]: Enable language variants filter for PCS endpoints. T229060, take 2, feeds timed out [22:01:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:11] (03Merged) 10jenkins-bot: dbctl: don't quote phaste urls [software/conftool] - 10https://gerrit.wikimedia.org/r/526518 (owner: 10CDanis) [22:01:58] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@c7e0e33]: Enable language variants filter for PCS endpoints. T229060, take 2, feeds timed out (duration: 01m 03s) [22:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:08] (03Abandoned) 10Urbanecm: Fix adding vendor files by default for commiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471533 (https://phabricator.wikimedia.org/T207058) (owner: 10Zoranzoki21) [22:11:57] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@76b6639]: Report 400 errors by default. T229277 [22:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:04] T229277: Change-Prop should report 400 errors from endpoints - https://phabricator.wikimedia.org/T229277 [22:13:26] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@76b6639]: Report 400 errors by default. T229277 (duration: 01m 29s) [22:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:11] !log crusnov@deploy1001 Started deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 - T226331 [22:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:21] T226331: Upgrade Netbox to 2.6.1 - https://phabricator.wikimedia.org/T226331 [22:18:31] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 - T226331 (duration: 00m 20s) [22:18:32] !log crusnov@deploy1001 Started deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 - T226331 [22:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:19] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 - T226331 (duration: 00m 47s) [22:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:34] !log crusnov@deploy1001 Started deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 (pass 2) - T226331 [22:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:42] T226331: Upgrade Netbox to 2.6.1 - https://phabricator.wikimedia.org/T226331 [22:23:44] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 (pass 2) - T226331 (duration: 00m 10s) [22:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:46] !log crusnov@deploy1001 Started deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 (pass 3) - T226331 [22:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:56] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b76139e]: Upgrade Netbox to v2.6.1 (pass 3) - T226331 (duration: 00m 09s) [22:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:12] jouncebot, next [22:42:13] In 0 hour(s) and 17 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T2300) [22:43:17] (03PS1) 10CRusnov: Rebuild artifacts and fix src rev [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526554 [22:46:16] (03CR) 10Ayounsi: [C: 03+1] Rebuild artifacts and fix src rev [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526554 (owner: 10CRusnov) [22:46:51] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Rebuild artifacts and fix src rev [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/526554 (owner: 10CRusnov) [23:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190730T2300). [23:00:04] bpirkle and jdlrobson: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:31] here ! [23:00:34] I can SWAT [23:00:34] I'm here [23:00:49] AndyRussG: You also have a patch for SWAT (but jouncebot forgot to ping you) [23:01:22] RoanKattouw: thanks much!!! yeah I might have gotten it in seconds after the hour [23:01:50] (03CR) 10Catrope: [C: 03+2] Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [23:01:54] (03CR) 10Dzahn: "@cdanis Here are the pages since i promised i would create them:" [puppet] - 10https://gerrit.wikimedia.org/r/496830 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:02:36] (03Merged) 10jenkins-bot: Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [23:02:54] (03CR) 10jenkins-bot: Enable MobileWebUIActionsTracking schema with 50% sampling rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526516 (https://phabricator.wikimedia.org/T220016) (owner: 10Pmiazga) [23:03:22] RoanKattouw: I'm not 100% sure if I'm creating the core branches correctly, since the CentralNotice deploy method was fixed [23:04:06] jdlrobson: Your patch is on mwdebug1002, please test [23:04:14] on it [23:04:22] AndyRussG: Oh, do we finally have wmf.N branches now instead of wmf_deploy branches? [23:04:53] RoanKattouw: we now have both [23:05:19] the wmf.N branches track the wmf_deploy branch [23:05:44] RoanKattouw: we're good to go [23:06:04] (03PS4) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [23:06:20] (03PS4) 10Catrope: Specify CentralAuth and OAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [23:06:39] (03CR) 10Catrope: [C: 03+2] Specify CentralAuth and OAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [23:06:53] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable MobileWebUIActionsTracking schema with 50% sampling rate (T220016) (duration: 00m 48s) [23:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:01] T220016: Create new MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [23:07:40] (03Merged) 10jenkins-bot: Specify CentralAuth and OAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [23:08:16] bpirkle: Your patch is on mwdebug1002, please test [23:08:18] (03CR) 10jenkins-bot: Specify CentralAuth and OAuth session storage separately from per-wiki session storage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521409 (https://phabricator.wikimedia.org/T227097) (owner: 10BPirkle) [23:11:37] Good to go [23:12:05] AndyRussG: You were right to be skeptical, your wmf_deploy cherry-picks were merged but don't show up when I git pull on the deployment host. Trying a single cherry-pick to wmf.16 now to see if that works. If it does, I'll cherry-pick it to wmf.15 as well, and also create both cherry-picks for the other commit [23:12:30] RoanKattouw: This one is the head in the wmf_deploy branch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralNotice/+/525677 [23:12:45] RoanKattouw: yeah I know it's not fully standard [23:13:04] Hey if CentralNotice is now using wmf.N instead of wmf_deploy that would make me super happy [23:13:11] Yeah it is [23:13:14] YAY [23:13:17] indeed [23:13:20] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Specify CentralAuth and OAuth session storage separately from per-wiki session storage (T227097, T227696) (duration: 00m 47s) [23:13:25] thanks to Tyler for the fix btw [23:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:29] T227696: OAuth extension uses session object store directly - https://phabricator.wikimedia.org/T227696 [23:13:29] T227097: Make sure that we're taking CentralAuth into consideration for staging release - https://phabricator.wikimedia.org/T227097 [23:14:07] Although... it looks like those branches branch from wmf_deploy instead of master? [23:14:21] RoanKattouw: basically we still just update the wmf_deploy CN branch when we want stuff to go out, and something automatic takes the latest commits from there instead of master when it makes the wmf.N branches [23:14:24] yes exactly [23:14:44] OK! That's nonstandard but in a more manageable way [23:15:08] yep [23:15:54] RoanKattouw: also, you'll get CI failures on all the new patches I cherry-picker to wmf_deploy *except* the last one [23:16:13] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [23:17:02] Ugh. Which is the one I need to cherry-pick to make CI pass? [23:17:09] The last one [23:17:22] " Banner history logger: remove loading of schema module" ? [23:17:27] There are actually 4 patches added on to wmf_deploy, but two are no-ops for production, just fixing tests, plus the other 2 that are mentioned in the Deployments page on wikitech [23:17:40] Well the one before that I think passes too [23:17:48] But that's the one to update the submodule pointer to [23:18:15] Hmm it doesn't really want to work that way but OK [23:19:12] I'm cherry-picking " Make CNDeviceTarget::addDeviceTarget() use DB_MASTER" in the hope that that one will pass CI cleanly [23:19:49] Yeah that should pass [23:19:52] Although if I need both that and the phan upgrade change, I might have to cherry-pick + force-merge the phan change first. We'll see what Jenkins says [23:20:05] yes also you'd need both of those [23:20:22] Sorry [23:20:27] yeah both are needed [23:22:12] I guess I was somehow thinking you'd just make the wmf.N branches point to the same SHA as the wmf_deploy branch [23:22:14] RoanKattouw: I need to revert that change [23:22:18] it's throwing errors in logstash [23:23:11] (and that would have just included everything, the prod stuff and the CI stuff) [23:26:06] (03PS1) 10Jdlrobson: Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526560 [23:26:14] ^ RoanKattouw sorry abotu that. Are you still around? [23:26:26] (03PS1) 10Dzahn: mediawiki: use a better notes_url for the "DSH groups" Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/526561 (https://phabricator.wikimedia.org/T227547) [23:26:26] jdlrobson: OK, submit a revert to Gerrit and link it to me? [23:28:18] (03PS2) 10Catrope: Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526560 (owner: 10Jdlrobson) [23:28:23] (03CR) 10Catrope: [C: 03+2] Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526560 (owner: 10Jdlrobson) [23:29:23] (03Merged) 10jenkins-bot: Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526560 (owner: 10Jdlrobson) [23:29:38] (03CR) 10jenkins-bot: Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526560 (owner: 10Jdlrobson) [23:31:24] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert "Enable MobileWebUIActionsTracking schema with 50% sampling rate" (T220016) (duration: 00m 47s) [23:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:32] T220016: Create, and deploy working MobileWebUIActionsTracking schema - https://phabricator.wikimedia.org/T220016 [23:32:26] AndyRussG: Ugh, James explained to me that I need to do a bunch of force-merging. Give me a minute [23:32:50] (03PS5) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [23:35:53] RoanKattouw: ok no worries [23:36:02] thanks Cam11598 [23:36:10] thanks RoanKattouw rather [23:36:17] (Catrope autocomplete fail) [23:36:30] I'll take credit [23:36:52] looks like https://grafana.wikimedia.org/d/000000566/overview?orgId=1&panelId=16&fullscreen is recovering :) [23:37:00] haha [23:37:01] You're welcome easiest thing I've ever done from the beach without a computer. What it is idk but I took credit for it [23:37:18] https://usercontent.irccloud-cdn.com/file/NDIQppWe/IMG_20190730_132155_042.jpg [23:38:09] If anyone needs me, I'll be with Cam11598 ๐Ÿ˜‰ [23:39:00] bpirkle: Monterey is amazing this time of year beat vacation ever 10/10 would recommend this resort [23:42:01] (03PS1) 10CRusnov: netbox: Fix swift CA errors. [puppet] - 10https://gerrit.wikimedia.org/r/526562 [23:42:59] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [23:43:19] (03CR) 10Gergล‘ Tisza: [C: 03+1] Add MachineVision to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526541 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [23:43:45] (03CR) 10Gergล‘ Tisza: [C: 03+1] Add wmgUseMachineVision default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526542 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [23:45:17] (03CR) 10Gergล‘ Tisza: Enable MachineVision on (beta) commonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526543 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [23:45:47] (03CR) 10Gergล‘ Tisza: [C: 03+1] Load MachineVision extension if enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526544 (https://phabricator.wikimedia.org/T227348) (owner: 10Mholloway) [23:46:15] OK, I got the wmf.N branches of CentralNotice to do what I want [23:46:43] Now just waiting for Jenkins to process the remaining patches [23:47:06] RoanKattouw: ah cool thanks! [23:48:45] RoanKattouw: I guess for future reference, for a SWAT deploy, I should cherry pick each patch added to wmf_deploy to the appropriate CentralNotice wmf.N branch(es), right? [23:48:56] Yes [23:49:01] and then no core Gerrit change to make? [23:49:09] Today was weird because CI was broken on the wmf.N branches [23:49:23] yeah we had two simultaneous CI problems [23:49:23] But when that's not the case, then it works as follows [23:50:17] In the regular flow of things, commits get merged into master, then your team periodically merges/cherry-picks things into wmf_deployment, and every Tuesday a new wmf.N branch is created with what's in wmf_deployment at that time [23:51:24] yes that's how we've done it since the wmf.N branches were created [23:51:24] If you need to emergency-fix something, you'll first have to commit+merge it into master if it isn't there already, then cherry-pick it over into wmf_deploy if it isn't there already and merge that; then to prep the SWAT, create cherry-picks into wmf.N and wmf.{N-1} (this week: wmf.16 and wmf.15), and let the SWATter merge those [23:51:50] RoanKattouw: ok got it cool [23:51:59] And yeah submodule update changes aren't needed, they haven't been for a long time thanks to a nice new(ish) Gerrit feature [23:52:33] Ahhh ok nice [23:53:22] Also the protocol is that the requestor creates the cherry-picks to the wmf.N branches in Gerrit [23:54:10] okok [23:54:11] It's easy to do from the UI, but we make it the requestor's task, so that if the UI fails with a conflict, that 1) is discovered early enough and 2) is the requestor's problem to fix, not the deployer's :) [23:54:24] yeee makes sense [23:54:29] Also, your patches (all 8 of them) are live on mwdebug1002, please test :) [23:54:46] (03CR) 10CRusnov: "`" [puppet] - 10https://gerrit.wikimedia.org/r/526562 (owner: 10CRusnov) [23:54:57] So basically like https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_deployment_branch just cherry-pick from wmf_deploy instead of master [23:54:59] ok testing [23:55:19] Exactly. The CentralNotice process works exactly like every other extension, just with s/master/wmf_deploy/g [23:58:23] RoanKattouw: lgtm! [23:59:18] OK, deploying [23:59:39] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:59:55] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers