[00:00:04] Deploy window NO DEPLOYS (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190712T0000) [00:01:01] RECOVERY - puppet last run on restbase1017 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:01:04] (03CR) 10CDanis: "PCC looks good (noops on several different server types):" [puppet] - 10https://gerrit.wikimedia.org/r/522217 (owner: 10CDanis) [00:01:25] !log eevans@deploy1001 Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) [00:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:31] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [00:01:51] !log eevans@deploy1001 Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 25s) [00:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:45] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:02:51] (03PS2) 10Dzahn: analytics: add notes_urls for cluster client and mysql [puppet] - 10https://gerrit.wikimedia.org/r/520959 [00:03:15] !log eevans@deploy1001 Started deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) [00:03:18] !log eevans@deploy1001 Finished deploy [cassandra/metrics-collector@df909a1]: deploy logback to restbase1017 (T222960) (duration: 00m 03s) [00:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:23] (03CR) 10Dzahn: analytics: add notes_urls for cluster client and mysql (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520959 (owner: 10Dzahn) [00:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:57] (03PS3) 10Dzahn: analytics: add notes_urls for cluster client and mysql [puppet] - 10https://gerrit.wikimedia.org/r/520959 [00:08:31] (03PS4) 10Krinkle: Sort wmgMonologChannels alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521901 [00:08:38] RoanKattouw: all done? [00:08:42] (just in case) [00:08:44] Yes [00:08:53] (03PS4) 10Krinkle: Remove dead 'wmgMonologChannels' entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521902 [00:09:02] (03CR) 10Krinkle: [C: 03+2] Sort wmgMonologChannels alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521901 (owner: 10Krinkle) [00:10:15] (03Merged) 10jenkins-bot: Sort wmgMonologChannels alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521901 (owner: 10Krinkle) [00:10:31] (03CR) 10jenkins-bot: Sort wmgMonologChannels alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521901 (owner: 10Krinkle) [00:11:02] * Krinkle staging on mwdebug1002 [00:11:41] (03CR) 10Dzahn: [C: 03+2] analytics: add notes_urls for cluster client and mysql [puppet] - 10https://gerrit.wikimedia.org/r/520959 (owner: 10Dzahn) [00:12:18] (03CR) 10Krinkle: [C: 03+2] Remove dead 'wmgMonologChannels' entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521902 (owner: 10Krinkle) [00:13:04] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f309856f0912 (duration: 00m 50s) [00:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:17] (03Merged) 10jenkins-bot: Remove dead 'wmgMonologChannels' entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521902 (owner: 10Krinkle) [00:13:32] (03CR) 10jenkins-bot: Remove dead 'wmgMonologChannels' entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521902 (owner: 10Krinkle) [00:15:36] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e4bd91f71b (duration: 00m 50s) [00:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:13] * Krinkle done staging [00:30:01] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:30:55] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:31:03] (03CR) 10Dzahn: [C: 03+2] contint: remove colordiff [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:31:05] (03PS2) 10Dzahn: contint: remove colordiff [puppet] - 10https://gerrit.wikimedia.org/r/517091 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:32:59] mutante: none of the other machines in the restbase cluster have ferm rules for 1017, that ring a bell? [00:33:23] (03PS2) 10Dzahn: contint: drop contint::tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/517098 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:33:27] James_F: collabwiki / [5f8b962507b9c54dcfb48d68] /rpc/RunSingleJob.php MWUnknownContentModelException from line 266 of /srv/mediawiki/php-1.34.0-wmf.13/includes/content/ContentHandler.php: The content model 'Graph.JsonConfig' is not registered on this wiki. [00:33:35] Looks like maybe too much was disabled? [00:33:42] MaxSem: ^ [00:34:26] urandom: no. but earlier i was trying to find the gerrit change where it got disabled and didn't [00:35:05] urandom: sounds like they would get created from a list in hiera though [00:35:11] Krinkle: Hmm. Possibly CollabWiki was never configured to have JsonContent on but were using it? [00:35:15] * James_F looks. [00:35:23] yeah, and that list wouldn't have changed, would it? [00:35:36] (03CR) 10Dzahn: [C: 03+2] contint: drop contint::tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/517098 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:36:12] urandom: i kind of expected a change that removed it / commented it before this IP move started [00:36:26] Krinkle: `$wmgUseGraph` is default true, which is the only thing that gates `wfLoadExtension( 'JsonConfig' );` [00:38:06] James_F: Used to be gated by ZeroBanner as well, which was on for all wikipedia.dblist [00:38:23] Which CollabWiki isn't? [00:38:42] maybe. special/wiki is... special. [00:38:57] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10Papaul) p:05Triage→03Normal a:03Papaul [00:38:59] urandom: what port do they need open? [00:39:02] But anyway, if anything we've expanded the places that get JsonConfig. [00:39:12] mutante: umm... a bunch [00:39:13] In any event, there are pages on there that have that db content model specified, which is only possible if it was used there, and it an only be broken now, if it isn't enabled anymore. [00:39:23] still looking for more details.. [00:39:30] Please file an UBN. [00:40:00] (03PS6) 10Smalyshev: Set up dumps for mediainfo RDF generation [puppet] - 10https://gerrit.wikimedia.org/r/516444 (https://phabricator.wikimedia.org/T221917) [00:40:43] urandom: looking. i see tcp dpts:7199:7202 [00:41:08] (03PS2) 10Dzahn: contint: remove MySQL related configuration [puppet] - 10https://gerrit.wikimedia.org/r/517095 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:42:53] urandom: it's $cassandra_hosts_ferm and that comes from $ferm_seeds and that from $all_seeds and that from profile/cassandra/seeds.erb [00:43:53] and that uses $all_instance.each do ... [00:44:02] @all_instances [00:44:38] finally .. $all_instances = hiera('profile::cassandra::instances'), [00:45:54] but .. it's in the list [00:47:04] (03CR) 10Dzahn: [C: 03+2] contint: remove MySQL related configuration [puppet] - 10https://gerrit.wikimedia.org/r/517095 (https://phabricator.wikimedia.org/T225735) (owner: 10Hashar) [00:50:12] !log restbase1018 - restart ferm service [00:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:30] urandom: restarting ferm service makes it appear :) see 1018 now [00:50:48] so puppet needs to be told to do that [00:51:16] i noticed it appeared already in /etc/ferm/conf.d but not in iptables -L [00:51:54] can you do systemctl restart ferm or is it too many hosts [00:53:07] mutante: no, I can do it [00:53:26] odd though, did it not restart to remove them? [00:53:44] yea. i don't know, i wasn't here when it got removed [00:53:51] maybe somebody used cumin [00:54:06] I'm not going to lock myself out if I restart, am I? :) [00:54:17] or the puppet issue is just when adding new stuff [00:54:35] nm, seems fine [00:54:47] urandom: nah, i am still good on 1018 where i did it [00:55:00] you can examine /etc/ferm/conf.d/ if you want [00:55:16] 10_bastion-ssh should be unchanged and first [00:56:31] RECOVERY - Restbase root url on restbase1017 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/RESTBase [00:56:53] :) [00:57:06] that looks good. i need to run though. kind of gettign kicked out and low battery [00:57:19] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [00:57:24] mutante: done restarting, thanks for your help [00:57:30] you can try ssh in a second shell before disconnecting the first . and worst case if you stop ferm service it opens it again [00:57:36] you're welcome. cu later [00:58:58] !log bootstrapping restbase1017-a -- T222960 [00:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:03] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [00:59:49] !log mw1342 is generating strange PHP erros (php7 only), ref T224491 [00:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:59:54] T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 [01:01:42] !log mw1342 generated some ~ 11,500 additional PHP errors over a 4 hour period (18:00-22:30 UTC), ref T224491 [01:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:05] (03PS1) 10CDanis: WIP: dbctl [puppet] - 10https://gerrit.wikimedia.org/r/522227 [01:03:02] 10Operations, 10Wikimedia-production-error (Shared Build Failure): Everything fails with unable to load the docker file - https://phabricator.wikimedia.org/T227833 (10DannyS712) Sorry, didn't mean to add that [01:04:52] (03PS2) 10CDanis: WIP: dbctl [puppet] - 10https://gerrit.wikimedia.org/r/522227 [01:05:51] (03CR) 10jerkins-bot: [V: 04-1] WIP: dbctl [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [01:08:06] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [01:11:54] (03CR) 10jerkins-bot: [V: 04-1] WIP: dbctl [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [01:20:55] RECOVERY - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.16.126 port 9042 https://phabricator.wikimedia.org/T93886 [01:48:40] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) Current snafu: None of data volumes are mounted (entries are missing from fstab). @... [02:06:41] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) T84333 is when the msw1<->cr connection has been made. The other option would be to add a 10G extension module to msw1 (EX-UM-4X4SFP), @papaul, do you have any spares? As the MPC3D doesn't suppor... [02:13:34] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Audit down ports - https://phabricator.wikimedia.org/T218751 (10ayounsi) Note that from https://librenms.wikimedia.org/ports/state=down/hostname=asw/format=list_basic/ there are still & new down ports on A/B/C. [02:13:51] PROBLEM - Disk space on restbase1017 is CRITICAL: DISK CRITICAL - free space: / 1016 MB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [02:21:12] (03PS3) 10CDanis: WIP: dbctl [puppet] - 10https://gerrit.wikimedia.org/r/522227 [02:33:03] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [02:36:27] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 97315800 and 8 seconds [02:37:29] (03PS4) 10CDanis: dbctl: install package on cluster management hosts [puppet] - 10https://gerrit.wikimedia.org/r/522227 [02:42:19] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 42986272 and 3 seconds [02:43:13] (03CR) 10CDanis: "PCC looks proper" [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [02:47:39] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17097680 and 1 seconds [02:48:13] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 844408 and 0 seconds [02:49:09] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 132744 and 35 seconds [02:49:25] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [03:08:26] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10Papaul) @ayounsi no we do not have any EX-UM-4X4SFP spares onsite. we will have to order some like we did when we did the switch re-fresh. All switches in rack A8, B8 and D8 didn't have the 10GB Uplink Mo... [04:48:26] (03PS1) 10CDanis: dbctl: add support for --version [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 [04:50:39] (03CR) 10CDanis: "% dbctl --version" [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 (owner: 10CDanis) [05:09:15] 10Operations, 10Traffic: Wikipedia is unavailable on Symbian phone's browsers - https://phabricator.wikimedia.org/T227828 (10Ft1978Bp) It is a Nokia C7-00 with Symbian^3. (The Opera Mobile version is 12.00.2256) [05:45:09] !log sudo -i /usr/local/sbin/restart-php7.2-fpm on mwdebug* to clear opcache [05:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:35] (03CR) 10Vgutierrez: [C: 03+2] Release 0.19 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521834 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [05:50:40] (03CR) 10jenkins-bot: Release 0.19 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/521834 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [05:57:16] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10elukey) For some reason the disk doesn't show as failed by megacli but: ` elukey@analytics1072:~$ ls /var/lib/hadoop/data/b ls: reading directory '/var/lib/hadoop/data/b': Input/output error ` [06:08:32] (03PS1) 10Vgutierrez: acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522237 (https://phabricator.wikimedia.org/T225945) [06:08:34] (03PS1) 10Vgutierrez: Release 0.19 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522238 (https://phabricator.wikimedia.org/T225945) [06:08:36] (03PS1) 10Vgutierrez: debian: Add release 0.19 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522239 (https://phabricator.wikimedia.org/T225945) [06:12:23] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522237 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:12:31] (03CR) 10Vgutierrez: [C: 03+2] Release 0.19 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522238 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:15:04] (03Merged) 10jenkins-bot: acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522237 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:15:16] (03Merged) 10jenkins-bot: Release 0.19 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522238 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:17:20] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.19 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522239 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:17:47] (03CR) 10jenkins-bot: acme_chief: Avoid retrying too eagerly on CERTIFICATE_STAGED status [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522237 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:18:00] (03CR) 10jenkins-bot: Release 0.19 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522238 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:20:03] (03Merged) 10jenkins-bot: debian: Add release 0.19 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522239 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:22:52] (03CR) 10jenkins-bot: debian: Add release 0.19 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/522239 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [06:26:24] (03CR) 10Jcrespo: [C: 03+1] "This has 2 blockers, manual deploy (REVOKE) and a decision of what to do with existing data, if it has to be deleted or archived." [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:27:50] (03CR) 10Jcrespo: [C: 03+1] mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:28:46] !log uploaded acme-chief 0.19 to apt.wikimedia.org (buster) - T225945 [06:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:51] T225945: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 [06:30:10] (03CR) 10Jcrespo: "Thee reasons this doesn't exist is that this is only a "theoretical" role, and it is not implemented anyware on production." [puppet] - 10https://gerrit.wikimedia.org/r/521382 (owner: 10Dzahn) [06:30:21] PROBLEM - puppet last run on theemin is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:07] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:31:39] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:35:11] !log upgrading acme-chief to version 0.19 in acme-chief test instances - T225945 [06:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:16] T225945: acme-chief staging time not working as expected - https://phabricator.wikimedia.org/T225945 [06:36:40] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) See also recent T217755 [06:37:05] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) [06:41:13] (03CR) 10Volans: "It would be nice to have --version on all binaries, just yesterday I tried on confctl and noticed it was not there." [software/conftool] - 10https://gerrit.wikimedia.org/r/522235 (owner: 10CDanis) [06:42:16] (03CR) 10Muehlenhoff: "The data is ephemeral (facts from currently running systems) and doesn't need to be archived, the database can be deleted." [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:43:59] (03CR) 10Volans: "LGTM, I'm just wondering if we should add the conftool-data to this same CR" [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [06:45:28] (03CR) 10Muehlenhoff: "That server and the entry in site.pp should stick around until the MySQL grant is deployed, otherwise some other server might pick up 10.6" [puppet] - 10https://gerrit.wikimedia.org/r/502171 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:55:32] (03CR) 10Muehlenhoff: [C: 03+1] mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:57:35] RECOVERY - puppet last run on theemin is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:19] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [06:58:53] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:30] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10jcrespo) Chris, you will need to coordinate with @elukey principally, as he is the person in touch directly with users affected to... [07:03:14] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) We can do it anytime with 10/15 mins of heads up Chris (I need to stop replication and traffic to db1107 before you can op... [07:11:04] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) @Papaul We will ask you to replace a disk here from T226406, when they arrive. [07:11:20] 10Operations, 10ops-codfw: Degraded RAID on db2044 - https://phabricator.wikimedia.org/T227829 (10jcrespo) [07:12:28] (03PS1) 10Muehlenhoff: Fix Cumin success_threshold in debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/522353 [07:29:03] (03CR) 10Jcrespo: [C: 03+1] "Let's schedule then a time then to do this (DBA + someone that can check nothing breaks at app level), preferably next week." [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [07:30:33] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2044 is CRITICAL: cluster=mysql device=cciss,11 instance=db2044:9100 job=node site=codfw Jcrespo https://phabricator.wikimedia.org/T227829 https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2044&var-datasource=codfw+prometheus/ops [07:32:25] (03CR) 10Elukey: Introduce openldap_config in hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [07:37:19] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10MoritzMuehlenhoff) We don't use a lot of disk space on mw servers, let's go with option 2. [07:46:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [07:47:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [07:47:42] lovely [07:47:51] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:47:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:49:59] failed fetches in esams https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-1h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend [07:50:04] not see anything strange in eqiad [07:50:10] *seeing [07:50:19] yeah, and it's all backends at the same time [07:50:36] errors going down now [07:51:01] the big spike seems to have cp3043 ints mostly [07:52:15] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:52:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [07:56:33] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-cache_type=varnish-text&var-status_type=5 [07:57:17] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?var-site=esams&var-status_type=5 [07:59:51] (03CR) 10Muehlenhoff: "How about:" [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [08:04:53] PROBLEM - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.126 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [08:04:57] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:05:13] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:06:03] PROBLEM - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [08:06:03] iirc restbase1017 is not in prod, yes? [08:06:21] it's being reimage, apparently the downtime expired [08:06:43] it is bootstrappin, see -sre [08:06:44] I'll extend it for another 24h [08:06:53] +1 thanks [08:07:06] ah, yes [08:07:41] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 4 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10fgiunchedi) >>! In T222960#5327124, @Eevans wrote: > Current snafu: None of data volumes are... [08:07:48] thanks ! [08:08:17] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 3 others: Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10MoritzMuehlenhoff) [08:09:03] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:09:36] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [08:13:39] (03CR) 10Hashar: "CI being added with https://gerrit.wikimedia.org/r/#/c/integration/config/+/522358/ :)" [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/508524 (https://phabricator.wikimedia.org/T221977) (owner: 10Ema) [08:17:00] (03PS2) 10Muehlenhoff: Remove Diamond from production hosts [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) [08:18:46] !log enable puppet on mw1222 [08:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:50] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the fix!" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/522353 (owner: 10Muehlenhoff) [08:25:11] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) a:03Cmjohnson [08:30:01] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [08:30:10] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/522365 [08:31:45] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/522366 [08:32:48] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 [08:33:59] (03CR) 10Hashar: "recheck" [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [08:34:23] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:34:30] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Fix Cumin success_threshold in debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/522353 (owner: 10Muehlenhoff) [08:35:24] (03CR) 10Hashar: "recheck" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10markahershberger) [08:35:52] (03CR) 10jerkins-bot: [V: 04-1] New library to interact with poolcounter from python [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517828 (owner: 10Giuseppe Lavagetto) [08:36:41] (03PS1) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522369 [08:38:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove Diamond from production hosts [puppet] - 10https://gerrit.wikimedia.org/r/522075 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [08:47:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [08:50:16] 10Operations, 10Traffic, 10observability, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10fgiunchedi) [08:52:01] PROBLEM - Check systemd state on cloudvirt1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:53:29] RECOVERY - Check systemd state on cloudvirt1013 is OK: OK - running: The system is fully operational [08:54:01] cloudvirt1013 is a race which sometimes hits when Diamond gets removed from a host [08:55:07] PROBLEM - puppet last run on cloudvirt1013 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [08:57:43] PROBLEM - Check systemd state on ldap-codfw-replica02 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:59:36] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Bouzinac) Excuse me, I'm a simple guy : I'd like to know why a direct javascript (vega based) isn't possi... [09:00:23] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:00:31] PROBLEM - Check systemd state on cloudvirt1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:00:59] PROBLEM - puppet last run on ldap-codfw-replica02 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:01:07] PROBLEM - puppet last run on cloudvirt1026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:01:17] PROBLEM - puppet last run on labpuppetmaster1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:01:25] PROBLEM - DPKG on schema1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:01:31] PROBLEM - puppet last run on schema1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:02:17] PROBLEM - Check systemd state on schema1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:15] PROBLEM - Check systemd state on db1073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:40] somewhat poetic of diamond to get notifications as last thing before getting decom [09:04:32] (03PS1) 10Muehlenhoff: Rename DNS entries for LDAP replicas in codfw to be more in line with our server naming policies [dns] - 10https://gerrit.wikimedia.org/r/522383 (https://phabricator.wikimedia.org/T227778) [09:04:47] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:05:39] PROBLEM - puppet last run on db1073 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:05:47] RECOVERY - DPKG on schema1002 is OK: All packages OK [09:07:39] RECOVERY - Check systemd state on db1073 is OK: OK - running: The system is fully operational [09:07:59] RECOVERY - Check systemd state on ldap-codfw-replica02 is OK: OK - running: The system is fully operational [09:09:17] RECOVERY - Check systemd state on cloudvirt1026 is OK: OK - running: The system is fully operational [09:10:09] PROBLEM - puppet last run on labpuppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:11:05] RECOVERY - puppet last run on db1073 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:11:55] RECOVERY - puppet last run on ldap-codfw-replica02 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:12:03] PROBLEM - puppet last run on cloudvirt1024 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:12:03] RECOVERY - puppet last run on cloudvirt1026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:12:31] PROBLEM - Check systemd state on cloudvirt1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:12:43] PROBLEM - Check systemd state on cloudvirt1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:13:09] PROBLEM - Check systemd state on lvs2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:13:59] RECOVERY - Check systemd state on cloudvirt1020 is OK: OK - running: The system is fully operational [09:14:11] RECOVERY - Check systemd state on cloudvirt1024 is OK: OK - running: The system is fully operational [09:14:31] PROBLEM - puppet last run on lvs2009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[diamond],Package[python-diamond] [09:17:29] RECOVERY - puppet last run on cloudvirt1024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:21:59] RECOVERY - Check systemd state on lvs2009 is OK: OK - running: The system is fully operational [09:22:27] RECOVERY - puppet last run on cloudvirt1013 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:24:13] RECOVERY - Check systemd state on schema1002 is OK: OK - running: The system is fully operational [09:25:23] RECOVERY - puppet last run on lvs2009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:28:53] RECOVERY - puppet last run on schema1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:29:39] (03CR) 10Elukey: "> How about:" [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [09:35:38] (03PS1) 10Hashar: dpkg-source to ignore .gitreview file [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522387 [09:35:50] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522369 (owner: 10Hashar) [09:38:23] (03PS1) 10Hashar: Add uploader full name [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522389 [09:39:37] (03CR) 10Hashar: "Else dpkg-source complains:" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522387 (owner: 10Hashar) [09:40:33] PROBLEM - Host ldap-codfw-replica01 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:33] (03CR) 10Hashar: "There are other lintian warnings but CI does not enforce them. With this change and the parent change, the build pass and would let us m" [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522389 (owner: 10Hashar) [09:43:01] !log shut down ldap-codfw-replica01/ldap-codfw-replica02 (pending reimage) [09:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:34] (03CR) 10Fsero: [C: 03+2] dpkg-source to ignore .gitreview file [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522387 (owner: 10Hashar) [09:43:37] (03CR) 10Hashar: "filled as T227859" [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/522367 (owner: 10Hashar) [09:43:41] (03CR) 10Fsero: [C: 03+2] Add uploader full name [debs/helmfile] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/522389 (owner: 10Hashar) [09:44:30] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [software/varnish/libvmod-uuid] (debian) - 10https://gerrit.wikimedia.org/r/522365 (owner: 10Hashar) [09:48:14] 10Operations, 10Analytics, 10Traffic: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10ema) [09:48:25] 10Operations, 10Analytics, 10Traffic: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10ema) p:05Triage→03Normal [09:52:30] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [09:57:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) (owner: 10Dzahn) [10:04:02] (03PS1) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:04:53] (03CR) 10jerkins-bot: [V: 04-1] wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [10:06:38] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) [10:07:00] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) I will try to force a relearn process/reboot, in case that works. [10:08:21] (03PS2) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:09:09] (03CR) 10jerkins-bot: [V: 04-1] wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [10:09:55] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:10:59] (03PS3) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:14:31] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) Based on https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php&1 and T184888 I will switchover codfw master to db2069. [10:14:50] (03PS1) 10Ema: cache: add role::cache::text_ats [puppet] - 10https://gerrit.wikimedia.org/r/522398 (https://phabricator.wikimedia.org/T227432) [10:19:13] (03PS4) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:20:09] (03PS1) 10Jcrespo: switchover.py: Fix small formatting bug when printing ROW format [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/522400 [10:20:24] (03PS2) 10Jcrespo: switchover.py: Fix small formatting bug when printing ROW format [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/522400 [10:22:44] (03CR) 10Jcrespo: [C: 03+2] switchover.py: Fix small formatting bug when printing ROW format [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/522400 (owner: 10Jcrespo) [10:23:50] !log switchover x1 codfw master from db2045 to db2069 [10:23:52] (03PS3) 10Fsero: helmfile.d: adding eqiad,codfw admin helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) [10:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:03] (03CR) 10Jbond: "looks good but a policy/styling question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522217 (owner: 10CDanis) [10:24:22] !log switchover x1 codfw master from db2045 to db2069 T227862 [10:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:27] T227862: db2045 failed battery - https://phabricator.wikimedia.org/T227862 [10:27:33] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) Everything went well except: ` Updating tendril... [WARNING] Old master not found on tendril server list Updating zarcillo... [WARNING] Old master not found on zarcillo master list ` [10:27:50] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) p:05Triage→03Normal [10:30:12] (03PS3) 10Jbond: hiera backends: update hiera.yaml file to work with puppet 4.9 [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) [10:31:23] (03PS1) 10Jcrespo: mariadb: Promote db2069 to be the new x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/522403 (https://phabricator.wikimedia.org/T227862) [10:32:07] (03CR) 10Jcrespo: [C: 03+2] mariadb: Promote db2069 to be the new x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/522403 (https://phabricator.wikimedia.org/T227862) (owner: 10Jcrespo) [10:32:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [10:34:12] (03CR) 10Jbond: [C: 03+1] "all fine to go from my point of view" [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [10:35:11] (03PS1) 10Muehlenhoff: Remove CherryPickCounter Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) [10:35:51] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/502172 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [10:37:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/522383 (https://phabricator.wikimedia.org/T227778) (owner: 10Muehlenhoff) [10:42:09] (03PS5) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:42:19] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [10:42:22] (03CR) 10Muehlenhoff: [C: 03+2] Rename DNS entries for LDAP replicas in codfw to be more in line with our server naming policies [dns] - 10https://gerrit.wikimedia.org/r/522383 (https://phabricator.wikimedia.org/T227778) (owner: 10Muehlenhoff) [10:44:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [10:44:48] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 3 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) Hi @Bouzinac, usually the software tries to make basic web page rendering functional without a... [10:46:44] (03PS6) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [10:47:59] (03PS1) 10Muehlenhoff: Remove jessie support from Kafka class [puppet] - 10https://gerrit.wikimedia.org/r/522406 [10:48:44] (03CR) 10Muehlenhoff: Remove CherryPickCounter Diamond collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [10:50:32] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/17346/" [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [10:54:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/522101 (https://phabricator.wikimedia.org/T227779) (owner: 10Jbond) [10:55:58] (03PS1) 10Jcrespo: mariadb: Promote db2069 to be the new x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522409 (https://phabricator.wikimedia.org/T227862) [10:56:27] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/522218 (https://phabricator.wikimedia.org/T222960) (owner: 10Eevans) [10:57:46] (03PS2) 10Jcrespo: mariadb: Promote db2069 to be the new x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522409 (https://phabricator.wikimedia.org/T227862) [11:00:15] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:00:21] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [11:03:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [11:04:35] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:04:41] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:04:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [11:05:33] (03CR) 10Jcrespo: [C: 03+2] mariadb: Promote db2069 to be the new x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522409 (https://phabricator.wikimedia.org/T227862) (owner: 10Jcrespo) [11:06:51] (03Merged) 10jenkins-bot: mariadb: Promote db2069 to be the new x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522409 (https://phabricator.wikimedia.org/T227862) (owner: 10Jcrespo) [11:07:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10jijiki) [11:07:32] (03CR) 10jenkins-bot: mariadb: Promote db2069 to be the new x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522409 (https://phabricator.wikimedia.org/T227862) (owner: 10Jcrespo) [11:08:22] (03PS4) 10Fsero: helmfile.d: adding eqiad,codfw admin helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) [11:09:00] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile.d: adding eqiad,codfw admin helmfiles [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [11:09:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10jijiki) p:05Triage→03Normal [11:09:19] !log jynus@deploy1001 Synchronized wmf-config/db-codfw.php: Switchover db2045 x1 codfw master to db2069 (duration: 00m 51s) [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:35] (03CR) 10Fsero: [V: 03+2 C: 03+2] "removed graphoid mentions!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/522098 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [11:09:53] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 16 ge 4 Effie Mouzeli Task filed: T227867 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [11:11:25] (03PS2) 10Muehlenhoff: Remove CherryPickCounter Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) [11:13:26] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) >>! In T227862#5327825, @jcrespo wrote: > Everything went well except: > ` > Updating tendril... > [WARNING] Old master not found on tendril server list > Updating zarcillo... > [WARNING] Old maste... [11:14:21] (03CR) 10Muehlenhoff: [C: 03+2] Remove CherryPickCounter Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522405 (https://phabricator.wikimedia.org/T210993) (owner: 10Muehlenhoff) [11:16:08] 10Operations, 10ops-codfw, 10DBA: db2045 failed battery - https://phabricator.wikimedia.org/T227862 (10jcrespo) ` root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db2069' where section='x1' and dc='codfw'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 root@db111... [11:20:51] (03PS1) 10Muehlenhoff: Drop jessie support from tor class [puppet] - 10https://gerrit.wikimedia.org/r/522415 [11:21:03] RECOVERY - puppet last run on labpuppetmaster1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:23:03] RECOVERY - puppet last run on labpuppetmaster1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:28:04] (03PS1) 10Muehlenhoff: Remove support for jessie from Phabricator classes [puppet] - 10https://gerrit.wikimedia.org/r/522420 [11:28:55] (03CR) 10jerkins-bot: [V: 04-1] Remove support for jessie from Phabricator classes [puppet] - 10https://gerrit.wikimedia.org/r/522420 (owner: 10Muehlenhoff) [11:35:59] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [11:36:16] (03CR) 10Muehlenhoff: "| I'd go for a slight variation, namely:" [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [11:38:31] (03CR) 10Hashar: "recheck" [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto) [11:40:18] (03CR) 10jerkins-bot: [V: 04-1] Add debian package build [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/517979 (owner: 10Giuseppe Lavagetto) [11:46:55] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:50:58] (03PS1) 10Muehlenhoff: Remove sudo rule for removed PowerDNSRecursor Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522426 [11:52:50] (03PS1) 10Hashar: dpkg-source to ignore .gitreview file [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/522428 [11:53:24] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [11:53:49] (03CR) 10Hashar: "That is to prevent:" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/522428 (owner: 10Hashar) [11:54:03] (03CR) 10Hashar: "And that make the debian-glue Jenkins job to pass :-]" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/522428 (owner: 10Hashar) [12:00:03] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:01:00] !log recreating termbox staging namespace T227775 [12:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:06] T227775: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 [12:04:20] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [12:04:25] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:05:49] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [12:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:58] !log recreating citoid staging namespace T227775 [12:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:04] T227775: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 [12:08:45] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Apr-Jun-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) This was published in https:... [12:09:06] 10Operations, 10Traffic, 10CommRel-Specialists-Support (Jul-Sep-2019), 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) [12:10:14] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [12:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:37] !log recreating sessionstore,cxserver and mathoid staging namespaces T227775 [12:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:26] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [12:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:29] (03PS1) 10Muehlenhoff: Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) [12:17:59] (03CR) 10jerkins-bot: [V: 04-1] Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) (owner: 10Muehlenhoff) [12:18:40] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [12:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:46] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [12:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:59] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [12:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:12] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [12:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:24] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [12:22:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:56] !log recreating eventgate-* and blubberoid staging namespaces T227775 [12:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:01] T227775: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 [12:29:00] (03PS1) 10Muehlenhoff: Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) [12:29:25] (03PS2) 10Muehlenhoff: Sync docker packages to thirdparty/ci [puppet] - 10https://gerrit.wikimedia.org/r/522434 (https://phabricator.wikimedia.org/T226236) [12:29:47] (03CR) 10jerkins-bot: [V: 04-1] Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [12:33:14] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [12:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:24] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [12:33:27] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [12:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:01] (03PS2) 10Muehlenhoff: Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) [12:34:14] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove sudo rule for removed PowerDNSRecursor Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522426 (owner: 10Muehlenhoff) [12:36:54] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [12:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:51] (03PS2) 10Muehlenhoff: Remove sudo rule for removed PowerDNSRecursor Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522426 [12:38:51] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [12:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:05] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [12:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove sudo rule for removed PowerDNSRecursor Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522426 (owner: 10Muehlenhoff) [12:39:46] !log recreating ci staging namespaces T227775 [12:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:51] T227775: recreate staging cluster namespaces using helmfile - https://phabricator.wikimedia.org/T227775 [12:40:43] (03CR) 10Gehel: [C: 04-1] cookbook API: add class API (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/506956 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [12:44:19] (03CR) 10Jhedden: [C: 03+1] wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [12:45:15] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [12:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:47] RECOVERY - puppet last run on icinga1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:53:53] (03CR) 10Ottomata: [C: 03+1] Remove jessie support from Kafka class [puppet] - 10https://gerrit.wikimedia.org/r/522406 (owner: 10Muehlenhoff) [12:59:02] (03PS1) 10Muehlenhoff: Remove sudo user for already removed Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/522449 [13:05:25] (03CR) 10Jhedden: [C: 03+1] "Nice addition. I think overall it's a good option to expose." [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [13:09:19] (03CR) 10Jhedden: [C: 03+1] Remove standard::diamond and fold into profile::wmcs::instance [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [13:11:15] (03PS1) 10Muehlenhoff: Remove obsolete NutcrackerCollector [puppet] - 10https://gerrit.wikimedia.org/r/522459 [13:17:57] (03PS2) 10CDanis: conftool::client: convert hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/522217 [13:17:59] (03PS5) 10CDanis: dbctl: install package on cluster management hosts [puppet] - 10https://gerrit.wikimedia.org/r/522227 [13:18:06] (03CR) 10CDanis: conftool::client: convert hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522217 (owner: 10CDanis) [13:21:14] (03PS1) 10Muehlenhoff: planet: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522465 [13:29:51] PROBLEM - HHVM rendering on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:31:11] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 79091 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:37:05] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:40:03] (03CR) 10CDanis: [C: 03+2] conftool::client: convert hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/522217 (owner: 10CDanis) [13:40:37] (03CR) 10CDanis: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [13:40:58] (03CR) 10CDanis: [C: 03+2] dbctl: install package on cluster management hosts [puppet] - 10https://gerrit.wikimedia.org/r/522227 (owner: 10CDanis) [13:45:00] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [13:45:07] (03PS7) 10Filippo Giunchedi: wmcs: swap check_graphite for Diamond metrics with check_prometheus [puppet] - 10https://gerrit.wikimedia.org/r/522397 (https://phabricator.wikimedia.org/T212231) [13:45:57] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Physikerwelt) [13:50:08] (03PS1) 10Jbond: check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) [13:51:10] (03CR) 10jerkins-bot: [V: 04-1] check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [13:53:11] (03PS1) 10CDanis: admin: cdanis dotfiles: .zshrc tweaks [puppet] - 10https://gerrit.wikimedia.org/r/522471 [13:53:13] 10Operations, 10observability, 10Patch-For-Review, 10good first bug: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10jbond) being as its a Friday i thought i would have a go at this however i didn't read the original message correctly a... [13:53:56] (03CR) 10CDanis: [C: 03+2] admin: cdanis dotfiles: .zshrc tweaks [puppet] - 10https://gerrit.wikimedia.org/r/522471 (owner: 10CDanis) [13:54:11] (03PS2) 10CDanis: admin: cdanis dotfiles: .zshrc tweaks [puppet] - 10https://gerrit.wikimedia.org/r/522471 [13:56:32] (03PS1) 10Effie Mouzeli: WIP: jobrunners: Enable php7_only feature flags [puppet] - 10https://gerrit.wikimedia.org/r/522472 (https://phabricator.wikimedia.org/T219148) [13:57:11] (03PS2) 10Jbond: check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) [14:01:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522436 (https://phabricator.wikimedia.org/T212231) (owner: 10Muehlenhoff) [14:04:50] 10Operations, 10Puppet: Add a CI check for the use of hiera() function - https://phabricator.wikimedia.org/T220820 (10jbond) the lookup function allows one to call it in multiple ways e.g. ` lookup('tcpircbot_host', {'default_value' => 'icinga.wikimedia.org'}) lookup({'name => 'tcpircbot_host', 'default_va... [14:05:36] (03CR) 10Jbond: conftool::client: convert hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522217 (owner: 10CDanis) [14:05:37] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:23] 10Operations, 10Math: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10jcrespo) [14:11:50] 10Operations, 10observability, 10Patch-For-Review, 10good first bug: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10CDanis) >>! In T227100#5328397, @jbond wrote: > being as its a Friday i thought i would have a go at this however i did... [14:12:26] (03PS2) 10Hashar: contint: remove unused contint::packages::python [puppet] - 10https://gerrit.wikimedia.org/r/517092 (https://phabricator.wikimedia.org/T225735) [14:12:28] (03PS3) 10Hashar: contint: remove several unused packages [puppet] - 10https://gerrit.wikimedia.org/r/517093 (https://phabricator.wikimedia.org/T225735) [14:12:30] (03PS3) 10Hashar: contint: remove unneeded profile::ci::hhvm [puppet] - 10https://gerrit.wikimedia.org/r/517094 (https://phabricator.wikimedia.org/T225735) [14:17:28] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 13.83 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:19:46] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 10.97 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:20:05] that used to be a graphite alert I just replaced with a prometheus check ^ might be false positive [14:23:10] PROBLEM - Persistent high iowait on cloudstore1008 is CRITICAL: 135.2 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:23:20] PROBLEM - Persistent high iowait on labstore1004 is CRITICAL: 26.37 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:24:59] looking into the alerts btw [14:28:25] ACKNOWLEDGEMENT - Persistent high iowait on cloudstore1008 is CRITICAL: 131.2 ge 10 andrew bogott presumed side-effect of https://gerrit.wikimedia.org/r/#/c/522397/ https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:31:16] (03CR) 10CDanis: "thanks for doing this!! LGTM with some nits." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [14:35:04] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Recap of what it has been done so fare in various (sub) tasks:... [14:37:57] (03CR) 10Elukey: [C: 03+1] Remove obsolete NutcrackerCollector [puppet] - 10https://gerrit.wikimedia.org/r/522459 (owner: 10Muehlenhoff) [14:39:32] (03PS1) 10Filippo Giunchedi: wmcs: scale iowait to number of CPUs [puppet] - 10https://gerrit.wikimedia.org/r/522483 (https://phabricator.wikimedia.org/T212231) [14:39:59] andrewbogott: ^ [14:40:37] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) @aaron I added two new rows to https://grafana.wikimedia.org/dashboard/db/mcrouter wit... [14:40:59] (03CR) 10Andrew Bogott: [C: 03+1] "wfm!" [puppet] - 10https://gerrit.wikimedia.org/r/522483 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [14:41:52] (03CR) 10Filippo Giunchedi: [C: 03+2] wmcs: scale iowait to number of CPUs [puppet] - 10https://gerrit.wikimedia.org/r/522483 (https://phabricator.wikimedia.org/T212231) (owner: 10Filippo Giunchedi) [14:42:01] thanks, will deploy [14:42:54] (03PS3) 10Jbond: check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) [14:43:52] (03CR) 10Jbond: "Thanks all updated/added" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [14:44:02] (03CR) 10jerkins-bot: [V: 04-1] check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [14:45:23] (03PS1) 10Muehlenhoff: netmon: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522490 [14:45:59] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) >>! In T214275#5313635, @Andrew wrote: >>>! In T214275#5307951, @elukey wrote: >> >> The latter should be doabl... [14:46:34] jbond42: not sure which one is failing since I don't know Ruby string literal semantics offhand, but I'd expect one of the two test cases you added to fail -- the regex as written wants to see an odd number of \s before a ! (which is correct) [14:46:58] RECOVERY - Persistent high iowait on cloudstore1008 is OK: (C)10 ge (W)5 ge 3.597 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [14:46:58] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) [14:47:34] (03PS5) 10Andrew Bogott: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) [14:47:41] (03CR) 10Andrew Bogott: [C: 03+1] "Thanks! I will test." [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) (owner: 10Andrew Bogott) [14:48:04] 10Operations, 10User-Elukey: memkeys segfaults on Debian Stretch - https://phabricator.wikimedia.org/T223863 (10elukey) [14:48:07] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) [14:49:54] cdanis: yes you are right its failing on `\\!` which is valid. however i have just done some tests and `\!` also works just as well. '!' is not an escap character so in that case '\' is just eveluated as a litral [14:50:24] i think the way forward is to either only support '\!' and not '\\!' are update the allready horrible regex [14:50:45] the regex is correct, the test is wrong [14:50:50] Ruby string quoting semantics are different from Puppet's [14:51:00] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove obsolete NutcrackerCollector [puppet] - 10https://gerrit.wikimedia.org/r/522459 (owner: 10Muehlenhoff) [14:51:16] jbond42: in Puppet you need to write '\\!' which then gets parsed into '\!' by the time the test sees it [14:51:23] (this is all horrible) [14:51:32] actully its the other way round [14:51:36] wait really [14:51:44] in puppet your right \! and it gets to \\! when ruby evaluates [14:52:05] in that case I've just read the documentation for each completely wrong [14:53:08] wait a minute its not error ing in the place i thought it was give me 10 mins [14:53:52] (03PS2) 10Muehlenhoff: Remove obsolete NutcrackerCollector [puppet] - 10https://gerrit.wikimedia.org/r/522459 [14:54:25] some of the recent errors are I think just mismatched spellings in error message vs expected error message [14:54:49] ahh yes parameter vs paramater [14:55:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete NutcrackerCollector [puppet] - 10https://gerrit.wikimedia.org/r/522459 (owner: 10Muehlenhoff) [14:59:10] RECOVERY - Persistent high iowait on labstore1004 is OK: (C)10 ge (W)5 ge 0.1 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [15:00:04] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [15:00:14] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:01:18] (03PS1) 10Muehlenhoff: mariadb::config: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522496 [15:03:58] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:04:08] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:05:58] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:06:23] (03PS4) 10Jbond: check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) [15:07:06] took way to long to fix one letter :( https://gerrit.wikimedia.org/r/c/operations/puppet/+/522469/3..4/modules/monitoring/spec/defines/check_prometheus_spec.rb [15:08:43] 10Operations, 10observability, 10Patch-For-Review, 10good first bug: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10jbond) ahh ok so in that case `\\!` and `\!` are the same. As `!` is not an escape character the '\' is treated treate... [15:09:24] jbond42: it looks like '\!' and '\\!' are exactly the same string in ruby [15:09:39] so that's why the other tests aren't failing :) [15:09:40] yes i think i was wrong before \\! just ends up being \! [15:10:07] (03CR) 10CDanis: [C: 03+1] "LGTM! thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [15:10:17] irb(main):004:0> '\!' == '\\!' [15:10:20] => true [15:10:23] ack [15:10:40] (03PS5) 10Jbond: check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) [15:11:40] (03CR) 10Cwhite: "Should we consider ensure=>absent before unmanaging the resources, or is the plan to clean up the files and sudo rule manually?" [puppet] - 10https://gerrit.wikimedia.org/r/522449 (owner: 10Muehlenhoff) [15:12:21] i was testing the string with regex so my passing test was checking for /\\!/ do many levels of escaping :/ [15:12:38] s/do/too/ [15:12:50] (03CR) 10Jbond: [C: 03+2] check_prometheus: done allow unescaped exclamation marks [puppet] - 10https://gerrit.wikimedia.org/r/522469 (https://phabricator.wikimedia.org/T227100) (owner: 10Jbond) [15:16:10] ACKNOWLEDGEMENT - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans T222960 [15:16:10] ACKNOWLEDGEMENT - Disk space on restbase1017 is CRITICAL: DISK CRITICAL - free space: / 10 MB (0% inode=95%): eevans T222960 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [15:16:10] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is CRITICAL: connect to address 10.64.16.126 and port 9042: Connection refused eevans T222960 https://phabricator.wikimedia.org/T93886 [15:16:10] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans T222960 https://phabricator.wikimedia.org/T120662 [15:16:10] ACKNOWLEDGEMENT - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed eevans T222960 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:18:52] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:19:15] (03PS1) 10Muehlenhoff: Adapt netboot.cfg/DHCP to new names of LDAP replicas in codfw [puppet] - 10https://gerrit.wikimedia.org/r/522498 (https://phabricator.wikimedia.org/T227778) [15:19:21] (03CR) 10Jcrespo: [C: 03+1] "Yay! +1 but also queing for next week if you are ok with it. Thanks Moritz!" [puppet] - 10https://gerrit.wikimedia.org/r/522496 (owner: 10Muehlenhoff) [15:19:30] (03PS1) 10Gehel: wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 [15:21:05] (03CR) 10Muehlenhoff: "Sure, no hurry at all :-)" [puppet] - 10https://gerrit.wikimedia.org/r/522496 (owner: 10Muehlenhoff) [15:21:10] (03PS6) 10Andrew Bogott: wmcs-cold-migrate: use 'virsh undefine' to cleanup old VMs [puppet] - 10https://gerrit.wikimedia.org/r/518748 (https://phabricator.wikimedia.org/T226415) [15:21:35] (03CR) 10Jcrespo: [C: 03+1] "Adding a reminder, to do on a separate patch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/522496 (owner: 10Muehlenhoff) [15:23:27] (03PS1) 10Jbond: rspec: update the standard module to use the global spec test [puppet] - 10https://gerrit.wikimedia.org/r/522500 [15:25:05] !log rebooting cloudvirt1018.eqiad.wmnet T216040 [15:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:11] T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown - https://phabricator.wikimedia.org/T216040 [15:27:10] (03CR) 10jerkins-bot: [V: 04-1] wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 (owner: 10Gehel) [15:28:44] (03PS2) 10Gehel: wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 [15:29:34] (03PS2) 10Elukey: Introduce a ldap config in hiera [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) [15:31:14] (03CR) 10jerkins-bot: [V: 04-1] wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 (owner: 10Gehel) [15:32:19] (03CR) 10Jbond: [C: 03+2] rspec: update the standard module to use the global spec test [puppet] - 10https://gerrit.wikimedia.org/r/522500 (owner: 10Jbond) [15:32:28] (03PS2) 10Jbond: rspec: update the standard module to use the global spec test [puppet] - 10https://gerrit.wikimedia.org/r/522500 [15:32:52] (03CR) 10Elukey: "> How about "ldap:" as a prefix, we don't even need to refer to the" [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [15:34:28] (03PS3) 10Gehel: wdqs: update response time check to new prometheus metrics. [puppet] - 10https://gerrit.wikimedia.org/r/522499 [15:36:58] (03CR) 10markahershberger: [C: 03+2] dpkg-source to ignore .gitreview file [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/522428 (owner: 10Hashar) [15:37:58] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [15:39:32] 10Operations, 10Analytics, 10Discovery, 10Research-Backlog: Make oozie swift upload emit event to Kafka about swift object upload complete - https://phabricator.wikimedia.org/T227896 (10Ottomata) [15:41:32] (03CR) 10markahershberger: [C: 03+2] Address concerns with already-merged code [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10markahershberger) [15:42:18] (03CR) 10Jbond: [C: 04-1] "I have just remembered that the .fixtures file is used by the CI to dynamically build a list of module dependencies and ensure that there " [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [15:42:20] (03CR) 10markahershberger: [C: 03+2] "Conveying @legoktm's approval" [debs/python-git-archive-all] - 10https://gerrit.wikimedia.org/r/503156 (owner: 10markahershberger) [15:43:36] 10Operations, 10observability, 10Patch-For-Review, 10good first bug: monitoring::check_prometheus should error on an unquoted ! in the query - https://phabricator.wikimedia.org/T227100 (10CDanis) 05Open→03Resolved a:03jbond [15:55:37] (03CR) 10BryanDavis: [C: 03+1] toolforge: install fish shell [puppet] - 10https://gerrit.wikimedia.org/r/522161 (https://phabricator.wikimedia.org/T219054) (owner: 10Jhedden) [15:57:13] (03CR) 10Jhedden: [C: 03+2] toolforge: install fish shell [puppet] - 10https://gerrit.wikimedia.org/r/522161 (https://phabricator.wikimedia.org/T219054) (owner: 10Jhedden) [15:57:27] (03PS2) 10Jhedden: toolforge: install fish shell [puppet] - 10https://gerrit.wikimedia.org/r/522161 (https://phabricator.wikimedia.org/T219054) [15:57:37] (03CR) 10Jhedden: [V: 03+2 C: 03+2] toolforge: install fish shell [puppet] - 10https://gerrit.wikimedia.org/r/522161 (https://phabricator.wikimedia.org/T219054) (owner: 10Jhedden) [15:58:39] 10Operations, 10Machine vision, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) a:03Mholloway [15:59:00] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) [16:00:12] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [16:00:24] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:00:30] huzzah [16:00:49] the restbase1017 saga comes to an end =) [16:02:17] urandom: godog: https://github.com/prometheus/docs/pull/1379/files merged and live https://prometheus.io/docs/practices/naming/ 🎉 [16:03:18] cdanis: \o/ very nice [16:03:43] chaomodus: I like your optimism! [16:04:34] PROBLEM - Check systemd state on restbase1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:04:46] PROBLEM - cassandra-a service on restbase1017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:04:58] aha [16:05:01] I jinxed it :( [16:05:03] always look on the bright side of life? [16:05:06] yah [16:06:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [16:08:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522498 (https://phabricator.wikimedia.org/T227778) (owner: 10Muehlenhoff) [16:09:45] (03PS1) 10Jbond: ntp: move the include of standard::ntp out of role and into profile [puppet] - 10https://gerrit.wikimedia.org/r/522510 [16:11:27] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) @ArielGlenn wanted to make sure you saw this; Maya is blocked on tasks until she has database access. @RStal... [16:11:48] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:11:54] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:12:59] cdanis: oh, excellent; it's official! 🎉 [16:18:03] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) (owner: 10Effie Mouzeli) [16:18:12] (03PS4) 10Effie Mouzeli: Revert "Increase swift proxy connection timeout to 1s" [puppet] - 10https://gerrit.wikimedia.org/r/520727 (https://phabricator.wikimedia.org/T226373) [16:20:06] sigh [16:21:59] jeh: should I merge your puppet change ? [16:22:10] this one [16:22:12] https://github.com/wikimedia/puppet/commit/f84861685ab218df01eefc713b14ff2b4db1c387 [16:22:50] jeh: this is a bit urgent [16:23:14] ok, I thought I already merged it though: https://gerrit.wikimedia.org/r/c/operations/puppet/+/522161 [16:23:32] it is not merged on the puppetmaster [16:23:40] oh, sorry. please [16:23:43] ok tx [16:25:42] !log Rolling restart swift proxy on ms-fe* [16:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:24] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:26:30] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [16:27:46] RECOVERY - Check systemd state on wdqs1010 is OK: OK - running: The system is fully operational [16:29:28] RECOVERY - Disk space on restbase1017 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:32:08] !log bootstrapping restbase1017-a -- T222960 [16:32:08] RECOVERY - Check systemd state on restbase1017 is OK: OK - running: The system is fully operational [16:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:13] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [16:32:22] RECOVERY - cassandra-a service on restbase1017 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:32:24] RECOVERY - cassandra-a SSL 10.64.16.126:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-a valid until 2020-06-24 13:01:17 +0000 (expires in 347 days) https://phabricator.wikimedia.org/T120662 [16:34:38] (03CR) 10Ori.livneh: "Looks good on Puppet compiler: https://puppet-compiler.wmflabs.org/compiler1002/17350/. Could someone cherry-pick it on beta? I no longer " [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [16:50:15] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Krinkle) Yes, I mostly changed our practice of deleting mirrors to archiving them (which means it's fully ready-only, including no pull requests or other sources of notifica... [16:55:14] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2022 misbehaving / error on boot - https://phabricator.wikimedia.org/T227667 (10Papaul) p:05Triage→03Normal [16:56:48] 10Operations, 10ops-codfw: SSH to mw2269.mgmt not working - https://phabricator.wikimedia.org/T227548 (10Papaul) a:03Papaul [17:08:11] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10RStallman-legalteam) @kzimmerman - seems like Maya may have signed an NDA through T&C. I can check with them. I don't hav... [17:20:40] (03CR) 10Muehlenhoff: "Yeah, it's just two hosts, I'll remove it via Cumin." [puppet] - 10https://gerrit.wikimedia.org/r/522449 (owner: 10Muehlenhoff) [17:21:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but let's wait for Alex to chime in." [puppet] - 10https://gerrit.wikimedia.org/r/522073 (https://phabricator.wikimedia.org/T227611) (owner: 10Elukey) [17:31:09] Krinkle: hi! this has been alerting for 5h - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=navigation-timing-alerts+grafana+alert /cc T203485 [17:31:09] T203485: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 [17:35:11] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Allow async foreign set/delete WAN cache operations in mcrouter - https://phabricator.wikimedia.org/T225642 (10aaron) For generic key testing, there is always: ` php maintenance/mctest.php --... [17:39:12] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10aaron) >>! In T214275#5328625, @elukey wrote: >>>! In T214275#5313635, @Andrew wrote: >>>>! In T214275#5307951, @elukey... [17:42:51] XioNoX: This is a perf-team Grafana alert, which ops is meant to ignore per T203485. We handle these via Grafana and our team e-mail alias (which incgina has notified already). [17:42:52] T203485: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 [17:43:24] Krinkle: but they show up in the outstanding Icinga alert lists [17:44:12] XioNoX: I'm aware, see ticket for context. Whatever options are available to prevent that, we're listening :) [17:44:40] We didn't plan to use Icinga at all originally, but were told we should not have Grafana send e-mails/irc messages, hence the current situation. [17:45:53] Krinkle: can you ack the alert in Icinga when you get notified? [17:45:56] or downtime it? [17:46:24] see the task :) [17:46:24] XioNoX: No, we don't use Icinga, and don't plan to log in every time to ack things, which will happen normally throughout the week. [17:47:40] 10Operations, 10observability, 10Performance-Team (Radar): Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Krinkle) In recent weeks there have been three or four occasions an SRE kindly informed us about an "on-going alert". I'm not sure if something changed in Ici... [17:50:33] ah yeah, I skimmed the task too fast [17:51:37] ok, yeah it all make sens [17:57:59] RECOVERY - cassandra-a CQL 10.64.16.126:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.16.126 port 9042 https://phabricator.wikimedia.org/T93886 [17:58:06] (03PS3) 10Dzahn: delete servermon role and module [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) [17:58:35] (03CR) 10Dzahn: [C: 03+2] delete servermon role and module [puppet] - 10https://gerrit.wikimedia.org/r/502174 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [18:01:25] we should add an auto-ACK to these checks via eventhandler or so [18:01:35] re: perf-team [18:02:04] !log bootstrapping restbase1017-b -- T222960 [18:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:10] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [18:02:39] good to see ^ :) [18:02:51] mutante: agreed :) [18:03:07] RECOVERY - cassandra-b service on restbase1017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:03:11] RECOVERY - cassandra-b SSL 10.64.16.127:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-b valid until 2020-06-24 13:01:18 +0000 (expires in 347 days) https://phabricator.wikimedia.org/T120662 [18:05:37] (03CR) 10Cwhite: [C: 03+1] "Great! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/522449 (owner: 10Muehlenhoff) [18:16:37] !log Remove bogus Graphite data at frontend.navtiming2.requet (typo from Nov 2018), graphite1004/2003 [18:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:05] (03PS1) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) [18:18:24] (03CR) 10jerkins-bot: [V: 04-1] lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [18:28:30] (03PS2) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) [18:28:40] (03CR) 10Cwhite: [C: 03+2] set up debian packaging [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/521580 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [18:34:04] (03PS3) 10Dzahn: traffic servers: remove netmon1003 director and backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) [18:36:19] 10Operations, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10Legoktm) >>! In T92298#2795655, @BBlack wrote: > > My current thinking on this is that it's best to wait on TLSv1.3's padding mechanism to be av... [18:39:28] (03CR) 10Alex Monk: "Done, I could give you deployment-prep access if you're likely to use it?" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [18:39:48] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) This is a dell server, I will try and put in a ticket with Dell but all h/w is showing that there isn't a problem so I may have trouble with Dell giv... [18:40:48] (03CR) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [18:41:19] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) Still no news, asked to escalate the case. [18:41:39] 10Operations, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10CDanis) >>! In T92298#5329217, @Legoktm wrote: > Do we support TLS 1.3 yet? I'm apparently connecting over 1.2 still. No -- {T170567} [18:42:33] 10Operations, 10ops-eqiad, 10Operations-Software-Development, 10observability: ms-be1043 sdk failed - https://phabricator.wikimedia.org/T218544 (10Cmjohnson) here is the Dell task You have successfully submitted request SR994463101. [18:42:44] (03CR) 10Jforrester: Even more invariant config moved over to CommonSettings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [18:43:04] (03PS5) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [18:43:06] (03PS1) 10Jforrester: Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 [18:43:08] (03PS1) 10Jforrester: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 [18:43:08] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10Performance: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567 (10Legoktm) [18:43:11] (03PS1) 10Jforrester: Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 [18:43:13] 10Operations, 10Traffic, 10HTTPS, 10Security: Investigate our mitigation strategy for HTTPS response length attacks - https://phabricator.wikimedia.org/T92298 (10Legoktm) [18:43:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: mw1239 memory errors - https://phabricator.wikimedia.org/T227867 (10Cmjohnson) This server is out of warranty, I can reseat the DIMM but will need the server to taken down. [18:44:06] (03CR) 10jerkins-bot: [V: 04-1] Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 (owner: 10Jforrester) [18:44:15] (03CR) 10jerkins-bot: [V: 04-1] Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 (owner: 10Jforrester) [18:44:29] (03CR) 10jerkins-bot: [V: 04-1] Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 (owner: 10Jforrester) [18:44:45] (03CR) 10Dzahn: lookup checks: add checks to warn against using hiera and advice lookup (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [18:46:43] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Cmjohnson) I am not sure what I was looking at yesterday but this server is out of warranty. However, I think I have a 4TB disks that I can replace it with. I will confirm when I get back to eqiad next week. [18:47:19] (03PS1) 10Jforrester: Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 [18:47:21] (03PS1) 10Jforrester: Stop setting wgGraphIsTrusted to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 [18:48:27] (03CR) 10jerkins-bot: [V: 04-1] Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 (owner: 10Jforrester) [18:48:46] (03CR) 10Dzahn: [C: 03+2] traffic servers: remove netmon1003 director and backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) (owner: 10Dzahn) [18:48:52] (03CR) 10jerkins-bot: [V: 04-1] Stop setting wgGraphIsTrusted to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 (owner: 10Jforrester) [18:49:43] !log setting CPU governor to performance for wdqs1010 - T225713 [18:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:49] T225713: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 [18:53:04] !log cp1072 - enabling notifications for service checks in icinga, they were disabled but all green and no SAL/ticket. looked like forgotten from the past [18:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:23] (03CR) 10Jbond: lookup checks: add checks to warn against using hiera and advice lookup (031 comment) [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/522526 (https://phabricator.wikimedia.org/T220820) (owner: 10Jbond) [18:56:06] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10Cmjohnson) You have successfully submitted request SR994463766 is the Dell ticket created. I did see the disk in megacli so I am not sure the TSR report I sent them will include the disk. I d... [18:59:17] 10Operations, 10ops-eqsin: msw1-eqsin/msw2-eqsin missing serial number - https://phabricator.wikimedia.org/T227911 (10faidon) [18:59:47] 10Operations, 10ops-codfw, 10ops-eqiad, 10DC-Ops, and 2 others: Triage and resolve all outstanding Netbox report errors - https://phabricator.wikimedia.org/T223450 (10faidon) [19:04:33] 10Operations, 10ops-eqsin: msw1-eqsin/msw2-eqsin missing serial number - https://phabricator.wikimedia.org/T227911 (10wiki_willy) a:03RobH [19:05:43] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Cmjohnson) [19:05:45] RECOVERY - cassandra-b CQL 10.64.16.127:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.16.127 port 9042 https://phabricator.wikimedia.org/T93886 [19:05:47] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) [19:08:08] (03PS3) 10Andrew Bogott: mwopenstackclients: 300 second timeouts for all http actions [puppet] - 10https://gerrit.wikimedia.org/r/522204 (https://phabricator.wikimedia.org/T227785) [19:08:28] !log rebooting cloudvirt1018.eqiad.wmnet T216040 [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:34] T216040: Start/shutdown VMs automatically on hypervisor boot/shutdown - https://phabricator.wikimedia.org/T216040 [19:08:50] (03PS2) 10Jforrester: Introduce wmgEnableJsonConfigDataMode so we can scrap wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522530 [19:09:28] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: 300 second timeouts for all http actions [puppet] - 10https://gerrit.wikimedia.org/r/522204 (https://phabricator.wikimedia.org/T227785) (owner: 10Andrew Bogott) [19:14:31] (03PS2) 10Jforrester: Use wmgEnableJsonConfigDataMode instead of wmgEnableTabularData and wmgEnableMapData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522531 [19:14:39] (03PS2) 10Jforrester: Drop wmgEnableTabularData and wmgEnableMapData, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522532 [19:14:49] (03PS2) 10Jforrester: Stop setting wgNonincludableNamespaces to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522535 [19:14:58] (03PS2) 10Jforrester: Stop setting wgGraphIsTrusted to the default; never varied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/522536 [19:15:08] !log bootstrapping restbase1017-c -- T222960 [19:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:14] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [19:15:19] RECOVERY - cassandra-c service on restbase1017 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:15:41] RECOVERY - cassandra-c SSL 10.64.16.128:7001 on restbase1017 is OK: SSL OK - Certificate restbase1017-c valid until 2020-06-24 13:01:19 +0000 (expires in 347 days) https://phabricator.wikimedia.org/T120662 [19:17:04] !log add prometheus-varnishkafka-exporter 0.1 to apt repo T196066 [19:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:09] T196066: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 [19:19:11] (03PS1) 10Gehel: wdqs: manual deployement for wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/522538 [19:23:42] (03PS4) 10Dzahn: traffic servers: remove netmon1003 director and backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) [19:24:45] (03CR) 10Smalyshev: [C: 03+1] wdqs: manual deployement for wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/522538 (owner: 10Gehel) [19:24:54] (03CR) 10Gehel: [C: 03+2] wdqs: manual deployement for wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/522538 (owner: 10Gehel) [19:25:52] (03CR) 10Andrew Bogott: [C: 03+1] "Looks good although I'm tempted to say we should add some Very Big default, like 12 hours." [puppet] - 10https://gerrit.wikimedia.org/r/522208 (https://phabricator.wikimedia.org/T227830) (owner: 10BryanDavis) [19:26:17] (03PS1) 10Smalyshev: Move wdqs1009 to local config for loading [puppet] - 10https://gerrit.wikimedia.org/r/522541 [19:28:23] (03CR) 10Andrew Bogott: [C: 03+1] "fine w/me!" [puppet] - 10https://gerrit.wikimedia.org/r/522510 (owner: 10Jbond) [19:30:46] (03PS2) 10Gehel: Move wdqs1009 to local config for loading [puppet] - 10https://gerrit.wikimedia.org/r/522541 (owner: 10Smalyshev) [19:30:48] (03CR) 10Ori.livneh: "Thanks Alex -- and no, I don't need access." [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [19:32:39] (03CR) 10Gehel: [C: 03+2] Move wdqs1009 to local config for loading [puppet] - 10https://gerrit.wikimedia.org/r/522541 (owner: 10Smalyshev) [19:34:37] (03PS5) 10Dzahn: traffic servers: remove netmon1003 director and backend [puppet] - 10https://gerrit.wikimedia.org/r/502177 (https://phabricator.wikimedia.org/T220355) [19:41:15] !log Disabled 2FA for MSchottlender-WMF for device reset. [19:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:41] (03PS2) 10Dzahn: netmon: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522490 (owner: 10Muehlenhoff) [20:02:36] (03CR) 10Dzahn: [C: 03+2] netmon: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522490 (owner: 10Muehlenhoff) [20:04:55] (03PS2) 10Dzahn: planet: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522465 (owner: 10Muehlenhoff) [20:05:03] (03PS1) 10Jhedden: openstack: resume VM state on host reboot [puppet] - 10https://gerrit.wikimedia.org/r/522548 (https://phabricator.wikimedia.org/T216040) [20:05:25] (03CR) 10Dzahn: [C: 03+2] planet: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/522465 (owner: 10Muehlenhoff) [20:07:05] (03PS2) 10Jhedden: openstack: resume VM state on host reboot [puppet] - 10https://gerrit.wikimedia.org/r/522548 (https://phabricator.wikimedia.org/T216040) [20:10:12] (03CR) 10Dzahn: [C: 03+2] Drop jessie support from tor class [puppet] - 10https://gerrit.wikimedia.org/r/522415 (owner: 10Muehlenhoff) [20:10:22] (03PS2) 10Dzahn: Drop jessie support from tor class [puppet] - 10https://gerrit.wikimedia.org/r/522415 (owner: 10Muehlenhoff) [20:35:59] RECOVERY - cassandra-c CQL 10.64.16.128:9042 on restbase1017 is OK: TCP OK - 0.000 second response time on 10.64.16.128 port 9042 https://phabricator.wikimedia.org/T93886 [20:38:37] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Mholloway) @joe Thanks (belatedly) for the comments. I've got a fork at https://github.com/mdh... [20:47:36] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [20:48:26] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) [20:55:00] 10Operations, 10Machine vision, 10serviceops, 10Reading-Infrastructure-Team-Backlog (Kanban), and 2 others: Update open_nsfw-- for Wikimedia production deployment - https://phabricator.wikimedia.org/T225664 (10Mholloway) On the subject of race conditions and accepting raw image data, I imagine the scoring... [20:57:50] (03PS1) 10CDanis: admin: cdanis zshrc: helper _on_wmf_prod & last-puppet-run [puppet] - 10https://gerrit.wikimedia.org/r/522554 [20:58:55] (03CR) 10CDanis: [C: 03+2] admin: cdanis zshrc: helper _on_wmf_prod & last-puppet-run [puppet] - 10https://gerrit.wikimedia.org/r/522554 (owner: 10CDanis) [21:20:16] (03PS1) 10Cwhite: prometheus: wire up prometheus-varnishkafka-exporter for deploy [puppet] - 10https://gerrit.wikimedia.org/r/522556 (https://phabricator.wikimedia.org/T196066) [21:49:51] (03PS3) 10Thcipriani: blubberoid: Add policy file [deployment-charts] - 10https://gerrit.wikimedia.org/r/517573 (https://phabricator.wikimedia.org/T215319) [21:49:53] (03PS1) 10Thcipriani: Blubberoid: enable policy, bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 [21:57:25] (03PS21) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:07:07] (03PS22) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:08:58] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add LibreNMS parity check report (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:10:53] (03PS23) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:11:15] (03CR) 10jerkins-bot: [V: 04-1] Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:16:23] (03PS24) 10CRusnov: Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [22:17:39] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10ArielGlenn) @kzimmerman The clinic duty person handles these (though not this weekend). I don't remember who that will be... [22:18:02] (03CR) 10CRusnov: [C: 03+2] Add LibreNMS parity check report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [22:24:18] (03PS1) 10CRusnov: netbox : Add Hiera data for automatic LibreNMS Netbox report [puppet] - 10https://gerrit.wikimedia.org/r/522562 [22:28:10] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for Mayakpwiki - https://phabricator.wikimedia.org/T227633 (10kzimmerman) Thanks @ArielGlenn - we hadn't heard anything since Maya submitted this on Tuesday and I remember you respond... [22:29:32] (03CR) 10CRusnov: "https://puppet-compiler.wmflabs.org/compiler1001/17353/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/522562 (owner: 10CRusnov) [22:58:34] (03PS1) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [23:00:00] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [23:00:46] before i look.. bet it will be pep8 saying the line is too long .. [23:00:50] (03PS2) 10Dzahn: Remove support for jessie from Phabricator classes [puppet] - 10https://gerrit.wikimedia.org/r/522420 (owner: 10Muehlenhoff) [23:01:25] E501 line too long (158 > 100 characters) :p [23:01:32] (03CR) 10jerkins-bot: [V: 04-1] Remove support for jessie from Phabricator classes [puppet] - 10https://gerrit.wikimedia.org/r/522420 (owner: 10Muehlenhoff) [23:01:43] now now jerkins.. that too ?? [23:03:25] (03PS2) 10Dzahn: wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 [23:04:11] (03CR) 10jerkins-bot: [V: 04-1] wmf_auto_reimage: Adjust message about waiting for puppet [puppet] - 10https://gerrit.wikimedia.org/r/522567 (owner: 10Dzahn) [23:17:41] (03CR) 10Dzahn: [C: 03+2] mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) (owner: 10Dzahn) [23:17:50] (03PS3) 10Dzahn: mariadb::ferm_misc: remove firewall rule for servermon [puppet] - 10https://gerrit.wikimedia.org/r/502176 (https://phabricator.wikimedia.org/T198939) [23:23:57] (03CR) 10Jeena Huneidi: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/522561 (owner: 10Thcipriani) [23:28:35] !log netmon1003 - stopping apache2 service (decom of servermon.wikimedia.org) [23:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:11] 10Operations: decom netmon1003 - https://phabricator.wikimedia.org/T220355 (10Dzahn) 05Stalled→03Open a:03Dzahn [23:31:15] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Dzahn) [23:31:44] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [23:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:51] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [23:31:54] 10Operations: decom netmon1003 - https://phabricator.wikimedia.org/T220355 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `netmon1003.wikimedia.org` - netmon1003.wikimedia.org - Removed from Puppet master and PuppetDB - Downtimed host on Icinga - No manageme... [23:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:57] !log netmon1003 - shutdown -h now after it's gone from Icinga now [23:36:00] and out... [23:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:34] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [23:41:11] expected [23:46:49] okey dokey, have a good weekend chaomodus [23:46:56] you too :) [23:48:26] off [23:53:30] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.85% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1