[00:02:00] (03PS1) 10Dzahn: icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) [00:02:42] (03CR) 10jerkins-bot: [V: 04-1] icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:03:38] (03PS2) 10Dzahn: icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) [00:04:01] (03CR) 10jerkins-bot: [V: 04-1] icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:04:11] (03PS3) 10Dzahn: icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) [00:05:24] (03CR) 10Dzahn: "nice that we just converted all these includes to class instances.. otherwise we would have to do it now to add the parameter" [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:05:57] (03CR) 10Dzahn: [C: 032] icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:08:29] (03PS4) 10Dzahn: icinga: tell rsync server to use IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) [00:12:10] meh.. there will be an alert or 2 about puppet failing there.. alreayd on it [00:12:33] nothing critical [00:14:36] (03CR) 10Dzahn: [C: 032] "actually doesn't work because you can't reassign variables in puppet :(" [puppet] - 10https://gerrit.wikimedia.org/r/456522 (owner: 10Dzahn) [00:15:48] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:16:32] ACKNOWLEDGEMENT - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn puppet issue in rsyncd. on it. [00:27:52] (03PS1) 10Dzahn: rsync::server: fix handling of use_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/463394 [00:31:59] (03PS1) 10Dzahn: icinga: hot fix for puppet issue with rsync and use_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463395 (https://phabricator.wikimedia.org/T202782) [00:33:09] (03CR) 10Dzahn: [C: 032] icinga: hot fix for puppet issue with rsync and use_ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/463395 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:36:01] (03CR) 10Dzahn: [C: 032] "didn't work -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463394/ & https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/46" [puppet] - 10https://gerrit.wikimedia.org/r/463392 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:37:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:00] (03CR) 10Dzahn: "for now i did https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463395/ instead" [puppet] - 10https://gerrit.wikimedia.org/r/463394 (owner: 10Dzahn) [00:38:37] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 11, down: 0, shutdown: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:41:17] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:41:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:50:05] (03PS2) 10Dzahn: icinga::plugins: set user/group for nagios_common::commands [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) [00:50:57] (03PS3) 10Dzahn: icinga::plugins: set user/group for nagios_common::commands [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) [00:53:00] (03CR) 10Dzahn: "identical to the other uses of the variable throughout the file. i will follow-up with a change that removes the out-of-scope variables al" [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:53:13] (03CR) 10Dzahn: [C: 032] icinga::plugins: set user/group for nagios_common::commands [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:55:44] (03CR) 10Dzahn: [C: 032] "Compilation results for einsteinium.wikimedia.org: no change" [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:00:03] (03CR) 10Dzahn: [C: 032] "yay! all the check commands on icinga1001 got created after this. lots of red and errors gone from puppet run output. only very issues lef" [puppet] - 10https://gerrit.wikimedia.org/r/463374 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:02:51] (03CR) 10Dzahn: "yep, you are right. done instead in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463316/3/modules/icinga/manifests/init.pp and" [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:02:57] (03Abandoned) 10Dzahn: nagios_common: make user/group configurable from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:07:27] (03PS1) 10Dzahn: alerting_host/icinga: move "mapped v6" snippet back to role level [puppet] - 10https://gerrit.wikimedia.org/r/463397 [01:08:19] (03CR) 10jerkins-bot: [V: 04-1] alerting_host/icinga: move "mapped v6" snippet back to role level [puppet] - 10https://gerrit.wikimedia.org/r/463397 (owner: 10Dzahn) [01:11:48] (03PS1) 10Dzahn: alerting_host/tcpircbot: remove useless role, include only profile [puppet] - 10https://gerrit.wikimedia.org/r/463398 [01:16:37] PROBLEM - Apache HTTP on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:17:04] (03PS1) 10Dzahn: certspotter/alerting_host: don't include role inside role, -> profile [puppet] - 10https://gerrit.wikimedia.org/r/463400 [01:17:28] RECOVERY - Apache HTTP on mw2254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.121 second response time [01:24:09] (03PS1) 10Dzahn: icinga: set user/group for event_handlers::raid [puppet] - 10https://gerrit.wikimedia.org/r/463401 (https://phabricator.wikimedia.org/T202782) [01:24:47] (03CR) 10Dzahn: "but.. the current style check hates it in both roles and profiles.. are we ok having it in modules (nowadays) ?" [puppet] - 10https://gerrit.wikimedia.org/r/463397 (owner: 10Dzahn) [01:27:17] (03PS2) 10Dzahn: icinga: set user/group for event_handlers::raid [puppet] - 10https://gerrit.wikimedia.org/r/463401 (https://phabricator.wikimedia.org/T202782) [01:32:00] (03PS1) 10Dzahn: icinga::plugins: add user/group param and avoid out-of-scope vars [puppet] - 10https://gerrit.wikimedia.org/r/463403 [01:33:00] (03CR) 10Dzahn: "i think this is what you wanted me to do instead Alex" [puppet] - 10https://gerrit.wikimedia.org/r/463403 (owner: 10Dzahn) [01:41:42] (03PS1) 10Dzahn: icinga::naggen/web/raid/ores: avoid out-of-scope-vars everywhere [puppet] - 10https://gerrit.wikimedia.org/r/463404 [01:46:19] (03PS1) 10Dzahn: icinga::web: don't include ::icinga [puppet] - 10https://gerrit.wikimedia.org/r/463405 [01:55:47] (03CR) 10Dzahn: [C: 032] "compiler shows einsteinium unaffected, fixes issue on icinga1001 https://puppet-compiler.wmflabs.org/compiler1002/12666/" [puppet] - 10https://gerrit.wikimedia.org/r/463401 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:58:21] (03CR) 10Dzahn: [C: 032] "puppet run on icinga1001 is ALL GREEN for the first time now since it's stretch and using just the package to setup the user :))" [puppet] - 10https://gerrit.wikimedia.org/r/463401 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:00:52] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) Finally there are no more puppet errors on icinga1001 now, that's the first time on stretch and letting just the package setup user and group. All th... [02:10:08] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:30:28] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:07:09] (03PS6) 10Andrew Bogott: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [03:08:08] (03CR) 10Andrew Bogott: [C: 032] tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [03:08:20] (03CR) 10Mathew.onipe: Add elasticsearch_cluster module (0315 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [03:08:39] (03PS36) 10Mathew.onipe: Add elasticsearch_cluster module [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [03:08:41] (03PS7) 10Mathew.onipe: elasticsearch_cluster: adding new features [software/spicerack] - 10https://gerrit.wikimedia.org/r/461267 (https://phabricator.wikimedia.org/T202885) [03:29:18] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 880.53 seconds [04:02:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 252.33 seconds [05:13:28] * mdholloway notes the mobileapps alerts last night (utc+5), will investigate [05:19:11] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T205649 (10Marostegui) [05:19:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Marostegui) [05:22:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463414 [05:24:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463414 (owner: 10Marostegui) [05:26:00] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463414 (owner: 10Marostegui) [05:27:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 (duration: 00m 59s) [05:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:53] !log Stop replication in sync on db1094 and dbstore1002:s7 [05:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463414 (owner: 10Marostegui) [05:35:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463416 [05:40:38] (03CR) 10Marostegui: "Will you include the prometheus config in a different commit?" [puppet] - 10https://gerrit.wikimedia.org/r/463268 (https://phabricator.wikimedia.org/T196376) (owner: 10Jcrespo) [05:41:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463416 (owner: 10Marostegui) [05:42:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463416 (owner: 10Marostegui) [05:47:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 (duration: 00m 55s) [05:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463416 (owner: 10Marostegui) [05:54:41] !log Deploy schema change on s7 eqiad, this will generate lag - T203709 [05:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:50] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [06:02:32] (03PS1) 10Marostegui: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463417 [06:04:39] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463417 (owner: 10Marostegui) [06:05:47] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463417 (owner: 10Marostegui) [06:06:00] (03CR) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463417 (owner: 10Marostegui) [06:07:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2055 (duration: 00m 55s) [06:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:20] !log Deploy schema change on db2055 - T203709 [06:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:27] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [06:19:45] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc [06:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:29] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc (duration: 00m 44s) [06:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:40] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc [06:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:21] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc (duration: 06m 40s) [06:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:10] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc [06:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:17] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200): /{domain}/v1 [06:28:17] ns/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for header vary: Accept, Accept-Encoding [06:29:39] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/faidon] [06:29:40] <_joe_> mdholloway: is this supposed to fix the mobileapps swagger issue? [06:29:55] <_joe_> I mean your deploys right now [06:30:06] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc (duration: 01m 57s) [06:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:31] <_joe_> mdholloway: you around? :) [06:31:36] _joe_: i'm here [06:31:57] yes, fixing the swagger issue was the original intent [06:31:57] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:31:59] <_joe_> ok, we have this alert about mobileapps sending back a different Vary: header than expected [06:32:05] right. [06:32:06] <_joe_> ok, thanks :) [06:32:21] <_joe_> <3 [06:32:57] went to deploy a fix for that and now having scap issues [06:33:34] so rolled back, and maybe better to just ack the spec alerts for now [06:34:32] <_joe_> mdholloway: ok, I'm doing that [06:34:39] <_joe_> can I help with the scap issues maybe? [06:35:06] _joe_: sure, i'll create a paste with the scap log [06:35:44] <_joe_> do you have a ticket for the MCS issue, btw? [06:36:17] _joe_: not yet [06:36:18] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:37:01] <_joe_> heh I don't think it's strictly needed, I just wanted to reference it in the icinga acknowledgement if there was one [06:37:26] _joe_: https://phabricator.wikimedia.org/P7600 [06:38:20] <_joe_> fatal: reference is not a tree: 7878ffcaa4ba1c7a87d5075868612bfbe3393dce [06:38:29] <_joe_> interesting, a git error [06:38:51] <_joe_> maybe you committed the wrong sha1 for the submodule? [06:39:00] * _joe_ perplexed [06:40:25] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463422 [06:41:06] _joe_: the first time i deployed, in my haste, i neglected to `git submodule update`, and hit trouble. so i rolled back, updated the src submodule, and retried, and now this [06:41:20] <_joe_> oh I see, ok [06:41:46] as shown in the log i pasted above, i tried with `--force` to force a fetch and checkout again but that didnt' help either [06:41:48] <_joe_> I fear we might need to start fresh on all servers [06:42:01] <_joe_> or better [06:42:13] <_joe_> lemme check on the servers [06:42:22] ok, thanks [06:44:19] <_joe_> I [06:47:14] ? :) [06:47:25] <_joe_> aah sorry, wrong window [06:47:37] <_joe_> I'm trying to figure out what's happening [06:47:42] <_joe_> or better, how to fix it [06:49:04] * mdholloway tries to dig up some convo with thcipriani from last time scap went out of whack during a deploy [06:52:48] <_joe_> so there is an easy way out [06:52:54] <_joe_> but I'd like to avoid it [06:56:15] <_joe_> mdholloway: can you paste me the whole list of scap commands you sent? [06:56:35] <_joe_> because it seesm the command that failed for you works if I launch it on a node [06:56:56] <_joe_> (/usr/bin/scap deploy-local -v --repo mobileapps/deploy fetch --force -g canary --refresh-config) [06:57:09] <_joe_> but ofc it refers to the old deploy [06:58:12] _joe_: here's my bash history for the session: [06:58:27] cd /srv/deployment/mobileapps/deploy/ [06:58:27] git log [06:58:27] git pull [06:58:27] git log [06:58:27] scap deploy "`git log --pretty=format:'%s' -n 1`" [06:58:27] git submodule update --init [06:58:27] scap deploy "`git log --pretty=format:'%s' -n 1`" [06:58:28] git log [06:58:28] git submodule update --init [06:58:29] git branch [06:58:29] scap deploy --force "`git log --pretty=format:'%s' -n 1`" [06:59:05] <_joe_> ok, I *think* I might have found a way [07:00:09] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:38] <_joe_> mdholloway: can you try again? [07:01:48] _joe_: sure, trying now [07:02:38] _joe_: with `--force`? [07:02:47] <_joe_> shouldn't be necessary [07:02:53] k [07:03:05] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc [07:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:14] <_joe_> --force doesn't necessarily do what you want [07:03:15] <_joe_> btw [07:04:14] <_joe_> current -> revs/bf090806febeb034a9340fb4bfdb80a738efc26b seems we're past the point where you had the issue [07:05:05] it's hanging on the promote and restart_service stage in a way that doesn't bode well... [07:05:22] <_joe_> ok, that doesn't mean the issue is with git anymore [07:06:04] https://www.irccloud.com/pastebin/99R9udOB/ [07:06:12] <_joe_> Process: 28618 ExecStart=/usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/mobileapps/config.yaml (code=exited, status=1/FAILURE) [07:06:18] <_joe_> the service failed to restart [07:08:45] <_joe_> Error while reading config file: Error: ENOENT: no such file or directory, open '/etc/mobileapps/config.yaml' [07:08:48] <_joe_> ARGH [07:08:56] just saw that in syslog [07:08:57] (wat) [07:09:04] <_joe_> I hate things that don't act like a FSM [07:09:18] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:11:37] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [07:14:55] ok, so on scb2001 /etc/mobileapps/config.yaml links to /srv/deployment/mobileapps/deploy-cache/revs/bf090806febeb034a9340fb4bfdb80a738efc26b/.git/config-files/etc/mobileapps/config.yaml which doesn't exist because /srv/deployment/mobileapps/deploy-cache/revs/ is empty [07:15:13] <_joe_> mdholloway:that's because I'm re-deploying it from scratch [07:15:14] <_joe_> via puppet [07:15:22] ah, ok [07:16:15] <_joe_> grrr this thing is so hard to get right once someone does the tiniest mistake [07:16:35] i was just thinking the same... this shouldn't be as big a footgun as it is [07:17:20] <_joe_> note I still could not fix the issue [07:18:49] <_joe_> mdholloway: third time's the charm [07:19:00] <_joe_> mdholloway: please try to scap now [07:19:08] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [07:19:08] _joe_: \o/ [07:19:11] ok, trying again [07:20:39] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc [07:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:37] _joe_: all right, all seems well so far [07:21:43] thanks so much for helping with this [07:21:47] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:21:48] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [07:21:48] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [07:21:49] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [07:21:57] <_joe_> mdholloway: thank you for fixing the issue [07:22:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Banyek) 05Open>03Resolved The rebuild finished: Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [07:22:47] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [07:22:48] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [07:22:56] _joe_: i think the alerts were originally related to the parsoid deployment, btw, but it needs more investigation [07:22:58] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [07:22:58] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:23:07] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:23:07] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [07:23:08] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [07:23:16] <_joe_> mdholloway: I'm pretty sure they were [07:23:22] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@bf09080]: Update mobileapps to 7878ffc (duration: 02m 43s) [07:23:22] 10Operations: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10MoritzMuehlenhoff) p:05Triage>03High [07:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:30] we only recently added header checks to the spec, so that changes trigger the alerts [07:24:04] (i'm not sure if all spec x-ample checks should necessary trigger a critical alert) [07:24:08] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [07:24:09] *necessarily [07:24:37] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463422 (owner: 10Marostegui) [07:24:54] 10Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532 (10MoritzMuehlenhoff) 05Open>03declined Closing, we have a number of trusty instance which are using rsync succesfully (e.g. the labvirts to sync Nova images), and otherwise trusty is deprecated. [07:25:41] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463422 (owner: 10Marostegui) [07:26:28] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [07:26:59] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2055 (duration: 00m 56s) [07:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463425 [07:35:27] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463422 (owner: 10Marostegui) [07:36:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463425 (owner: 10Marostegui) [07:37:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463425 (owner: 10Marostegui) [07:40:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 00m 55s) [07:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:34] !log Stop replication in sync on dbstore1002 and db1078 [07:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:50] <_joe_> mdholloway: well we can discuss what should be critical and what shouldn't be [07:50:57] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463425 (owner: 10Marostegui) [08:00:47] (03PS2) 10Jcrespo: mariadb backups: Convert db1116 into an eqiad backup source host [puppet] - 10https://gerrit.wikimedia.org/r/463268 (https://phabricator.wikimedia.org/T196376) [08:05:51] (03PS1) 10Muehlenhoff: Only allow HTTP port for Hue [puppet] - 10https://gerrit.wikimedia.org/r/463428 [08:07:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463429 [08:07:42] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10Banyek) I had experiences with some Sun Fire servers, their mgmt interfaces were constantly broken, but mostly a powercycle solved it. I don't know if we can afford to fully power down... [08:08:03] (03CR) 10Marostegui: [C: 031] mariadb backups: Convert db1116 into an eqiad backup source host [puppet] - 10https://gerrit.wikimedia.org/r/463268 (https://phabricator.wikimedia.org/T196376) (owner: 10Jcrespo) [08:09:40] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10jcrespo) @Banyek, please read https://wikitech.wikimedia.org/wiki/Management_Interfaces [08:10:20] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463429 (owner: 10Marostegui) [08:10:41] (03CR) 10Jcrespo: [C: 032] mariadb backups: Convert db1116 into an eqiad backup source host [puppet] - 10https://gerrit.wikimedia.org/r/463268 (https://phabricator.wikimedia.org/T196376) (owner: 10Jcrespo) [08:11:10] !log Deploy schema change on s3 eqiad, this will generate lag - T203709 [08:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:15] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [08:11:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463429 (owner: 10Marostegui) [08:12:31] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1078 (duration: 00m 54s) [08:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:00] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10Banyek) a:03Banyek I can try these [08:16:08] (03PS1) 10Jcrespo: mariadb: Correct wrong hiera key identifiers [puppet] - 10https://gerrit.wikimedia.org/r/463430 [08:16:44] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10jcrespo) Don't, this requires a power drain, you cannot help with this. [08:16:53] 10Operations, 10ops-codfw, 10DBA: Issues with mgmt interface on es2001 host - https://phabricator.wikimedia.org/T204928 (10jcrespo) a:05Banyek>03None [08:17:11] (03PS1) 10Muehlenhoff: yarn_http: Restrict to caches [puppet] - 10https://gerrit.wikimedia.org/r/463431 [08:17:56] (03CR) 10Jcrespo: [C: 032] mariadb: Correct wrong hiera key identifiers [puppet] - 10https://gerrit.wikimedia.org/r/463430 (owner: 10Jcrespo) [08:19:27] 10Operations, 10Patch-For-Review: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10MoritzMuehlenhoff) Can you please add the nickserv password to pwstore? [08:19:54] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463429 (owner: 10Marostegui) [08:34:45] (03PS1) 10Jcrespo: mariadb: Depool db1089, db1104 to setup backup source for s7,s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) [08:35:47] (03PS2) 10Jcrespo: mariadb: Depool db1086, db1104 to setup backup source for s7,s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) [08:37:39] (03CR) 10Jcrespo: "Could I get at least a +1 or -1 (with reasons) from banyek? Revies are a very critical part of a DBA job." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:39:13] (03CR) 10Banyek: [C: 031] "sure you can" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:39:39] (03CR) 10Jcrespo: "The important part is the "reasons" :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:41:31] (03CR) 10Banyek: [C: 031] "> The important part is the "reasons" :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:42:37] (03CR) 10Jcrespo: "> > The important part is the "reasons" :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:44:33] (03CR) 10Jcrespo: "> > > The important part is the "reasons" :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:52:55] (03CR) 10Banyek: [C: 031] mariadb: Depool db1086, db1104 to setup backup source for s7,s8 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [08:56:03] (03CR) 10Jcrespo: mariadb: Depool db1086, db1104 to setup backup source for s7,s8 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:05:39] (03CR) 10Banyek: [C: 031] mariadb: Depool db1086, db1104 to setup backup source for s7,s8 (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:11:24] (03PS1) 10Jonas Kress (WMDE): Enable WBQualityConstraintsSuggestionsBetaFeature in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T202712) [09:18:51] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10MoritzMuehlenhoff) This needs additional information: What is the email address of the mailing list administrator? [09:18:58] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:19:23] 10Operations, 10MediaWiki-extensions-CodeReview, 10Patch-For-Review, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:21:52] 10Operations, 10Traffic, 10HTTPS: WMF servers support ESNI? - https://phabricator.wikimedia.org/T205378 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:24:18] (03CR) 10Jonas Kress (WMDE): "@Jforrester could you please confirm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463439 (https://phabricator.wikimedia.org/T202712) (owner: 10Jonas Kress (WMDE)) [09:26:30] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1086, db1104 to setup backup source for s7,s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:30:53] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1086, db1104 (duration: 00m 57s) [09:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:47] (03CR) 10jenkins-bot: mariadb: Depool db1086, db1104 to setup backup source for s7,s8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463434 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [09:34:00] !log converting whikishared.cx_coprora to TokuDB on host dbstrore1002 (T205544) [09:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:05] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [09:37:10] 10Operations, 10TechCom: change my email address in the techcom alias - https://phabricator.wikimedia.org/T205661 (10MoritzMuehlenhoff) 05Open>03Resolved p:05Triage>03Normal a:03MoritzMuehlenhoff I've changed your address. And welcome :-) [09:37:19] (03CR) 10Vgutierrez: "let's move this CR forward, @legotkm do you see any standing issue to not merge it?" [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 (owner: 10Alex Monk) [09:39:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-Logstash, 10Release-Engineering-Team (Watching / External): logstash-beta.wmflab throws multiple "Error: Could not locate that visualization" - https://phabricator.wikimedia.org/T204845 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:40:13] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10MoritzMuehlenhoff) p:05Triage>03Normal [09:40:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10gabriel-wmde) 05declined>03Open Sorry I totally missed this. I would like to have the option of both formats. [09:41:13] !log installing ca-certificates updates on trusty/stretch [09:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:14] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10akosiaris) a:03akosiaris [09:45:25] !log stop db1086 and db1104 for cloning to db1116 [09:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:58] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:51:58] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:00] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:28] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:29] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:29] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:30] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:48] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:48] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:48] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:52:57] ^on it... [09:52:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title}{/revision}{/tid} (Get references of a test page) is CRITICAL: Test Get references of a test page had an unexpected value for header vary: Accept, Accept-Encoding [09:53:04] same issue as earlier, different endpoint [09:53:05] sigh [10:09:29] !log reimaging mw2150 to test router ACLs on cumin1001 [10:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:20] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1086, db1104 to setup backup source for s7,s8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463443 [10:16:49] (03CR) 10Jcrespo: [C: 04-2] "Not until both instances are backup up and replicating." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463443 (owner: 10Jcrespo) [10:20:17] (03PS2) 10Giuseppe Lavagetto: rake_modules/specdeps: fix logic in resolving specs that need running [puppet] - 10https://gerrit.wikimedia.org/r/463293 [10:20:32] <_joe_> mdholloway: heh [10:20:38] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [10:20:43] <_joe_> it looks like we changed the values for that header [10:21:00] <_joe_> mdholloway: maybe remove that check completely? do you care much about the Vary: header? [10:22:36] _joe_: probably best to disable that check for now [10:22:43] 10Operations, 10Traffic, 10HTTPS, 10Upstream: Enable ESNI support on Wikimedia servers - https://phabricator.wikimedia.org/T205378 (10Krenair) [10:23:07] will update shortly [10:23:08] <_joe_> mdholloway: to be clear - the Vary: header is important [10:23:26] <_joe_> but if you don't have any feature that depends on controlling the cache in varnish with it [10:23:36] <_joe_> I wouldn't check its value in the swagger spec [10:24:31] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Klein) [10:26:38] (03PS8) 10Arturo Borrero Gonzalez: cloudvps: add prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) [10:28:49] _joe_: we do care about it (or will sometime hopefully soon) for working language variants, but it's not essential to check at the moment [10:28:58] *working with [10:29:43] <_joe_> so either you change it to be [Accept, Accept-Encoding] everywhere [10:30:34] <_joe_> or, well, we just miss having the ability to do a "in" test for headers [10:30:59] (03CR) 10Giuseppe Lavagetto: [C: 032] rake_modules/specdeps: fix logic in resolving specs that need running [puppet] - 10https://gerrit.wikimedia.org/r/463293 (owner: 10Giuseppe Lavagetto) [10:35:52] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10Mvolz) >>! In T201611#4622404, @akosiaris wrote: > https://gerrit.wikimedia.org/g/mediawiki/services/zotero/+/refs/heads/master would be the repository @Mvolz I... [10:36:51] (03PS1) 10Arturo Borrero Gonzalez: d/service: fix templates leftovers [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463445 [10:37:45] (03PS3) 10Giuseppe Lavagetto: mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 [10:43:18] 10Operations, 10TechCom: change my email address in the techcom alias - https://phabricator.wikimedia.org/T205661 (10daniel) Thank you :) [10:50:49] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:52:27] 10Operations, 10netops: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 (10MoritzMuehlenhoff) Works like a charm! [10:55:29] !log converting dewiki.flaggedtemplates to TokuDB on host dbstrore1002 (T205544) [10:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:34] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [10:57:30] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:58:42] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@961aa5a]: Update mobileapps to 38271fa [10:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:58] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [11:00:18] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:00:19] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [11:00:38] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:00:48] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [11:01:09] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [11:01:19] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:01:19] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [11:01:19] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [11:01:28] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [11:01:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [11:01:48] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@961aa5a]: Update mobileapps to 38271fa (duration: 03m 05s) [11:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:08] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [11:02:19] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10akosiaris) [11:02:28] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [11:04:34] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10faidon) >>! In T41785#4622337, @Andrew wrote: > ``` > Andrews-MacBook-Pro-3:~ andrew$ dig +short -x 185.15.56.18 > mx-out01.cloudinfra.wmflabs.org. > mx-out01.wmflabs.... [11:05:59] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [11:07:17] (03PS2) 10Arturo Borrero Gonzalez: d/service: fix templates leftovers [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463445 [11:09:20] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10akosiaris) As far as solving the logstash URL I think the best approac... [11:10:46] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) >>! In T41785#4624771, @faidon wrote: >>>! In T41785#4622337, @Andrew wrote: >> ``` >> Andrews-MacBook-Pro-3:~ andrew$ dig +short -x 185.15.56.18 >> mx-out01.... [11:11:49] (03PS1) 10Alexandros Kosiaris: scap: Update logstash URL for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/463453 (https://phabricator.wikimedia.org/T204907) [11:16:09] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:13] 10Operations: Integrate Stretch 9.5 point release - https://phabricator.wikimedia.org/T199670 (10MoritzMuehlenhoff) These updates have been fully deployed: ``` openldap shared-mime-info base-files discover redis ``` [11:21:19] !log installing php5 security updates [11:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] (03CR) 10Alexandros Kosiaris: [C: 032] scap: Update logstash URL for mediawiki canaries [puppet] - 10https://gerrit.wikimedia.org/r/463453 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [11:30:38] (03PS9) 10Arturo Borrero Gonzalez: cloudvps: add prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) [11:31:19] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/service: fix templates leftovers [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463445 (owner: 10Arturo Borrero Gonzalez) [11:31:32] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:32:05] (03PS3) 10Arturo Borrero Gonzalez: d/service: fix templates leftovers [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463445 (https://phabricator.wikimedia.org/T203177) [11:32:28] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/service: fix templates leftovers [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463445 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:38:33] !log add prometheus-openstack-exporter 0.0.8-2 to reprepro (T203177) [11:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:38] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [11:41:12] (03PS1) 10Muehlenhoff: Add gbirke to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/463456 (https://phabricator.wikimedia.org/T202072) [11:42:09] (03CR) 10Muehlenhoff: [C: 032] Add gbirke to researchers and analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/463456 (https://phabricator.wikimedia.org/T202072) (owner: 10Muehlenhoff) [11:45:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting Access to view EventLogging data for gabriel-wmde / gbirke - https://phabricator.wikimedia.org/T202072 (10MoritzMuehlenhoff) 05Open>03Resolved a:05gabriel-wmde>03MoritzMuehlenhoff @gabriel-wmde I've enabled your ac... [11:45:50] (03Abandoned) 10Muehlenhoff: add Gabriel Birke to analytics-users and researchers groups [puppet] - 10https://gerrit.wikimedia.org/r/456161 (https://phabricator.wikimedia.org/T202072) (owner: 10ArielGlenn) [11:46:16] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for 0.0.8-2 jessie-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463457 [11:46:34] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/changelog: generate entry for 0.0.8-2 jessie-wikimedia [debs/prometheus-openstack-exporter] - 10https://gerrit.wikimedia.org/r/463457 (owner: 10Arturo Borrero Gonzalez) [11:47:42] I can't just push tags into gerrit git repos? [11:49:12] (03PS10) 10Arturo Borrero Gonzalez: cloudvps: add prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) [11:50:01] arturo: you can [11:56:20] paladox: how? [11:56:52] arturo: I believe you can do: [11:57:00] git tag [11:57:57] git push origin HEAD:refs/heads/ [11:57:57] arturo, there's a way in the web UI to easily create them [11:58:09] Or that ^^ [11:58:18] may have to set up "push annotated tag" permissions but that should be doable if you're a repo owner [11:58:43] He’s in ldap/ops [11:58:47] So he’s a admin [11:58:58] yeah should be able to change any permissions then [11:59:13] i think you need to give --tags argument to git push [11:59:47] when i tag things i do: git tag -s tagname; git push --tags [11:59:50] https://www.irccloud.com/pastebin/PJrk540N/ [12:00:01] prohibited by gerrit bawolff [12:00:24] arturo: which repo is this? [12:00:47] paladox: is in the paste [12:00:51] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10mobrovac) >>! In T201611#4624686, @Mvolz wrote: > Great, I'll test it today/Monday with citoid and hopefully get a patch up soon. That repo points to and uses tw... [12:00:55] Sounds permission related. Its definitely worked for me on repos i maintain on gerrit [12:01:14] lemme see if I can fix the permissions [12:01:54] !log installint php security updates on einsteinium (icinga.wikimedia.org) [12:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:19] arturo, https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-openstack-exporter/+/463458/ [12:02:34] (03PS1) 10Alex Monk: Allow ops to create tags [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463458 [12:02:58] ok thanks Krenair, shall I just merge that? [12:02:59] might need push to refs/tags/* but let's see how this goes [12:03:00] yeah [12:03:08] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] Allow ops to create tags [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463458 (owner: 10Alex Monk) [12:03:19] alright now try [12:03:33] Krenair: same -_- [12:03:36] bah [12:03:50] what if you do it through https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/prometheus-openstack-exporter,tags arturo ? [12:04:12] mmm moritzm how do you do git tags for deb packaging repos? [12:05:26] I did get this working at some point [12:05:52] arturo: usually only for the releases in the upstream branch [12:06:13] moritzm: and how do you push tags to gerrit? [12:06:22] (03PS1) 10Paladox: Modify access rules [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463460 [12:06:48] arturo: https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-openstack-exporter/+/463460/ [12:07:18] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] Modify access rules [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463460 (owner: 10Paladox) [12:07:39] paladox: please rebase :-P [12:07:52] (merge conflict) [12:08:06] Ok [12:08:06] https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki,access is what mediawiki has and i can definitely push tags from things that inherit from that [12:09:29] (03PS2) 10Paladox: Modify access rules [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463460 [12:09:43] (03PS3) 10Paladox: Modify access rules [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463460 [12:10:16] (03PS4) 10Paladox: Modify access rules [debs/prometheus-openstack-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/463460 [12:10:37] arturo: done [12:11:11] great paladox !! it works now [12:11:31] :) [12:28:42] !log downtime cloudcontrol1004.wikimedia.org for 2H (tests related to T203177) [12:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:47] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [12:36:27] (03PS11) 10Arturo Borrero Gonzalez: cloudvps: add prometheus-openstack-exporter [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) [12:39:17] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compilation is OK:" [puppet] - 10https://gerrit.wikimedia.org/r/462455 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:39:21] !log converting wikidatawiki.text to TokuDB on host dbstrore1002 (T205544) [12:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:26] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [12:42:13] !log downtime cloudcontrol1003.wikimedia.org for 2H (tests related to T203177) [12:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:18] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [12:43:01] godog: you around? [12:46:11] nevermind [12:46:55] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/463465 (https://phabricator.wikimedia.org/T203177) [12:47:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: fix ferm syntax [puppet] - 10https://gerrit.wikimedia.org/r/463465 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:49:13] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints [12:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: fix comma in ferm array [puppet] - 10https://gerrit.wikimedia.org/r/463466 (https://phabricator.wikimedia.org/T203177) [12:51:36] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: fix comma in ferm array [puppet] - 10https://gerrit.wikimedia.org/r/463466 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:57:54] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: typo in config template [puppet] - 10https://gerrit.wikimedia.org/r/463467 (https://phabricator.wikimedia.org/T203177) [12:59:19] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: typo in config template [puppet] - 10https://gerrit.wikimedia.org/r/463467 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [13:00:07] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints (duration: 10m 55s) [13:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:16] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #2 [13:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:21] (03CR) 10Mark Bergsma: [C: 032] Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [13:01:10] (03Merged) 10jenkins-bot: Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [13:02:40] (03PS1) 10Alexandros Kosiaris: Rearrange conftool mediawiki node stanzas [puppet] - 10https://gerrit.wikimedia.org/r/463468 [13:02:42] (03PS1) 10Alexandros Kosiaris: Use conftool to populate mw canaries in scap [puppet] - 10https://gerrit.wikimedia.org/r/463469 (https://phabricator.wikimedia.org/T204907) [13:04:03] _joe_: I 'd like to hear your thoughts ^ [13:04:21] <_joe_> akosiaris: hah yeah I'll look at it in a second [13:04:29] I considered an alternative approach per the comment, but it seemed to duplicate information [13:04:38] I am also wondering whether I can ditch the dummy port [13:05:14] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #2 (duration: 04m 57s) [13:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:19] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #3 [13:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:02] (03CR) 10Mark Bergsma: [C: 032] Remove Server.modified and refresh preexisting servers individually [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [13:06:42] (03Merged) 10jenkins-bot: Remove Server.modified and refresh preexisting servers individually [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [13:07:25] <_joe_> akosiaris: I like your approach [13:07:43] <_joe_> it will mean we will have one additional service on those hosts, as far as conftool is concerned [13:07:59] <_joe_> and they will be depooled/pooled like normal appservers [13:08:09] <_joe_> that doesn't make scap pick the right ones, though [13:08:58] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #3 (duration: 03m 39s) [13:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:02] !log mobrovac@deploy1001 Started deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #4 [13:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:22] <_joe_> what was your idea there? [13:09:32] <_joe_> I had a lamer solution in mind tbh [13:10:05] what do you mean it does not make scap pick the right ones ? [13:10:07] could anybody help me spot the typo? [13:10:14] https://www.irccloud.com/pastebin/NII6Jwlz/ [13:10:15] aaah that it still is hardcoded [13:10:25] instead of being discovered ... [13:10:34] (I can't discover what's missing in puppet) [13:10:37] <_joe_> so, to solve the immediate issue [13:10:59] <_joe_> I would've created two static dsh files [13:11:35] arturo: @listen_port ? [13:11:37] <_joe_> or better, a confd file [13:11:56] lol thanks akosiaris [13:12:11] arturo: actually all variables need an @ there [13:12:13] <_joe_> with both sets, which would print either group based on the mwconfig variable (in conftool) for the master DC [13:12:22] yes yes [13:13:04] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: fix syntax in variables used in template [puppet] - 10https://gerrit.wikimedia.org/r/463470 (https://phabricator.wikimedia.org/T203177) [13:13:50] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: fix syntax in variables used in template [puppet] - 10https://gerrit.wikimedia.org/r/463470 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [13:15:18] _joe_: two static dsh files ? [13:15:32] <_joe_> < _joe_> or better, a confd file [13:15:38] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@7caf4d8]: Update metrics top endpoints, take #4 (duration: 06m 36s) [13:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:45] yeah that one I am still parsing it [13:15:56] trying to imagine how to implement it [13:16:05] <_joe_> the confd template would be [13:16:08] possibly in conjuction with my idea [13:16:56] <_joe_> if wmfMasterDatacenter == 'eqiad' { of course this is go text/template [13:17:22] <_joe_> so it will look something like [13:18:27] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) bacula-fd is installed on number of servers, but is not a sensible candidate for automated restarts; if the Director attempts to connect to connect... [13:18:45] <_joe_> {{ dc := getv <%= @conftool_prefix %>/mediawiki-config/common/wmfMasterDatacenter }}{{ if dc == 'eqiad'}} [13:18:46] (03PS7) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [13:19:07] 10Operations, 10Patch-For-Review: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991 (10MoritzMuehlenhoff) [13:19:11] <_joe_> akosiaris: and it's downright impossible to talk here :/ [13:19:46] (03CR) 10Mark Bergsma: [C: 032] Don't recalculate server.up in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 (owner: 10Mark Bergsma) [13:19:57] what ? you don't like our new robot overlords ? [13:20:27] (03Merged) 10jenkins-bot: Don't recalculate server.up in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 (owner: 10Mark Bergsma) [13:24:52] (03PS2) 10Alexandros Kosiaris: Rearrange conftool mediawiki node stanzas [puppet] - 10https://gerrit.wikimedia.org/r/463468 [13:24:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Rearrange conftool mediawiki node stanzas [puppet] - 10https://gerrit.wikimedia.org/r/463468 (owner: 10Alexandros Kosiaris) [13:26:37] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: fix subvars usage [puppet] - 10https://gerrit.wikimedia.org/r/463472 (https://phabricator.wikimedia.org/T203177) [13:27:01] !log rebooting tungsten for kernel security update [13:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:31] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Weird compiler output: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/12669/console" [puppet] - 10https://gerrit.wikimedia.org/r/463472 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [13:28:50] (03PS1) 10Muehlenhoff: Remove additional role include [puppet] - 10https://gerrit.wikimedia.org/r/463473 [13:28:58] (03CR) 10Vgutierrez: [C: 032] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [13:30:39] (03Merged) 10jenkins-bot: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [13:32:02] PROBLEM - High CPU load on API appserver on mw2139 is CRITICAL: CRITICAL - load average: 58.44, 21.32, 12.32 [13:32:14] (03CR) 10jenkins-bot: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [13:33:03] RECOVERY - High CPU load on API appserver on mw2139 is OK: OK - load average: 23.49, 18.21, 11.85 [13:40:49] 10Operations, 10Recommendation-API, 10Research, 10SCB, 10Services (next): Setup access from service to mysql - https://phabricator.wikimedia.org/T205452 (10bmansurov) Friendly ping @Joe @fgiunchedi. Can you please help with this task? Thanks! [14:02:11] 10Operations, 10ops-codfw: wtp2020: correctable memory errors - https://phabricator.wikimedia.org/T205712 (10MoritzMuehlenhoff) [14:02:37] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2020 is CRITICAL: 5 ge 4 Muehlenhoff T205712 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw%2520prometheus%252Fops [14:08:38] (03Abandoned) 10Alexandros Kosiaris: Display etcd /mediawiki-config values in noc.w.o [puppet] - 10https://gerrit.wikimedia.org/r/455578 (owner: 10Alexandros Kosiaris) [14:18:04] (03PS3) 10Mark Bergsma: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 [14:22:09] !log converting dewiki.flaggedimages to TokuDB on host dbstrore1002 (T205544) [14:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:15] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [14:23:34] (03PS1) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) [14:24:39] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) (owner: 10Urbanecm) [14:26:53] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1086, db1104 to setup backup source for s7,s8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463443 (owner: 10Jcrespo) [14:27:53] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:28:00] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1086, db1104 to setup backup source for s7,s8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463443 (owner: 10Jcrespo) [14:29:32] (03PS3) 10Alex Monk: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 [14:29:38] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1086, db1104 (duration: 00m 55s) [14:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:55] (03PS1) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) [14:32:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:33:13] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [14:37:32] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1086, db1104 to setup backup source for s7,s8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463443 (owner: 10Jcrespo) [14:37:53] (03PS1) 10Jcrespo: mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) [14:41:28] (03CR) 10Jcrespo: "Aside from deploying this, it requires to add an additional account for backups running as mentioned here: T111929" [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [14:42:39] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 (10jcrespo) [14:42:58] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) Ok, I've deleted the A records `mx-outNN.cloudinfra.wmflabs.org`. FWIW was not intending to have two sets of records. Like I mentioned in T41785#4615256 goin... [14:46:58] (03CR) 10Cwhite: "Looks good, save one inline comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463404 (owner: 10Dzahn) [14:47:59] (03PS2) 10Giuseppe Lavagetto: parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 [14:48:01] (03PS3) 10Giuseppe Lavagetto: service::node::config::scap3: get rid of confd-controlled configs [puppet] - 10https://gerrit.wikimedia.org/r/458476 [14:48:03] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 [14:48:05] (03PS1) 10Giuseppe Lavagetto: parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 [14:48:07] (03PS1) 10Giuseppe Lavagetto: parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 [14:48:22] (03CR) 10Cwhite: [C: 031] icinga::plugins: add user/group param and avoid out-of-scope vars [puppet] - 10https://gerrit.wikimedia.org/r/463403 (owner: 10Dzahn) [14:49:02] (03CR) 10jerkins-bot: [V: 04-1] parsoid: connect to MediaWiki via https everywhere [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto) [14:49:20] (03CR) 10Cwhite: [C: 031] certspotter/alerting_host: don't include role inside role, -> profile [puppet] - 10https://gerrit.wikimedia.org/r/463400 (owner: 10Dzahn) [14:49:25] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 (owner: 10Giuseppe Lavagetto) [14:50:54] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Krenair) >>! In T41785#4625330, @herron wrote: > mail traffic will continue flowing through the various public outbound NAT IPs depending on the instance config Hm, I... [14:52:09] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 [14:54:35] (03PS3) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770 [14:56:00] (03CR) 10Giuseppe Lavagetto: [C: 032] "This fixes the behaviour of the pluging with newer versions of puppet-lint" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/463314 (owner: 10Cwhite) [14:58:06] (03PS4) 10Mark Bergsma: Extend testConfigServerRemoval test case. [debs/pybal] - 10https://gerrit.wikimedia.org/r/447770 [14:59:03] (03CR) 10Cwhite: "Giuseppe mentioned that the interface::add_ip6_mapped declaration belongs in site.pp. I checked and the linter does not appear to complai" [puppet] - 10https://gerrit.wikimedia.org/r/463397 (owner: 10Dzahn) [14:59:55] (03PS1) 10Muehlenhoff: Restrict Icinga rsync access [puppet] - 10https://gerrit.wikimedia.org/r/463492 [15:01:22] (03CR) 10Cwhite: [C: 031] alerting_host/tcpircbot: remove useless role, include only profile [puppet] - 10https://gerrit.wikimedia.org/r/463398 (owner: 10Dzahn) [15:03:05] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) @herron, the IP ranges for VMs are: eqiad1: 172.16.0.0/21 eqiad: 10.68.16.0/21 We often just use 10.0.0.0/8 for eqiad since production is unable to connect... [15:07:20] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/12670/" [puppet] - 10https://gerrit.wikimedia.org/r/463492 (owner: 10Muehlenhoff) [15:09:02] (03CR) 10Cwhite: [C: 032] naggen2: restrict generated defines to valid options [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [15:09:15] (03PS6) 10Cwhite: naggen2: restrict generated defines to valid options [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) [15:10:38] !log activate Equinix peering sessions on cr4-ulsfo [15:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:54] (03PS1) 10Mark Bergsma: Add .vscode/ to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/463494 [15:12:21] (03CR) 10Mark Bergsma: [C: 032] Add .vscode/ to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/463494 (owner: 10Mark Bergsma) [15:13:03] (03Merged) 10jenkins-bot: Add .vscode/ to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/463494 (owner: 10Mark Bergsma) [15:13:32] (03Abandoned) 10Paladox: Gerrit: Make PolyGerrit the default ui [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [15:13:56] (03CR) 10Cwhite: [C: 031] Restrict Icinga rsync access [puppet] - 10https://gerrit.wikimedia.org/r/463492 (owner: 10Muehlenhoff) [15:17:32] (03PS3) 10Giuseppe Lavagetto: profile::lvs::realserver: new profile to substitute the role [puppet] - 10https://gerrit.wikimedia.org/r/463488 [15:17:34] (03PS2) 10Giuseppe Lavagetto: parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 [15:19:26] (03PS3) 10Giuseppe Lavagetto: parsoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463489 [15:21:08] (03CR) 10Cwhite: naggen2: python3 and remove activerecord support (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [15:21:15] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12672/wtp1045.eqiad.wmnet/ shows the change is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/463489 (owner: 10Giuseppe Lavagetto) [15:22:16] (03PS2) 10Andrew Bogott: Updates to Product Analytics profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/458907 (owner: 10Bearloga) [15:23:04] (03CR) 10Andrew Bogott: [C: 032] Updates to Product Analytics profiles and roles [puppet] - 10https://gerrit.wikimedia.org/r/458907 (owner: 10Bearloga) [15:38:40] (03PS2) 10Jcrespo: mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) [15:47:55] (03PS1) 10Banyek: user: some dotfiles for user banyek [puppet] - 10https://gerrit.wikimedia.org/r/463502 [15:50:01] (03CR) 10Jcrespo: user: some dotfiles for user banyek (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [15:51:04] !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --delete on mediawiki.org and testwiki [15:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:56] \o/ [15:53:32] \o/ [15:53:41] (03PS2) 10Banyek: user: some dotfiles for user banyek [puppet] - 10https://gerrit.wikimedia.org/r/463502 [16:01:15] !log ladsgroup@mwmaint2001:~$ mwscript extensions/CentralAuth/maintenance/deleteLocalPasswords.php --wiki=fawiki --prefix (T201009) [16:01:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:20] T201009: Run deleteLocalPasswords.php in WMF prod (Central Auth wikis only!) after 1.32.0-wmf.16 is everywhere - https://phabricator.wikimedia.org/T201009 [16:05:34] !log compressing tables at db1116:3317, stopping replication [16:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:19] PROBLEM - confd service on deploy1001 is CRITICAL: CRITICAL - Expecting active but unit confd is activating [16:07:28] RECOVERY - confd service on deploy1001 is OK: OK - confd is active [16:12:06] that's ^ me ignore [16:18:09] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10herron) >>! In T41785#4625386, @Andrew wrote: > @herron, the IP ranges for VMs are: > > eqiad1: 172.16.0.0/21 > eqiad: 10.68.16.0/21 > > We often just use 10.0.0.0/... [16:18:49] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:19:28] (03CR) 10Jcrespo: [C: 031] "+1 as in, this seems correct/non dangerous, but of course this is all pure personal preferences. From my view the jedi tm" [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [16:19:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:22:44] (03PS1) 10Bstorm: openstack: add case for stretch and newtron in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) [16:23:00] (03CR) 10Jcrespo: [C: 04-2] "Grants deployed, waiting on completed table compression." [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [16:23:03] ACKNOWLEDGEMENT - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active Ayounsi ulsfo DC move (waiting on Equinix to update their MAC filter) https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:23:38] (03PS2) 10Daniel Kinzler: Enable injection of RC records on wikidata org. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370966 [16:25:04] !log add HKBN BGP sessions to esams and eqsin [16:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:10] (03PS2) 10Bstorm: openstack: add case for stretch and newtron in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) [16:25:14] (03CR) 10Banyek: ":) I use tmux since it exists, because I wanted vertical split - and then I stayed there" [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [16:25:50] (03CR) 10Banyek: "but i won't merge this until monday" [puppet] - 10https://gerrit.wikimedia.org/r/463502 (owner: 10Banyek) [16:27:24] (03PS3) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463506 (https://phabricator.wikimedia.org/T203254) [16:33:31] (03CR) 10Andrew Bogott: [C: 032] Labs monitoring: Authorise new shinken host [puppet] - 10https://gerrit.wikimedia.org/r/461957 (https://phabricator.wikimedia.org/T204562) (owner: 10Alex Monk) [16:33:38] (03PS2) 10Andrew Bogott: Labs monitoring: Authorise new shinken host [puppet] - 10https://gerrit.wikimedia.org/r/461957 (https://phabricator.wikimedia.org/T204562) (owner: 10Alex Monk) [16:34:08] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:38:59] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10herron) Hi @Klein, this often means that the IP address being used to subscribe is present on a spam list. Do you encounter the same issue when attempting to subscri... [16:39:33] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10dmaza) Is there any particular reason why we are doing APC instead of an alternative considering that APC is unmaintained? [16:43:36] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10jcrespo) > mysqli driver interaction with MediaWiki working as expected One small comment, the mysqli driver had issues in th... [16:43:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:44:56] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Legoktm) >>! In T176370#4625612, @dmaza wrote: > Is there any particular reason why we are doing APC instead of an alternative... [16:45:17] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Legoktm) [16:45:37] (03PS1) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [16:46:43] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10dmaza) >>! In T176370#4625630, @Legoktm wrote: >>>! In T176370#4625612, @dmaza wrote: >> Is there any particular reason why we... [16:47:29] (03PS2) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [16:58:14] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Aklapper) 05Open>03stalled [16:59:52] (03PS1) 10Gehel: wdqs: collect JMX metrics from ConcurrentHttpRequestsFilter [puppet] - 10https://gerrit.wikimedia.org/r/463511 (https://phabricator.wikimedia.org/T204364) [17:00:47] (03Abandoned) 10Jcrespo: Revert "multiinstance.pp: Page based on the number of processess" [puppet] - 10https://gerrit.wikimedia.org/r/449712 (owner: 10Jcrespo) [17:00:52] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) >>! In T176370#4625626, @jcrespo wrote: >> mysqli driver interaction with MediaWiki working as expected > > One smal... [17:01:39] Krinkle: it is ok to leave it ticket, it was a TODO for me comment [17:01:50] *ticked [17:02:22] I don't think issues would be worse than they are now, if someting is going to work is the mysql driver [17:02:33] 10Operations, 10Core Platform Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [17:02:37] It's still good just to run a couple tests. [17:02:48] We can make something together, just to understand its behaviour beyond doubt. [17:02:54] and compare to HHVM. [17:03:13] Maybe we'll find some differences we can configure, or some differences we cannot control but should know about. [17:03:53] we have pending with tim to run several controlled breakage scenarios [17:04:15] (servers DROP, server REJECTs packages, lag, replication broken) [17:04:33] we could do them on both HHVM and PHP [17:16:44] (03PS2) 10Dzahn: Remove additional role include [puppet] - 10https://gerrit.wikimedia.org/r/463473 (owner: 10Muehlenhoff) [17:16:56] (03CR) 10Dzahn: [C: 032] Remove additional role include [puppet] - 10https://gerrit.wikimedia.org/r/463473 (owner: 10Muehlenhoff) [17:19:35] (03CR) 10Dzahn: [C: 032] "tungsten shows an unexpected puppet error .. which hardly could be related to this.. Invalid relationship: Exec[git_clone_operations/softw" [puppet] - 10https://gerrit.wikimedia.org/r/463473 (owner: 10Muehlenhoff) [17:21:18] (03CR) 10Dzahn: [C: 032] "oh.. well it is. test includes standard, standard includes git and xhgui does NOT include standard but it should.. will fix it there" [puppet] - 10https://gerrit.wikimedia.org/r/463473 (owner: 10Muehlenhoff) [17:22:38] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:28] ACKNOWLEDGEMENT - puppet last run on tungsten is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463473/ [17:26:10] (03PS1) 10Dzahn: xhgui::app: add standard and firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/463513 [17:26:28] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10herron) Did the host have an existing IPv6 address when the puppet run was started? If so puppet changing the IP mid-run is probably enough to interrupt the run in progress. [17:26:45] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10herron) p:05Triage>03Normal [17:26:54] (03CR) 10Dzahn: [C: 032] xhgui::app: add standard and firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/463513 (owner: 10Dzahn) [17:27:56] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10Dzahn) I don't think it's actually puppet hanging. I think what is happening is that in the moment puppet adds the new IP to the interface your existing connecting gets int... [17:29:35] (03PS3) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [17:32:48] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:34:21] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10herron) >>! In T205577#4625894, @Dzahn wrote: > I don't think it's actually puppet hanging. I think what is happening is that in the moment puppet adds the new IP to the in... [17:36:48] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 11, down: 0, shutdown: 91 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:41:03] (03PS23) 10Bstorm: WIP toolforge: write/move a sonofgridengine module and toolforge profile [puppet] - 10https://gerrit.wikimedia.org/r/448791 (https://phabricator.wikimedia.org/T200557) [17:44:50] (03PS1) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) [17:45:28] (03CR) 10jerkins-bot: [V: 04-1] ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [17:46:34] (03PS2) 10Herron: ircecho: restart service on change [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) [17:46:36] picky picky [17:47:31] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/12673/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/463515 (https://phabricator.wikimedia.org/T205539) (owner: 10Herron) [17:48:58] (03CR) 10Dzahn: [C: 032] "needed follow-up https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463513/" [puppet] - 10https://gerrit.wikimedia.org/r/463473 (owner: 10Muehlenhoff) [17:49:10] 10Operations, 10IRCecho, 10Patch-For-Review: Puppet doesn't restart ircecho when the code changes - https://phabricator.wikimedia.org/T205539 (10herron) p:05Triage>03Normal [17:49:41] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10herron) p:05Triage>03Normal [17:50:24] (03CR) 10Dzahn: "re: tungsten This changed now with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463473/ and https://gerrit.wikimedia.org/r/#/" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [17:54:23] (03PS4) 10Paladox: WIP: Update gerrit to 2.16 [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/463509 [17:59:26] (03PS4) 10Dzahn: icinga: disable notifications for hosts using role(test) [puppet] - 10https://gerrit.wikimedia.org/r/460064 [18:01:08] (03CR) 10Dzahn: [C: 032] "with the latest changes this now only affects 2 hosts, cp1099 which was specifically triggering this as we did _not_ expect notifications " [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [18:01:12] (03PS1) 10Bearloga: Add chelsyx to analytics-search-users group [puppet] - 10https://gerrit.wikimedia.org/r/463517 (https://phabricator.wikimedia.org/T204415) [18:02:27] (03CR) 10Dzahn: [C: 032] "eeden will not show a difference because puppet is disabled there with a comment that it will be reimaged shortly" [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [18:03:20] (03PS2) 10Bstorm: tools mail: write RBL check warning to file [puppet] - 10https://gerrit.wikimedia.org/r/463144 (https://phabricator.wikimedia.org/T202558) (owner: 10GTirloni) [18:03:35] (03CR) 10Dzahn: [C: 032] "@bblack this should now switch off the icinga alerts for cp1099 and anything in the future that will use role(test)." [puppet] - 10https://gerrit.wikimedia.org/r/460064 (owner: 10Dzahn) [18:04:03] (03CR) 10Bstorm: [C: 032] tools mail: write RBL check warning to file [puppet] - 10https://gerrit.wikimedia.org/r/463144 (https://phabricator.wikimedia.org/T202558) (owner: 10GTirloni) [18:04:04] any ops here who can +2 https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463517/? please and thank you. chelsyx needs it as soon as possible [18:05:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:05:50] bearloga, neither of the linked tickets appear to be access requests [18:06:17] you need to open an access request ticket and wait for SRE meeting review which I think is on monday [18:06:29] Krenair: she should have been in that group before [18:06:32] well [18:06:34] she's not at the moment [18:06:48] so unless there was an approved ticket that somehow missed the group... [18:07:19] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:07:56] (i.e., the group was approved along with other ones but then missed from the patch or something) [18:09:22] I'm reasonably sure that group was created without a ticket when Andrew Otto put together analytics-search user and he just forgot to include her in it. unfortunately he doesn't appear to be online atm [18:09:58] I don't think that's supposed to happen [18:10:08] changes to the admin module have to go through a process, we can't just merge it, sorry. but what we can do is set high priority to make it happen soon [18:10:31] mutante, it's a sudo request [18:10:54] unless it's urgent enough to bother ma.rk it has to wait for the meeting and the priority of the task is not taken into account afaik? [18:11:17] yea, since it's Friday it would mean Monday for sudo or non-sudo at the least [18:11:50] how did the original group membership come into being without SRE meeting review? [18:12:03] maybe there was one, idk [18:13:36] analytics-search-users is almost 3 years old [18:13:40] we have a rotation for this, each week one member of SRE handles these requests. so on Monday the topic here will change to a new name (Ops Clinic Duty) and you can ping that person directly to ensure it comes up in the meeting [18:13:46] and looks like it did get approved: https://phabricator.wikimedia.org/T122620#1962302 [18:14:12] (03PS1) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 [18:15:36] (03PS2) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 [18:15:55] (03CR) 10Paladox: "This is needed for gerrit 2.16 otherwise we may loose logging." [puppet] - 10https://gerrit.wikimedia.org/r/463519 (owner: 10Paladox) [18:16:51] Krenair mutante: she's gonna create a request :) [18:17:53] ok [18:18:21] it doesn't necessarily need to come from the person getting access afaik, but sure [18:18:30] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10Dzahn) Also see this epic ticket to solve the entire thing "once and for all" :) -> T102099 [18:20:14] bearloga: yep, cool. what Krenair said and setting Prio to High is fine. somebody will bring it to the meeting and i will also try to keep an eye on it [18:20:32] thanks! [18:21:20] (03PS2) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [18:22:12] (03PS3) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [18:23:06] (03PS4) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [18:23:54] (03CR) 10Dzahn: [C: 032] "i don't see the class used in https://tools.wmflabs.org/openstack-browser/puppetclass/ either .. so going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/463398 (owner: 10Dzahn) [18:24:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:25:08] (03CR) 10Dzahn: "but should it really be added _before_ 2.16 or should it wait until the migration and serve as a reminder only?" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (owner: 10Paladox) [18:25:36] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10GTirloni) The host did have an IPv6 address in the same network but using the MAC address instead of the IPv4 address like we wanted. I didn't have to reconnect SSH. The s... [18:25:47] (03CR) 10Paladox: "Im going to test to see weather it will work in 2.15 (but be ignored) otherwise yeh it will have to wait until the day we do the 2.16 upgr" [puppet] - 10https://gerrit.wikimedia.org/r/463519 (owner: 10Paladox) [18:27:09] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:28:42] (03CR) 10Dzahn: "as with all design changes involving colors there is probably going to be a lot of bike shedding here. what is considered to "look profess" [puppet] - 10https://gerrit.wikimedia.org/r/458593 (https://phabricator.wikimedia.org/T200739) (owner: 10Paladox) [18:28:44] (03PS3) 10Paladox: Gerrit: Add flogger javaopts [puppet] - 10https://gerrit.wikimedia.org/r/463519 (https://phabricator.wikimedia.org/T200739) [18:28:52] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10GTirloni) If there are bigger plans for the IPv6 config code, I think we can close this one. [18:30:13] (03PS2) 10Dzahn: alerting_host/tcpircbot: remove useless role, include only profile [puppet] - 10https://gerrit.wikimedia.org/r/463398 [18:32:52] 10Operations, 10SRE-Access-Requests: Requesting access to to `stats`, `analytics-search-users`, `statistics-privatedata-users` for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) p:05Triage>03High [18:34:46] (03CR) 10Dzahn: [C: 032] "well.. this did remove the extra line from motd indicating that tcpircbot runs on this server.. unsure if we are missing that. could move " [puppet] - 10https://gerrit.wikimedia.org/r/463398 (owner: 10Dzahn) [18:35:33] (03PS2) 10Dzahn: certspotter/alerting_host: don't include role inside role, -> profile [puppet] - 10https://gerrit.wikimedia.org/r/463400 [18:37:01] (03CR) 10Dzahn: [C: 032] certspotter/alerting_host: don't include role inside role, -> profile [puppet] - 10https://gerrit.wikimedia.org/r/463400 (owner: 10Dzahn) [18:38:46] 10Operations, 10Wikimedia-Mailing-lists, 10Bengali-Sites: Set up mailing list for Bengali Wikibooks - https://phabricator.wikimedia.org/T203736 (10Shahadat) sorry for my late reply. email address of the mailing list administrator: 1. hello2shahadat@gmail.com 2. info@shahadathossain.net 3. shahadatsmailbox@gm... [18:39:22] (03PS2) 10Bearloga: Add chelsyx to analytics-search-users group [puppet] - 10https://gerrit.wikimedia.org/r/463517 (https://phabricator.wikimedia.org/T205736) [18:41:25] (03PS1) 10Dzahn: tcpircbot:: add system::role motd snippet in profile [puppet] - 10https://gerrit.wikimedia.org/r/463525 [18:44:09] (03CR) 10jerkins-bot: [V: 04-1] tcpircbot:: add system::role motd snippet in profile [puppet] - 10https://gerrit.wikimedia.org/r/463525 (owner: 10Dzahn) [18:45:04] (03Abandoned) 10Dzahn: tcpircbot:: add system::role motd snippet in profile [puppet] - 10https://gerrit.wikimedia.org/r/463525 (owner: 10Dzahn) [19:03:11] !log phab2001 - scheduled downtime, rebooting for kernel [19:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:47] (03PS1) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [19:11:15] (03CR) 10Bstorm: "So far, I'm presuming that I should not remove the joins from image and revision so that the temp tables are "behind the scenes". However" [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:17:09] Phabricator will be inaccessible for a minute. The server needs to be rebooted. It will be back asap. [19:18:09] !log phab1001 (Phabricator), scheduled downtime, reboot for maintenance [19:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:15] (03PS2) 10Dzahn: Restrict Icinga rsync access [puppet] - 10https://gerrit.wikimedia.org/r/463492 (owner: 10Muehlenhoff) [19:34:54] (03CR) 10Dzahn: [C: 032] Restrict Icinga rsync access [puppet] - 10https://gerrit.wikimedia.org/r/463492 (owner: 10Muehlenhoff) [19:35:28] (03PS1) 10MSantos: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) [19:36:09] (03CR) 10jerkins-bot: [V: 04-1] Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [19:37:25] hmm https://commons.wikimedia.org/wiki/Main_Page images are missing, is that known? [19:38:07] https://phabricator.wikimedia.org/F26231755 [19:38:12] paladox: i see them. could it be a parental filter due to yesterday's image of the day [19:38:26] it was working earlier today though [19:38:30] ie a hour or 2 ago [19:39:08] even after a hard reload / in aanother browser ? [19:39:13] works now [19:39:17] holding shift during reload? ok [19:39:18] good :) [19:41:03] (03PS2) 10MSantos: Fix: Regenerate map tiles up to zoom level 9 [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) [19:42:28] (03CR) 10Dzahn: [C: 032] "+hosts allow = tegmen.wikimedia.org icinga1001.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/463492 (owner: 10Muehlenhoff) [19:49:45] (03CR) 10BearND: "Really minor nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463542 (https://phabricator.wikimedia.org/T202201) (owner: 10MSantos) [20:06:41] (03PS1) 10Dzahn: network::constancts: add icinga1001 to monitoring_hosts [puppet] - 10https://gerrit.wikimedia.org/r/463546 (https://phabricator.wikimedia.org/T202782) [20:14:47] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12674/" [puppet] - 10https://gerrit.wikimedia.org/r/463546 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:25:11] (03CR) 1020after4: [C: 031] Implement all the SSH agent bits and stop proxying [software/keyholder] - 10https://gerrit.wikimedia.org/r/458236 (owner: 10Faidon Liambotis) [20:29:24] (03CR) 10Dzahn: [C: 032] "this fixed the rsync over IPv6 on icinga1001 _from_ einsteinium. but we expected our more specific rsync rules to already allow that.. the" [puppet] - 10https://gerrit.wikimedia.org/r/463546 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:34:52] (03PS2) 10Dzahn: icinga::plugins: add user/group param and avoid out-of-scope vars [puppet] - 10https://gerrit.wikimedia.org/r/463403 [20:37:03] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12675/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/463403 (owner: 10Dzahn) [20:45:03] PROBLEM - Disk space on analytics1003 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: No such file or directory [20:50:32] PROBLEM - Disk space on analytics1003 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: No such file or directory [20:52:50] (03PS2) 10Dzahn: alerting_host/icinga: move "mapped v6" snippet back to site [puppet] - 10https://gerrit.wikimedia.org/r/463397 [20:54:52] RECOVERY - Disk space on analytics1003 is OK: DISK OK [20:56:50] (03PS1) 10Bearloga: profile::product_analytics::base: fix package name [puppet] - 10https://gerrit.wikimedia.org/r/463552 [20:57:09] !log analytics1003 - unmounted and remounted /mnt/hdfs after Icinga alerts that it was not accessible - commands from https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Fixing_HDFS_mount_at_/mnt/hdfs - like it happened before on stat1004 and others (T182342) [20:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:15] T182342: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342 [20:58:07] 10Operations, 10Analytics, 10Analytics-Cluster: stat1004 - /mnt/hdfs is not accessible - https://phabricator.wikimedia.org/T182342 (10Dzahn) 16:45 < icinga-wm> PROBLEM - Disk space on analytics1003 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: No such file or directory 16:50 < icinga-wm> PROBLEM... [20:58:59] (03CR) 10Dzahn: "ok! doing that and moving it to site level" [puppet] - 10https://gerrit.wikimedia.org/r/463397 (owner: 10Dzahn) [21:02:39] (03CR) 10Dzahn: [C: 032] "Compilation results for einsteinium.wikimedia.org: no change" [puppet] - 10https://gerrit.wikimedia.org/r/463397 (owner: 10Dzahn) [21:02:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:06:53] (03CR) 10Dzahn: [C: 032] "confirmed package names, and i see this is cloud only in openstack-browser" [puppet] - 10https://gerrit.wikimedia.org/r/463552 (owner: 10Bearloga) [21:07:03] (03PS2) 10Dzahn: profile::product_analytics::base: fix package name [puppet] - 10https://gerrit.wikimedia.org/r/463552 (owner: 10Bearloga) [21:07:13] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:08:54] (03CR) 10Dzahn: icinga::naggen/web/raid/ores: avoid out-of-scope-vars everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463404 (owner: 10Dzahn) [21:10:07] (03PS2) 10Dzahn: icinga::naggen/web/raid/ores: avoid out-of-scope-vars everywhere [puppet] - 10https://gerrit.wikimedia.org/r/463404 [21:17:12] (03Abandoned) 10Dzahn: monitoring:: add action_url next to notes_url parameter [puppet] - 10https://gerrit.wikimedia.org/r/459645 (owner: 10Dzahn) [21:19:45] (03PS4) 10Dzahn: cache::text: replace (commented) mwmaint1001 with mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/462036 (https://phabricator.wikimedia.org/T201343) [21:20:01] (03CR) 10Dzahn: [C: 032] "comment only - mentioned in service ops meeting too" [puppet] - 10https://gerrit.wikimedia.org/r/462036 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [21:42:58] (03PS1) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [21:43:37] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [21:43:49] (03PS2) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [21:44:24] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [21:46:07] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) This should be all done now. The last open change is the one above to add a temp hack to avoid that both mwmaint servers in eqiad become activated at the same time w... [21:50:32] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:52:11] (03CR) 10Cwhite: [C: 031] icinga::naggen/web/raid/ores: avoid out-of-scope-vars everywhere [puppet] - 10https://gerrit.wikimedia.org/r/463404 (owner: 10Dzahn) [21:53:43] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:49:12] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:02:12] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:05:23] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet operation_type={create_container,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:06:32] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:24:22] (03PS1) 10BryanDavis: cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 [23:25:36] (03CR) 10jerkins-bot: [V: 04-1] cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 (owner: 10BryanDavis) [23:27:08] (03PS2) 10BryanDavis: cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 [23:27:12] (03CR) 10Smalyshev: [C: 031] wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [23:28:14] (03PS3) 10BryanDavis: cloud: Enable systemd unit for MediaWiki-Vagrant [puppet] - 10https://gerrit.wikimedia.org/r/463576 [23:36:09] (03Abandoned) 10BryanDavis: Redirect careers and jobs vanity domains to new location [puppet] - 10https://gerrit.wikimedia.org/r/449743 (owner: 10BryanDavis)