[00:09:37] 10Operations, 10monitoring, 10Patch-For-Review: how to structure wiki pages for Icinga reaction play books - https://phabricator.wikimedia.org/T197873 (10Dzahn) [00:10:52] (03PS1) 10Dzahn: icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [00:12:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:25:41] (03PS1) 10Dzahn: monitoring::service: fix $cluster FIXME [puppet] - 10https://gerrit.wikimedia.org/r/459660 [00:28:02] (03PS3) 10Dzahn: tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:30:08] (03CR) 10Dzahn: "fixed the lint issue, compiled it. we are getting a " Duplicate declaration: Monitoring::Service[tor_orport] is already declared"" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:32:52] (03PS4) 10Dzahn: tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:32:54] (03CR) 10Dzahn: "adding $instance_name to resource names of monitoring::service to avoid issue in https://puppet-compiler.wmflabs.org/compiler1002/12395/to" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:33:25] (03CR) 10jerkins-bot: [V: 04-1] tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:36:28] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:37:26] (03PS5) 10Dzahn: tor: add an additional relay instance [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:38:31] (03CR) 10Dzahn: "compiles now to https://puppet-compiler.wmflabs.org/compiler1002/12397/torrelay1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [00:41:37] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [00:51:03] (03PS1) 10Alex Monk: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 [00:52:40] (03CR) 10jerkins-bot: [V: 04-1] Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 (owner: 10Alex Monk) [00:55:58] (03PS1) 10Andrew Bogott: Add hiera defs for cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/459663 (https://phabricator.wikimedia.org/T204003) [00:56:34] (03CR) 10Andrew Bogott: [C: 032] Add hiera defs for cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/459663 (https://phabricator.wikimedia.org/T204003) (owner: 10Andrew Bogott) [01:00:59] !log krinkle@mwmaint1001$ mwscript deleteEqualMessages.php --wiki fixcopyrightwiki --delete --lang-code='*' [01:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:08] PROBLEM - Check systemd state on cloudvirt1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:15:17] RECOVERY - Check systemd state on cloudvirt1019 is OK: OK - running: The system is fully operational [01:23:37] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T194855 (10Andrew) [01:23:47] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) [01:24:49] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 andrew bogott T196507 [01:25:29] ACKNOWLEDGEMENT - HP RAID on cloudvirt1020 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 andrew bogott T196507 [01:35:08] 10Operations, 10ops-eqiad, 10DC-Ops: Rename labvirt1019 and cloudvirt1020 to cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T204004 (10Andrew) a:05Andrew>03Cmjohnson I reimaged these and made all the puppet/dns changes needed. All that remains is the datacenter bits. [01:36:06] (03CR) 10Andrew Bogott: [C: 032] Clean up old labvirt1019/1020 entries [dns] - 10https://gerrit.wikimedia.org/r/459652 (https://phabricator.wikimedia.org/T204004) (owner: 10Andrew Bogott) [01:38:16] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T204011 [01:38:18] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T204011 (10ops-monitoring-bot) [01:40:46] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 andrew bogott T204011 [01:41:16] ACKNOWLEDGEMENT - HP RAID on cloudvirt1020 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 andrew bogott T194855 [01:41:56] ACKNOWLEDGEMENT - HP RAID on cloudvirt1020 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T204012 [01:41:59] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T204012 (10ops-monitoring-bot) [02:03:04] PROBLEM - HHVM rendering on mw1279 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [02:04:14] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 77193 bytes in 0.127 second response time [02:05:58] possible bad config change? mainspace article was just created by new user on enwp with no permissions https://en.wikipedia.org/wiki/How_can_i_create_new_article%3F%3F [02:25:13] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:30:14] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:34:47] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.20) (duration: 13m 27s) [02:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:31] !log krinkle@mwmaint1001$ mwscript deleteEqualMessages.php --wiki fixcopyrightwiki --delete --lang-code='*' [02:44:34] CindyCicaleseWMF: [02:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:41] 71 pages in the MediaWiki namespace override messages. [02:44:41] 4 pages are equal to the default message and were deleted. [02:45:31] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Sep 11 02:45:31 UTC 2018 (duration 10m 44s) [02:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:01] Krinkle: Nice - thank you! After the next translatewiki.net patch, more of the messages should be able to go away. [03:10:49] enterprisey: Thanks. Are you sure page creation is (meant to be) disabled for new users? I thought it was only disable for anons. [03:11:09] Krinkle: at least on enwp, it should be disabled for new users as wel [03:11:13] I've opened a phab ticket [03:12:51] Krinkle: We're getting some strange behavior where messages that are clearly translated in the i18n json in master are still showing up in English on the wiki even if you look on Special:AllMessages. Could it need some cache clearing magic? [03:14:13] beware: The LTA is monitoring this channel [03:14:29] they left, revi [03:14:31] (for now) [03:14:34] log is publoc [03:14:36] public* [03:14:39] Oh [03:14:43] he can sneak into the log page without being here [03:14:50] The problem extension seems to be https://www.mediawiki.org/wiki/Extension:GettingStarted ? [03:15:28] Oh no, nevermind. [03:29:24] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 821.15 seconds [03:48:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 193.79 seconds [03:48:54] CindyCicaleseWMF: did you do a full scap? Message cache needs to be manual regenerated which is done by full scap but not scap sync-file [03:50:02] oh i didnt read backscroll [03:50:15] so what i said was kind of stupid [03:51:09] or maybe not, i dont really know [04:28:09] CindyCicaleseWMF: do you have a specific example I can examine? [05:18:10] (03CR) 10Marostegui: Labs: Config template generation for pt-kill (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [05:48:59] 10Operations, 10Phabricator, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10elukey) @Dzahn What are the steps missing to failover to phab1002? This task seems close to be able to finally do it :) [05:49:29] (03PS1) 10Marostegui: db-codfw.php: Depool db2075 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459675 (https://phabricator.wikimedia.org/T200509) [05:54:30] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2075 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459675 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [05:55:44] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2075 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459675 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [05:56:48] (03PS13) 10Marostegui: mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) [05:57:05] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2084:3315 and db2075 (duration: 00m 51s) [05:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:28] legoktm: filed as https://phabricator.wikimedia.org/T204018 which has an example [05:58:17] tgr: I think it's being overridden by https://fixcopyright.wikimedia.org/w/index.php?title=MediaWiki:Eucc-option-protect-public-domain-text-v3&action=history [05:59:01] d'oh, didn't think of that [05:59:12] https://fixcopyright.wikimedia.org/w/index.php?title=Special%3AAllMessages&prefix=Eucc&filter=modified&lang=en&limit=100 [05:59:27] that checks out, it affects the messages which had amendment numbers added to them [05:59:58] how are the other translations not broken though? all hand-copied to the fixcopyright wiki? [06:00:14] https://fixcopyright.wikimedia.org/w/index.php?title=Special%3AAllMessages&prefix=Eucc&filter=modified&lang=de&limit=100 seems like... :( [06:00:20] fr, es, de apparently yes [06:00:56] and the other translations are broken, e.g. sv [06:01:33] are these local overrides still needed? the AM numbers are in translatewiki now [06:03:06] (03CR) 10jenkins-bot: db-codfw.php: Depool db2075 and db2084:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459675 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [06:04:46] tgr: are they in git? [06:06:15] legoktm: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/459412/2/i18n/en.json [06:06:30] if that was indeed the reason for the override [06:07:02] hmm [06:07:34] I think we should just cherry-pick all the l10n updates into wmf.20, scap, then look at AllMessages, see if there's a diff, and run deleteEqualMessages so they stop overriding [06:09:31] does that sound sensible/reasonable? [06:09:56] legoktm: would it be possible to create XXX/en instead of XXX locally? that wouldn't interfere with translations [06:10:43] but yeah just deploying those messages the normal way would be better [06:11:00] I don't remember how /en works given that en is the default language [06:14:31] tgr: are you going to be around for a while? I was planning to sleep soon, so if possible I don't want to stay up for a full scap + post deploy babysitting [06:15:26] legoktm: yeah, I just got up [06:15:59] ok [06:16:12] I don't know how annoying it'll be to cherry-pick everything, you could also try merging master into wmf.20 [06:16:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:17:52] legoktm: there is some kind of legal review, did master already pass that? [06:17:59] yes [06:18:26] stuff that cindy submitted or +2'd was already reviewed [06:18:50] (because as soon as it gets into master, then LU could have deployed it) [06:21:23] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 318 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:22:09] legoktm: re /en, it does not appear to work [06:28:58] legoktm: it seems the non-en versions do not have the AM numbers, not even the ones that have local overrides (except en) so it's not going to be as easy as that [06:30:23] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:14] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json] [06:31:40] all 13 translations have the numbers on TW [06:31:58] is there a way to get those into git? [06:32:07] should I make a patch by hand? [06:32:28] I guess so? [06:33:17] I can push l10n update patches to gerrit on demand [06:34:06] Nikerabbit: I'm just not sure about the legal review if we take all changes from twn [06:35:59] Nikerabbit: in any case, could you create a patch? just having it in gerrit can't possibly cause harm [06:36:13] and at least we'll see how much change we are talking about [06:37:57] tgr: okay [06:38:06] greg-g: just to get this on your radar: < legoktm> I think we should just cherry-pick all the l10n updates into wmf.20, scap, then look at AllMessages, see if there's a diff, and run deleteEqualMessages so they stop overriding [06:38:25] context is https://phabricator.wikimedia.org/T204018 [06:47:59] tgr: to patches in gerrit now [06:48:19] thx Nikerabbit [06:55:44] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] Nikerabbit: do you know what's up with changes like https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/459681/2/i18n/de.json#b72 ? [06:57:23] the stuff that gets removed there does not come from twn [06:58:25] (03CR) 10Jcrespo: "Compiler results: https://puppet-compiler.wmflabs.org/compiler1002/12399/" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [06:58:44] I guess https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/459638 / https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/459653 was done outside twn? [06:59:46] (03CR) 10Jcrespo: "> Compiler results: https://puppet-compiler.wmflabs.org/compiler1002/12399/" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [07:01:47] (03CR) 10Jcrespo: "See error below on comment file" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458810 (owner: 10Banyek) [07:04:14] tgr: yeah it could be that we hadn't imported those yet, sorry about that [07:05:59] no worries, I just wanted to make sure you know about it. I'll just merge those into the L10n-bot patch [07:22:05] (03PS3) 10Fomafix: Rename language codes sr-ec and sr-el to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [07:24:32] hashar: zeljkof: I'm about to do an emergency deploy to refresh language on fixcopyright.wikimedia.org, per https://phabricator.wikimedia.org/T204018#4573210 [07:24:41] anyone I should ping / clear that with? [07:27:23] !log Disable puppet on all the DBs for alert testing - https://phabricator.wikimedia.org/T200509 [07:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:21] tgr: you should probably talk to Greg, but he's not around for hours; if there's nothing on the calendar and the ops don't mind... [07:31:07] (03CR) 10Marostegui: [C: 032] mariadb: Set pages for multi-instance hosts [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [07:31:35] godog: _joe_: any concerns about that? [07:33:19] !log Stop replication on db2084:3315 for alert testing T200509 [07:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:25] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [07:33:32] it's hours until the first switchover, and this seems like an easy/trivial deploy, so I'd say go for it tgr [07:35:02] <_joe_> +1 [07:37:44] (03PS5) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [07:37:59] tgr: not sure if you saw ^^ [07:38:12] paravoid: yes, thanks [07:39:03] none of us here is really involved in the fixcopyright campain itself, but a bunch of us are awake and working anyway, so let us know if we can help in anything [07:39:36] (03PS2) 10Muehlenhoff: Install backup2001 with 4.14 kernel [puppet] - 10https://gerrit.wikimedia.org/r/459546 [07:41:00] (03CR) 10Muehlenhoff: [C: 032] Install backup2001 with 4.14 kernel [puppet] - 10https://gerrit.wikimedia.org/r/459546 (owner: 10Muehlenhoff) [07:41:43] (03CR) 10Mathew.onipe: "> Patch Set 7:" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [07:42:40] Nikerabbit: how does the localization for EUCopyrightCampaign work, exactly? the i18n json files in wmf.20 are super old, for example hu.json is not even in that branch yet, but uselang=hu works [07:42:43] (03PS11) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [07:43:09] (03CR) 10Faidon Liambotis: [C: 04-1] "Thanks for fixing this up Daniel! We'll also need to fix the Family field in the main tor config to cover this extra relay and unfortunate" [puppet] - 10https://gerrit.wikimedia.org/r/399972 (owner: 10Faidon Liambotis) [07:43:30] (03CR) 10Giuseppe Lavagetto: "Hi Fomafix, I think we should wait for the end of the quarter, when I will have completed the currently ongoing apache config transition," [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [07:44:15] oh, right, LocalisationUpdate [07:44:26] (03PS4) 10Fomafix: Add language codes sr-ec and sr-el next to sr-cyrl and sr-latn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375616 (https://phabricator.wikimedia.org/T117845) [07:45:27] tgr: yep [07:47:13] PROBLEM - HHVM jobrunner on mw1305 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [07:48:14] RECOVERY - HHVM jobrunner on mw1305 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [07:49:16] !log Stop replication on db2075 for alert testing T200509 [07:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:21] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [07:50:02] tgr: sorry I was looking elsewhere [07:54:45] Nikerabbit: can you vet https://phabricator.wikimedia.org/T204018#4573272 ? do I need to worry about LocalisationUpdate or will scap do whathever needs to be done there? [07:54:55] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2075 and db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459685 [07:56:13] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 701.85 seconds [07:56:23] ^ that is me testing a new alert [07:59:03] (03CR) 10Mobrovac: [C: 031] Replace the semver patch version in Accept with x [puppet] - 10https://gerrit.wikimedia.org/r/455036 (https://phabricator.wikimedia.org/T202682) (owner: 10Ppchelko) [08:00:33] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 661.91 seconds [08:00:36] ^ that is me testing a new alert [08:01:44] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:02:27] they did not page, if that is what you are testing [08:02:33] yeah :-) [08:02:41] tgr: good and difficult question about scap and l10nupdate... I would expect that full scap will override l10nupdate additions [08:02:43] (03CR) 10Giuseppe Lavagetto: [C: 031] sre.switchdc.mediawiki: parsoid skip broken host [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:02:44] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.14 seconds [08:02:48] apergos: I will now test two, that should page [08:02:55] okey dokey [08:02:55] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2075 and db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459685 (owner: 10Marostegui) [08:04:42] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2075 and db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459685 (owner: 10Marostegui) [08:05:31] tgr: FYI, it looks like raimond is doing regulary translation exports withing the next hour [08:05:43] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2084:3315 and db2075 (duration: 00m 49s) [08:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:22] Nikerabbit: you mean I should wait for that? [08:07:09] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2075 and db2084:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459685 (owner: 10Marostegui) [08:07:30] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459687 (https://phabricator.wikimedia.org/T200509) [08:09:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459687 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:11:01] (03CR) 10Mobrovac: [C: 04-1] parsoid: connect to MediaWiki via https everywhere (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458475 (owner: 10Giuseppe Lavagetto) [08:11:17] tgr: there doesn't seem to be any new translations since my export, but that would have the import issue fixed, but I saw you already did that manually so I suppose you can just go ahead and ignore the new patch [08:11:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459687 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:12:21] !log marostegui@deploy1001 sync-file aborted: Depool db1096:3315, db1100 (duration: 00m 08s) [08:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:37] (03CR) 10Filippo Giunchedi: [C: 031] mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [08:13:07] 10Operations, 10Performance-Team: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10elukey) The last one to be unaware of this was me while checking un-handled criticals in Icinga :) Since they were all about timings spiking over the weekend I thought to ping people b... [08:13:13] !log stopping replication on db2034 to test dc switch replication sync step [08:13:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3315, db1100 (duration: 00m 49s) [08:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:20] !log Stop replication on db1096:3315 for new alert testing (this should generate a page) T200509 [08:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:26] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [08:18:15] (03PS12) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [08:20:17] (03CR) 10Volans: [C: 032] sre.switchdc.mediawiki: parsoid skip broken host [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:21:03] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: parsoid skip broken host [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:23:14] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315, db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459687 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:25:16] PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 650.11 seconds [08:25:25] ^ that is me testing [08:25:30] !log restarting replication on db2034 after testing dc switch replication sync step [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:50] A page should've arrived and I haven't arrived :( [08:26:03] it hasn't arrived :( [08:27:56] !log Stop replication on db1100 for new alert testing (this should generate a page) T200509 [08:28:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:02] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [08:28:17] RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.10 seconds [08:29:38] so far I am remaining page free fyi [08:30:00] ? [08:31:39] as far as maro stegui's testing [08:32:29] 10Operations, 10Horizon, 10Traffic, 10Upstream: Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10ema) p:05Triage>03Normal [08:33:03] (03CR) 10Giuseppe Lavagetto: [C: 031] "Makes sense: at debug level, you really want to see the debug logs of your application plus the libraries you call directly, nothing else." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:35:39] (03PS3) 10Giuseppe Lavagetto: authdns: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/458751 [08:35:53] (03CR) 10Giuseppe Lavagetto: [C: 032] authdns: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/458751 (owner: 10Giuseppe Lavagetto) [08:35:59] apergos: can you help debug that maybe? [08:36:43] check if the messages left nagios, aql, etc.? [08:36:56] I can sure take a look [08:37:04] thanks! [08:39:03] (03PS1) 10Marostegui: Revert "mariadb: Set pages for multi-instance hosts" [puppet] - 10https://gerrit.wikimedia.org/r/459730 [08:39:11] (03CR) 10Giuseppe Lavagetto: mariadb: Set pages for multi-instance hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449711 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [08:39:13] (03PS2) 10Marostegui: Revert "mariadb: Set pages for multi-instance hosts" [puppet] - 10https://gerrit.wikimedia.org/r/459730 [08:39:29] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:39:32] !log repair xfs on sdh/sdc on ms-be2040 - T199198 [08:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:37] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:42:00] (03PS1) 10Ladsgroup: Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459732 [08:43:53] (03CR) 10Ladsgroup: [C: 032] "labs only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459732 (owner: 10Ladsgroup) [08:45:12] (03Merged) 10jenkins-bot: Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459732 (owner: 10Ladsgroup) [08:45:28] (03CR) 10jenkins-bot: Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459732 (owner: 10Ladsgroup) [08:46:26] rebased on deploy1001 ^ [08:47:36] the recent changes in dewiki in beta looks fine [08:47:42] will double check later [08:50:12] (03PS1) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/459733 [08:53:09] 10Operations, 10Traffic, 10Patch-For-Review: Deploy initial ATS test clusters in core DCs - https://phabricator.wikimedia.org/T199720 (10ema) [08:53:12] 10Operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853 (10ema) 05Open>03Resolved a:03ema This can be closed now that we have: deployed two test clusters running ATS and routing traffic to all our applications, gained basic operational experience with it, verifi... [08:54:18] (03Abandoned) 10Marostegui: Revert "mariadb: Set pages for multi-instance hosts" [puppet] - 10https://gerrit.wikimedia.org/r/459730 (owner: 10Marostegui) [08:56:03] (03PS1) 10Marostegui: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) [08:56:24] PROBLEM - Filesystem available is greater than filesystem size on ms-be1043 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be1043:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [08:57:43] PROBLEM - High lag on wdqs1009 is CRITICAL: 3630 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [08:58:43] PROBLEM - High lag on wdqs1010 is CRITICAL: 3605 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:02:38] (03PS1) 10Gilles: Add python-geoip2 to stat machines [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) [09:05:05] (03PS2) 10Marostegui: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) [09:06:12] gehel: FYI wdqs lags ^^^ [09:06:22] volans: ack [09:06:26] (03CR) 10Elukey: Add python-geoip2 to stat machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:06:58] (03PS1) 10Elukey: profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/459736 (https://phabricator.wikimedia.org/T172532) [09:06:58] volans: that's the test cluster, so I was just finishing something before having a look, but I should at least acknowledge that it isn't an issue [09:07:20] ah, sorry didn't realize is the test cluster [09:07:50] (03CR) 10Marostegui: "This looks better now: https://puppet-compiler.wmflabs.org/compiler1002/12401/" [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [09:11:48] (03PS3) 10Marostegui: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) [09:12:48] volans: it still should not lag like this. We added some metrics recently, I might learn something [09:13:06] gehel: ack :) [09:15:56] (03CR) 10Gilles: Add python-geoip2 to stat machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [09:16:38] (03PS2) 10Gilles: Add python-geoip2 to stat machines [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) [09:17:54] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:18:46] (03PS2) 10Elukey: profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/459736 (https://phabricator.wikimedia.org/T172532) [09:19:37] tgr: I cant tell if you just +2ed that patch 7 times, of if the the irc bot is broken..... (see #mediawiki-feed ) [09:20:27] tgr: oh, they are all different patchs..... just your comment was the same for each one! [09:20:30] (03PS3) 10Gehel: admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [09:20:50] (03CR) 10jerkins-bot: [V: 04-1] admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [09:20:50] addshore: yeah, I'm merging master into wmf.20 basically [09:20:53] PROBLEM - HHVM jobrunner on mw1309 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [09:21:51] (03PS3) 10Elukey: profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/459736 (https://phabricator.wikimedia.org/T172532) [09:21:54] RECOVERY - HHVM jobrunner on mw1309 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [09:23:06] (03CR) 10Jcrespo: "I think the alternative is easier," (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [09:24:43] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:25:05] !log installing curl security updates [09:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:13] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 898 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:29:37] (03PS4) 10Gehel: admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [09:30:29] (03CR) 10Banyek: "I don't know why this appear as a new patch, I just commit --amend -ed to the previous one" [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [09:30:59] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::data_purge: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/459736 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [09:31:41] 10Operations, 10Citoid, 10Services (watching): Support meta tag refresh redirects in citoid to support elsevier's linking hub - https://phabricator.wikimedia.org/T204032 (10Mvolz) p:05Triage>03High [09:33:01] (03PS4) 10Jcrespo: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [09:33:40] (03CR) 10DCausse: Elasticsearch shard size check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [09:33:54] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 51 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:37:52] (03PS5) 10Jcrespo: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [09:37:57] (03PS1) 10Ladsgroup: Revert "Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459754 [09:38:43] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [09:41:31] (03CR) 10Ladsgroup: [C: 032] Revert "Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459754 (owner: 10Ladsgroup) [09:42:50] (03Merged) 10jenkins-bot: Revert "Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459754 (owner: 10Ladsgroup) [09:44:13] ^ rebased [09:45:05] (03PS6) 10Marostegui: multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) [09:45:07] (03CR) 10Mathew.onipe: Elasticsearch shard size check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [09:45:36] Amir1: are you working on tin? [09:45:47] no, it was just a rebase [09:45:51] labs patch [09:45:54] ok [09:45:58] (03CR) 10Marostegui: [C: 032] multiinstance.pp: Make multi-instance slaves page [puppet] - 10https://gerrit.wikimedia.org/r/459734 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [09:45:59] I'm about to do a scap [09:47:14] (03PS13) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [09:47:56] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [09:48:36] (03CR) 10Jcrespo: "Try adjusting the change id back to Change-Id: I2e46d2319d5c3e2d81b2c134f91d1d31512ddab4" [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [09:52:28] (03CR) 10Muehlenhoff: [C: 031] admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [09:52:54] (03CR) 10Muehlenhoff: [C: 031] admins: add Mathew Onipe as member of elasticsearch-roots and wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/459556 (https://phabricator.wikimedia.org/T202708) (owner: 10Gehel) [09:53:22] (03CR) 10Alexandros Kosiaris: "let's just remember to revert this after the switchover" [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:54:10] !log Stop replication on db1096:3315 for paging testing [09:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:55:31] (03PS22) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [09:55:59] 10Operations, 10ops-eqiad, 10DC-Ops: Replace wtp1043's sda - https://phabricator.wikimedia.org/T196886 (10Volans) Please revert https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/459607 once this has been fixed and the host is back in the pool. [09:56:08] (03CR) 10Volans: "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/459607 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:56:22] (03CR) 10jerkins-bot: [V: 04-1] Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [09:56:36] (03CR) 10jenkins-bot: Revert "Revert "Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459754 (owner: 10Ladsgroup) [09:56:50] (03CR) 10Banyek: "Switching the Change-Id helped. I think I accidentally removed it from commit message when I rewrote it." [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [09:56:58] (03PS14) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [09:57:47] (03PS23) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [09:57:51] (03CR) 10Alexandros Kosiaris: [C: 031] "An inline question, otherwise LGTM" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:59:02] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:59:24] !log tgr@deploy1001 Started scap: T204018 update i18n on fixcopyrightwiki [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:30] T204018: fixcopyright wiki translations are missing some messages - https://phabricator.wikimedia.org/T204018 [10:00:57] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:02:54] (03CR) 10Jcrespo: "> Switching the Change-Id helped. I think I accidentally removed it" [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [10:03:35] !log Stop replication on db2084:3315 for alert testing [10:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:20] (03CR) 10Banyek: "!!!This change is abandoned, because this shouldn't be the one we need!!!" [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [10:04:46] (03Abandoned) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/459733 (owner: 10Banyek) [10:06:48] (03CR) 10Gehel: [C: 04-1] Elasticsearch shard size check (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [10:07:12] PROBLEM - puppet last run on db2088 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:07:53] ^ me testing [10:08:09] (03CR) 10Alexandros Kosiaris: [C: 031] "ok then" [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:10:29] if you hoped for a page, it wasn't just you that didn't get one (well at least I did not get one still) [10:10:46] no, that one shouldn't page :) [10:10:52] hoping for a db1096 page :) [10:11:26] PROBLEM - MariaDB Slave Lag: s5 on db1096 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1023.61 seconds [10:11:30] ^me testing [10:11:42] oooh a page a page [10:11:46] \o/ [10:11:46] got it [10:11:47] <_joe_> paged [10:11:50] that page is a test [10:12:05] worked [10:12:14] sorry for the noise, but the now code actually minimizes false positives [10:12:17] thanks and sorry for the inconveniences [10:12:18] *new [10:12:22] RECOVERY - puppet last run on db2088 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:12:36] so it is one page but it will save you a lot afterwards [10:12:37] RECOVERY - MariaDB Slave Lag: s5 on db1096 is OK: OK slave_sql_lag Replication lag: 0.30 seconds [10:12:51] two pages :-P [10:13:01] ok, 2, fine [10:13:08] I need also to generate a page for db1100 now, to make sure the new thing didn't break the old things - sorry about that too, that should be the last one [10:13:15] :-D [10:13:16] 4? [10:13:20] heh [10:13:41] I prefer to make sure we didn't break anything and a normal replica also pages, better to find out now than in an outage :) [10:14:12] !log Stop replication on db1100 to test the paging [10:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:02] PROBLEM - MariaDB Slave Lag: s5 on db2084 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 676.32 seconds [10:15:05] ^ that is me testing and that should not page [10:16:04] !log Stop replication on db2075 to test the paging (should not page) [10:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:21] RECOVERY - MariaDB Slave Lag: s5 on db2084 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:18:15] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10mark) Indeed, let's go with a "proper" Debian package, imho the cleanest way to go and conforming to how we do things. [10:22:55] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Marostegui) Hopefully Percona also will fix the pt-kill bug in a definitive way (there have been two attempts T183983#3983838 T183983#4145153 T183983#4267429 ) so we'll not need to u... [10:25:26] PROBLEM - MariaDB Slave Lag: s5 on db1100 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 660.41 seconds [10:25:28] db1100 is a page test [10:25:47] woo hoo! success [10:25:56] \o/ [10:26:28] (03PS3) 10Elukey: Add python-geoip2 to stat machines [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [10:26:36] RECOVERY - MariaDB Slave Lag: s5 on db1100 is OK: OK slave_sql_lag Replication lag: 0.03 seconds [10:26:48] No more tests pages are expected - I have finished with the tests. Sorry for the noise [10:27:41] !log db1096:3315 and db1100 were test pages - NO MORE TEST PAGES ARE EXPECTED FROM NOW ON - T200509 [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:48] T200509: Make sure multi-instance slaves page - https://phabricator.wikimedia.org/T200509 [10:27:52] PROBLEM - MariaDB Slave Lag: s5 on db2075 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 694.16 seconds [10:27:55] ^ test and that should not page [10:28:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315, db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459761 [10:29:02] RECOVERY - MariaDB Slave Lag: s5 on db2075 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [10:30:25] !log tgr@deploy1001 Finished scap: T204018 update i18n on fixcopyrightwiki (duration: 31m 01s) [10:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:31] T204018: fixcopyright wiki translations are missing some messages - https://phabricator.wikimedia.org/T204018 [10:33:36] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315, db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459761 (owner: 10Marostegui) [10:34:22] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add compat network link with main deployment network [puppet] - 10https://gerrit.wikimedia.org/r/459573 (https://phabricator.wikimedia.org/T202636) [10:35:15] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315, db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459761 (owner: 10Marostegui) [10:35:33] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: eqiad1: add compat network link with main deployment network [puppet] - 10https://gerrit.wikimedia.org/r/459573 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [10:36:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3315, db1100 (duration: 00m 49s) [10:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:41] !log Disable GTID on all codfw masters (sX, x1, esX) (not in db2040 as it is not enabled there) T189107 [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:46] T189107: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 [10:43:11] 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) The following masters had GTID enabled - I have disabled it: ``` db2048 Using_Gtid: Slave_Pos db2035 Using_Gtid: Slave_P... [10:45:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315, db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459761 (owner: 10Marostegui) [10:54:10] Oh my goodness! tgr: nikerabbit: legoktm: anybody else who helped fix the overnight translation issue on fixcopyright: THANK YOU! [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1100). [11:00:04] Pikne: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] o/ [11:00:37] I can SWAT today [11:00:41] Pikne: around for swat? [11:00:48] Hi [11:02:00] Pikne: I'll let you know when your patch is at mwdebug1002, ready for testing, do you know how to test there? [11:02:03] (03PS1) 10GTirloni: cloud: Add gtirloni to shinken instance [puppet] - 10https://gerrit.wikimedia.org/r/459763 (https://phabricator.wikimedia.org/T203489) [11:02:05] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [puppet] - 10https://gerrit.wikimedia.org/r/459763 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [11:02:14] (03PS1) 10Marostegui: [WIP]: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [11:02:46] I don't, sorry [11:05:05] (03CR) 10Marostegui: "Looks like this is all we need: https://puppet-compiler.wmflabs.org/compiler1002/12411/" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:05:09] (03PS2) 10Marostegui: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [11:05:10] Pikne: no problem, there are docs https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [11:05:22] you need to install the extension, let me know when you are done [11:05:49] (03CR) 10jerkins-bot: [V: 04-1] Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:07:37] I installed it [11:08:09] enable the extension (it's off by default), select mwdebug1002 [11:08:31] something like this https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#/media/File:X-Wikimedia-Debug_demo.png [11:08:45] Yes, found it [11:08:49] it should look like that ^ [11:08:58] ok, the patch will be there in a few minutes [11:09:32] I'll let you know, you just now use the wiki (estonian wikipedia?) and the extension will make sure you reach the debug server [11:09:36] so you can test there [11:09:45] (03PS2) 10Zfilipin: Set category collation to 'uca-et-u-kn' on Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455804 (https://phabricator.wikimedia.org/T202977) (owner: 10Gerrit Patch Uploader) [11:09:54] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455804 (https://phabricator.wikimedia.org/T202977) (owner: 10Gerrit Patch Uploader) [11:11:08] (03Merged) 10jenkins-bot: Set category collation to 'uca-et-u-kn' on Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455804 (https://phabricator.wikimedia.org/T202977) (owner: 10Gerrit Patch Uploader) [11:11:16] Pikne: I'll need your help for running the script https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#updateCollation [11:12:08] Previous collation was 'etwiki' => 'uca-et-u-kn', // T202977 [11:12:08] T202977: Numeric sorting for category collation on Estonian projects - https://phabricator.wikimedia.org/T202977 [11:12:21] Pikne: ok, the patch is at mwdebug1002, please test on all affected wikis, if possible [11:12:34] (copy/paste error) The revious collation was xx-uca-et [11:12:58] Dereckson: just what I wanted to ask :) xx-uca-et [11:13:13] so the script will always be `mwscript updateCollation.php --wiki= mwscript updateCollation.php --wiki=xx-uca-et` [11:13:18] grr [11:13:28] `mwscript updateCollation.php --wiki= mwscript updateCollation.php --previous-collation=xx-uca-et` [11:13:40] Pikne, Dereckson: so I need to run updateCollation.php for all et wikis? the patch changes collation for 6 [11:14:04] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdc1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [11:14:13] in a screen/tmux, at least for etwiki, as it will take time [11:14:55] Dereckson: thanks for the reminder, will do [11:15:10] https://etherpad.wikimedia.org/p/collation ? [11:16:11] (03CR) 10Volans: [C: 032] logging: minor improvements and a fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:16:17] Dereckson: thanks, just what I had in a text editor ready for pasting somewhere :) [11:17:08] seems good so [11:17:19] (03Merged) 10jenkins-bot: logging: minor improvements and a fix [software/spicerack] - 10https://gerrit.wikimedia.org/r/459606 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:17:22] Pikne: so, does the patch work? did you test it? I'm not sure how well the wikis would work before I run the script... [11:17:37] (03CR) 10jenkins-bot: Set category collation to 'uca-et-u-kn' on Estonian-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455804 (https://phabricator.wikimedia.org/T202977) (owner: 10Gerrit Patch Uploader) [11:18:40] as long as the code is correct: the collation is time intensitive, but won't break pages: at worst you create categories sorted strangely [11:18:53] if the code isn't correct: an exception should be shown on a category page [11:18:55] I see there now is category section "0-9", numeric sorting itself doesn't seem to be in place yet. It needs running the script, right? [11:19:02] yes, indeed [11:19:38] (03CR) 10Jcrespo: "I don't know about this, but the reason why it was set as critical by default was to mimic core.pp, whatever is the decision, they both sh" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [11:20:28] Pikne, Dereckson: ok, so the patch seems to work fine, can I deploy it? [11:20:35] I'll run the script after deployment [11:20:40] (03PS2) 10Mark Bergsma: Increase ATLAS probe unreachability threshold to 25 [puppet] - 10https://gerrit.wikimedia.org/r/459560 [11:22:40] (03PS2) 10Alexandros Kosiaris: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/458806 (https://phabricator.wikimedia.org/T203776) [11:22:44] As far as category sections are concerned it looks good to me. Letter 'w' and 'v' also seem to be different as it was previously done for xx-uca-et hack. [11:24:12] ok, as long as things are not broken, I'll deploy and run the scripts [11:24:36] Pikne: you can disable the extension now [11:25:14] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:455804|Set category collation to uca-et-u-kn on Estonian-language wikis (T202977)]] (duration: 00m 50s) [11:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:22] T202977: Numeric sorting for category collation on Estonian projects - https://phabricator.wikimedia.org/T202977 [11:27:06] argh, my connection dropped for a few seconds while running the deployment :/ [11:27:16] (03CR) 10Mark Bergsma: [C: 032] Increase ATLAS probe unreachability threshold to 25 [puppet] - 10https://gerrit.wikimedia.org/r/459560 (owner: 10Mark Bergsma) [11:27:24] and I was not using tmux/screen, as it never happened before :/ [11:27:42] I'll deploy one more time, just in case [11:27:43] 11:25:14 <+logmsgbot> !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:455804|Set category collation to uca-et-u-kn on [11:27:46] Estonian-language wikis (T202977)]] (duration: 00m 50s) [11:27:49] worked [11:28:14] yes, my terminal window also refreshed, looks like it worked, ok, no redundant deployment then [11:28:44] Pikne, Dereckson: running scripts, with tmux, of course [11:28:57] especially after this :/ [11:30:10] (03PS2) 10Volans: mediawiki: ignore exit codes on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) [11:31:49] (03CR) 10Volans: [C: 032] mediawiki: ignore exit codes on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:33:01] (03Merged) 10jenkins-bot: mediawiki: ignore exit codes on stop_cronjobs [software/spicerack] - 10https://gerrit.wikimedia.org/r/459595 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:37:18] Pikne, Dereckson: etwiktionary is half done, after it I'll run etwiki, all other wikis are already done https://phabricator.wikimedia.org/T202977#4573746 [11:37:35] !log restarting hhvm on mw1261-mw1265 to pick up curl security updates [11:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:49] (03PS1) 10Volans: Upstream release v0.0.8 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/459771 (https://phabricator.wikimedia.org/T199079) [11:41:07] (03CR) 10Volans: [C: 032] Upstream release v0.0.8 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/459771 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:41:43] Pikne, Dereckson: all done except etwiki, it will take a while https://phabricator.wikimedia.org/T202977#4573758 [11:42:03] Pikne: please test all wikis except etwiki, let me know if anything looks broken [11:42:07] (03Merged) 10jenkins-bot: Upstream release v0.0.8 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/459771 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [11:42:15] Pikne: I'll let you know when etwiki script finishes [11:45:02] (03PS1) 10GTirloni: Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) [11:47:33] zeljkof: I'm running the snippet from https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [11:47:42] should not interfere with SWAT AFAIK [11:49:01] tgr: thanks for letting me know, I'm just running a script, done with deployments [11:49:35] !log uploaded spicerack_0.0.8-1{,+deb9u1} to apt.wikimedia.org {jessie,stretch}-wikimedia - T199079 [11:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:42] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [11:54:28] (03PS1) 10Dereckson: Update Informatika SZÅ  Chomutov throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459775 (https://phabricator.wikimedia.org/T203909) [11:54:46] All looks good. Numeric sorting works fine in categories like https://et.wiktionary.org/wiki/Kategooria:Keeltevaheline and https://et.wikipedia.org/wiki/Kategooria:Euroraha [11:56:35] (03CR) 10Dereckson: [C: 032] Update Informatika SZÅ  Chomutov throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459775 (https://phabricator.wikimedia.org/T203909) (owner: 10Dereckson) [11:58:07] 10Operations, 10Puppet: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) But we'll still need the service definition after - which reads the config file [11:58:26] (03Merged) 10jenkins-bot: Update Informatika SZÅ  Chomutov throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459775 (https://phabricator.wikimedia.org/T203909) (owner: 10Dereckson) [11:58:29] zeljkof, tgr: I'm going to deploy this throttle rule update ^ [12:00:16] looks good on mwdebug1002 [12:00:45] !log dereckson@deploy1001 sync-file aborted: Update Informatika SZÅ  Chomutov throttle rule (duration: 00m 04s) [12:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:51] !log dereckson@deploy1001 Synchronized wmf-config/throttle.php: Update Informatika SZÅ  Chomutov throttle rule (T203909) (duration: 00m 50s) [12:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:57] T203909: Allow IP for creating account for school project for 30 days - Informatika SZÅ  Chomutov - https://phabricator.wikimedia.org/T203909 [12:03:28] (03PS1) 10GTirloni: cloud: Add gtirloni to sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/459779 (https://phabricator.wikimedia.org/T203489) [12:04:24] Pikne: all scripts done, please check the wikis https://phabricator.wikimedia.org/T202977#4573758 [12:06:02] zeljkof: They seem all good (examples above) [12:06:38] Thank you! I'll now see if I can come up with a patch that removes xx-uca-et workaround (needs no SWAT deploy) [12:06:46] (03CR) 10jenkins-bot: Update Informatika SZÅ  Chomutov throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459775 (https://phabricator.wikimedia.org/T203909) (owner: 10Dereckson) [12:09:04] !log installing jq security updates on trusty [12:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:06] Pikne: great! [12:14:03] (03PS24) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [12:18:42] (03PS25) 10Banyek: Labs: Config template generation for pt-kill [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) [12:22:28] PROBLEM - tilerator on maps1003 is CRITICAL: connect to address 10.64.32.117 and port 6534: Connection refused [12:22:37] PROBLEM - tilerator on maps1002 is CRITICAL: connect to address 10.64.16.42 and port 6534: Connection refused [12:22:47] PROBLEM - tilerator on maps1004 is CRITICAL: connect to address 10.64.48.154 and port 6534: Connection refused [12:22:48] PROBLEM - tilerator on maps1001 is CRITICAL: connect to address 10.64.0.79 and port 6534: Connection refused [12:23:22] * gehel is looking at tilerator [12:28:28] !log restarting tilerator on maps1* (eqiad) - heap memory exceeded [12:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:47] RECOVERY - tilerator on maps1003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.182 second response time [12:37:57] RECOVERY - tilerator on maps1002 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.043 second response time [12:38:07] RECOVERY - tilerator on maps1004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.058 second response time [12:39:08] RECOVERY - tilerator on maps1001 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.041 second response time [12:44:04] (03CR) 10Alexandros Kosiaris: [C: 031] icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:45:09] (03PS6) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [12:45:28] PROBLEM - Apache HTTP on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:46:27] RECOVERY - Apache HTTP on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.024 second response time [12:49:15] (03PS6) 10Alex Monk: _trigger_dns_zone_update: Also pass in remote servers to sync to [software/certcentral] - 10https://gerrit.wikimedia.org/r/459581 [12:49:29] (03PS2) 10Alex Monk: Change how DNS update commands work to handle problematic values [software/certcentral] - 10https://gerrit.wikimedia.org/r/459662 [12:49:45] (03CR) 10Jcrespo: "This looks ok: https://puppet-compiler.wmflabs.org/compiler1002/12415/labsdb1009.eqiad.wmnet/ But we need to send the review to cloud peop" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [12:51:51] 10Operations, 10Maps (Tilerator): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Gehel) [12:53:58] (03PS7) 10Herron: mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) [12:54:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: have a default notes_url for all services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:55:15] (03CR) 10Herron: [C: 032] mtail: add exim tls ciphersuite metrics [puppet] - 10https://gerrit.wikimedia.org/r/458289 (https://phabricator.wikimedia.org/T203260) (owner: 10Herron) [12:57:26] 10Operations, 10Maps (Tilerator): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Gehel) p:05Triage>03High [12:58:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: add notes_url parameter to NRPE monitor service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459641 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [12:58:12] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10abian) In case you want to analyze the situation, wikiba.se is down right now. [12:59:04] (03PS4) 10Elukey: Add python-geoip2 to stat machines [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:00:21] elukey: why do we need python-geoip2? [13:00:46] (03CR) 10Faidon Liambotis: "Why do we need python-geoip2 rather than python-maxminddb?" [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:00:48] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Addshore) Indeed it is: {F25772293} $ ping wikiba.se ``` Pinging wikiba.se [89.31.143.100] with 32 bytes of data: Request timed out. R... [13:02:09] !log upgraded spicerack to version 0.0.8 on sarin/neodymium - T199079 [13:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:15] T199079: Refactor the switchdc script - https://phabricator.wikimedia.org/T199079 [13:02:31] (03CR) 10Elukey: [C: 032] Add python-geoip2 to stat machines [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:02:43] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Krenair) Got a bunch of patches open that need review, in no particular order: * This one is needed to implement the outcome of the abo... [13:03:26] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:04:06] (03CR) 10Marostegui: "> I don't know about this, but the reason why it was set as critical" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:04:52] (03PS3) 10Marostegui: Get only active replicas to page for mysqld process number [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) [13:05:35] (03CR) 10Jcrespo: "My point is just to be consistent, either we deploy both changes or none, not too worried, but this has an easy amend. Maybe we should loo" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:05:41] (03CR) 10Marostegui: "I won't merge this today, as I want to actually see the puppet compiler results once codfw is active tomorrow, as it should change and onl" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:06:35] 10Operations, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Maps (Tilerator): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) a:03MSantos [13:07:01] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [13:07:30] 10Operations, 10Maps-Sprint, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) [13:08:44] !log performing some additional switchdc live test [13:08:47] (03CR) 10Marostegui: "> My point is just to be consistent, either we deploy both changes or" [puppet] - 10https://gerrit.wikimedia.org/r/459764 (https://phabricator.wikimedia.org/T200509) (owner: 10Marostegui) [13:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:50] (03CR) 10Filippo Giunchedi: Elasticsearch shard size check (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [13:09:15] (03PS1) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [13:10:43] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [13:10:59] (03CR) 10jerkins-bot: [V: 04-1] api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [13:12:27] !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@sarin) [13:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:34] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@sarin) [13:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:48] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[python3-geoip2] [13:13:04] ah of course! [13:13:08] this is my fault [13:13:32] (03PS4) 10Jcrespo: mariadb: Enable read_only monitoring on core mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/450228 (https://phabricator.wikimedia.org/T172489) [13:14:28] !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@sarin) [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:37] !log END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) (volans@sarin) [13:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:57] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:18:57] (03PS2) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [13:20:45] (03CR) 10jerkins-bot: [V: 04-1] api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 (owner: 10Alex Monk) [13:21:12] !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@sarin) [13:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:20] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@sarin) [13:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:15] (03PS3) 10Alex Monk: api: Also handle SIGHUP signals to the API process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459785 [13:30:43] (03PS1) 10Herron: mtail: restart on change to exim program [puppet] - 10https://gerrit.wikimedia.org/r/459789 [13:31:52] 10Operations, 10DBA, 10MediaWiki-extensions-Translate, 10Technical-Debt: Query returned 22222 row(s): query: SELECT * FROM `translate_metadata` on Metawiki - https://phabricator.wikimedia.org/T204026 (10mark) [13:37:46] (03PS1) 10Alexandros Kosiaris: Add restbase-async.svc.(eqiad|codfw).wmnet entries [dns] - 10https://gerrit.wikimedia.org/r/459790 [13:37:57] (03CR) 10Andrew Bogott: [C: 031] "You'll also need to add your contact info to the private repo; I'll talk you through that later today." [puppet] - 10https://gerrit.wikimedia.org/r/459779 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [13:38:26] (03CR) 10Andrew Bogott: [C: 031] "Looks good to me! You will not enjoy the results of this :/" [puppet] - 10https://gerrit.wikimedia.org/r/459763 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [13:38:45] (03CR) 10Andrew Bogott: [C: 031] Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [13:43:30] (03PS1) 10Volans: dnsdisc: fine tune update_ttl and check_ttl [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) [13:44:38] (03PS2) 10Alexandros Kosiaris: Add restbase-async.svc.(eqiad|codfw).wmnet entries [dns] - 10https://gerrit.wikimedia.org/r/459790 [13:45:13] (03CR) 10Volans: "This can perfectly be merged *after* the switchover to codfw, it's just a small improvement in check and logging." [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:45:22] (03CR) 10Alexandros Kosiaris: [C: 032] Add restbase-async.svc.(eqiad|codfw).wmnet entries [dns] - 10https://gerrit.wikimedia.org/r/459790 (owner: 10Alexandros Kosiaris) [13:51:16] (03CR) 10Gehel: "I'm not entirely convinced by this fine tuning (but not entirely against either). Dry run modes are hard!" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:54:39] (03CR) 10Alexandros Kosiaris: dnsdisc: fine tune update_ttl and check_ttl (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:55:29] 10Operations, 10Electron-PDFs, 10Proton, 10Patch-For-Review, and 3 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748 (10phuedx) [13:58:11] (03CR) 10Mark Bergsma: [C: 04-1] "Besides some issues and style suggestions, it seems the logic is broken as return_code doesn't appear to be set correctly." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [14:01:54] (03CR) 10Gehel: dnsdisc: fine tune update_ttl and check_ttl (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:04:04] 10Operations, 10Domains, 10Traffic: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) [14:06:37] (03PS35) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:07:44] (03PS2) 10Volans: dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) [14:08:01] (03CR) 10Volans: "done" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:11:28] (03PS36) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [14:12:58] (03CR) 10Gehel: "You should probably ignore my comments until dc switch is complete, but still, I could not resist." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:16:12] (03PS1) 10Elukey: profile::analytics::refinery::job::camus: ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/459795 (https://phabricator.wikimedia.org/T204060) [14:17:42] (03PS2) 10Elukey: profile::analytics::refinery::job::camus: ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/459795 (https://phabricator.wikimedia.org/T204060) [14:20:10] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::camus: ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/459795 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:20:30] (03PS3) 10Elukey: profile::analytics::refinery::job::camus: ease labs testing [puppet] - 10https://gerrit.wikimedia.org/r/459795 (https://phabricator.wikimedia.org/T204060) [14:20:58] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) these are the contact options listed on the site at http://wikiba.se/contact/ (in case it's down) to find an admin: //It's also po... [14:21:03] (03CR) 10Elukey: [C: 032] "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/12418/analytics1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/459795 (https://phabricator.wikimedia.org/T204060) (owner: 10Elukey) [14:30:04] Deploy window Datacenter Switchover - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1430) [14:31:20] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (akosiaris@sarin) [14:31:21] !log END (FAIL) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=99) (akosiaris@sarin) [14:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:43] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (akosiaris@sarin) [14:31:44] !log END (FAIL) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=99) (akosiaris@sarin) [14:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:25] <_joe_> akosiaris: sudo -i [14:32:46] <_joe_> akosiaris: can I go on? [14:32:48] !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (akosiaris@sarin) [14:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:43] there's a host named sarin? [14:33:56] yes [14:34:05] it's a star [14:34:17] <_joe_> Krenair: after https://en.wikipedia.org/wiki/Delta_Herculis [14:34:25] hm, yes, codfw [14:34:29] <_joe_> well Delta Herculis Aa, specifically [14:37:30] (03PS6) 10Bstorm: wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) [14:38:08] !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (akosiaris@sarin) [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:20] moving on to next step [14:38:25] 01-switch-dc [14:38:27] <_joe_> +1 [14:38:29] !log START - Cookbook sre.switchdc.services.01-switch-dc (akosiaris@sarin) [14:38:30] !log Switching services parsoid, restbase, restbase-async, mobileapps, apertium, citoid, cxserver, eventstreams, graphoid, mathoid, proton, pdfrender, recommendation-api, zotero, eventbus, ores, wdqs, wdqs-internal: eqiad => codfw (akosiaris@sarin) [14:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:51] !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (akosiaris@sarin) [14:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:05] <_joe_> rather uneventful :P [14:39:53] thumbor.discovery.wmnet not used currently, right ? [14:40:01] !log START - Cookbook sre.switchdc.services.02-restore-ttl (akosiaris@sarin) [14:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:07] <_joe_> akosiaris: no [14:40:14] <_joe_> it's unused indeed [14:40:14] actually pretty sure it isn't, just double double checking [14:40:22] !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (akosiaris@sarin) [14:40:25] <_joe_> yeah I did double-check last week [14:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:14] 10Operations, 10Citoid, 10Services (watching): Support meta tag refresh redirects in citoid to support elsevier's linking hub - https://phabricator.wikimedia.org/T204032 (10danstillman) We've added support for this in translation-server: https://github.com/zotero/translation-server/commit/aa7179e3ced515ce10... [14:41:23] we are done with this step [14:41:35] in 20mins, it's swift's turn [14:42:09] <_joe_> traffic for services is at 19:00Z, right? [14:42:13] yes [14:42:14] Confirmed we're seeing a rise in CODFW activity in ORES [14:42:29] <_joe_> halfak: thanks for checking :) [14:43:43] (03PS1) 10Alex Monk: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 [14:45:27] (03PS1) 10Alex Monk: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 [14:45:31] (03CR) 10jerkins-bot: [V: 04-1] Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [14:46:54] (03PS15) 10Mathew.onipe: Elasticsearch shard size check [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) [14:47:50] (03PS2) 10Alex Monk: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 [14:47:52] (03CR) 10jerkins-bot: [V: 04-1] Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 (owner: 10Alex Monk) [14:48:14] (03CR) 10Faidon Liambotis: [C: 04-1] Add SNMP classes (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [14:49:34] (03CR) 10jerkins-bot: [V: 04-1] Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 (owner: 10Alex Monk) [14:50:33] paravoid: akosiaris: does the switchover interfere with MediaWiki deploys? the copyright has some urgent i18n changes (and probably will have more) [14:50:51] a deploy is not the only way to get them out but it's the easiest one [14:51:22] (03CR) 10Bstorm: [C: 032] wiki replicas: moving compatibility views to $table_compat [puppet] - 10https://gerrit.wikimedia.org/r/447654 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [14:51:30] tgr: small urgent deploys are fine. But otherwise there is a full train freeze this week [14:51:35] just urgent fixes [14:51:39] (03PS3) 10Alex Monk: Be a lot more verbose about problems in the ACME process [software/certcentral] - 10https://gerrit.wikimedia.org/r/459798 [14:52:17] akosiaris: small in what sense? it's a bunch of i18n file changes, but since it's i18n it requires a full scap [14:52:51] small in the sense no new functionality, no new code paths etc [14:53:05] <_joe_> tgr: i18n changes surely qualify as "small" [14:53:07] it it's just data files for translations it's fine [14:53:27] ack, thanks [14:53:45] the change is small, I just wasn't sure if the scap process if affected somehow [14:54:33] no it is not affected [14:55:42] (03PS2) 10Alex Monk: Log command we run for DNS zone updates [software/certcentral] - 10https://gerrit.wikimedia.org/r/459799 [14:56:26] (03CR) 10Mathew.onipe: "> Patch Set 14:" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (https://phabricator.wikimedia.org/T203546) (owner: 10Mathew.onipe) [14:58:09] (03CR) 10Dzahn: "@Moritz no date needed after all?" [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [14:58:28] (03PS2) 10Filippo Giunchedi: cache::upload: Switch swift temporarily to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458794 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [14:59:34] (03CR) 10Muehlenhoff: [C: 031] "What do you mean? There a date in the patch." [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [14:59:50] (03CR) 10Filippo Giunchedi: [C: 032] cache::upload: Switch swift temporarily to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458794 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [15:00:04] Deploy window Datacenter Switchover - Media storage/Swift (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1500) [15:00:04] !log begin switching swift to codfw [15:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:35] (03CR) 10Elukey: [C: 032] "> Why do we need python-geoip2 rather than python-maxminddb?" [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [15:01:55] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) I renamed these servers but they're still complaining about missing batteries. [15:04:20] (03PS2) 10Dzahn: netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) [15:04:28] (03PS2) 10Filippo Giunchedi: cache::upload: Move swift to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458795 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [15:05:19] (03CR) 10Filippo Giunchedi: [C: 032] cache::upload: Move swift to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458795 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [15:05:21] (03CR) 10jerkins-bot: [V: 04-1] netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [15:06:22] !log serve switch originals and thumbs from codfw only [15:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:58] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.5.0-1) - https://phabricator.wikimedia.org/T203121 (10thcipriani) Poking this for ETA. This one should unblock us on graphoid as well as a add a builder to help support generic jobs in CI... [15:11:59] 10Operations, 10Maps-Sprint, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10Jhernandez) [15:17:20] (03CR) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) (owner: 10Muehlenhoff) [15:17:52] (03PS2) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [15:21:42] (03PS3) 10Dzahn: netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) [15:21:58] (03PS3) 10Volans: dnsdisc: improve TTL checks [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) [15:22:00] (03PS1) 10Volans: exceptions: add SpicerackCheckError [software/spicerack] - 10https://gerrit.wikimedia.org/r/459804 (https://phabricator.wikimedia.org/T199079) [15:22:02] (03PS1) 10Volans: dnsdisc: catch dnspython exceptions [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) [15:22:37] (03CR) 10Volans: "done" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/459791 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:24:26] (03CR) 10Volans: "To be merged *after* the switchover" [software/spicerack] - 10https://gerrit.wikimedia.org/r/459804 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:24:30] (03CR) 10Volans: "To be merged *after* the switchover" [software/spicerack] - 10https://gerrit.wikimedia.org/r/459805 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [15:24:49] (03PS2) 10Dzahn: icinga: add notes_url parameter to NRPE monitor service [puppet] - 10https://gerrit.wikimedia.org/r/459641 (https://phabricator.wikimedia.org/T197873) [15:26:44] (03CR) 10Dzahn: icinga: have a default notes_url for all services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [15:26:53] 10Operations, 10Wikimedia-Mailing-lists: Open Foundation West Africa (OFWA) mailing list - https://phabricator.wikimedia.org/T203966 (10Flixtey) [15:30:11] (03PS2) 10Dzahn: icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) [15:31:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: have a default notes_url for all services [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [15:31:11] (03CR) 10Dzahn: [C: 04-1] "not sure yet if we would actually use it" [puppet] - 10https://gerrit.wikimedia.org/r/459645 (owner: 10Dzahn) [15:35:40] (03PS1) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [15:36:15] (03CR) 10jerkins-bot: [V: 04-1] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [15:48:13] (03PS1) 10Alex Monk: [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [15:49:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk) [15:50:07] 10Operations, 10ops-eqiad: Interface errors for stat1006 - https://phabricator.wikimedia.org/T203576 (10Cmjohnson) @elukey is it okay to resolve this task? [15:50:58] 10Operations, 10ops-eqiad: Interface errors for stat1006 - https://phabricator.wikimedia.org/T203576 (10elukey) 05Open>03Resolved Yep didn't see any more issues, will re open if necessary. Thanks! [15:51:12] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1020 - https://phabricator.wikimedia.org/T204012 (10Cmjohnson) 05Open>03declined existing task already [15:51:14] (03PS2) 10Alex Monk: [WIP] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 [15:51:45] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T204011 (10Cmjohnson) 05Open>03declined existing task [15:52:52] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10Cmjohnson) [15:55:38] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [15:56:40] Ganeti VMs have switch ports? [15:59:28] context? :) [15:59:42] they don't, but what do you mean? [16:00:05] godog and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:10] (03PS2) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:00:39] (03CR) 10jerkins-bot: [V: 04-1] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:02:11] (03PS3) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:02:41] (03CR) 10jerkins-bot: [V: 04-1] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:02:57] (03PS1) 10Alex Monk: setup.py test dependencies: Remove pylint maximum version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459811 [16:03:54] paravoid, https://phabricator.wikimedia.org/T203087 [16:04:16] someone just copied a checklist [16:05:48] (03PS4) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:06:46] 10Operations, 10Analytics, 10hardware-requests: Decommission Ganeti vm meitnerium.wikimedia.org (old Archiva host) - https://phabricator.wikimedia.org/T203087 (10Krenair) No switch port disabling step for VMs either [16:12:31] (03PS5) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:16:40] !log repair sdd1 on ms-be1043 - T199198 [16:16:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:46] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [16:16:57] (03PS2) 10GTirloni: cloud: Add gtirloni to shinken instance [puppet] - 10https://gerrit.wikimedia.org/r/459763 (https://phabricator.wikimedia.org/T203489) [16:17:21] !log correction, sdk1 on ms-be1041 - T199198 [16:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:19:12] (03CR) 10GTirloni: [C: 032] cloud: Add gtirloni to shinken instance [puppet] - 10https://gerrit.wikimedia.org/r/459763 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [16:19:14] looks like a spike that is gone? [16:19:37] ack, i see it but it's already over in the graph, afaict [16:21:35] (03PS2) 10GTirloni: cloud: Add gtirloni to sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/459779 (https://phabricator.wikimedia.org/T203489) [16:22:18] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:23:24] gtirloni: hi, did you get your icinga contact yet? [16:23:33] (03CR) 10Volans: "A question and a couple of nitpicks in the python script, see inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:23:53] mutante: yeah, andrew just walked me through it, thanks. we were just unsure about that acl* [16:24:25] gtirloni: seems like you even have 2 contacts but that is normal for wmcs [16:24:49] gtirloni: yea, just replied to your email. i am not sure about the Phabricator ACL either, but i think Robh might know [16:25:04] ? [16:25:25] Are you asking about #acl*sre-team? [16:25:26] robh: it's about the "acl*operations-team" in Phab [16:25:34] heh, i renamed it a few months ago ;D [16:25:38] (03CR) 10GTirloni: [C: 032] cloud: Add gtirloni to sms contact group [puppet] - 10https://gerrit.wikimedia.org/r/459779 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [16:25:40] anyone in it can add new members though [16:25:55] that controls access to S4 and possibly other things not sure [16:25:57] that would explain, yep :) [16:26:13] ha! cool, thank you :) [16:26:16] s4 access = acl*sre-team and acl*procurement-review [16:26:29] so yeah, if in sre team only add to the former [16:26:44] if they arent in sre team then I tend to add to the latter when there is something they need to review (as long as they are staff) [16:26:49] as only staff are allowed access to s4. [16:26:58] (well, staff + contractors cuz they are really staff) [16:27:09] !log added gtirloni to acl*sre-team on Phabricator (T203489) [16:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:15] T203489: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 [16:27:54] robh mutante: thank you! [16:28:32] welcome [16:29:22] (03CR) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:30:03] (03PS6) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:30:25] (03CR) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:36:41] all: I'm going to cause a page in labtest to check gtirloni's contact settings. Please disregard. [16:37:12] (03PS7) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:37:22] (03CR) 10Volans: Whitelist EventLogging schemas for ingestion into MySQL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:37:44] (03CR) 10jerkins-bot: [V: 04-1] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:38:15] (03CR) 10Ottomata: "Switched to list in after all. We can change back to regex if/when we need to." [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:38:42] (03PS8) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [16:39:46] PROBLEM - nova-scheduler process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:40:01] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10mark) [16:44:20] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: main: don't offer the static route in DHCP for the compat network [puppet] - 10https://gerrit.wikimedia.org/r/459816 (https://phabricator.wikimedia.org/T202636) [16:44:33] (03CR) 10Ottomata: "Tested in deployment-prep, works a-ok after deploying https://gerrit.wikimedia.org/r/#/c/eventlogging/+/459815/" [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [16:44:37] PROBLEM - nova-compute proc minimum on labvirt1018 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [16:46:47] RECOVERY - nova-compute proc minimum on labvirt1018 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [16:47:40] (03PS2) 10Andrew Bogott: cloudvps: main: don't offer the static route in DHCP for the compat network [puppet] - 10https://gerrit.wikimedia.org/r/459816 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [16:47:45] (03PS3) 10Arturo Borrero Gonzalez: cloudvps: main: don't offer the static route in DHCP for the compat network [puppet] - 10https://gerrit.wikimedia.org/r/459816 (https://phabricator.wikimedia.org/T202636) [16:48:23] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: main: don't offer the static route in DHCP for the compat network [puppet] - 10https://gerrit.wikimedia.org/r/459816 (https://phabricator.wikimedia.org/T202636) (owner: 10Arturo Borrero Gonzalez) [16:52:26] RECOVERY - nova-scheduler process on labtestcontrol2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler [16:53:19] !log repair sdd on ms-be1043 - T199198 [16:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:24] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [16:54:32] (03CR) 10Ayounsi: [C: 031] icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:55:04] (03CR) 10Ayounsi: [C: 031] netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [16:55:41] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10Joe) AIUI, the reason why we're not using MySQL (which would probably fit this storage model as well, if not better than cassandra) is just that we don't have libra... [16:55:42] (03PS5) 10Gehel: admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [16:56:28] (03CR) 10Gehel: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [16:56:37] (03CR) 10Gehel: [C: 032] admins: create shell user for Mathew Onipe [puppet] - 10https://gerrit.wikimedia.org/r/458877 (https://phabricator.wikimedia.org/T202708) (owner: 10Dzahn) [16:57:21] (03PS2) 10Gehel: admins: add Mathew Onipe as member of elasticsearch-roots and wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/459556 (https://phabricator.wikimedia.org/T202708) [16:57:32] (03PS1) 10GTirloni: cloud: Add GTirloni as prod icinga contact [puppet] - 10https://gerrit.wikimedia.org/r/459817 (https://phabricator.wikimedia.org/T203489) [16:58:37] (03CR) 10Gehel: [C: 032] admins: add Mathew Onipe as member of elasticsearch-roots and wdqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/459556 (https://phabricator.wikimedia.org/T202708) (owner: 10Gehel) [16:59:03] onimisionipe: ^ [16:59:15] Yea.. [16:59:26] Thanks [16:59:43] (03CR) 10Andrew Bogott: [C: 031] "I predict that this will improve your ability to interact with icinga but won't get you pages. We'll see!" [puppet] - 10https://gerrit.wikimedia.org/r/459817 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [16:59:58] (03CR) 10GTirloni: [C: 032] cloud: Add GTirloni as prod icinga contact [puppet] - 10https://gerrit.wikimedia.org/r/459817 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [17:00:19] (03CR) 10Volans: [C: 031] "The python part LGTM, I'll leave the remaining bits to those that have more context than me. I've left some totally minor nitpicks that ar" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [17:00:25] (03PS2) 10GTirloni: cloud: Add GTirloni as prod icinga contact [puppet] - 10https://gerrit.wikimedia.org/r/459817 (https://phabricator.wikimedia.org/T203489) [17:03:06] (03CR) 10Gilles: "I just assumed that the geoip2 library has useful features in addition to the underlying maxminddb one. I haven't tried to use either yet." [puppet] - 10https://gerrit.wikimedia.org/r/459735 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [17:03:45] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 59.37 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:05:56] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.72 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:06:14] (03PS9) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [17:06:43] (03CR) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [17:07:56] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:08:24] * gehel is looking at wdqs [17:08:32] _joe_: thanks for your comment at T203039#4574709. A quick note that bmansurov is OoO for the rest of the week and it may take some days before you see a response. just fyi. [17:08:32] T203039: Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 [17:08:44] (03CR) 10Volans: [C: 031] "LGTM for the python part, thanks for all the fixes!" [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [17:09:08] <_joe_> leila: I have my hands full anyways, I just wanted to be sure I got the issue right [17:10:00] <_joe_> and I think mobrovac can answer my question, actually :) [17:10:05] !log delete BGP sessions with old AS10089 router on cr1-eqsin [17:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:33] (03PS10) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) [17:12:04] (03CR) 10Ottomata: Whitelist EventLogging schemas for ingestion into MySQL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [17:12:13] (03CR) 10jerkins-bot: [V: 04-1] Whitelist EventLogging schemas for ingestion into MySQL [puppet] - 10https://gerrit.wikimedia.org/r/459807 (https://phabricator.wikimedia.org/T203596) (owner: 10Ottomata) [17:12:41] 10Operations, 10DBA, 10Research, 10Services (designing): Storage of data for recommendation API - https://phabricator.wikimedia.org/T203039 (10Pchelolo) > don't have libraries and abstractions for accessing MySQL from our nodejs services. Is that correct? That's the easy part, node has great support for M... [17:14:39] _joe_: sounds good. [17:18:06] RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [17:24:48] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10Dzahn) [17:28:58] (03PS4) 10Dzahn: netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) [17:30:36] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): Scap should use Eval.Jit=1 when calling rebuildLocalisationCache.php via HHVM - https://phabricator.wikimedia.org/T203680 (10thcipriani) [17:31:25] (03CR) 10Dzahn: [C: 032] netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:32:00] (03PS1) 10Thcipriani: Scap: Update config to use PHP=hhvm -vEval.Jit=1 [puppet] - 10https://gerrit.wikimedia.org/r/459828 (https://phabricator.wikimedia.org/T191921) [17:32:07] (03PS4) 10Dzahn: icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) [17:33:19] (03PS2) 10Thcipriani: Scap: Update config to use PHP=hhvm -vEval.Jit=1 [puppet] - 10https://gerrit.wikimedia.org/r/459828 (https://phabricator.wikimedia.org/T203680) [17:34:12] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10thcipriani) [17:36:09] (03CR) 10Dzahn: [C: 032] icinga/ripeatlas: add playbook link as notes_url in Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459626 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [17:37:09] (03PS5) 10Dzahn: netops::check: add 3 playbook links to Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/459630 (https://phabricator.wikimedia.org/T197873) [17:37:47] (03CR) 10GTirloni: [C: 032] Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [17:38:01] (03CR) 10GTirloni: [V: 032 C: 032] Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [17:38:05] 10Operations, 10DBA, 10MediaWiki-extensions-Translate, 10Wikimedia-production-error: DBPerformance warning "Query returned 22186 rows: query: SELECT * FROM `translate_metadata`" on Meta-Wiki - https://phabricator.wikimedia.org/T204026 (10Krinkle) [17:38:16] (03CR) 10GTirloni: [C: 032] Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [17:38:41] (03CR) 10GTirloni: [V: 032 C: 032] Add gtirloni to cloud-wide root [labs/private] - 10https://gerrit.wikimedia.org/r/459773 (https://phabricator.wikimedia.org/T203489) (owner: 10GTirloni) [17:39:57] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Scap should use Eval.Jit=1 when calling rebuildLocalisationCache.php via HHVM - https://phabricator.wikimedia.org/T203680 (10thcipriani) I've merged a patch to scap to allow setting `php_version` in `scap.cfg` that will be used... [17:41:02] PROBLEM - High lag on wdqs1003 is CRITICAL: 3607 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:45:42] PROBLEM - High lag on wdqs1004 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:46:26] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [17:46:51] gehel: FYI ^^^ (is the main cluster, this time I checked ;) ) [17:47:11] volans: yep, the public cluster, looking [17:48:31] let me know if I can help [17:49:42] PROBLEM - High lag on wdqs1005 is CRITICAL: 3657 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:52:06] we're having those no data responses from wikidata again [17:53:22] dc switch might be making things worse also given the error rate, not much worth [17:53:31] s/worth/worse/ [17:53:40] :( [17:53:44] there were temp. outages of wikiba.se reported earlier [17:53:53] those could potentially affect wikidata [17:54:40] is there a dependency between wikiba.se and wikidata? [17:54:58] current issue looks just like T202764 [17:54:59] T202764: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 [17:56:23] ACKNOWLEDGEMENT - High lag on wdqs1003 is CRITICAL: 4233 ge 3600 Gehel New occurence of https://phabricator.wikimedia.org/T202764 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:56:23] ACKNOWLEDGEMENT - High lag on wdqs1004 is CRITICAL: 4029 ge 3600 Gehel New occurence of https://phabricator.wikimedia.org/T202764 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:56:24] ACKNOWLEDGEMENT - High lag on wdqs1005 is CRITICAL: 3933 ge 3600 Gehel New occurence of https://phabricator.wikimedia.org/T202764 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:56:55] gehel: not sure enough, but wikiba.se lists the query service as an app [17:57:12] RECOVERY - Filesystem available is greater than filesystem size on ms-be1043 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1043&var-datasource=eqiad%2520prometheus%252Fops [17:58:28] the issue here is related to recent changes, I doubt very much that recent changes on wikidata has a synchronous call to wikiba.se. But there might be other more subtle dependencies. [17:59:49] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Gehel) The issue as seen from WDQS can be followed on [[ https://logstash.wikimedia.org/goto/6eb9fc0ee56... [18:07:30] (03PS1) 10Smalyshev: Allow kafka updater to have options [puppet] - 10https://gerrit.wikimedia.org/r/459831 [18:07:36] No time to file a bug due to RL issues. [18:07:37] https://en.wikipedia.org/wiki/File:Dr._Richard_Pierzchajlo,_Head_Shot,_2017.jpg [18:07:41] Cannot be deleted [18:07:53] Gives some inconsistent state internal error. [18:08:07] 10Operations, 10ops-codfw, 10fundraising-tech-ops: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10RobH) p:05Triage>03Normal [18:08:08] That's it [18:09:29] Krenair: ^ [18:12:27] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [18:12:48] 10Operations, 10ops-codfw, 10fundraising-tech-ops: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10ayounsi) [18:13:36] Cyberpower678: it is really helpful if you copy-paste the error message rather than describing it. (probably faster for you, too) [18:14:55] The file "mwstore://local-multiwrite/local-deleted/q/q/2/qq22rrn3rkspulntfholqkxlpie4wo1.jpg" is in an inconsistent state within the internal storage backends [18:14:57] MatmaRex: ^ [18:16:03] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) [18:16:04] i guess you can add this one to https://phabricator.wikimedia.org/T141704 . not really my area, sorry [18:16:46] 10Operations, 10Performance-Team, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Gehel) It looks like there is a correlation between bot activity on wikidata query service (T202765) and... [18:17:52] 10Operations, 10Commons, 10MediaWiki-Database, 10Multimedia, and 4 others: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704 (10MusikAnimal) Got this on enwiki when attempting to delete https://en.wikipedia.org/wiki/File:Dr._Richard_Pierzchajlo,_Hea... [18:18:14] 10Operations, 10ops-codfw, 10fundraising-tech-ops: move/setup/install frauth2001.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10ayounsi) [18:20:40] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.5.0-1) - https://phabricator.wikimedia.org/T203121 (10akosiaris) >>! In T203121#4574382, @thcipriani wrote: > Poking this for ETA. > > This one should unblock us on graphoid as well as a a... [18:23:05] 10Operations, 10cloud-services-team, 10Patch-For-Review: Onboard gtirloni to WMF - https://phabricator.wikimedia.org/T203489 (10GTirloni) 05Open>03Resolved [18:27:47] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 51090 MB (10% inode=99%) [18:28:56] RECOVERY - Disk space on elastic1023 is OK: DISK OK [18:30:37] I have no idea what to do with this [18:33:00] Is anybody available to do an urgent deployment for fixcopyright.wikimedia.org? It is mostly i18n with one small change to HTML. [18:35:17] jouncebot, next [18:35:17] In 0 hour(s) and 24 minute(s): Datacenter Switchover - Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1900) [18:35:21] pretty close [18:35:54] maybe legoktm? [18:37:47] CindyCicaleseWMF, maybe greg-g can find someone [18:39:48] <_joe_> CindyCicaleseWMF: can you point me to the change? [18:40:19] _joe_: be aware, l10nupdates can make scap take around 50 minutes or so [18:40:39] <_joe_> greg-g: I know, that's why I wanted to see the change [18:40:56] <_joe_> I would advise to merge it after the switchover is done, tbh [18:40:57] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 50250 MB (10% inode=99%) [18:42:21] <_joe_> if l10nupdates need to run, that is [18:42:42] that was my thinking [18:43:47] https://gerrit.wikimedia.org/r/c/mediawiki/skins/EUCopyrightCampaignSkin/+/459679 [18:44:06] _joe_: ^^ [18:44:28] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EUCopyrightCampaign/+/459825 [18:44:33] https://gerrit.wikimedia.org/r/c/mediawiki/skins/EUCopyrightCampaignSkin/+/459824 [18:45:30] _joe_: the two right above are only i18n [18:46:10] <_joe_> yeah I don't think we have the time to deploy those before the traffic swtichover; I'd ask you to wait until after that's done in hopefully 1 hour or so [18:46:21] <_joe_> you have time to find a deployer in the meanwhile [18:46:43] <_joe_> (It's 9 pm here and I have to work on the switchover, so I'm not available) [18:47:03] got it - that's fine - thanks! [18:51:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] icinga: have a default notes_url for all services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [18:51:27] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:51:57] CindyCicaleseWMF: hey, I'm here now [18:52:21] we can do it after the switchover I guess [18:52:37] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:53:01] XioNoX: how do we feel about that flap? ^ [18:53:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:53:27] looking [18:54:38] legoktm: thanks! [18:55:30] ema: no planned maintenance, so some transport provider issue. Not an issue for the failover (that's like the 3rd redundant path between codfw and eqiad [18:55:53] XioNoX: awesome, thanks [18:56:40] XioNoX: or 4th? ;) [18:56:42] now [18:58:36] RECOVERY - Disk space on elastic1023 is OK: DISK OK [18:59:10] (03PS1) 10Reedy: Remove wikimedia.ee [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) [18:59:48] (03CR) 10Ema: [C: 032] traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/458808 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:00:05] Deploy window Datacenter Switchover - Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T1900) [19:04:16] (03PS2) 10Alexandros Kosiaris: traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/458808 (https://phabricator.wikimedia.org/T203776) [19:04:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:04:50] (03CR) 10Alexandros Kosiaris: [V: 032] traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/458808 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:04:56] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-upload site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:05:24] is that expected? [19:06:15] it probably is related to the transport provider flap mentioned above [19:06:20] <_joe_> Krenair: not related to any action,, we didn't run puppet anywhere [19:06:27] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:06:29] greg-g: _joe_: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1802619&oldid=1802498 [19:06:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:06:37] !log route esams via codfw [19:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:49] <_joe_> legoktm: not now, sorry [19:06:58] legoktm: ack [19:06:58] just an fyi [19:07:05] _joe_: it's in 2 hours, no worries [19:07:06] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:07:19] legoktm: thanks for the heads up [19:07:28] mhm :) [19:07:35] <_joe_> no I was just saying I'm not looking at anything apart from the switchover :P [19:07:39] <_joe_> I didn't open the link [19:07:45] lol, ok [19:07:54] ah okay, I just put the fixcopyright stuff on the calendar after the traffic slot [19:08:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:08:12] <_joe_> eheheh ok sorry, I'm looking at 3 grafana dashboards at the same time [19:15:16] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:19:39] (03PS3) 10Ema: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/458806 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:19:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:20:17] !log depool eqiad from edge traffic [19:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:30] (03CR) 10Ema: [C: 032] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/458806 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:21:06] (03CR) 10Dzahn: [C: 04-1] "-1 just to set it to "stalled". first needs domain transfer and then the new owner changes the NS and after that we can clean up here" [dns] - 10https://gerrit.wikimedia.org/r/459835 (https://phabricator.wikimedia.org/T204056) (owner: 10Reedy) [19:22:50] (03Abandoned) 10Volans: sre.switchdc.mediawiki: wait TTL expiration [cookbooks] - 10https://gerrit.wikimedia.org/r/457936 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [19:23:09] OpenSSL 1.1.1 came out today [19:23:32] exciting! [19:23:38] https://www.openssl.org/blog/blog/2018/09/11/release111/ [19:23:39] yep [19:23:49] (03PS2) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [19:23:51] (03CR) 10Ayounsi: "Addressing Mark's comments" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:23:55] TLS 1.3 support [19:24:31] (03CR) 10jerkins-bot: [V: 04-1] Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) (owner: 10Ayounsi) [19:25:01] Question: How late tommorow we would be read only/ [19:25:42] <_joe_> Father_of_Lies: at 14:00 UTC approximately [19:25:45] Father_of_Lies: shortly after 14:00 UTC, see https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Schedule_for_2018_switch [19:25:58] Thanks! [19:26:02] <_joe_> some minutes between 14:00 and 15:00 [19:28:26] RECOVERY - High lag on wdqs1005 is OK: (C)3600 ge (W)1200 ge 1137 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:28:57] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is CRITICAL: 58.73 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:29:03] (03PS2) 10Ema: cache::text: Switch restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/458802 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:29:25] (03PS1) 10Alex Monk: Compatibility with new flask version [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 [19:29:42] (03PS3) 10Ayounsi: Icinga: add check_vcp (part 1) [puppet] - 10https://gerrit.wikimedia.org/r/458850 (https://phabricator.wikimedia.org/T201097) [19:30:22] (03CR) 10Ema: [C: 032] cache::text: Switch restbase active/active [puppet] - 10https://gerrit.wikimedia.org/r/458802 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:31:01] !log switch restbase to active/active [19:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:11] (03PS1) 10Herron: mtail: update exim ciphersuite regex [puppet] - 10https://gerrit.wikimedia.org/r/459842 [19:36:29] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) a:03Dzahn [19:37:15] (03PS2) 10Ema: cache::text, cache::upload: Switch services to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458803 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:38:26] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 1050 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:55] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921 (10hashar) Seems scap learned to set the `PHP` environment variable for mwscript. I guess we can now try `hhvm -d hh... [19:38:56] !log switch all services to codfw only [19:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:04] (03CR) 10Ema: [C: 032] cache::text, cache::upload: Switch services to codfw [puppet] - 10https://gerrit.wikimedia.org/r/458803 (https://phabricator.wikimedia.org/T203776) (owner: 10Alexandros Kosiaris) [19:39:33] 10Operations, 10Security, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606 (10Reedy) 05Open>03stalled [19:41:36] 10Operations, 10Security, 10Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606 (10Reedy) Re-tagging. If it's decided that "we" (Wikimedia) actually want to use it, then a review should task should be created per the standard guidelines. This task is kinda vague, suggesting it needs... [19:58:59] 10Operations, 10Maps-Sprint, 10Maps (Tilerator), 10Reading-Infrastructure-Team-Backlog (Kanban): investigate tilerator crash on maps eqiad - https://phabricator.wikimedia.org/T204047 (10MSantos) Here is what I learned so far: 1) Tilerator stop responds every Tuesday at noon. The worker dies and tries to r... [20:16:22] 10Operations, 10Wikidata, 10wikiba.se: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10akosiaris) [20:17:51] 10Operations, 10Performance-Team, 10Wikidata, 10wikiba.se: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10akosiaris) p:05Triage>03High [20:31:39] 10Operations, 10Performance-Team, 10Wikidata, 10wikiba.se: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10akosiaris) https://grafana.wikimedia.org/dashboard/db... [20:33:17] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:38:27] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on einsteinium is OK: (C)60 le (W)70 le 71.15 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:41:55] 10Operations, 10Performance-Team, 10Wikidata, 10wikiba.se, 10wikidata-tech-focus: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10Addshore) [20:41:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:52:36] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1135 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [21:00:05] legoktm and CindyCicaleseWMF: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) fixcopyright deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T2100). [21:00:57] lol, that's a great choice of messages [21:01:04] hah [21:01:05] it knoooows [21:03:23] lol! it DOES know! [21:03:30] !log restarted apache on mwdebug1002, running puppet [21:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:35] (03CR) 10Dzahn: mediawiki::web::prod_sites: make includes explicit in more wikis (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:11:13] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @kaldari Indeed. Problem is that the number of affected pages was... [21:13:00] (03CR) 10Dzahn: "I created an URL file with a bunch of test URLs this could be affecting on login.wikimedia.org / wb.wikimedia / nl.wikimedia and then used" [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:13:25] (03CR) 10Dzahn: [C: 04-1] "would affect /api/ urls on chapter wikis" [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:17:28] 10Operations, 10Performance-Team, 10Wikidata, 10wikiba.se, 10wikidata-tech-focus: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10mark) T97368 appears to be a... [21:18:21] (03CR) 10Dzahn: [C: 04-1] "where "foo" and "bar" are result lines from apache-fast-test that contain "404" the diff is:" [puppet] - 10https://gerrit.wikimedia.org/r/451257 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:20:46] 10Operations, 10Performance-Team, 10Wikidata, 10wikiba.se, 10wikidata-tech-focus: wikibase_shared/-wikidatawiki-hhvm:CacheAwarePropertyInfoStore memcached key not well distributed, causing excessive traffic - https://phabricator.wikimedia.org/T204083 (10Krinkle) [21:21:55] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 2 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Krinkle) [21:22:07] !log legoktm@deploy1001 Synchronized php-1.32.0-wmf.20/skins/EUCopyrightCampaignSkin/: add og:image meta tag - https://gerrit.wikimedia.org/r/459836 (duration: 00m 51s) [21:22:11] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Performance-Team, and 3 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Krinkle) [21:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:23] CindyCicaleseWMF: ^^ [21:23:31] [21:23:56] !log legoktm@deploy1001 Started scap: i18n updates for fixcopyright [21:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:42] cool [21:27:54] (03PS1) 10Dduvall: Parameterize profile::labs::lvm::srv volume size [puppet] - 10https://gerrit.wikimedia.org/r/459850 (https://phabricator.wikimedia.org/T203842) [21:28:44] (03CR) 10jerkins-bot: [V: 04-1] Parameterize profile::labs::lvm::srv volume size [puppet] - 10https://gerrit.wikimedia.org/r/459850 (https://phabricator.wikimedia.org/T203842) (owner: 10Dduvall) [21:28:46] and there's thescap [21:29:08] have a quiet night/day, folks [21:33:09] (03PS2) 10Dduvall: Parameterize profile::labs::lvm::srv volume size [puppet] - 10https://gerrit.wikimedia.org/r/459850 (https://phabricator.wikimedia.org/T203842) [21:34:14] (03CR) 10Dzahn: "as Luca said, in compiler it looks there is a diff like for example "AllowEncodedSlashes On" and "UseCanonicalName Off" but maybe let's fi" [puppet] - 10https://gerrit.wikimedia.org/r/451258 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:36:29] (03CR) 10Dzahn: "> Since pcc will not help a lot in here, we could also take these replacements one step further and remove all the annoying extra spaces" [puppet] - 10https://gerrit.wikimedia.org/r/451259 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:36:47] 10Operations, 10Puppet: exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) [21:37:06] 10Operations, 10Puppet, 10Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q1): exported puppet resources are not queryable: cannot create grafana graphs of EventLogging running in beta cluster - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) [21:44:38] (03CR) 10Dzahn: [C: 031] "compared every single one (new lines vs the combination of the 3 included snippets). yes, all identical except wikivoyage is different in " [puppet] - 10https://gerrit.wikimedia.org/r/451260 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:50:08] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 52791 MB (10% inode=99%) [21:56:17] !log legoktm@deploy1001 Finished scap: i18n updates for fixcopyright (duration: 32m 21s) [21:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:24] CindyCicaleseWMF: ^^ [21:57:19] CindyCicaleseWMF: should I do a run of deleteEqualMessages now? [22:01:09] (03PS3) 10Dduvall: Parameterize profile::labs::lvm::srv volume size [puppet] - 10https://gerrit.wikimedia.org/r/459850 (https://phabricator.wikimedia.org/T203842) [22:01:29] legoktm: yes please! [22:05:32] (03CR) 10Dzahn: [C: 031] "yes, compared all of them with actual diff to the combination of the different includes. all good" [puppet] - 10https://gerrit.wikimedia.org/r/452322 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [22:05:41] 61 pages in the MediaWiki namespace override messages. [22:05:41] 20 pages are equal to the default message (+ 0 talk pages). [22:05:52] so 41 pages out of sync [22:06:18] !log legoktm@mwmaint1001:~$ mwscript deleteEqualMessages.php --wiki fixcopyrightwiki --delete --lang-code='*' [22:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:31] legoktm: sounds about right, thanks [22:07:40] (03CR) 10Dzahn: [C: 031] mediawiki::web::prod_sites: enable HHVM on some sites(!!!) [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto) [22:10:37] 10Operations, 10DBA, 10JADE, 10Epic, and 2 others: [Epic] Extension:JADE scalability concerns - https://phabricator.wikimedia.org/T196547 (10awight) [22:16:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:19:06] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:19:36] (03CR) 10Dzahn: "why not add species to the list of remnant_simple_wikis above? looking at the diff between species and f.e. outreach i see they are both i" [puppet] - 10https://gerrit.wikimedia.org/r/452636 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [22:20:28] (03CR) 10Dzahn: "so i think either we can treat it just like another "simple_remnant" or the upload rule part would be missing after this" [puppet] - 10https://gerrit.wikimedia.org/r/452636 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [22:28:43] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) Google is known to not respect our canonical URLs, see {T93550}. T... [22:33:06] RECOVERY - Disk space on elastic1023 is OK: DISK OK [22:39:22] 10Operations, 10Performance-Team: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Dzahn) > according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable. But wouldn't that be a feature? I mean, don't you want to disable IRC notification... [22:41:49] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10kaldari) >Problem is that the number of affected pages was somewhere > 500,0... [22:46:21] (03PS1) 10Dzahn: monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) [22:46:52] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) I'm making some changes to the proposal, which I hope emphasize the role of Judgment pages to car... [22:47:07] (03CR) 10jerkins-bot: [V: 04-1] monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:47:35] (03PS2) 10Dzahn: monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) [22:48:28] (03CR) 10jerkins-bot: [V: 04-1] monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:52:38] (03PS1) 10Dzahn: icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) [22:53:14] (03CR) 10jerkins-bot: [V: 04-1] icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) (owner: 10Dzahn) [22:57:10] (03PS2) 10Dzahn: icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) [22:57:13] (03PS1) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [22:58:00] (03PS3) 10Dzahn: monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) [22:59:01] (03CR) 10jerkins-bot: [V: 04-1] monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:59:54] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180911T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:19] 10Operations, 10Performance-Team, 10Patch-For-Review: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Krinkle) >>! In T203485#4575744, @Dzahn wrote: >> according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable. > > But wouldn't tha... [23:02:43] (03CR) 10Krinkle: [C: 031] icinga::performance: remind users to ignore checks using notes_url [puppet] - 10https://gerrit.wikimedia.org/r/459864 (https://phabricator.wikimedia.org/T203485) (owner: 10Dzahn) [23:10:50] 10Operations, 10Performance-Team, 10Patch-For-Review: Revisit Grafana/Icinga notification strategy - https://phabricator.wikimedia.org/T203485 (10Dzahn) Ok, gotcha. I think the best solution is probably to use event_handlers to let Icinga auto-ACK the checks. That would keep info on IRC (and in email) as is... [23:12:52] (03CR) 10Krinkle: monitoring: enable using notes_url with grafana_alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:15:31] (03PS4) 10Dzahn: monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) [23:16:23] (03CR) 10Alex Monk: [C: 04-1] "Sep 11 23:14:47 deployment-certcentral03 uwsgi-certcentral[6792]: Traceback (most recent call last):" [software/certcentral] - 10https://gerrit.wikimedia.org/r/459841 (owner: 10Alex Monk) [23:16:25] (03CR) 10Dzahn: monitoring: enable using notes_url with grafana_alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:18:16] (03PS5) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [23:18:30] (03CR) 10Ayounsi: "Addressing Faidon's comments." (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [23:19:46] (03CR) 10Krinkle: [C: 031] monitoring: enable using notes_url with grafana_alert [puppet] - 10https://gerrit.wikimedia.org/r/459862 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:20:10] mutante: auto-ack sounds fine with me. [23:22:18] Krinkle: ok, yep, i think it's the best solution, just a little more work then the above [23:22:26] i want both [23:23:17] i as adding those notes_urls in other places too, and a suggestion is to even create it in base module for all checks [23:26:03] there would be a standardized URL on wikitech for each check and either there would be content there,like a runbook or ticket links or longer comments (talk: pages, heh) or it would be a link to an empty wiki page [23:38:05] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) Here's a new proposal for the anatomy of a judgment (in this case, of a diff): ``` wikitext (main slot): n... [23:40:00] (03PS2) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [23:40:49] (03CR) 10Alex Monk: "It looks like https://github.com/certbot/certbot/commit/83f7e72fefb8d9087a5ad488153a644e1b905572 is not playing nicely with our tests." [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [23:42:01] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [23:45:15] (03PS3) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [23:47:09] (03CR) 10Dzahn: "imagine a wiki page like the one we have for networking specific alerts at https://wikitech.wikimedia.org/wiki/Network_monitoring#Icinga_a" [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:47:37] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [23:49:18] (03PS4) 10Alex Monk: Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 [23:49:34] (03CR) 10Dzahn: "i'm not sure yet if this is the best possible URL structure but it should be on wikitech and be a section or a page for each service_check" [puppet] - 10https://gerrit.wikimedia.org/r/459659 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:50:11] (03PS6) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [23:51:05] (03CR) 10jerkins-bot: [V: 04-1] Remove maximum version for acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/459866 (owner: 10Alex Monk) [23:53:01] (03PS7) 10Ayounsi: Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis) [23:53:16] (03CR) 10Ayounsi: "Tested in cloud, everything seems to be working fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364753 (owner: 10Faidon Liambotis)