[00:01:15] ostriches: kinda [00:01:52] puppet merge of what? [00:02:03] Wanna tweak gerrit replication config (https://gerrit.wikimedia.org/r/#/c/283585/). I was working on getting replication to lead working today, but I'm having troubles debugging the SSH on it and am calling it a day. [00:02:14] Wanna shut off so it doesn't fail and spam the logs all night [00:02:47] you know lead came up in this channel earlier today, too [00:03:10] it went down out of nowhere. lost ethernet to the switch when an unrelated switch config change was made. [00:03:22] due to a juniper bug + it's not in any interface group, so I guess it's defaulting on vlan [00:03:32] might all be inter-related with whatever network issues you're having [00:04:30] also, seems like a very strange coincidence that grrrit-wm died the moment I merged your gerrit-related patch heh [00:04:36] death via self-reference? :) [00:05:12] anyways, it's merged [00:06:45] Krenair: I kicked [00:07:01] Krenair: tools admins can kick it, https://wikitech.wikimedia.org/wiki/Grrrit-wm [00:08:11] 06Operations, 10Traffic, 07Beta-Cluster-reproducible: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data" - https://phabricator.wikimedia.org/T125938#2208836 (10BBlack) >>! In T125938#2005042, @BBlack wrote: > In general, it's probably best to disable gzip output compression in the applic... [00:08:44] YuviPanda, they have to sudo su yuvipanda? [00:10:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [00:10:36] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [00:13:17] bblack: Thx [00:29:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:29:26] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:21:26] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 08m 28s) [02:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Apr 15 02:30:21 UTC 2016 (duration 8m 55s) [02:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:58:04] Krenair: "sudo -u yuvipanda" is the best way to make a cheese sandwich [03:05:41] 06Operations, 10Traffic, 07HTTPS: status.wikimedia.org has no (valid) HTTPS - https://phabricator.wikimedia.org/T34796#2208959 (10Dzahn) The actual blocker for 2. was that Catchpoint was able to replace almost all features of Watchmouse, _except_ that it doesn't have that kind of status page. So maybe an opt... [03:13:39] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2208960 (10Dzahn) I think we should go with 4. short term. Then for mid/long-term maybe we want to have this redundant, one in each DC and that could possible solve the chicken-eg... [03:26:01] 06Operations, 10Analytics-Wikistats, 07Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2208982 (10Dzahn) I would recommend to accept the regression as a feature. We actually like it when the wikipedia.org domain just... [03:28:14] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2208983 (10MZMcBride) Thank you for the explanations and clarifications here. I really appreciate them. [03:29:18] 06Operations, 07Need-volunteer: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326#2208984 (10Dzahn) [03:31:38] 06Operations, 10Analytics-Wikistats, 07Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2208987 (10Dzahn) Interestingly right after i say this i see the discussion on T13240 to introduce analytics.wikimedia.org. And i... [03:35:12] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2197243 (10Dzahn) also see T126281 (i think we should not fix/redirect stats.wikipedia.org, but say that there is just stats.wikimedia.org and this new analytics.wikimedia.org [03:38:34] 06Operations, 07Puppet, 07Need-volunteer: MaxClients on puppetmaster - https://phabricator.wikimedia.org/T97466#2208996 (10Dzahn) [03:42:54] 06Operations, 10Monitoring, 10netops: graph interface drops in ganglia - https://phabricator.wikimedia.org/T80515#2209008 (10Dzahn) [03:43:45] 06Operations, 10Monitoring, 10netops: graph interface drops in ganglia - https://phabricator.wikimedia.org/T80515#876457 (10Dzahn) Are we still interested in graphing the interface drops in Ganglia nowadays? [03:45:30] 06Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#2209011 (10Dzahn) What now? Can archive.org people help or something? [04:01:44] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209015 (10MZMcBride) >>! In T118548#2201380, @GWicke wrote: > We'll initially deploy this without caching for the `?redirect=... [04:08:34] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209016 (10Pchelolo) >>! In T118548#2209015, @MZMcBride wrote: > The mailing list post mentioned `?redirect=false`. Will any f... [04:09:44] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209017 (10Chmarkine) I suggest we use Let's Encrypt. It can issue SAN certificates. > Can I get a certificate for multiple domain names (SAN certificates)? > Yes, the same certi... [04:12:21] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209018 (10MZMcBride) >>! In T118548#2209015, @MZMcBride wrote: > I have some vague memory that the value of some URL paramete... [04:15:00] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209019 (10Pchelolo) > What happens with `?redirect=yes` or other truthy values? It's considered to be true and the redirect... [04:30:30] 06Operations, 10Analytics-Wikistats, 07Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Nemo_bis) >>! In T126281#2208987, @Dzahn wrote: > Interestingly right after i say this i see the discussion on T132407... [05:11:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [05:39:48] (03PS1) 10KartikMistry: Fix cxserver on deployment-cxserver03 returning 404s [puppet] - 10https://gerrit.wikimedia.org/r/283596 (https://phabricator.wikimedia.org/T132733) [05:59:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:21:35] 06Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#2209096 (10tstarling) For a long time, the MW configuration files in wmf-config were not under version control. Domas introduced conf-svn: a local subversion repository for those files with an automatic,... [06:29:38] (03PS2) 10Muehlenhoff: Enable base::firewall on tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283441 [06:30:46] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:05] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:25] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:46] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:46] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:25] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:36] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:19] (03PS2) 10Muehlenhoff: Add gnome-pkg-tools to package_builder base file list [puppet] - 10https://gerrit.wikimedia.org/r/283451 [06:49:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on tungsten [puppet] - 10https://gerrit.wikimedia.org/r/283441 (owner: 10Muehlenhoff) [06:56:06] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:16] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:16] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:42] moritzm, can we prepare an all-eqiad-firewall deployment for mysql? [06:56:56] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:35] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:45] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:55] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:57] jynus: good morning, I hope you had a nice vacation. you mean that we apply base::firewall to eqiad when the dc is switched to codfw? sure, I'll prepare patches for that [07:04:49] 06Operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#2209107 (10tstarling) So I have a lot of files, I haven't really deleted anything, but it is more like clutter recursively hidden than a principled archive. I have partial copies of the MW configuration f... [07:14:52] moritzm exactly that :-) [07:15:37] nice, I'll prepare patches today [07:15:38] one nice and easy step :-). I think that would mean moving some classes to the role instead of the node [07:16:06] we may want to remove iron from that at the same time [07:16:11] yeah, for some we can probably enable it directly in the role [07:16:20] sounds good wrt iron [07:16:26] misc will not be failovered [07:16:42] so that is coredb, or whatever is called [07:17:05] do not worry about non-mariadb clasess, those will disappear [07:17:29] I mean mariadb::core is what we want, coredb will almost disappear [07:17:41] ok [07:17:53] !log rebooting oxygen for kernel update to 4.4 [07:17:53] I will help with that mess [07:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:18:21] we may not have time to upgrade to jessie, however [07:18:53] except s2 and s3 [07:19:09] but we will upgrade from precise to trusty [07:21:02] nice, let me know if I can help with some of the non-DBA tasks during the switchover time window [07:21:49] there is actually not much to be done, just I would ask for help with the firewall- mainly because I will be busy with the DBA things [07:22:38] that is by design- I will not be doing but essential mantenance [07:26:28] ok! [07:38:36] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2209124 (10Volans) db2017 and db2018 RAID is back to optimal, re-enabled notifications on icinga for RAID checks. Leaving the task open for the remaining hosts. [07:40:52] !log rebooting oresrdb* to Linux 4.4 [07:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:43:19] (03PS4) 10Giuseppe Lavagetto: role::jobqueue_redis: add monitoring of the redis instances [puppet] - 10https://gerrit.wikimedia.org/r/282950 [07:44:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] role::jobqueue_redis: add monitoring of the redis instances [puppet] - 10https://gerrit.wikimedia.org/r/282950 (owner: 10Giuseppe Lavagetto) [07:47:12] (03PS1) 10Volans: Depool db1042 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283601 (https://phabricator.wikimedia.org/T111654) [07:48:38] (03CR) 10Hashar: [C: 031] Fix cxserver on deployment-cxserver03 returning 404s [puppet] - 10https://gerrit.wikimedia.org/r/283596 (https://phabricator.wikimedia.org/T132733) (owner: 10KartikMistry) [07:50:11] (03PS1) 10Volans: MariaDB: use Puppet cert for s4 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283602 (https://phabricator.wikimedia.org/T111654) [07:51:29] (03Abandoned) 10Giuseppe Lavagetto: cassandra: do not manage the service via puppet [puppet] - 10https://gerrit.wikimedia.org/r/250682 (https://phabricator.wikimedia.org/T103134) (owner: 10Giuseppe Lavagetto) [07:51:50] (03Abandoned) 10Giuseppe Lavagetto: k8s: switch to using systems' CA [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [07:57:34] (03CR) 10Volans: "Diff looks good: https://puppet-compiler.wmflabs.org/2463/" [puppet] - 10https://gerrit.wikimedia.org/r/283602 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:00:04] !log removing empty log archives from Fluorine (T132324) [08:00:06] T132324: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324 [08:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:02:41] 06Operations, 06Services, 07Tracking: Move Node.JS services to Jessie and Node 4 (tracking) - https://phabricator.wikimedia.org/T124989#2209166 (10hashar) #beta-cluster migration is tracked by T125003 [08:05:41] !log starting TLS upgrade for shard s4 T111654 [08:05:42] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [08:05:44] (03PS1) 10EBernhardson: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 [08:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:05:48] jynus: FYI --^ [08:06:21] (03PS2) 10EBernhardson: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 [08:07:15] (03CR) 10Volans: [C: 032] Depool db1042 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283601 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:07:41] (03Merged) 10jenkins-bot: Depool db1042 for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283601 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:08:04] (03CR) 10Gehel: [C: 031] "Always nice to give useful feedback to the user!" [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [08:09:05] good, good [08:09:58] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1042 to upgrade TLS on s4 - T111654 (duration: 00m 36s) [08:09:58] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [08:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:10:46] (03PS3) 10EBernhardson: Make mwrepl a little more user friendly [puppet] - 10https://gerrit.wikimedia.org/r/283604 [08:11:14] (03CR) 10EBernhardson: "ps3 add's minor tweak as a irealized -eq only works with integers" [puppet] - 10https://gerrit.wikimedia.org/r/283604 (owner: 10EBernhardson) [08:11:36] (03PS1) 10Muehlenhoff: Add ferm service for debug proxy [puppet] - 10https://gerrit.wikimedia.org/r/283606 [08:11:38] (03PS1) 10Muehlenhoff: Enable base::firewall for hassaleh/hassium [puppet] - 10https://gerrit.wikimedia.org/r/283607 [08:17:30] 06Operations: librsvg path patch needs to be applied for jessie - https://phabricator.wikimedia.org/T132584#2209209 (10MoritzMuehlenhoff) 05Open>03Resolved librsvg has been built with the patch and uploaded to carbon. [08:17:32] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2209211 (10MoritzMuehlenhoff) [08:18:40] (03CR) 10Volans: [C: 032] MariaDB: use Puppet cert for s4 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283602 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:20:36] (03PS2) 10Alexandros Kosiaris: Fix cxserver on deployment-cxserver03 returning 404s [puppet] - 10https://gerrit.wikimedia.org/r/283596 (https://phabricator.wikimedia.org/T132733) (owner: 10KartikMistry) [08:20:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix cxserver on deployment-cxserver03 returning 404s [puppet] - 10https://gerrit.wikimedia.org/r/283596 (https://phabricator.wikimedia.org/T132733) (owner: 10KartikMistry) [08:26:11] (03PS2) 10Muehlenhoff: Enable base::firewall for hassaleh/hassium [puppet] - 10https://gerrit.wikimedia.org/r/283607 [08:29:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "so, this is most often caused by the clean step reusing the same make files as the rest of the targets, which seems DRY but does have the " [puppet] - 10https://gerrit.wikimedia.org/r/283451 (owner: 10Muehlenhoff) [08:29:20] (03PS1) 10Mobrovac: Mathoid: Use Scap3 as the deployment method [puppet] - 10https://gerrit.wikimedia.org/r/283609 (https://phabricator.wikimedia.org/T116338) [08:29:22] (03PS3) 10Alexandros Kosiaris: Add gnome-pkg-tools to package_builder base file list [puppet] - 10https://gerrit.wikimedia.org/r/283451 (owner: 10Muehlenhoff) [08:29:26] (03CR) 10Alexandros Kosiaris: [V: 032] Add gnome-pkg-tools to package_builder base file list [puppet] - 10https://gerrit.wikimedia.org/r/283451 (owner: 10Muehlenhoff) [08:35:03] (03PS1) 10Volans: Repool db1042 after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) [08:35:26] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2209239 (10Gehel) [08:36:10] 06Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2209252 (10jcrespo) I gladly explain, I think it was clear for everyone involved in this ticket: "Upgrade db servers to jessie" to me is as clear or useful as a ticket saying "have a recent kernel installed on all ma... [08:37:07] 06Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2209255 (10jcrespo) >>! In T125028#2209252, @jcrespo wrote: > I gladly explain, I think it was clear for everyone involved in this ticket: > > "Upgrade db servers to jessie" to me is as clear or useful as a ticket sa... [08:50:52] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb s1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283611 [08:53:17] (03PS2) 10Gehel: Revert "remove wdqs1002 from varnish during reinstall / fix" [puppet] - 10https://gerrit.wikimedia.org/r/283485 (https://phabricator.wikimedia.org/T132387) [08:54:00] (03PS1) 10Muehlenhoff: Enable base::firewall for db1065 [puppet] - 10https://gerrit.wikimedia.org/r/283612 [08:56:28] (03PS2) 10Filippo Giunchedi: revised yaml instance descriptor format [puppet] - 10https://gerrit.wikimedia.org/r/283574 (owner: 10Eevans) [08:56:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] revised yaml instance descriptor format [puppet] - 10https://gerrit.wikimedia.org/r/283574 (owner: 10Eevans) [08:57:29] (03PS3) 10Gehel: Revert "remove wdqs1002 from varnish during reinstall / fix" [puppet] - 10https://gerrit.wikimedia.org/r/283485 (https://phabricator.wikimedia.org/T132387) [09:00:08] (03CR) 10Gehel: [C: 032] Revert "remove wdqs1002 from varnish during reinstall / fix" [puppet] - 10https://gerrit.wikimedia.org/r/283485 (https://phabricator.wikimedia.org/T132387) (owner: 10Gehel) [09:00:20] (03CR) 10Filippo Giunchedi: "nitpick, also +1 on limiting to jessie for now" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [09:00:25] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb s4 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283614 [09:01:04] !log reenabling wdqs1002 in varnish rotation after reinstall (T132387) [09:01:05] T132387: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387 [09:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:02:46] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb s5 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283615 [09:05:35] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb s6 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283616 [09:07:50] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb s7 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283617 [09:09:32] (03CR) 10Alexandros Kosiaris: [C: 032] Mathoid: Use Scap3 as the deployment method [puppet] - 10https://gerrit.wikimedia.org/r/283609 (https://phabricator.wikimedia.org/T116338) (owner: 10Mobrovac) [09:09:40] (03PS2) 10Alexandros Kosiaris: Mathoid: Use Scap3 as the deployment method [puppet] - 10https://gerrit.wikimedia.org/r/283609 (https://phabricator.wikimedia.org/T116338) (owner: 10Mobrovac) [09:09:51] (03CR) 10Alexandros Kosiaris: [V: 032] Mathoid: Use Scap3 as the deployment method [puppet] - 10https://gerrit.wikimedia.org/r/283609 (https://phabricator.wikimedia.org/T116338) (owner: 10Mobrovac) [09:17:21] (03PS1) 10Gehel: Increase client_max_body_size to 100M in nginx [puppet] - 10https://gerrit.wikimedia.org/r/283619 [09:23:21] (03CR) 10DCausse: [C: 031] Increase client_max_body_size to 100M in nginx [puppet] - 10https://gerrit.wikimedia.org/r/283619 (owner: 10Gehel) [09:23:42] !log Re-arrange s3 replica topology: making codfw replicate from db1075 - T111654 [09:23:43] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [09:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:25:02] will you chose db1075? is it to update the ticket [09:25:31] we choosed db1075 in the Google Doc [09:25:50] actually I'm depooling the new ones first, just in case [09:25:58] I do not care the actual server, I just wanted to double check [09:26:15] and will update the s3 master ticket [09:26:23] ok [09:26:53] https://phabricator.wikimedia.org/T128353 [09:27:46] 06Operations, 10Monitoring, 10netops: graph interface drops in ganglia - https://phabricator.wikimedia.org/T80515#876457 (10fgiunchedi) I don't think so, if it is network devices interfaces drops those are in librenms, if it is host interface drop those are in graphite [09:27:52] (03PS1) 10Volans: Depool new s3 slaves for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) [09:34:05] (03PS1) 10Muehlenhoff: Enable base::firewall for mariadb es1 shard in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/283622 [09:35:19] 06Operations, 10Traffic, 07HTTPS: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209325 (10faidon) Option 4 sounds like the sanest (and easiest) to me too. apt and mirrors/ubuntu are different services really and might be split in the future (cf. T84817) so I... [09:36:04] (03PS2) 10Gehel: Increase client_max_body_size to 100M in nginx [puppet] - 10https://gerrit.wikimedia.org/r/283619 [09:36:54] (03PS1) 10Alexandros Kosiaris: servermon: Add Krenair (Alex Monk) to servermon users [puppet] - 10https://gerrit.wikimedia.org/r/283623 [09:41:41] is jenkins having long delays for everyone? Or is it just me? (https://gerrit.wikimedia.org/r/#/c/283619/) [09:42:56] for me too gehel [09:42:56] (03PS3) 10Giuseppe Lavagetto: Log all write activity to an irc bot [software/conftool] - 10https://gerrit.wikimedia.org/r/280843 [09:42:59] I was about to ask the same [09:43:15] seems on vacation since ~1h, it worked fine for me before [09:43:56] checking it [09:44:59] volans: what do you check (I have no idea how CI works here...) [09:45:17] gehel: me neither, but I know a bit jenkins, so from the UI [09:45:46] volans: from https://integration.wikimedia.org/zuul/ I see 48 events in queue (but not sure what that means) [09:46:10] you have to go in the tab that have the jobs that should run on your CR [09:46:29] alos noticed that, it's not really catching up ATM [09:46:44] mine is on mediawiki-config for example, and the jobs are under Ops [09:46:54] but there is no pending job [09:46:59] so seems they were not scheduled at all [09:47:55] (03CR) 10Volans: [C: 032] Repool db1042 after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [09:48:10] If I understood correctly, the scheduling is actually done by Zuul, not directly Jenkins... [09:48:23] ok, then could be there too [09:49:33] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209336 (10BBlack) I guess the question remains, though: if the Varnish redirect-stripper sees ?redirect with a non-falsey val... [09:50:31] _joe_, godog maybe one of you have more insight on what can be wrong with the CI? [09:50:51] seems that the CI jobs are not scheduled in jenkins [09:51:14] (03PS1) 10Faidon Liambotis: install_server: dhcpd.conf cleanups [puppet] - 10https://gerrit.wikimedia.org/r/283626 [09:51:16] (03PS1) 10Faidon Liambotis: install_server: avoid DHCP next-servers to self [puppet] - 10https://gerrit.wikimedia.org/r/283627 [09:52:09] 06Operations, 10Monitoring, 10netops: graph interface drops in ganglia - https://phabricator.wikimedia.org/T80515#2209339 (10akosiaris) 05Open>03declined I concur. Declining [09:52:17] I'm a bit stuck... mediawiki-config merge process is done by Jenkins :( [09:52:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install_server: dhcpd.conf cleanups [puppet] - 10https://gerrit.wikimedia.org/r/283626 (owner: 10Faidon Liambotis) [09:53:10] (03CR) 10Faidon Liambotis: [C: 032 V: 032] install_server: avoid DHCP next-servers to self [puppet] - 10https://gerrit.wikimedia.org/r/283627 (owner: 10Faidon Liambotis) [09:55:03] volans: I can't ssh to gallium (Zuul server). Proably a wrong SSH config on my side. There are some info on debugging on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Debugging [09:56:18] gehel: I'm in there, let me see [09:57:58] volans: I'm in there too, wrong bastion... [09:59:47] not much that I understand... [10:00:14] same here :) [10:08:01] 06Operations, 10DBA: db1024 (s2 master) will run out of disk space in ~4 months - https://phabricator.wikimedia.org/T122048#2209355 (10mark) [10:10:45] !log on helium: scheduled restore of home_pmtpa to bast4001 [10:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:24] (03PS1) 10Muehlenhoff: Add recently assigned CVE to changelog (already applied in older stable patch set) [debs/linux] - 10https://gerrit.wikimedia.org/r/283632 [10:16:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [10:21:22] <_joe_> !log restarted zuul, zuul-merger on gallium [10:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:31] (03CR) 10Faidon Liambotis: [C: 04-1] "If the package name is the same (elasticsearch), and you're including the elasticsearch-2.x repo everywhere that we use the elasticsearch " [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [10:23:40] (03CR) 10Faidon Liambotis: "Have you tested that? Doesn't it need a "satisfy any" to make those conditions or'ed with each other?" [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [10:25:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add recently assigned CVE to changelog (already applied in older stable patch set) [debs/linux] - 10https://gerrit.wikimedia.org/r/283632 (owner: 10Muehlenhoff) [10:29:21] (03CR) 10Volans: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:29:24] (03CR) 10Faidon Liambotis: [C: 04-1] Override kafkatee's default logrotate/rsyslog configuration. (033 comments) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [10:29:32] (03CR) 10Volans: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:30:34] (03CR) 10Volans: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:30:58] <_joe_> volans: I see those jobs being submitted by zuul [10:31:25] _joe_: yes, first results too, seems working fine [10:31:33] 06Operations, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2209395 (10mark) @akosiaris so can we resolve this task? [10:32:12] (03CR) 10Faidon Liambotis: [C: 031] "Woohooo :)" [dns] - 10https://gerrit.wikimedia.org/r/283364 (https://phabricator.wikimedia.org/T124482) (owner: 10BBlack) [10:34:03] (03PS2) 10Ema: Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) [10:34:07] (03CR) 10Volans: Repool db1042 after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:34:17] (03CR) 10Volans: [C: 032] Repool db1042 after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:34:52] (03Merged) 10jenkins-bot: Repool db1042 after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283610 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:37:21] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1042 after TLS upgrade on s4 - T111654 (duration: 00m 30s) [10:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:32] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [10:39:29] elukey: finish=15223.3min speed=1465K/sec [10:39:34] aqs again [10:39:37] this is ridiculous :) [10:39:49] (03PS2) 10Volans: Depool new s3 slaves for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) [10:41:45] 06Operations, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452#2209403 (10jcrespo) This is not a datacenter ops issue, yet. [10:41:56] (03CR) 10Giuseppe Lavagetto: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/280843 (owner: 10Giuseppe Lavagetto) [10:42:18] thanks _joe_:) [10:42:47] 06Operations, 10Traffic, 07HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2209409 (10BBlack) [10:44:43] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2209239 (10Joe) I don't see why upgrading hhvm would impact running processes. The repo schema is going to be changed, so there is not really any possibility that different versions running at the same time inte... [10:45:23] (03PS1) 10Muehlenhoff: Also enable base::firewall on rcs1001 [puppet] - 10https://gerrit.wikimedia.org/r/283637 [10:45:28] 06Operations, 07HHVM: upgrade HHVM to 3.12.1 on terbium - https://phabricator.wikimedia.org/T132751#2209420 (10Joe) p:05Triage>03Normal a:03Joe [10:45:47] (03CR) 10Volans: [C: 032] Depool new s3 slaves for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:46:51] 06Operations, 10puppet-compiler: puppet compiler: NoneType' object is not iterable with node auto-select feature - https://phabricator.wikimedia.org/T117278#2209429 (10Joe) 05Open>03Resolved [10:47:17] 06Operations, 10puppet-compiler: puppet compiler: NoneType' object is not iterable with node auto-select feature - https://phabricator.wikimedia.org/T117278#1770219 (10Joe) @akosiaris fixed this a long time ago [10:48:47] (03Merged) 10jenkins-bot: Depool new s3 slaves for TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283620 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [10:50:59] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool new db1075,1077,1078 to upgrade TLS on s3 - T111654 (duration: 00m 41s) [10:50:59] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [10:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:20] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2209442 (10mobrovac) >>! In T118548#2209336, @BBlack wrote: > I guess the question remains, though: if the Varnish redirect-st... [11:04:10] (03CR) 10Gehel: "Recheck" [puppet] - 10https://gerrit.wikimedia.org/r/283619 (owner: 10Gehel) [11:05:49] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [11:13:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also enable base::firewall on rcs1001 [puppet] - 10https://gerrit.wikimedia.org/r/283637 (owner: 10Muehlenhoff) [11:16:34] 06Operations, 10Traffic, 07HTTPS: enable https for (carbon|ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209456 (10BBlack) [11:17:05] 06Operations, 10Traffic, 07HTTPS: enable https for (carbon|ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10BBlack) Added carbon to the list, since that actually is the HTTP hostname we use for some of the access to this service (contents are the same as apt.wm.o though). [11:17:30] (03PS1) 10BBlack: refactor install_server web stuff towards SSL config [puppet] - 10https://gerrit.wikimedia.org/r/283638 (https://phabricator.wikimedia.org/T132450) [11:17:32] (03PS1) 10BBlack: mirrors::serve: split mirrors/ubuntu site configs [puppet] - 10https://gerrit.wikimedia.org/r/283639 (https://phabricator.wikimedia.org/T132450) [11:42:18] (03PS2) 10BBlack: mirrors::serve: split mirrors/ubuntu site configs [puppet] - 10https://gerrit.wikimedia.org/r/283639 (https://phabricator.wikimedia.org/T132450) [11:42:20] (03PS2) 10BBlack: refactor install_server web stuff towards SSL config [puppet] - 10https://gerrit.wikimedia.org/r/283638 (https://phabricator.wikimedia.org/T132450) [11:45:33] (03PS1) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [11:46:57] (03PS1) 10Giuseppe Lavagetto: Make output more readable [software/conftool] - 10https://gerrit.wikimedia.org/r/283642 [11:50:32] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [11:52:52] (03PS2) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [11:56:01] (03PS1) 10Urbanecm: Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) [11:58:25] (03CR) 10Peachey88: [C: 04-1] Add Resolution: namespace to foundationwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [11:59:16] bah gerrit ui [11:59:37] (03PS2) 10Urbanecm: Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) [12:00:16] (03CR) 10Urbanecm: "Upps, fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [12:00:22] !log remove maintenance from wdqs1002 [12:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:48] (03PS3) 10BBlack: mirrors::serve: split mirrors/ubuntu site configs [puppet] - 10https://gerrit.wikimedia.org/r/283639 (https://phabricator.wikimedia.org/T132450) [12:01:51] (03PS3) 10BBlack: refactor install_server web stuff towards SSL config [puppet] - 10https://gerrit.wikimedia.org/r/283638 (https://phabricator.wikimedia.org/T132450) [12:02:18] (03PS3) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [12:03:42] (03PS3) 10Gehel: Increase client_max_body_size to 100M in nginx [puppet] - 10https://gerrit.wikimedia.org/r/283619 [12:06:22] (03CR) 10Gehel: [C: 032] Increase client_max_body_size to 100M in nginx [puppet] - 10https://gerrit.wikimedia.org/r/283619 (owner: 10Gehel) [12:08:22] (03PS4) 10BBlack: refactor install_server web stuff towards SSL config [puppet] - 10https://gerrit.wikimedia.org/r/283638 (https://phabricator.wikimedia.org/T132450) [12:08:31] (03CR) 10BBlack: [C: 032 V: 032] refactor install_server web stuff towards SSL config [puppet] - 10https://gerrit.wikimedia.org/r/283638 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [12:08:57] !log experimenting on carbon HTTP config (apt/mirrors/ubuntu.wm.o) - watch out for installer / package-update issues! [12:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:03] (03PS1) 10Urbanecm: Disable MoodBar on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) [12:12:38] (03PS2) 10Giuseppe Lavagetto: Make output more readable [software/conftool] - 10https://gerrit.wikimedia.org/r/283642 [12:14:45] !log Re-arrange s3 replica topology: making codfw replicate from db1075 (this time for real) - T111654 [12:14:46] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:52] !log increase max request size for elasticsearch [12:15:55] (03PS4) 10BBlack: mirrors::serve: split mirrors/ubuntu site configs [puppet] - 10https://gerrit.wikimedia.org/r/283639 (https://phabricator.wikimedia.org/T132450) [12:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:15] (03CR) 10BBlack: [C: 032 V: 032] mirrors::serve: split mirrors/ubuntu site configs [puppet] - 10https://gerrit.wikimedia.org/r/283639 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [12:16:42] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:19:40] (03PS4) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [12:24:28] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2209585 (10faidon) [12:24:37] (03PS5) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [12:24:39] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2209603 (10faidon) [12:25:22] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2209585 (10faidon) [12:25:24] 06Operations, 13Patch-For-Review: Migrate carbon to jessie - https://phabricator.wikimedia.org/T123733#1936623 (10faidon) [12:28:38] (03PS1) 10Volans: MariaDB: use Puppet cert for s3 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283648 (https://phabricator.wikimedia.org/T111654) [12:29:46] 06Operations: Setting up a mirror serv{er,ice} - https://phabricator.wikimedia.org/T84817#2209610 (10faidon) I moved the splitting stuff to a different task, T132757 and put it as a blocker to this task. As far as mirrors go: - As @mark mentioned, we are already an Ubuntu official mirror. That's good to know and... [12:30:07] bblack: ^^^ [12:30:10] all that :) [12:30:31] I'm hoping mutante will save the day :P [12:32:05] ok :) [12:34:21] 06Operations, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: url-downloader should be set up more redundantly - https://phabricator.wikimedia.org/T122134#2209620 (10akosiaris) 05Open>03stalled There is still the issue of having one url-downloader per DC (manual switchover) and the fact we have a DC rack... [12:34:39] (03PS6) 10Mobrovac: Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) [12:36:15] (03PS6) 10Ladsgroup: [WIP] ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [12:38:35] (03CR) 10Alexandros Kosiaris: [C: 032] Zotero: Use Scap3 for deployment [puppet] - 10https://gerrit.wikimedia.org/r/283641 (https://phabricator.wikimedia.org/T129140) (owner: 10Mobrovac) [12:38:36] 06Operations, 10MediaWiki-JobRunner, 13Patch-For-Review: jobchron logs are not rotated - https://phabricator.wikimedia.org/T96132#2209627 (10fgiunchedi) 05Open>03Resolved this is deployed, thanks @Matanya ! ``` mw1002.eqiad.wmnet: -rw-r----- 1 root root 1643102 Apr 15 12:37 /var/log/upstart/jobchron.log... [12:39:06] akosiaris: heh too quick, still waiting for the compiler to confirm all's good [12:39:31] oh, already ? I am doing the same thing [12:39:39] I haven't merged yet [12:41:32] heheh [12:41:41] let's see who wins the race [12:41:59] akosiaris: https://puppet-compiler.wmflabs.org/2476/ [12:42:17] looking good [12:42:36] the change on iridium is just a simple sudo::user resource name rename [12:42:45] my change failed ... [12:42:53] 502 Proxy Error [12:43:11] it failed for me earlier too [12:43:14] I love how our install return an html to the jenkins api [12:43:39] cause obviously that's parseable by the jenkins api clients ... [12:43:40] sigh [12:43:40] :) [12:43:41] (03PS1) 10Faidon Liambotis: Kill wmftest.org DYNA, remove text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/283650 [12:43:44] bblack: ^ too [12:44:34] mobrovac: ok merging [12:46:19] (03CR) 10BBlack: [C: 031] Kill wmftest.org DYNA, remove text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/283650 (owner: 10Faidon Liambotis) [12:47:41] (03CR) 10Faidon Liambotis: [C: 032] Kill wmftest.org DYNA, remove text-addrs-v4 [dns] - 10https://gerrit.wikimedia.org/r/283650 (owner: 10Faidon Liambotis) [12:49:23] mobrovac: seems like we are ok on sca1001 [12:49:30] \o/ [12:49:39] I 'd let the change propagate to the other 3 before running a deploy though [12:49:57] defaults.js btw is not deployable (still owned by root) but IIRC it's not in the repo [12:49:59] yeah, i need to fix some git problems with zotero/translators first [12:50:00] :/ [12:50:12] but rather generated by puppet [12:50:20] worse case scenario we just remove it from the repo [12:50:29] we rewrote the history, but now git pull is giving me hell [12:50:49] ah good point, lemme check [12:51:04] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [12:51:18] akosiaris: that's going to be a problem [12:51:42] scap3 switches symlinks, and because of this file being owned by root, that's not gonna fly i'm afraid [12:52:05] (03CR) 10Luke081515: [C: 04-1] Disable MoodBar on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [12:52:10] or rather, the host won't let scap3 remove the dir in the first place [12:52:32] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [12:53:29] mobrovac: not following... default.js is a file [12:53:32] not a dir [12:53:38] defaults.jar [12:53:42] damn [12:53:49] defaults.js .. grrr javascript/java [12:53:57] ok, but it won't let it delete the file [12:54:15] why will scap3 want to delete the file ? [12:54:28] it just switches symlinks, no ? [12:54:37] because it needs to delete /srv/deployment/zotero/translation-server in order to create the symlink [12:55:54] (03CR) 10Luke081515: [C: 031] Add Resolution: namespace to foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283643 (https://phabricator.wikimedia.org/T132746) (owner: 10Urbanecm) [12:56:12] ah, it does a rm -rf on the migration ? [12:56:44] quite sure it does [12:57:00] i don't see how else would it do it [12:57:19] (03PS1) 10BBlack: add https to ferm rules on installserver [puppet] - 10https://gerrit.wikimedia.org/r/283651 [12:58:24] akosiaris: it's not obvious to me how could we get defaults.js outside of the repo [12:58:25] (03CR) 10BBlack: [C: 032 V: 032] add https to ferm rules on installserver [puppet] - 10https://gerrit.wikimedia.org/r/283651 (owner: 10BBlack) [12:58:39] and then point zotero to it [12:58:45] there's no mention of that file [12:59:05] so it must be a xulrunner thing [12:59:41] (03PS2) 10Urbanecm: Disable MoodBar on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) [13:00:17] (03PS2) 10Volans: MariaDB: use Puppet cert for s3 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/283648 (https://phabricator.wikimedia.org/T111654) [13:00:42] (03CR) 10Luke081515: [C: 031] Disable MoodBar on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [13:02:14] (03CR) 10Urbanecm: "Ok, I commented the line as I could see at tawiki's line." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283647 (https://phabricator.wikimedia.org/T131685) (owner: 10Urbanecm) [13:07:24] (03CR) 10Alexandros Kosiaris: "Nope. https://httpd.apache.org/docs/2.2/mod/core.html#require (we got 2.2 on netmon1001) says:" [puppet] - 10https://gerrit.wikimedia.org/r/283623 (owner: 10Alexandros Kosiaris) [13:08:16] mobrovac: it is [13:08:20] (03PS1) 10Urbanecm: Add domain *.natmus.dk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) [13:08:44] akosiaris: ok, so that effectively means we can't deploy it with scap3 [13:09:10] even if set the owner of that file to deploy-service that's useless because scap3 will remove it anyway [13:09:54] we could change the position of the file and ship a symlink to say /etc/zotero/defaults.js in scap3 [13:09:54] (03CR) 10Filippo Giunchedi: [C: 031] "another nitpick, LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [13:10:14] a dangling one in the repo which will only be a non dangling one in the target nodes [13:10:41] akosiaris: good idea! [13:10:43] that's the easy way out.. assuming xulrunner will not complain [13:10:49] which I doubt it will [13:11:00] akosiaris: i'll add the symlink to the repo and you modify ops/puppet? [13:11:05] sure [13:15:04] akosiaris: done in https://gerrit.wikimedia.org/r/#/c/283656/ [13:16:47] (03PS1) 10Alexandros Kosiaris: zotero: change the path of the zotero defaults.js config [puppet] - 10https://gerrit.wikimedia.org/r/283657 [13:17:09] (03PS7) 10Ladsgroup: ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [13:17:50] (03CR) 10jenkins-bot: [V: 04-1] zotero: change the path of the zotero defaults.js config [puppet] - 10https://gerrit.wikimedia.org/r/283657 (owner: 10Alexandros Kosiaris) [13:20:50] (03PS2) 10Alexandros Kosiaris: zotero: change the path of the zotero defaults.js config [puppet] - 10https://gerrit.wikimedia.org/r/283657 [13:22:46] mobrovac: should we test it ? in beta ? [13:22:57] do we even have zotero in beta ? [13:22:58] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209749 (10BBlack) [13:23:09] akosiaris: we do, deployment-zotero01 [13:23:18] (03PS1) 10BBlack: add dhparam to install web_server [puppet] - 10https://gerrit.wikimedia.org/r/283658 (https://phabricator.wikimedia.org/T132450) [13:23:20] (03PS1) 10BBlack: add LE cert nginx config for carbon [puppet] - 10https://gerrit.wikimedia.org/r/283659 (https://phabricator.wikimedia.org/T132450) [13:23:23] akosiaris: i need a couple of mins more to bring the translators up to date with scap3 [13:23:35] heh, I am the creator of that VM.. imagine how hard I want to erase it from my memory [13:23:47] akosiaris: oh, we can't test in beta i think [13:24:13] akosiaris: https://phabricator.wikimedia.org/T132666 [13:24:18] we can try though [13:25:06] what a mess [13:26:43] (03CR) 10Chad: "Ouch, I didn't see how deep the rabbit hole went. Can abandon if it's not worth trying to do right now." [puppet] - 10https://gerrit.wikimedia.org/r/283577 (owner: 10Chad) [13:26:49] (03CR) 10BBlack: [C: 032] add dhparam to install web_server [puppet] - 10https://gerrit.wikimedia.org/r/283658 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [13:27:27] (03CR) 10BBlack: [C: 032] add LE cert nginx config for carbon [puppet] - 10https://gerrit.wikimedia.org/r/283659 (https://phabricator.wikimedia.org/T132450) (owner: 10BBlack) [13:28:10] akosiaris: curious to hear your thoughts on https://phabricator.wikimedia.org/T132747 while we're at it [13:33:05] (03PS3) 10Ema: Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) [13:35:13] akosiaris: i updated both repos on tin [13:35:16] (03CR) 10Ladsgroup: "Tested, Works like a charm :)" [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [13:35:37] akosiaris: I think we go ahead and deploy the translators with scap3 [13:37:16] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209853 (10BBlack) So, carbon now has working certs for apt, mirrors, and ubuntu, from Letsencrypt. I ran the cert generation manually, and that part's not... [13:39:15] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2209878 (10Cmjohnson) a:05Cmjohnson>03RobH This card is a one-off and I agree we should stick with what we know. Assigning back to @robh to comment or resolve. [13:40:11] 06Operations, 10ops-eqiad, 06DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2209883 (10Cmjohnson) 05Open>03Resolved The disk has been replaced, resolving [13:41:49] 06Operations, 10Phabricator, 10Traffic, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2209885 (10mark) I think a backup Phabricator host in codfw would make a lot of sense, and is something we strive for (nearly) every service, anyway. - cod... [13:46:20] !log Starting TLS for shard s3 T111654 [13:46:52] mobrovac: where are you? [13:46:55] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [13:47:02] sorry... was intended for morebots [13:47:03] :) [13:48:08] seems like morebots is on holidays today :) [13:48:54] yeah, was working before... stashbot too took 35s to get the phab task title [13:50:28] (03PS1) 10Muehlenhoff: Extend Hiera data for yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/283662 [13:50:30] (03PS1) 10Muehlenhoff: Configure an rsync server which is used to synchronise the AEAD key files between the auth servers [puppet] - 10https://gerrit.wikimedia.org/r/283663 [13:54:53] mobrovac: ok let's do the translators then [13:56:18] <_joe_> oh tcpircbot is not working? [13:56:42] <_joe_> and I tought my software was in a bad shape :P [13:58:50] (03CR) 10BBlack: [C: 031] Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [13:59:50] 06Operations, 10Analytics-EventLogging: "Throughput of EventLogging NavigationTiming events" UNKNOWN - https://phabricator.wikimedia.org/T132770#2209918 (10fgiunchedi) [14:00:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [14:03:24] 06Operations, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2209932 (10faidon) [14:04:37] elukey: "EventLogging overall insertion rate from MySQL consumer" is WARN, any idea why? [14:04:45] elukey: also that kafka CRIT above [14:06:14] (03PS1) 10Ladsgroup: toollabs: flake8 [puppet] - 10https://gerrit.wikimedia.org/r/283664 [14:06:26] !log start decommission of restbase1009-a.eqiad.wmnet : T95253 [14:06:27] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [14:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:15] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2209952 (10BBlack) @faidon noted on IRC https://github.com/diafygi/acme-tiny might be a better client option, and is debianized already for stretch+ [14:08:13] urandom: you should probably fix your phab profile to mention your name [14:08:30] urandom: I was trying to reference you and @Eric... wasn't matching anything [14:09:26] paravoid: {{done}} [14:09:31] :D [14:09:34] paravoid: thanks, i didn't realize [14:10:38] (03PS1) 10Muehlenhoff: Add Yubico AEADs to backup [puppet] - 10https://gerrit.wikimedia.org/r/283665 [14:10:43] urandom: https://phabricator.wikimedia.org/T132771#2209932 is where I referenced you fwiw :) [14:11:00] paravoid: yes. i saw. [14:12:45] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:13:59] (03CR) 10Alexandros Kosiaris: [C: 031] Add Yubico AEADs to backup [puppet] - 10https://gerrit.wikimedia.org/r/283665 (owner: 10Muehlenhoff) [14:14:19] (03PS1) 10Ladsgroup: mw_rc_irc: flake8 [puppet] - 10https://gerrit.wikimedia.org/r/283666 [14:17:10] (03PS1) 10Faidon Liambotis: Kill eventlogging_NavigationTiming_throughput alert [puppet] - 10https://gerrit.wikimedia.org/r/283667 (https://phabricator.wikimedia.org/T132770) [14:17:24] akosiaris: deployment successful! [14:17:46] akosiaris: are we bold enough to try translation-server as well? [14:18:07] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Kill eventlogging_NavigationTiming_throughput alert [puppet] - 10https://gerrit.wikimedia.org/r/283667 (https://phabricator.wikimedia.org/T132770) (owner: 10Faidon Liambotis) [14:18:12] 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143705 (10Volans) a:03Volans [14:18:58] bast1001 fingerprint changed? https://wikitech.wikimedia.org/w/index.php?title=Help%3ASSH_Fingerprints%2Fbast1001.wikimedia.org&type=revision&diff=435718&oldid=192159 [14:19:22] yes AndyRussG, see ops-l [14:19:55] 06Operations, 10DBA: es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2209986 (10jcrespo) p:05Normal>03High [14:21:20] mobrovac: sure, why not ? [14:21:34] mobrovac: K thx! [14:21:40] mobrovac: want to avoid deployed to say sca1002 ? [14:21:51] s/deployed/deploying/ [14:22:07] akosiaris: we can first try to deploy only to codfw [14:22:16] that would work too [14:22:25] scap3 ftw! [14:22:33] ok lemme merge that patch then [14:22:42] akosiaris: your move-defaults-to-/etc patch has been merged and applied? [14:22:44] ah ok [14:22:46] hehe [14:22:57] it should be a noop until I delete the file manually [14:23:19] (03PS3) 10Alexandros Kosiaris: zotero: change the path of the zotero defaults.js config [puppet] - 10https://gerrit.wikimedia.org/r/283657 [14:23:21] noop? shouldn't the file go into /etc/zotero? [14:23:27] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] zotero: change the path of the zotero defaults.js config [puppet] - 10https://gerrit.wikimedia.org/r/283657 (owner: 10Alexandros Kosiaris) [14:23:36] yes, but the original won't be delete d [14:24:06] 06Operations, 06Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#2209990 (10Multichill) >>! In T95714#2207876, @Andrew wrote: > This ticket has a terrible, unclear title, and even after reading the ticket I'm not 100% sure what it's about. > I think we ca... [14:24:17] akosiaris: seems to be missing require => File['/etc/zotero'] [14:24:23] but i guess we can get away with it [14:24:31] it's being auto required [14:24:35] it's an implicit dependency [14:24:43] don't you love puppet ? ;-) [14:25:08] * mobrovac <3 puppet with all its quirks! [14:25:10] there aren't many implicit dependencies btw, and that's the only one I remember [14:25:31] or, in the case of puppet, quirks == features [14:25:44] windows me anyone ? [14:25:47] urandom: hey, quick question [14:25:57] paravoid: shoot [14:25:58] urandom: we have a check for highestMax(cassandra.restbase10*.org.apache.cassandra.metrics.ColumnFamily.all.SSTablesPerReadHistogram.99percentile, 1) [14:26:07] it's tripping a lot right now, warns at 15 [14:26:12] yeah :( [14:26:20] (critical is at 30, not sure if that's tripping too) [14:26:28] should we just bump the thresholds a little bit? [14:26:36] +1 [14:26:49] we already raised it once, maybe we should raise it again, but it is also indicative of an actual problem [14:27:27] imho all of these thresholds should be revisited anyway once we reach the holy grail - a stable cassandra install [14:27:52] s/all/most/ [14:28:00] holy grail indeed since it's a moving target :) [14:28:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [14:28:19] even if it is indicative of a real problem, there is little point in having non-actionable alerts for it all the time [14:28:22] yah, indiana jones had an easier task [14:28:47] paravoid: yeah, fair enough. [14:29:40] any suggestions? [14:29:49] fortunately there's > 1 FTE available for it ;) [14:30:03] mark: ?? [14:30:14] ~ 1 FTE [14:30:44] paravoid: looking... [14:31:17] thank you :) [14:31:33] mobrovac: to avoid any unnecessary pages, I 'd say we do sca2001, sca1001 [14:31:55] I 'll stop zotero on them and remove the file so you can deploy [14:31:55] paravoid: bumping those values by 10 should paper over it for a while; i can submit a gerrit [14:32:03] akosiaris: why touching eqiad at this point when we don't have to? [14:32:22] ok sca2001 only then ? [14:32:30] akosiaris: we might as well go only with sca2001 [14:32:32] yup [14:32:34] ok [14:32:55] (03PS1) 10Faidon Liambotis: Bump Cassandra SSTables per-read alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283668 [14:32:55] i don't think you need to stop zotero, jsut remove the file [14:32:56] ok, I 'll stop zotero and remove the file then on sca2001 [14:33:01] ah yes [14:33:02] ok [14:33:17] done [14:33:21] kk, deploying [14:33:33] (03PS2) 10Faidon Liambotis: Bump Cassandra SSTables per-read alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283668 [14:33:39] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Bump Cassandra SSTables per-read alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283668 (owner: 10Faidon Liambotis) [14:33:48] akosiaris: seemed to have worked! [14:33:53] checking [14:34:37] paravoid: thanks [14:34:49] thank you :) [14:35:18] elukey: here? [14:35:20] akosiaris: yup, it works! [14:35:40] ok [14:35:46] (03CR) 10Dereckson: Add domain *.natmus.dk to wgCopyUploadsDomains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) (owner: 10Urbanecm) [14:35:49] so, let's do the rest ? [14:35:49] akosiaris: remove the file on sca[12]00[12] as well to do a full deploy? [14:35:52] yes [14:36:18] done [14:36:32] k, deploying everywhere [14:36:47] akosiaris: {{done}} ! [14:36:53] that was fast [14:37:01] ok [14:37:15] and without even an alert ... nice [14:37:27] (03PS1) 10Alex Monk: deployment-prep shinken: fix check_command for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/283669 (https://phabricator.wikimedia.org/T132733) [14:37:52] that's a very very good start of the week-end! [14:38:24] :-) [14:39:35] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2210011 (10BBlack) Getting a little closer on the standard front! It's `Submitted to IESG for Publication` and the IESG state is `On agenda of 2016-05-05 IESG telechat // Needs 9 more YES or NO OBJEC... [14:39:38] <_joe_> this is me ^^ [14:39:43] (03PS2) 10Urbanecm: Add domain *.natmus.dk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) [14:39:45] <_joe_> (logmsgbot restarting) [14:40:23] (03CR) 10EBernhardson: "This upgrade needs to happen in one datacenter, we need to do a software switchover in mediawiki to a new version that supports 2.x, then " [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:40:35] <_joe_> meh, I don't seem to be able to log to it from conftool [14:40:39] !log reindexing all wikis after the switch to codfw (T132762) [14:40:40] T132762: Reindex all pages edited since Apr 7 2016 - 14h00 UTC - https://phabricator.wikimedia.org/T132762 [14:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:15] 06Operations, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210015 (10fgiunchedi) * see also {T113733} for a related discussion to cassandra and graphite disk usage, the median/percentile/etc metrics are old artifacts, listing from a recently-created instance: ```... [14:42:31] (03CR) 10Gehel: "I don't think that we are using elasticsearch on Precise. I'll remove that part." [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:43:15] <_joe_> uhm this must've been puppet reverting my change [14:43:35] (03CR) 10Ottomata: Override kafkatee's default logrotate/rsyslog configuration. (031 comment) [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/283411 (https://phabricator.wikimedia.org/T132322) (owner: 10Elukey) [14:43:47] ottomata: morning! [14:43:56] good morning! :) [14:44:06] ottomata: "EventLogging overall insertion rate from MySQL consumer" is warning, "Kafka Broker Replica Max Lag" is critical [14:44:20] paravoid seems to be haunting people today :) [14:44:22] Kafka Broker Replica Max lag [14:44:28] is the most annoying thing [14:44:29] mobrovac: not just today :P [14:44:33] i sent an email about it a months ago [14:44:39] it should go away after upgarde [14:44:40] upgrade [14:44:50] we adjusted thresholds once to get rid of it, but it came back :/ [14:44:52] looking at maxlag... [14:44:54] sorry [14:44:55] uh [14:44:57] insertion rate [14:46:07] ottomata: there was also an eventlogging UNKNOWN btw, opened https://phabricator.wikimedia.org/T132770 for it [14:46:14] s/was/is/ [14:46:22] was [14:46:24] I killed it [14:46:26] * mark will buy paravoid a whip [14:46:33] and expense it [14:46:42] hehehe [14:46:55] lol [14:47:04] https://www.youtube.com/watch?v=ggXbzjnffAo [14:47:07] ottomata: and if maxlag isn't useful until kafka 0.9, let's kill it? [14:47:18] (03PS2) 10Gehel: Import elasticsearch 2.x into our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) [14:47:22] well, its not that its not useful, its that it is spiking and flapping [14:47:38] if it alarmed and stayed alarmed for a long period and got worse, then it would be useful [14:47:42] but ja [14:47:50] we should maybe adjust thesholds more [14:48:11] paravoid: https://phabricator.wikimedia.org/T121407 [14:48:15] 06Operations, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210075 (10mobrovac) +1 for getting rid of `meta` metrics. #RESTBase reads those tables only once on start-up and can't start without them. Also of note is that each of these tables holds only one record (t... [14:48:34] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210078 (10mobrovac) [14:50:21] (03CR) 10Faidon Liambotis: "(This is for reprepro, not apt, so pinning isn't at play here.)" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:50:28] (03CR) 10Faidon Liambotis: [C: 04-1] Import elasticsearch 2.x into our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:52:30] (03CR) 10Dereckson: [C: 031] Add domain *.natmus.dk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283653 (https://phabricator.wikimedia.org/T132748) (owner: 10Urbanecm) [14:52:34] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 1 failures [14:53:35] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [14:53:43] (03CR) 10Gehel: "So is there no way to have both 1.7.x and 2.x available at the same time? We could have 2.x available only in Trusty at first (CirrusSearc" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:56:13] (03CR) 10EBernhardson: "the debian way to have both is to name the package after the versions, such as making an elasticsearch2 package. Might be reasonable?" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [14:58:26] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210190 (10faidon) Thanks to both, these are both useful and will cut down our metrics significantly. Let's do those :) More questions: - Good to know these were artifacts. Still, d... [15:11:26] paravoid: re: the puppetization of tune2fs to remove reserved space, do you have any suggestions for how to go about that? Pointers to something similar I could use as an example would work. [15:12:00] urandom: modules/swift/manifests/init_device.pp is doing mkfs [15:12:01] (03PS3) 10Giuseppe Lavagetto: Make output more readable [software/conftool] - 10https://gerrit.wikimedia.org/r/283642 [15:12:03] (03PS4) 10Giuseppe Lavagetto: Log all write activity to an irc bot [software/conftool] - 10https://gerrit.wikimedia.org/r/280843 [15:12:07] not the same, but not that different I think [15:15:09] paravoid: auh, is it the 'unless' there that prevents this from running more than one time? [15:15:54] that was that part i was wondering how to go about [15:17:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:18:40] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:19:45] (03PS2) 10Muehlenhoff: Add Yubico AEADs to backup [puppet] - 10https://gerrit.wikimedia.org/r/283665 [15:19:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add Yubico AEADs to backup [puppet] - 10https://gerrit.wikimedia.org/r/283665 (owner: 10Muehlenhoff) [15:20:02] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:24:07] (03PS4) 10Ema: Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) [15:24:16] (03CR) 10Ema: [C: 032 V: 032] Workaround for mdadm boot-time race condition [puppet] - 10https://gerrit.wikimedia.org/r/283459 (https://phabricator.wikimedia.org/T131961) (owner: 10Ema) [15:28:47] (03PS1) 10Ottomata: Adjust eventlogging icinga alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283673 (https://phabricator.wikimedia.org/T132770) [15:29:01] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: puppet fail [15:30:10] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: puppet fail [15:32:13] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:01] !log remove cassandra metrics for restbase100[1234]* restbase100[789] restbase2004 - T132771 [15:34:02] T132771: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771 [15:34:03] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:33] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10Krenair) I'd like to see it puppetised for T97593#2115226 [15:36:32] (03CR) 10BryanDavis: "The way that Nick designed for the frequent updates we were doing in the early days was to scp the deb(s) as part of the rolling restart p" [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) (owner: 10Gehel) [15:37:45] (03PS2) 10Ottomata: Adjust eventlogging icinga alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283673 (https://phabricator.wikimedia.org/T132770) [15:38:27] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2210442 (10BBlack) >>! In T132450#2210430, @Krenair wrote: > I'd like to see it puppetised for T97593#2115226 Yeah me too for a lot of things, but labs will... [15:39:36] (03CR) 10Ottomata: [C: 032] Adjust eventlogging icinga alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/283673 (https://phabricator.wikimedia.org/T132770) (owner: 10Ottomata) [15:44:15] (03PS1) 10Ottomata: Disable MaxLag icinga check in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283675 (https://phabricator.wikimedia.org/T121407) [15:44:35] (03CR) 10jenkins-bot: [V: 04-1] Disable MaxLag icinga check in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283675 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [15:45:48] (03PS2) 10Ottomata: Disable MaxLag icinga check in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283675 (https://phabricator.wikimedia.org/T121407) [15:46:09] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2210476 (10Krenair) >>! In T132450#2210442, @BBlack wrote: > there's no "webroot" to go stuff a file in and have it appear publicly Varnish should be able t... [15:46:26] (03CR) 10Ottomata: [C: 032] Disable MaxLag icinga check in kafka::server::monitoring [puppet/kafka] - 10https://gerrit.wikimedia.org/r/283675 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [15:47:37] (03PS1) 10Ottomata: Update kafka submodule disabling MaxLax icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/283677 (https://phabricator.wikimedia.org/T121407) [15:48:00] (03CR) 10Ottomata: [C: 032] Update kafka submodule disabling MaxLax icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/283677 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [15:48:08] (03CR) 10Ottomata: [V: 032] Update kafka submodule disabling MaxLax icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/283677 (https://phabricator.wikimedia.org/T121407) (owner: 10Ottomata) [15:48:20] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2210480 (10Krenair) >>! In T132450#2210476, @Krenair wrote: >>>! In T132450#2210442, @BBlack wrote: >> there's no "webroot" to go stuff a file in and have it... [15:51:52] (03PS1) 10Ema: Reuse update-initramfs in lvs::balancer and interface::rps [puppet] - 10https://gerrit.wikimedia.org/r/283678 [15:52:13] qchris_: You about? [15:52:43] Is it about the gerrit change? [15:52:52] I saw it and I'll look at it over the weekend. [15:53:08] The package? No. I was having an issue yesterday trying to setup a new replication destination. [15:53:17] Ah. Ok. [15:53:34] I added the key to known_hosts but was still getting rejectkey errors [15:54:17] Mhmm. Not sure. I'd have to look at it. But I should leave right now ... will you be around in a few hours? [15:54:28] (03PS1) 10Chad: Revert "Cut off lead replication for a bit. SSH is busted" [puppet] - 10https://gerrit.wikimedia.org/r/283679 [15:54:45] qchris_: Yeah, my day's just getting started. Let me know when you've got some time :) [15:54:52] Cool beans. [15:55:04] I'll ping you later. [15:55:47] (03PS1) 10Giuseppe Lavagetto: Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/283680 [15:56:11] 06Operations, 10Analytics-Wikistats, 07Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2010000 (10Nuria) stats.wikimedia.org will nor be fully replaced for quite some many months and even then we will keep old "froze... [15:56:17] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2210516 (10Papaul) [15:57:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Log all write activity to an irc bot [software/conftool] - 10https://gerrit.wikimedia.org/r/280843 (owner: 10Giuseppe Lavagetto) [15:58:12] (03CR) 10Giuseppe Lavagetto: [C: 032] Make output more readable [software/conftool] - 10https://gerrit.wikimedia.org/r/283642 (owner: 10Giuseppe Lavagetto) [15:58:46] (03CR) 10Giuseppe Lavagetto: [C: 032] Version bump [software/conftool] - 10https://gerrit.wikimedia.org/r/283680 (owner: 10Giuseppe Lavagetto) [15:59:59] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2210525 (10Papaul) [16:01:02] (03PS1) 10Andrew Bogott: Remove references to stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283681 (https://phabricator.wikimedia.org/T126281) [16:02:49] (03PS2) 10Chad: Invalidate InitialiseSettings cache anytime config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273004 [16:03:54] (03PS1) 10Urbanecm: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) [16:05:46] (03CR) 10Chad: [C: 032] "Actually it works exactly like I hoped. Go me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273004 (owner: 10Chad) [16:06:32] (03Merged) 10jenkins-bot: Invalidate InitialiseSettings cache anytime config changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273004 (owner: 10Chad) [16:08:24] !log demon@tin Synchronized wmf-config/CommonSettings.php: more liberal initialisesettings invalidation (duration: 00m 42s) [16:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:38] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2198925 (10Southparkfan) I have a setup where I tell Varnish to redirect all /.well-known/acme-challenge traffic to one backend server (practically any serve... [16:16:16] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2210645 (10Gehel) [16:21:35] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210653 (10Eevans) As I've mentioned elsewhere, this topic always causes me some culture shock; I've grown accustomed to approaching this from the other end of the perspective, where... [16:25:46] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2210659 (10Eevans) >>! In T132771#2210075, @mobrovac wrote: > +1 for getting rid of `meta` metrics. #RESTBase reads those tables only once on start-up and can't start without them. A... [16:28:57] (03CR) 10Nuria: [C: 031] Remove references to stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283681 (https://phabricator.wikimedia.org/T126281) (owner: 10Andrew Bogott) [16:29:48] (03PS1) 10Ottomata: Change stats user git email address [puppet] - 10https://gerrit.wikimedia.org/r/283684 [16:31:23] (03CR) 10Ottomata: [C: 032 V: 032] Change stats user git email address [puppet] - 10https://gerrit.wikimedia.org/r/283684 (owner: 10Ottomata) [16:33:51] (03PS3) 10Gehel: Import elasticsearch 2.x into our APT repository [puppet] - 10https://gerrit.wikimedia.org/r/283466 (https://phabricator.wikimedia.org/T132376) [16:44:25] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2210715 (10Krenair) [16:48:03] paravoid: [16:48:04] https://gerrit.wikimedia.org/r/#/c/283673/ [16:48:05] https://gerrit.wikimedia.org/r/#/c/283675/ [16:48:07] I HOPE YOU'RE HAPPY [16:48:36] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2210736 (10Krenair) >>! In T97593#2115226, @Krenair wrote: > I am also wondering what the best way is to put the Let's Encrypt... [16:48:50] (03PS8) 10Ladsgroup: ores: Add support for running preached as a systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [16:53:19] 06Operations, 10Analytics, 10Analytics-Cluster, 10Traffic: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 (or 0.10?) - https://phabricator.wikimedia.org/T121562#2210775 (10Nuria) [17:17:33] 06Operations, 13Patch-For-Review: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2210833 (10ema) This issue should be fixed now: https://gerrit.wikimedia.org/r/#/c/283459/. I'd leave the ticket open given that we (successfully) rebooted just a co... [17:20:10] (03PS1) 10Alex Monk: Add nlwiki to deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283689 (https://phabricator.wikimedia.org/T118005) [17:20:38] (03CR) 10Alex Monk: "If anything is missing here it's probably also missing at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Add_a_wiki#Ste" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283689 (https://phabricator.wikimedia.org/T118005) (owner: 10Alex Monk) [17:27:31] PROBLEM - MariaDB Slave SQL: s2 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Error Incorrect key file for table iwlinks: try to repair it on query. Default database: svwiki. Query: DELETE /* LinksUpdate::incrTableUpdate 127.0.0.1 */ FROM iwlinks WHERE iwl_from = 6434916 AND ((iwl_prefix = d AND iwl_title IN (Q1,Q100196,Q1003183,Q1006733,Q100995,Q101017,Q101065,Q101313,Q101487,Q10149 [17:28:59] table corruption, the best kind of corruption [17:29:12] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2210861 (10csteipp) >>! In T50501#2210736, @Krenair wrote: >>>! In T97593#2115226, @Krenair wrote: >> I am also wondering what... [17:29:15] * volans looking [17:29:47] incorrect index key is toku or myisam not wanting to cooperate [17:30:20] 06Operations, 10ops-codfw: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2210876 (10Papaul) [17:30:59] toku it is [17:31:07] can I recreate it? [17:31:25] I think this is the first time this happening on dbstore1002 [17:31:47] but it is still 22, so it could be fixed in 23 [17:32:09] is so different the tokudb version between 22 and 23? [17:32:13] I cannot demonstrate it is not fixed [17:32:15] no [17:32:44] but if it is not the latest version, they say they do not take responsability, because 23 had lots of improvements [17:32:48] [17:33:03] can I recreate it or are you? [17:33:15] go for it [17:33:20] (I can do it, go take a break) [17:33:25] * volans holding hands [17:33:44] (03CR) 10Luke081515: [C: 031] Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [17:33:56] I mean, it is a table with no primary key, I do not hold them responsable [17:34:03] lol [17:34:18] are you repairing it or re-importing it? [17:34:25] recreating it [17:34:34] as in alter ENGINE=InnoDB; [17:34:47] that works every time, 99% percent of the time [17:34:49] ah ok [17:34:56] (it is index corruption, not data) [17:35:21] fixed [17:35:37] the "not using a crappy engine" helps, too [17:35:56] aka https://phabricator.wikimedia.org/T109069 [17:35:57] RECOVERY - MariaDB Slave SQL: s2 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [17:37:12] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2210895 (10Krenair) >>! In T50501#2210861, @csteipp wrote: > Ftr, I'd love to see us move in the direction of the let's encrypt... [17:42:18] (03PS2) 10Andrew Bogott: Remove references to stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283681 (https://phabricator.wikimedia.org/T126281) [17:42:26] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2209932 (10GWicke) What is the timeline on T85451? [17:46:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2210917 (10Smalyshev) 05Open>03Resolved I think this is done? [17:46:14] (03CR) 10Andrew Bogott: [C: 032] Remove references to stats.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/283681 (https://phabricator.wikimedia.org/T126281) (owner: 10Andrew Bogott) [17:49:18] 06Operations, 10Analytics-Wikistats, 13Patch-For-Review, 07Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281#2210923 (10Andrew) 05Open>03declined I've removed the .wikipedia.org puppet code, and I'm closing this... [17:52:58] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:07] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:37] hey guys, is there any reason why we might block an ip address accessing our site? Just got an email to OTRS about this [17:57:50] and/or is there anyone I can forward this to so they can deal with it [17:58:47] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [17:59:57] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:02:58] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:05:56] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:07:07] Cookies52, last time I was an OTRS user (aka billion years ago) there was a noc queue for that [18:14:00] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211018 (10BBlack) [18:16:55] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211018 (10valhallasw) Some notes on my ideas on how to do this for tool labs are at {T122403}, but that's a more complex scenario than a simple single webserver. [18:24:37] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2211054 (10Nuria) @BBlack This would not be a full-fledged service. What we would be deploying either via puppet of fab is just html/js so we only really need an apache install via pup... [18:32:11] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2211069 (10BBlack) @Nuria - thanks for the details! We still need to sort out an actual place for the js/html to live at in production (which, if it's as simple as it sounds, can probabl... [18:39:26] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2211077 (10Krenair) It may or may not be a strict blocker but we should probably wait for {T132812} [18:47:48] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211085 (10BBlack) [18:50:48] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211103 (10BBlack) Edited description - having to stop the existing service is a problem for renewals, we still have a challenge to do there. Also, we could support nginx/... [18:53:21] (03CR) 10Florianschmidtwelzow: [C: 031] Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [19:01:38] (03PS2) 10Urbanecm: Add HD versions of logo for dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) [19:03:44] (03CR) 10Urbanecm: "My "new patch" was rebase. I have no idea why it wasn't noted in the comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [19:04:30] 06Operations, 10Traffic, 07HTTPS: Preload STS for wikimedia.org - https://phabricator.wikimedia.org/T132685#2211138 (10BBlack) Note that T132450 is already resolved in practice. The ticket is just still open because we need to puppetize decent administration of the solution before the certs expire 90 days f... [19:08:14] 06Operations, 10Analytics, 10DNS, 10Traffic: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2211146 (10Nuria) @BBlack : ganeti sounds fine as really the majority of the time requests are going to be served by varnish. The fabfile we use to deploy to labs is here: https://github... [19:13:12] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2211153 (10ori) >>! In T96848#2203163, @BBlack wrote: > Notable: there's an ongoing report of 1.9.14 causing an HTTP/2 proto error in Chrome. We may need to be wary and stick with .13... [19:16:36] (03PS1) 10Merlijn van Deen: [puppet] DO NOT SUBMIT -- auth.conf for resource_types search [puppet] - 10https://gerrit.wikimedia.org/r/283696 [19:16:44] andrewbogott: ^ [19:17:02] probably not the right config file either, but at least it's recorded somewhere [19:17:28] thanks! [19:18:01] * andrewbogott tries it, breaks everything [19:18:54] valhallasw`cloud: oh, hm, you tested on a box running puppetmaster [19:19:05] whereas I need it to work with passenger, I wonder if that's why I couldn't make it work [19:19:27] andrewbogott: passenger? [19:19:41] https://docs.puppet.com/guides/passenger.html [19:20:26] valhallasw`cloud: for my reference can you paste the complete url you are using to get the list? [19:20:31] 06Operations, 06Labs, 10wikitech.wikimedia.org: Nutcracker having issues on wikitech - https://phabricator.wikimedia.org/T115457#2211172 (10Krenair) 05Open>03Resolved Icinga says this is fine at the moment, duration: 29d 1h 2m 58s [19:20:39] andrewbogott: curl -k "https://toolsbeta-puppetmaster3:8140/puppet/resource_types/role::*" [19:20:47] thanks [19:30:07] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2211186 (10Dzahn) a:03Dzahn [19:30:32] 06Operations: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#2209585 (10Dzahn) ok, i will start with the install2001 upgrade at least [19:39:50] 06Operations: move human users out of UID range for system accounts - https://phabricator.wikimedia.org/T114446#2211192 (10Dzahn) [19:43:15] Did anyone answer my question above? Didn't get a response! [19:44:05] Cookies52: 11:11 < MaxSem> Cookies52, last time I was an OTRS user (aka billion years ago) there was a noc queue for that [19:44:41] do you mean blocked from editing for blocked from even reading [19:45:38] Cookies52, if they're having issues connecting, get them to provide a traceroute [19:49:16] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [19:50:22] 06Operations, 06Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2211207 (10ori) Using [[ https://github.com/memcached/memcached/blob/master/doc/protocol.txt#L809-813 | the formula provided in the memcached documentation ]], I wrote a script to pr... [19:52:58] I can't find a noc queue, and I'll try that Krenair and get back [19:53:00] thanks! [19:54:04] Cookies52: if they cant connect at all, you can open a ticket in phabricator [19:54:19] if they are just blocked from editing then OTRS please [19:55:28] mutante, it's a complete block from accessing - I'll get them to send me a traceroute and post that too [19:55:32] what should I file it under> [19:55:34] ? [19:56:11] operations [19:56:27] maybe netops [19:56:29] ok, makes sense :P Thanks a lot [20:11:54] (03PS2) 10Dzahn: Extend Hiera data for yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/283662 (owner: 10Muehlenhoff) [20:12:00] (03CR) 10Dzahn: [C: 032] Extend Hiera data for yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/283662 (owner: 10Muehlenhoff) [20:17:01] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:19:29] (03PS1) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [20:19:31] (03PS1) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [20:19:33] (03PS1) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [20:20:36] (03PS2) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [20:24:10] (03CR) 10jenkins-bot: [V: 04-1] Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:24:16] (03CR) 10Merlijn van Deen: [C: 04-1] Allow horizon to query the labs puppetmaster for a list of classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:29:34] (03PS2) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [20:29:36] (03PS2) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [20:29:38] (03PS3) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [20:31:28] (03CR) 10jenkins-bot: [V: 04-1] Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:32:20] (03CR) 10Alex Monk: "what about per-project puppetmasters?" [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:33:29] (03CR) 10Merlijn van Deen: "https://gerrit.wikimedia.org/r/283696 provides that, but I'm not sure if we should -- queries can easily DOS a puppetmaster (queries can t" [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:43:06] (03PS3) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [20:43:08] (03PS3) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [20:43:10] (03PS4) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [20:44:55] (03CR) 10jenkins-bot: [V: 04-1] Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 (owner: 10Andrew Bogott) [20:45:19] (03CR) 10jenkins-bot: [V: 04-1] Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:48:45] (03PS1) 10Volans: Repool with low weight new s3 DB after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283748 (https://phabricator.wikimedia.org/T111654) [20:49:23] (03CR) 10Merlijn van Deen: [C: 04-1] "Sorry, I think my suggestion was not correct erb syntax -- if I read the docs correctly this one is..." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:49:45] (03PS4) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [20:49:47] (03PS4) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [20:49:49] (03PS5) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [20:49:51] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283682 (https://phabricator.wikimedia.org/T132792) (owner: 10Urbanecm) [20:50:43] (03CR) 10Merlijn van Deen: Allow horizon to query the labs puppetmaster for a list of classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/283728 (owner: 10Andrew Bogott) [20:50:53] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [20:50:54] why is puppet so hard :( [21:02:09] (03CR) 10Volans: [C: 032] Repool with low weight new s3 DB after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283748 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [21:02:35] (03Merged) 10jenkins-bot: Repool with low weight new s3 DB after TLS upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283748 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [21:03:12] (03PS5) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [21:03:14] (03PS5) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [21:03:16] (03PS6) 10Andrew Bogott: Allow horizon to query the labs puppetmaster for a list of classes [puppet] - 10https://gerrit.wikimedia.org/r/283728 [21:04:35] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool new db1075,1077,1078 after TLS upgrade on s3 - T111654 (duration: 00m 36s) [21:04:36] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [21:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:58] !log Completed TLS for shard s3 (finally, monitoring repooled servers) T111654 [21:05:59] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [21:08:29] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211383 (10Krenair) [21:10:46] (03CR) 10Andrew Bogott: [C: 032] "Finally got the puppet compiler to agree with this." [puppet] - 10https://gerrit.wikimedia.org/r/283726 (owner: 10Andrew Bogott) [21:10:55] (03PS6) 10Andrew Bogott: Eliminate the old, unlintable us_labs_puppet_master global [puppet] - 10https://gerrit.wikimedia.org/r/283726 [21:16:37] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211403 (10Krenair) At least, I assume it's OTRS. [21:16:51] (03CR) 10Andrew Bogott: "confirmed no-op in the puppet compiler" [puppet] - 10https://gerrit.wikimedia.org/r/283727 (owner: 10Andrew Bogott) [21:16:58] (03PS6) 10Andrew Bogott: Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 [21:17:19] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:19:01] (03CR) 10Andrew Bogott: [C: 032] Removed the unused is_puppet_master var [puppet] - 10https://gerrit.wikimedia.org/r/283727 (owner: 10Andrew Bogott) [21:40:53] (03PS1) 10Papaul: DNS: Adding mgmt DNS entries for conf200[1-3] Bug: T131959 [dns] - 10https://gerrit.wikimedia.org/r/283752 (https://phabricator.wikimedia.org/T131959) [21:51:50] (03Abandoned) 10Papaul: Decom:Removed production DNS for db200[1-7] Bug:T125827 [dns] - 10https://gerrit.wikimedia.org/r/278344 (https://phabricator.wikimedia.org/T125827) (owner: 10Papaul) [21:53:01] ostriches: Still around? I assume that I do not have sudo, so I cannot double-check the ssh config. But can you connect to lead as gerrit2 user on the commandline? [21:53:07] Did you restart the gerrit server? [21:53:14] s/gerrit server/gerrit/ [21:57:54] qchris_: i think you can read the config without sudo [21:58:06] The ssh config of the gerrit2 user? [21:58:13] I hope not, but let me try. [21:58:34] oh, i mean sshd_config because we had issues with jgit connecting to it [21:58:53] and made changes on lead because of that [21:59:05] i can look it up for you though [21:59:24] The host key of lead needs to be in the known_hosts of the gerrit2 user on ytterbium. [21:59:54] yea, but he already did that [22:00:01] rsa and ecdsa [22:00:09] and the error message wouldnt go away [22:00:09] And did he also restart gerrit? [22:00:15] yea, multiple times [22:00:25] Mhmm. Ok. [22:00:25] before that we had another error, btw [22:00:52] Then let I'll go back to checking the logs again :-) [22:00:53] first it was "Algorithm negotiation fail". [22:01:05] and that we fixed with https://gerrit.wikimedia.org/r/#/c/283554/ [22:01:29] * ostriches catches up [22:01:31] then came the host key issue and demon tried adding them .. [22:02:14] ostriches: did i miss something in between? [22:02:21] And gerrit2 can now connect to lead (on the commandline)? [22:02:35] Nope that was all correct mutante [22:02:58] qchris_: when I did it with plain `ssh` it worked fine [22:02:59] it still complains about the host key for some reason [22:03:20] 06Operations: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180#2097333 (10Andrew) What if ssh keys were managed by puppet? Adding new hosts to puppet would be more complicated (you'd have to explicitly create a key in the private repo for each host) but after that the keys... [22:03:24] mutante: I cannot see the host key of lead in the known_hosts of gerrit2 user .ssh/ [22:03:41] on ytterbium [22:03:52] if I get what qchris_ was asking [22:03:56] ostriches: was it a different user maybe? [22:04:03] eh [22:04:03] Thanks volans. [22:04:35] I cleared them out last night after we gave up testing [22:04:42] ah! [22:06:21] So IIRC, once the plain ssh works on the commandline as gerrit2, it's adding the replication config, restarting gerrit, and it should work. [22:07:08] That's where I was [22:08:02] Mhmm. [22:09:09] papaul: https://gerrit.wikimedia.org/r/#/c/283752/1/templates/10.in-addr.arpa wmf6408 doesnt have a service name yet or should it be conf2001 = 1 [22:09:49] Hard to say if one cannot look at the hosts. Do you have verbose ssh logs from lead and say antimony when gerrit is trying to connect them to replicate? [22:10:24] Just to check which algos they pick for host keys etc. [22:10:25] mutante: it doesn't have a service name yet [22:10:35] mutante: it is another server [22:10:36] papaul: nothing to do with conf ? [22:10:38] ok [22:10:50] mutante: yes nothing to do with conf [22:11:05] mutante: that server is part of the spare pool [22:11:21] qchris_: I do for plain ssh but gerrit doesn't log much [22:11:27] (03CR) 10Dzahn: [C: 032] DNS: Adding mgmt DNS entries for conf200[1-3] Bug: T131959 [dns] - 10https://gerrit.wikimedia.org/r/283752 (https://phabricator.wikimedia.org/T131959) (owner: 10Papaul) [22:11:41] Right, I meant server-side logs. [22:11:44] From sshd. [22:12:03] papaul: thanks, and done [22:12:10] mutante: thanks [22:12:55] 06Operations, 10OTRS: OTRS has been eating 100% of mendelevium's CPU for the last fortnight - https://phabricator.wikimedia.org/T132822#2211383 (10Keegan) If it's not OTRS, it's directly affecting OTRS. I just ran a memory intensive full-text search that I last ran in mid-March. Back then the search took abou... [22:14:10] ostriches: The thing is that gerrit's ssh is crippled, as it does not support all algos in the same way that the ssh client would do. So seeing Gerrit connect to the sshd on lead and antimony would allow to rule out that gerrit is picking different algos for host keys. [22:18:20] qchris_: Yeah I know it's massively outdated, part of the reason we're upgrading (which is why I need to start replicating the repos to the replacement box!) [22:18:27] antimony is precise, lead is jessie. [22:18:38] I see. [22:21:50] Let me check the Jsch code, if it can be made to log more ... [22:28:37] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211472 (10BBlack) Rough notes from thinking about implementation more: ``` # id = uniq id for this cert, e.g. puppet $title # names = foo.wm.o[,bar.wm.o[,baz...]] # mode... [22:28:49] !log rebooting bast4001 - debugging install issue [22:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:30:49] ostriches: There's not much logging in those parts of Jsch that one could turn on. Jsch only has a hard-coded System.err.println("finger-print: "+key_fprint); which is commented out. So one would need to recompile Jsch and Gerrit to get better logs from Gerrit (client side). [22:31:04] So increasing verbosity on the ssh server side would be easier. [22:31:15] That we *can* do [22:31:26] The error message stems from the strict host key checking (surprise :-D) so one could [22:31:26] Ok, so step 1 lets turn replication to lead back on [22:31:51] turn strict host key checking off temporary, to see if there are other issues that might be in the way. [22:31:54] mutante: I've got https://gerrit.wikimedia.org/r/#/c/283679/ ready for that, if you've got a min [22:32:12] I tried that, but I don't think gerrit reads .ssh/config? [22:32:42] just a moment, rebooting something right now [22:32:53] IIRC, Gerrit is supposed to use it. But it caches it upon startup. So changes in there need to be followed by a Gerrit restart. [22:33:38] (03PS2) 10Dzahn: Revert "Cut off lead replication for a bit. SSH is busted" [puppet] - 10https://gerrit.wikimedia.org/r/283679 (owner: 10Chad) [22:33:58] (03CR) 10Dzahn: [C: 032] Revert "Cut off lead replication for a bit. SSH is busted" [puppet] - 10https://gerrit.wikimedia.org/r/283679 (owner: 10Chad) [22:34:16] PROBLEM - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% [22:34:37] yep, known. it's me. SALed [22:35:15] ACKNOWLEDGEMENT - Host bast4001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn install [22:36:55] qchris_: Both lead and antimony already set to VERBOSE [22:37:26] ostriches: Perfect! Let me try to see if I can log in to them. [22:37:51] Prolly not lead. [22:37:56] Actually, you should [22:37:58] gerrit-roots [22:39:46] Nay. /me is only gerrit-admins. [22:40:12] :-) [22:40:19] Apr 14 20:33:08 lead sshd[39440]: Connection from 208.80.154.80 port 48637 on 208.80.154.82 port 22 [22:40:19] Apr 14 20:33:08 lead sshd[39440]: fatal: Unable to negotiate a key exchange method [preauth] [22:40:21] On lead ^ [22:40:23] Probably the failures [22:40:32] Cool. [22:40:39] well, "key exchange method" is the kex algorithms [22:40:43] like in what i linked earlier [22:41:13] when in puppet/hieradata/ try grep -r kex * and you see [22:41:31] how some hosts have the NIST keys disabled and some dont [22:42:32] and modules/ssh/templates/sshd_config.erb <%- if @disable_nist_kex -%> [22:43:06] but we already fixed that the other day? so that part is odd [22:43:16] mutante: some labs things needed it disabled because older software only supported ssh with i tneabled [22:44:10] YuviPanda: yes, i know, i'm pointing out that because here we have a case of older software [22:44:24] aaaah, gerrit. nvm me [22:44:30] it fixes the issue when jgit connects [22:44:33] yea [22:44:33] Maybe jessie has the nist keys disable by default? So not disabling them through config does not enable them? [22:44:59] that could be it, yes [22:45:07] i had puppet disabled temp. tried the change [22:45:09] it fixed it [22:45:48] so if puppet added the config line back then disabling it in hiera like that doesnt work on jessie [22:47:22] try it on lead, disable puppet, remove the KexAlgorithms line, restart sshd [22:47:41] re-enable puppet, see if it comes back [22:48:05] i would but already have this other console open right now [22:48:18] it's the same i did the other day [22:49:11] qchris_: So part of the problem is strict host checking always works for `ssh` because /etc/ssh/known_hosts exists [22:49:27] Short of stopping puppet I 'spose [22:50:05] Oh you put the key in the system wide file? Not sure if Gerrit checks that too. [22:50:37] I guess we can check that. (Still downloading the ssh precise deb to look at the defaults for kex) [22:53:08] qchris_, mutante: https://phabricator.wikimedia.org/P2911 [22:53:48] Yup. The default Key Exchange algorithms changed between between precise and jessie. diffie-hellman-group14-sha1 and diffie-hellman-group1-sha1 are missing. [22:54:00] system-wide file vs. usefile ... i put some bet on that [22:54:15] it's in addition to the kex thing of course [22:54:27] user file.. cant type. much lag [22:55:31] ostriches: try it without the KexAlgorithms line , so it's just defaults [22:55:58] that's when we got past that and back to the host key issue [22:55:59] It picks the the elliptic curve host key. IIRC Gerrit wants the RSA keys. [22:56:28] oh it works!? [22:56:34] didnt read the paste right [22:56:40] No. That's from the paste. [22:56:48] That's ssh on the commandline. [22:57:22] qchris_: I tried putting the rsa fingerprint in known_hosts too. [22:57:57] mutante: Jenkins never merged :\ https://gerrit.wikimedia.org/r/#/c/283679/ [22:58:07] Yup. I saw that in the logs. I am just saying that the command uses the ECDSA one. Gerrit would use the RSA one. [22:58:51] ostriches: merged now [23:03:38] * ostriches kicks ytterbium a few times [23:04:14] ostriches: Oh! One other thing ... You did use the unhashed hostname in the hostkey, right? [23:04:24] Yeah [23:04:30] Good. [23:04:33] I can't read it if it's hashed :P [23:08:54] qchris_: Ok, restarted replication. Failed hostkey. This is at our control state of not having any host keys in ~/.ssh/ [23:09:14] There is still one in /etc/ssh/ tho, but it's the ecdsa one [23:09:55] And is the hostkey for antimony also only in /etc/ssh and not in ~/.ssh? [23:10:20] antimony's in both [23:10:37] rsa in ~/.ssh/, ecdsa in /etc/ [23:10:52] Can you try adding the rsa lead key to ~/.ssh? [23:11:02] Then restart gerrit. [23:11:11] And then let us know that it works just fine? [23:12:53] Restarting gerrit [23:14:05] Ok back, let's start a replication job again [23:14:44] Same, org.eclipse.jgit.errors.TransportException: gerritslave@lead.wikimedia.org:/srv/gerrit/git/mediawiki/vendor.git: reject HostKey: lead.wikimedia.org [23:16:18] :-( [23:16:56] 06Operations, 13Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2211530 (10Dzahn) on carbon, /var/log/syslog, we can see how DHCP works: 198.35.26.5 is bast4001 ``` Apr 15 22:33:51 carbon dhcpd: DHCPDISCOVER from 90:b1:1c:4d:42:49 via 198.35.26.2 Apr 15 22:33:51... [23:18:26] qchris_: I mean I guess I don't have to replicate these repos to the new machine, but it helps keep the downtime shorter :\ [23:18:40] Oh wait. From you paste from before: ​debug3: order_hostkeyalgs: prefer hostkeyalgs: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521 [23:18:52] There is no "ssh-rsa" in there [23:19:40] Can you ssh from ytterbium to lead using the rsa host key on the commandline? [23:20:21] Is lead just the new gerrit server? I guess you can just copy the files over then. [23:20:29] Yeah [23:21:43] I guess you'll run into the same issue, when setting up new replication targets that are on Jessie. But at least the problem is kicked down the road. [23:22:05] Hopefully, the gerrit -> Phab switchover happened by the time a new replication target is needed :-) [23:23:09] Nerpppp [23:23:25] Forcing rsa didn't work with `ssh` [23:24:19] Ha! [23:24:24] Ah, I see what happens. [23:24:24] Then that's the culprit. [23:24:45] Here's where it gets interesting: [23:24:48] debug1: Host 'lead.wikimedia.org' is known and matches the RSA host key. [23:24:48] debug1: Found key in /var/lib/gerrit2/.ssh/known_hosts:3 [23:24:48] Warning: the RSA host key for 'lead.wikimedia.org' differs from the key for the IP address '208.80.154.82' [23:24:48] Offending key for IP in /etc/ssh/ssh_known_hosts:475 [23:24:48] Matching host key in /var/lib/gerrit2/.ssh/known_hosts:3 [23:24:48] Exiting, you have requested strict checking. [23:24:49] Host key verification failed. [23:25:25] aww. got reinstalled and the host key has been updated but the one for the IP is still old ? [23:28:55] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2211545 (10mmodell) >>! In T706#2168712, @JeanFred wrote: >>>! In T706#2168682, @Luke081515 wrote: >> You can't convert project to subproj... [23:30:05] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2211553 (10mmodell) >>! In T706#2198525, @Lokal_Profil wrote: > I like to join #Project-Creators to be able to setup/coordinate/clean-up p... [23:30:45] I dunno. I'm stumped. [23:31:23] What's the new error message? [23:31:36] delete that offending key for the IP? [23:31:37] I mean it's the same, I just dunno how to fix it [23:31:56] Right, cleaning the known_hosts files should do the trick. [23:32:04] /etc/ssh/ssh_known_hosts:475 [23:32:10] that line 475 [23:32:42] And it's not outdated. [23:32:47] That's the correct one from lead afaict [23:33:12] but it's the ecdsa one. [23:33:52] But antimony has the same ecdsa one in there too. I'm just not sure why it's not preferring the one from ~/.ssh/known_hosts [23:34:18] antimony.wikimedia.org,antimony,208.80.154.7,2620:0:861:1:208:80:154:7 ecdsa-sha2-nistp256 .... [23:34:42] Does the ssh client allow to connect to antimony when forcing rsa? [23:36:39] From the command line like that? No [23:36:43] Same failure over host keys [23:37:50] Can you disable puppet and move /etc/ssh/ssh_known_hosts out of the way? [23:37:59] Or is there a commandline switch to avoid reading it? [23:39:04] GlobalKnownHostsFile <- that's the option [23:41:22] Hmm, and it works when I set that to /dev/null! [23:41:25] lead & antimony [23:42:48] Mhmm. So lead allows RSA although it does not have it in hostkeyalgs? Weird. [23:43:21] 06Operations, 13Patch-For-Review: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2211585 (10Dzahn) Faidon made this change https://gerrit.wikimedia.org/r/#/c/283627/1 which meant now install2001 instead of carbon would be used as install server for bast4001 looking on install2001,... [23:48:02] Should just disable again [23:48:10] No point in poking this more, weekend is upon us [23:49:01] Sounds like copying is the easier solution. [23:49:28] Yeah [23:49:34] I'm not figuring it out at 5pm on a friday [23:49:45] mutante: Last time this week, I swear. https://gerrit.wikimedia.org/r/#/c/283760/ [23:49:56] /me hands ostriches a beer [23:50:13] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211601 (10Dzahn) https://github.com/aloyr/acme-tiny-automator "automates deployment of letsencrypt certs using acme-tiny library This relies on the acme-tiny library, do... [23:52:24] just restart grrrt-wm one more time before you go? [23:52:30] the revert is on the master .. now [23:52:50] I know nothing about the bot. [23:53:19] ah, ok [23:55:05] Ok, replication config back to normal. [23:55:09] So it won't fail all weekend [23:55:57] mutante: ostriches http://wikitech.wikimedia.org/wiki/Grrrit-wm :) [23:56:40] mutante: ostriches I can do it if you want [23:56:59] Connection closed by unknown :( [23:57:00] 06Operations, 10Traffic, 07HTTPS: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2211602 (10BBlack) It's similar in scope to what's going on in my paste above, but it's still missing a few bits we'll need on the webserver config chicken/egg thing even f... [23:57:05] Unknown doesn't like me [23:57:12] YuviPanda: thank you, please do [23:57:14] heh. oyu need to be tools admin or ops. [23:57:51] done. I need to modify the health check to check for ssh connectivity somehow instead of just process death [23:58:00] maybe a heartbeat file or something [23:58:28] thx Yuvi