[00:05:01] (03PS4) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [00:07:43] (03PS2) 10Dereckson: Show counts in category pages on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [00:09:01] (03CR) 10Dereckson: "PS2: rebased to use short array syntax" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/284087 (https://phabricator.wikimedia.org/T132972) (owner: 10Eranroz) [00:16:45] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2241788 (10Krenair) [00:27:52] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: puppet fail [00:29:48] icinga-wm: no [00:30:02] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:40:46] nice quit message :p [00:40:49] (03PS1) 10Dzahn: ircecho: make it start on systemd, add unit file [puppet] - 10https://gerrit.wikimedia.org/r/285561 (https://phabricator.wikimedia.org/T123729) [00:42:09] (03CR) 10jenkins-bot: [V: 04-1] ircecho: make it start on systemd, add unit file [puppet] - 10https://gerrit.wikimedia.org/r/285561 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [00:54:14] (03PS2) 10Dzahn: ircecho: make it start on systemd, add unit file [puppet] - 10https://gerrit.wikimedia.org/r/285561 (https://phabricator.wikimedia.org/T123729) [01:06:12] (03CR) 10Dzahn: [C: 032] "no-op on argon, kraz is still unknown to compiler http://puppet-compiler.wmflabs.org/2571/" [puppet] - 10https://gerrit.wikimedia.org/r/285561 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:09:30] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 559.33 seconds [01:18:10] (03PS1) 10Dzahn: ircecho: fix init file dependency for service on systemd [puppet] - 10https://gerrit.wikimedia.org/r/285568 (https://phabricator.wikimedia.org/T123729) [01:22:02] (03CR) 10Dzahn: [C: 032] "no-op on argon http://puppet-compiler.wmflabs.org/2572/" [puppet] - 10https://gerrit.wikimedia.org/r/285568 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:25:10] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2241827 (10Dzahn) now: ``` service ircecho status ● ircecho.service - IRC bot for the MW RC IRCD Loaded: loaded (/etc/systemd/system/ircecho.service; disabl... [01:30:09] (03PS1) 10Dzahn: ircserver: puppetize install of ircd-ratbox [puppet] - 10https://gerrit.wikimedia.org/r/285569 (https://phabricator.wikimedia.org/T123729) [01:34:48] (03CR) 10Dzahn: [C: 032] "no-op on argon http://puppet-compiler.wmflabs.org/2573/" [puppet] - 10https://gerrit.wikimedia.org/r/285569 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [01:38:44] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:45:17] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2241833 (10Dzahn) < icinga-wm> RECOVERY - puppet last run on kraz is OK: OK: ● ircd.service - IRCd for Mediawiki RecentChanges feed Loaded: loaded (/etc/sy... [01:49:28] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [01:50:37] (03PS1) 10Dzahn: ircserver: add irssi on irc server for testing [puppet] - 10https://gerrit.wikimedia.org/r/285570 [01:50:59] (03PS2) 10Dzahn: ircserver: add irssi on irc server for testing [puppet] - 10https://gerrit.wikimedia.org/r/285570 (https://phabricator.wikimedia.org/T123729) [01:51:55] (03CR) 10Dzahn: [C: 032] ircserver: add irssi on irc server for testing [puppet] - 10https://gerrit.wikimedia.org/r/285570 (https://phabricator.wikimedia.org/T123729) (owner: 10Dzahn) [02:01:16] (03CR) 10Alex Monk: [C: 04-1] "This is actively breaking root login in beta" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [02:06:32] (03PS8) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [02:07:56] twentyafterfour, that was fast. want me to put this commit in place on the puppetmaster? [02:08:14] ah but you have already [02:08:15] okay [02:11:15] Krenair: :) hopefully that fixes it [02:11:24] running puppet now to test on deployment-sca01 [02:11:53] it looked like it fixed it [02:12:12] I tested it on... deployment-elastic06 I think it was [02:12:37] sorry for breaking beta so much today... I keep running into undocumented gotchas and stuff I'm just not familiar with [02:13:06] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [02:13:08] I still can't figure out why keyholder auth isn't working for deploy-service user [02:14:58] how can i tell a single appserver to send the MW-RC-IRC data to another IRC server [02:15:22] they all send data to argon , UDP 9390 [02:15:47] (03PS9) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [02:16:01] 06Operations, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: Migrate argon (irc.wikimedia.org) to Jessie - https://phabricator.wikimedia.org/T123729#2241841 (10Dzahn) The bot connects to the IRC server but does not join any channels because it does not get input on port 9390 from the appservers. compare to... [02:16:36] mutante: in operations/mediawiki-config.git:/wmf-config/ProductionServices.php [02:16:50] ori: thanks!:) [02:16:51] change line 60: $wmfAllServices['eqiad']['irc'] = '208.80.154.160'; // eqiad: argon [02:17:19] please don't actually do that in production though [02:17:53] you can have it send to multiple servers [02:17:58] you can do it as a live hack on mw1098 [02:18:08] that server is not getting any organic traffic, just x-wikimedia-debug [02:18:09] is mw1098 not serving any traffic? [02:18:10] ok [02:18:37] if it's only for a few minutes, you can just locally hack edit, make an edit to test2.wikipedia.org (or whatever), and then run sync-common to undo the change [02:19:37] cool, thanks. i did not even have the ProductionServices.php yet, now i do :) [02:20:00] mutante: do you want me to help you test it? [02:20:06] and i notice there is prep work for a codfw irc [02:20:16] ori: sure, if you like, yes [02:20:25] i just got the ircd and bot up on kraz [02:20:42] ok, so if i configure mw1098 to use it, you'll see it? [02:21:06] hmm, i just need to know the right channel [02:21:07] checking [02:21:40] yes, there is #test2.wikipedia on argon :) [02:22:08] yes, it should work [02:22:47] btw, I said mw1098, I meant mw1099 [02:22:52] making an edit, just a sec [02:22:53] i have irssi on that box, connected to localhost, sitting in that channel. udmxircecho is up , tcpdump listening [02:23:22] 02:22:42.677920 IP mw1098.eqiad.wmnet.59614 > kraz.wikimedia.org.9390: UDP, length 337 [02:23:27] so far ..yes [02:23:31] in the IRC client. .not yet [02:24:31] mw1099 [02:24:37] i made an edit on test2 [02:24:43] did you see it? [02:25:01] i see it in tcpdump, but the bot still did not join the channel [02:25:31] how are you in the channel if the bot has not created it yet? [02:25:53] branch prediction [02:26:22] I made another edit, just in case [02:26:57] building up my mainspace edit count in anticipation of an enwiki RfA [02:26:58] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.21) (duration: 11m 46s) [02:27:01] (just kidding) [02:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:07] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:31] ori, can you try once more please? [02:28:06] done [02:28:17] https://test2.wikipedia.org/w/index.php?title=IRC&action=history [02:28:57] I didn't see anything in the new irc [02:29:07] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 66711 bytes in 0.086 second response time [02:29:30] and then i got disconnected from my local network.. back [02:29:49] I don't see it on the old IRC either [02:29:58] yeah, it definitely went to kraz [02:30:26] 02:30 [@rc-pmtpa] [ Krenair] [ root] [02:30:27] woot [02:30:38] :root!~root@special.user JOIN :#test2.wikipedia [02:30:55] hehe, i also smiled at special.user [02:31:04] so how did the channel get created [02:31:08] did the bot talk now? [02:31:16] no [02:31:27] bah.. but at least i see the bot on it [02:31:30] that is new [02:31:36] I gotta run, sorry. [02:31:38] but I'm also slightly concerned because the bot appears to be in several channels that aren't test2.wikipedia [02:31:50] although it's not sending anything to the one I checked [02:31:50] ori: thanks! [02:31:52] https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug for how to route requests to mw1099 [02:32:47] Krenair: i wonder why it's on channels now and wasnt before [02:32:56] i did /list -YES a couple times [02:32:58] i had nothing [02:33:01] before i asked here [02:33:18] did it just take a while? [02:33:37] did what just take a while? [02:33:46] the bot joining the channels [02:33:56] I don't know when it joined, I only found out it was on them through whois [02:34:02] what makes that happen if it doesnt receive traffic [02:34:28] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 2 failures [02:34:45] it was not in them right after i started it [02:35:23] as you point out, first test2.wikipedia would have to work on the old server [02:35:47] :irc.pmtpa.wikimedia.org 322 Krenair #fr.wiktionary 1 : [02:35:47] :irc.pmtpa.wikimedia.org 322 Krenair #es.wikipedia 1 : [02:35:47] :irc.pmtpa.wikimedia.org 322 Krenair #hi.wikipedia 1 : [02:35:48] :irc.pmtpa.wikimedia.org 322 Krenair #fr.wikipedia 1 : [02:35:48] :irc.pmtpa.wikimedia.org 322 Krenair #test2.wikipedia 3 : [02:35:49] :irc.pmtpa.wikimedia.org 322 Krenair #ru.wikipedia 1 : [02:35:50] :irc.pmtpa.wikimedia.org 322 Krenair #en.wikipedia 2 : [02:36:03] yes, i see the same list [02:38:20] so traffic is definitely flowing in to kraz from mw1099? [02:38:22] well, it's progress either way :) the IRCd is up, the bot is up [02:38:34] yes, i saw it [02:38:54] tcpdump port 9390 [02:39:10] I can't log in to that server let alone run tcpdump on it mutante [02:39:38] i'm just saying what i ran [02:40:28] 19:28 < mutante> 02:22:42.677920 IP mw1098.eqiad.wmnet.59614 > kraz.wikimedia.org.9390: UDP, length 337 [02:40:44] 1098 [02:40:48] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:41:28] so maybe not, and mw1098 vs mw1099 [02:41:40] well, and once i said that , here it is: [02:41:43] 02:41:22.666078 IP mw1099.eqiad.wmnet.35276 > kraz.wikimedia.org.9390: UDP, length 22 [02:42:32] I just tried sending something myself [02:43:03] 19:28 <@rc-pmtpa> [[IRC]] !N https://test2.wikipedia.org/w/index.php?oldid=282815&rcid=451046 * Ori Livneh * (+8) Created page with "Testing!" [02:43:09] that ? :) [02:43:21] you received that on the new server? [02:45:07] no, that was the old one :p [02:45:17] so we know test2.wp can work [02:45:35] but it doesnt on new.. gotta continue there later [02:46:26] gotta go for dinner. thanks so far for helping [02:46:33] I'd stick some debug logging in udpmxircecho [02:46:38] it's going in the right direction.. [02:46:45] check it's actually sending to irc [02:46:51] that sounds good, yes [02:47:00] we need some more verbosity [02:47:08] bbl [02:57:17] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:01:25] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.22) (duration: 18m 27s) [03:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:25] irc.pmtpa.wikimedia.org ? [03:06:00] (03PS1) 10BBlack: rt.wm.o: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285572 (https://phabricator.wikimedia.org/T132812) [03:06:03] (03PS1) 10BBlack: rt.wm.o: remove old cert definition [puppet] - 10https://gerrit.wikimedia.org/r/285573 (https://phabricator.wikimedia.org/T132812) [03:06:26] (03Abandoned) 10BBlack: ganglia: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285441 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [03:06:32] (03Abandoned) 10BBlack: ganglia: remove old cert absent line [puppet] - 10https://gerrit.wikimedia.org/r/285442 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [03:08:00] (03CR) 10BBlack: [C: 032] rt.wm.o: use LE cert [puppet] - 10https://gerrit.wikimedia.org/r/285572 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [03:11:11] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Apr 27 03:11:11 UTC 2016 (duration 9m 46s) [03:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:17] (03PS1) 10BBlack: LE: actually deploy the challenge-apache.conf file [puppet] - 10https://gerrit.wikimedia.org/r/285574 [03:12:34] (03CR) 10BBlack: [C: 032 V: 032] LE: actually deploy the challenge-apache.conf file [puppet] - 10https://gerrit.wikimedia.org/r/285574 (owner: 10BBlack) [03:15:16] bblack, yep [03:15:47] IRC network still has the name with pmtpa [03:15:47] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: Puppet has 1 failures [03:15:55] even though it's actually in eqiad [03:26:09] RECOVERY - puppet last run on magnesium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:28:48] (03PS1) 10BBlack: LE: comment out most of apache challenge conf [puppet] - 10https://gerrit.wikimedia.org/r/285577 [03:29:02] (03CR) 10BBlack: [C: 032 V: 032] LE: comment out most of apache challenge conf [puppet] - 10https://gerrit.wikimedia.org/r/285577 (owner: 10BBlack) [03:30:36] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2241887 (10BBlack) [03:31:03] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2240497 (10BBlack) Table at top updated. rt.wikimedia.org is on LE now as our first example with Apache. [03:31:18] PROBLEM - puppet last run on krypton is CRITICAL: CRITICAL: Puppet has 1 failures [03:31:44] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2241889 (10BBlack) Converted rt.wm.o, so now we have 1x apache + 1x nginx converted. Next I'm going to switch ubuntu+mirrors (both on carbon) to a si... [03:43:48] (03PS1) 10BBlack: mirrors: combine ubuntu+mirrors into single SAN instance [puppet] - 10https://gerrit.wikimedia.org/r/285578 [03:44:16] (03CR) 10BBlack: [C: 032 V: 032] mirrors: combine ubuntu+mirrors into single SAN instance [puppet] - 10https://gerrit.wikimedia.org/r/285578 (owner: 10BBlack) [03:48:01] (03PS2) 10BBlack: rt.wm.o: remove old cert definition [puppet] - 10https://gerrit.wikimedia.org/r/285573 (https://phabricator.wikimedia.org/T132812) [03:48:08] (03CR) 10BBlack: [C: 032 V: 032] rt.wm.o: remove old cert definition [puppet] - 10https://gerrit.wikimedia.org/r/285573 (https://phabricator.wikimedia.org/T132812) (owner: 10BBlack) [03:50:46] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2241897 (10BBlack) [03:50:48] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2241899 (10BBlack) [03:50:51] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2241893 (10BBlack) 05Open>03Resolved a:03BBlack SAN test worked as well. We'll likely have more refinement and bugfixes to deal with later when... [05:32:43] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Sort out letsencrypt puppetization for simple public hosts - https://phabricator.wikimedia.org/T132812#2241955 (10Dzahn) {F3935166} :) [05:46:53] (03PS1) 10Ori.livneh: admin/ori: make 'reqs' continuously update [puppet] - 10https://gerrit.wikimedia.org/r/285581 [05:48:00] (03PS1) 10Yuvipanda: Don't barf on --release for non-lighttpd usage [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285582 (https://phabricator.wikimedia.org/T98440) [05:48:02] (03PS1) 10Yuvipanda: Implement webservice 'status' command [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285583 [05:52:43] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2242004 (10Chmarkine) @mark As (ubuntu|mirrors).wikimedia.org now supports HTTPS, could we update Wikimedia's Ubuntu mirror link to https://ubuntu.wikimedia.... [06:05:00] (03PS1) 10Dzahn: RT: remove role from krypton, was for jessie test [puppet] - 10https://gerrit.wikimedia.org/r/285586 [06:05:31] (03PS2) 10Dzahn: RT: remove role from krypton, was for jessie test [puppet] - 10https://gerrit.wikimedia.org/r/285586 [06:06:12] (03PS3) 10Dzahn: RT: remove role from krypton, was for jessie test [puppet] - 10https://gerrit.wikimedia.org/r/285586 [06:06:26] (03CR) 10Dzahn: [C: 032] RT: remove role from krypton, was for jessie test [puppet] - 10https://gerrit.wikimedia.org/r/285586 (owner: 10Dzahn) [06:07:22] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [06:09:41] !log krypton - re-enabled puppet after LE + rt.wm.o puppetization issues fixed with gerrit 285586 @bblack #LEftw [06:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:09:52] RECOVERY - puppet last run on krypton is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:11:29] !log krypton - for some unrelated reason on every puppet run there is some noise about analytics::burrow stuff [06:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:13:29] hi [06:15:32] PROBLEM - HTTPS on krypton is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: SSL connect attempt failed with unknown error error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol [06:16:31] (03PS1) 10Yuvipanda: Add support for uwsgi-plain webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285588 [06:16:33] (03PS1) 10Yuvipanda: Fix job_running check to match old webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285589 [06:16:35] (03PS1) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 [06:24:19] (03PS2) 10Yuvipanda: Implement webservice 'status' command [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285583 [06:24:21] (03PS2) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 [06:24:23] (03PS2) 10Yuvipanda: Fix job_running check to match old webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285589 [06:24:25] (03PS2) 10Yuvipanda: Add support for uwsgi-plain webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285588 [06:30:21] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:32] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] (03PS3) 10Yuvipanda: Implement webservice 'status' command [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285583 [06:31:03] (03PS3) 10Yuvipanda: Bump debian version [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 [06:31:05] (03PS3) 10Yuvipanda: Fix job_running check to match old webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285589 [06:31:07] (03PS3) 10Yuvipanda: Add support for uwsgi-plain webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285588 [06:31:12] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] 06Operations, 10Traffic, 06WMF-Legal, 10domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#2242036 (10Dzahn) 05Resolved>03Open >>! In T88861#1890467, @Mschon wrote: > does wmf support https://letsencrypt.org ? Times have changed. The answer is now Yes. [06:31:32] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:32:02] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:03] 06Operations, 10Traffic, 06WMF-Legal, 10domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#2242041 (10Dzahn) T133548 [06:32:11] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:51] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [06:33:12] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:55] (03PS2) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM/Zend conditionals from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/285366 (https://phabricator.wikimedia.org/T126310) [06:35:24] (03CR) 10Giuseppe Lavagetto: "This is a partial extraction from I21ec5494710a173a23823625aa9bb0bf6ce5c492 that will only affect beta." [puppet] - 10https://gerrit.wikimedia.org/r/285366 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [06:35:52] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:22] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:02] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:23] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:28] (03PS2) 10Alexandros Kosiaris: servermon: Add managed_puppet_modules parameter [puppet] - 10https://gerrit.wikimedia.org/r/285390 [07:06:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Add managed_puppet_modules parameter [puppet] - 10https://gerrit.wikimedia.org/r/285390 (owner: 10Alexandros Kosiaris) [07:08:12] (03PS3) 10Giuseppe Lavagetto: mediawiki::web: drop the HHVM/Zend conditionals from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/285366 (https://phabricator.wikimedia.org/T126310) [07:11:23] (03PS1) 10Alexandros Kosiaris: servermon: Use the WSGI way of spawning gunicorn processes [puppet] - 10https://gerrit.wikimedia.org/r/285594 [07:13:06] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: drop the HHVM/Zend conditionals from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/285366 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [07:13:26] (03PS2) 10Alexandros Kosiaris: uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 [07:15:05] (03CR) 10Yuvipanda: [C: 04-1] uwsgi: Add python3 support [puppet] - 10https://gerrit.wikimedia.org/r/283492 (owner: 10Alexandros Kosiaris) [07:15:31] (03CR) 10Yuvipanda: "Should find spots where the python3 plugin package was explicitly specified and remove them as well, could cause conflicts otherwise (sinc" [puppet] - 10https://gerrit.wikimedia.org/r/283492 (owner: 10Alexandros Kosiaris) [07:15:34] akosiaris: ^ [07:16:05] YuviPanda: yeah. it's 3. toollabs, ores, ircyall [07:16:16] what on earth is ircyall ? [07:16:27] one of my sad, sad mistakes that I should rip out sometime soon [07:16:44] but it's still used by wikibugs [07:16:50] maybe in a few weeks [07:16:55] (03PS2) 10Alexandros Kosiaris: servermon: Use the WSGI way of spawning gunicorn processes [puppet] - 10https://gerrit.wikimedia.org/r/285594 [07:17:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Use the WSGI way of spawning gunicorn processes [puppet] - 10https://gerrit.wikimedia.org/r/285594 (owner: 10Alexandros Kosiaris) [07:17:03] (it's an authenticated HTTP -> IRC gateway) [07:17:31] ??? why ??? [07:17:44] is it REST ? [07:17:56] if it is it belongs behind a REST proxy [07:17:57] :P [07:18:43] _joe_: merging your change as well [07:18:55] <_joe_> akosiaris: yeah thanks [07:19:10] (03PS1) 10Yuvipanda: Use IP rather than hostname for registering services [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285599 [07:20:22] akosiaris: at some point a few years ago I decided that we have too many bots and should have one bot that other bots can communicate via so we can keep the IRC related code in one place. Then I spent a weekend writing this code and then the puppet module, and then over the course of the next week realized what a terrible mistake I had made and forgot about it. [07:20:42] akosiaris: but in the meantime wikibugs started using it for stuff and I couldn't just kill it [07:21:24] YuviPanda: so now it's in that perfect limbo where the owner does not want it anymore but other ppl do [07:21:44] but not enough to assume ownership so they can continue bitching to the owner [07:21:45] right ? [07:22:07] akosiaris: mostly, yeah. I can just move it to a toollabs tool and kill the puppet stuff at some point. [07:23:05] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2242099 (10elukey) [07:39:22] (03CR) 10Hashar: "Compiled again at https://puppet-compiler.wmflabs.org/2574/" [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [07:46:43] (03PS1) 10Alexandros Kosiaris: servermon: Specify the directory where the code actually lives in [puppet] - 10https://gerrit.wikimedia.org/r/285600 [07:47:03] (03PS2) 10Alexandros Kosiaris: servermon: Specify the directory where the code actually lives in [puppet] - 10https://gerrit.wikimedia.org/r/285600 [07:47:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Specify the directory where the code actually lives in [puppet] - 10https://gerrit.wikimedia.org/r/285600 (owner: 10Alexandros Kosiaris) [07:48:48] (03CR) 10Elukey: [C: 04-1] "Today I realized that I haven't tested the change on misc caches yesterday, and so I retried today getting erb compilation failures:" [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [07:50:16] (03PS4) 10Elukey: Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) [07:54:07] (03CR) 10Elukey: "This time looks better:" [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [07:55:03] (03PS1) 10Alexandros Kosiaris: servermon: Amend the make_updates.py path [puppet] - 10https://gerrit.wikimedia.org/r/285602 [07:55:40] (03PS2) 10Alexandros Kosiaris: servermon: Amend the make_updates.py path [puppet] - 10https://gerrit.wikimedia.org/r/285602 [07:55:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon: Amend the make_updates.py path [puppet] - 10https://gerrit.wikimedia.org/r/285602 (owner: 10Alexandros Kosiaris) [07:58:16] (03CR) 10Mobrovac: "This is now waiting the deployment of Iefbe5d7ea925a0cd37e2bbe1940690090df966c7 which fixes the registry path issue." [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [07:58:43] (03PS1) 10Muehlenhoff: Add ferm service for rsyncd on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285603 [08:02:19] (03PS5) 10Elukey: Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) [08:03:46] Anybody knows what's the procedure to add a package to apt.wikimedia.org? This was asked in https://phabricator.wikimedia.org/T123223#2108462 ... [08:03:48] (03CR) 10Elukey: [C: 032] Add a maintenance flag to cache::misc directors. [puppet] - 10https://gerrit.wikimedia.org/r/285364 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [08:04:53] akosiaris: o/ should I merge your change too? [08:05:45] andre__: best to file a phab task for it and add the "operations" project, only ops can add packages there [08:05:58] we're managing the repository with reprepro [08:06:11] https://wikitech.wikimedia.org/wiki/Reprepro [08:07:59] elukey: yup thanks [08:08:58] ack! [08:10:19] (03PS1) 10Giuseppe Lavagetto: mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) [08:10:21] (03PS1) 10Giuseppe Lavagetto: dhcp: remove entries for decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285605 (https://phabricator.wikimedia.org/T126242) [08:13:35] (03CR) 10Jcrespo: "This looks sane to me, compared to production configuration. However, we should check what happens for existing content. Does it continue " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [08:13:36] moritzm, thank you. Will do. [08:13:42] (03CR) 10Mobrovac: [C: 04-1] "In order for this to work for scap deployments, scap::target also needs to be adjusted. A second look should also be given to all of the v" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [08:14:30] 06Operations, 10Wikimedia-SVG-rendering: Install Noto CJK (Source Han Sans) font family for SVG rendering - https://phabricator.wikimedia.org/T123223#2242178 (10Aklapper) >>! In T123223#2108462, @Dereckson wrote: > What's the procedure to add a package to apt.wikimedia.org? https://wikitech.wikimedia.org/wiki/... [08:16:40] (03CR) 10Elukey: [C: 031] "Puppet compiler looks good https://puppet-compiler.wmflabs.org/2579/stat1002.eqiad.wmnet/, all the settings looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/285603 (owner: 10Muehlenhoff) [08:17:38] 06Operations, 10ops-eqiad, 10DBA: db1070, db1071 and db1065 overheating problems - https://phabricator.wikimedia.org/T132515#2242180 (10Volans) 05Open>03Resolved No more evidence of overheating for all three servers, dmesg and syslog clear (except the DIMM issue on `db1065` related to T133250. [08:20:17] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2242184 (10fgiunchedi) thanks @dzahn ! there's another ~25 labs instances reporting data to misc-eqiad, likely because they are not running puppet or a self-hosted puppet m... [08:22:47] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2242185 (10Joe) @cscott so the way to depool a server will be: 1) Remove it from the pool in the load balancer 2) p... [08:22:54] (03PS2) 10Muehlenhoff: Add ferm service for rsyncd on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285603 [08:23:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm service for rsyncd on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/285603 (owner: 10Muehlenhoff) [08:26:25] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2242187 (10Joe) Also, I don't really understand your mocking of me saying the WMF should really take care about this... [08:34:53] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 79 failures [08:39:22] (03PS2) 10Giuseppe Lavagetto: mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) [08:39:56] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/2580/ - changes for mc1009,mc2008,mc2009 looks good" [puppet] - 10https://gerrit.wikimedia.org/r/284907 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:40:06] akosiaris: possible to merge https://gerrit.wikimedia.org/r/#/c/284654/ today? [08:40:42] mobrovac: I'm updating cxserver/deploy and when akosiaris is ready, will deploy cxserver in production. [08:40:50] <_joe_> !log stopping puppet on mw10[7-8][0-9] and mw112[1-9]/mw1130 for T126242 [08:40:51] T126242: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242 [08:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:41:23] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [08:41:28] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2242204 (10mobrovac) I agree. Having a consistent mapping between requests and cached content and using locks to get rid of double-processing seems like the way to go. Then,... [08:41:53] mobrovac: can you remove -1 from 284654? [08:42:12] kart_: when the new code's been deployed, yes [08:42:32] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:42:50] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2242206 (10fgiunchedi) this doesn't seem to be blocked on ops ATM, let us know when the pieces are in place and if we can help [08:42:52] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:43:42] mobrovac: I guess it should go along with the patch, else cxserver's language selector will be broken :) [08:44:00] kart_: the ideal flow should be that you deploy the new code but don't restart the service, then we merge the change which will prompt puppet to restart cxserver [08:44:48] mobrovac: I see. Wasn't cxserver deployment doing server restart automatically? [08:44:48] kart_: as soon as akosiaris says he can assist, i'll rmove my -1 [08:44:55] OK! [08:45:00] That's fine. [08:45:01] kart_: no, git deploy only checks out the code [08:45:23] mobrovac: got it. I always do restart cxserver after git deploy. So,.. [08:45:33] kart_: speaking of which, you should switch to scap3 for deployment [08:45:49] mobrovac: yes. There is a ticket for it (TM) [08:46:13] mobrovac: I should work with release team to get it done next week. [08:46:24] yup, kart_, guide is @ https://wikitech.wikimedia.org/wiki/Services/Scap_Migration [08:46:34] OK. Checking. [08:47:49] kart_: yes, if mobrovac is ok with it we can merge https://gerrit.wikimedia.org/r/#/c/284654/ today [08:47:54] That helps, will start working on it. [08:48:05] akosiaris: he says if akosiaris is okay. [08:48:26] akosiaris: i'm ok with it if the new code will be deployed too [08:48:45] recursive dependency :) [08:48:51] ok then, should we do it now ? [08:48:59] akosiaris: lets do. [08:49:19] (03PS3) 10Jcrespo: Add mysql to labs dns servers [puppet] - 10https://gerrit.wikimedia.org/r/280863 (https://phabricator.wikimedia.org/T128737) [08:49:21] ok, I suppose code deployment first ? [08:49:29] akosiaris: I'm proceding to deploy cxserver, won't restart server [08:49:34] ok [08:49:35] yes. [08:50:12] moritzm: unmerged changes on palladium, ferm for stat1002 [08:52:42] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:53:04] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:53:05] akosiaris: done [08:53:13] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:53:23] mobrovac: deployed code. service has not restarted. [08:53:23] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:53:32] k, removing -1 [08:53:33] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:53:33] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:53:33] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:53:33] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:53:33] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:53:38] <_joe_> oh god [08:53:43] (03PS3) 10Giuseppe Lavagetto: mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) [08:53:43] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:53:53] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:54:02] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:04] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:54:06] (03CR) 10Mobrovac: [C: 031] "Code deployed, let's go" [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [08:54:11] (03CR) 10Jcrespo: [C: 032] Add mysql to labs dns servers [puppet] - 10https://gerrit.wikimedia.org/r/280863 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [08:54:12] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:54:12] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:13] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:20] <_joe_> looks like cp3038 is not reachable via ssh [08:54:22] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:23] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:24] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:24] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:24] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:54:43] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:54:44] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [08:54:52] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:54:52] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:54:53] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [08:54:53] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 146 connecting: cp3038_v4, cp3038_v6 [08:55:42] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:55:58] _joe_: I can't login into management either [08:56:36] <_joe_> I am into mgmt now [08:56:48] <_joe_> but maybe let's prepare to depool it [08:56:48] ah, that might explain it [08:56:58] yeah, I agree [08:57:00] <_joe_> actually, we should [08:57:14] <_joe_> !log depooling cp3038 from all live pools [08:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:23] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:59:07] !log starting mysql on holmium [08:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:58] <_joe_> !log hard rebooting cp3038, console unreachable [09:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:52] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 148 ESP OK [09:01:53] RECOVERY - Host cp3038 is UP: PING OK - Packet loss = 0%, RTA = 83.94 ms [09:02:02] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [09:02:12] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [09:02:13] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [09:02:14] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [09:02:14] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [09:02:14] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [09:02:14] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 148 ESP OK [09:02:48] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [09:02:49] PROBLEM - confd service on cp3038 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [09:02:50] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [09:03:10] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [09:03:18] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 148 ESP OK [09:03:29] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [09:03:40] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [09:03:40] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [09:03:50] RECOVERY - confd service on cp3038 is OK: OK - confd is active [09:03:51] PROBLEM - NTP on cp3038 is CRITICAL: NTP CRITICAL: Offset unknown [09:03:58] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 148 ESP OK [09:04:19] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [09:04:19] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [09:04:39] RECOVERY - IPsec on kafka1018 is OK: Strongswan OK - 148 ESP OK [09:04:50] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [09:04:50] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [09:04:57] (03PS1) 10Jcrespo: Correct wrong collation from utf8 to unicode CI [puppet] - 10https://gerrit.wikimedia.org/r/285608 (https://phabricator.wikimedia.org/T128737) [09:05:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [09:05:28] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [09:05:58] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [09:06:09] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [09:06:09] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [09:06:18] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 148 ESP OK [09:06:24] (03CR) 10Jcrespo: [C: 032] Correct wrong collation from utf8 to unicode CI [puppet] - 10https://gerrit.wikimedia.org/r/285608 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [09:06:40] akosiaris: ah, thanks [09:07:02] akosiaris: are we good to merge the patch now? [09:07:24] kart_: in a few... making sure cp3038 is ok still [09:07:53] _joe_: should we repool cp3038 ? I think it looks fine [09:08:25] _joe_: or do you prefer we keep it depooled and investigate a bit more what happened ? [09:08:44] akosiaris: okay! [09:08:45] <_joe_> I am investigating, didn't find shit until now [09:09:34] <_joe_> so yeah repooling it [09:09:41] <_joe_> !log repooling cp3038 [09:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:59] _joe_: last think I see in the logs is the memory compaction [09:10:05] thing* [09:10:21] which anyway happens once per minute [09:11:03] <_joe_> !log repooling cp3038 [09:11:14] (03PS1) 10Jcrespo: Correcting typo on character_set_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/285609 (https://phabricator.wikimedia.org/T128737) [09:11:31] (03CR) 10Jcrespo: [C: 032] Correcting typo on character_set_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/285609 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [09:11:42] (03CR) 10Jcrespo: [V: 032] Correcting typo on character_set_filesystem [puppet] - 10https://gerrit.wikimedia.org/r/285609 (https://phabricator.wikimedia.org/T128737) (owner: 10Jcrespo) [09:11:53] (03PS4) 10Giuseppe Lavagetto: mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) [09:12:22] and cp3038 is not in ganglia :-( [09:13:08] <_joe_> akosiaris: wat? [09:13:21] https://grafana.wikimedia.org/dashboard/db/server-board?var-server=cp3038&var-network=eth0 [09:13:29] that would graphite [09:13:32] it is in graphite indeed [09:13:33] yes [09:13:35] but not ganglia :-( [09:13:52] I wasn't disputing your fact, only trying to be helpful :-) [09:14:17] <_joe_> akosiaris: none of the esams upload caches is [09:14:18] <_joe_> wtf [09:14:19] cool. thanks :-) [09:14:27] _joe_: some misconfiguration.. what else ? [09:14:38] <_joe_> akosiaris: yes I think so [09:14:40] so, neither graphite has anything that would even hint into what happened [09:14:42] <_joe_> or the daemon died [09:15:51] (03PS11) 10Alexandros Kosiaris: Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [09:16:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Read config from cxserver [puppet] - 10https://gerrit.wikimedia.org/r/284654 (https://phabricator.wikimedia.org/T122498) (owner: 10KartikMistry) [09:16:05] <_joe_> oh man [09:16:11] <_joe_> you merge-snooped me [09:16:20] :-D [09:16:21] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [09:16:28] (03PS5) 10Giuseppe Lavagetto: mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) [09:16:38] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: remove decommissioned appservers [puppet] - 10https://gerrit.wikimedia.org/r/285604 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [09:17:03] akosiaris: it should take 10 minutes, right? [09:17:22] kart_: depends on the machine. 30 max [09:17:30] but I am forcing a puppet run anyway [09:18:13] kart_: ok done. check that everything is working as expected [09:18:40] (03PS1) 10Muehlenhoff: Add salt grains for debug proxies and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285610 [09:18:42] (03PS1) 10Muehlenhoff: Drop debdeploy-ipsec-test:standard server group in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285611 [09:19:26] akosiaris: checking. [09:19:34] _joe_: ok I got something... kernel stacktrace [09:19:44] akosiaris: I don't need to restart service, right? [09:19:52] old relatively... Yesterday at 20:02 UTC. TCP related [09:19:58] kart_: no you dont [09:20:26] okay. [09:20:33] Seems good then! [09:20:44] took months, but finally. [09:21:02] _joe_: http://p.defau.lt/?grPKxDfSBbUydkWiDY_vgQ [09:21:06] kart_: yay! [09:21:11] happy we are finally done with it [09:22:20] (03PS2) 10Muehlenhoff: Add salt grains for debug proxies and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285610 [09:22:30] RECOVERY - NTP on cp3038 is OK: NTP OK: Offset -5.888938904e-05 secs [09:22:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add salt grains for debug proxies and wire up in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285610 (owner: 10Muehlenhoff) [09:23:40] <_joe_> !log restarted hhvm on mw1144, usual deadlock [09:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:49] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.702 second response time [09:24:32] (03PS2) 10Muehlenhoff: Drop debdeploy-ipsec-test:standard server group in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285611 [09:24:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop debdeploy-ipsec-test:standard server group in debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/285611 (owner: 10Muehlenhoff) [09:24:48] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:25:03] 06Operations, 10Traffic, 13Patch-For-Review: Letsencrypt all the prod things we can - planning - https://phabricator.wikimedia.org/T133717#2242275 (10fgiunchedi) p:05Triage>03Normal [09:25:18] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2242276 (10fgiunchedi) p:05Triage>03Normal [09:25:38] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 66673 bytes in 0.398 second response time [09:25:44] (03CR) 10Liuxinyu970226: [C: 031] Redirect yue.wikipedia.org to zh-yue.wikipedia.org for now [puppet] - 10https://gerrit.wikimedia.org/r/285086 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [09:25:57] (03CR) 10Liuxinyu970226: [C: 031] Set up yue.wikipedia.org DNS record [dns] - 10https://gerrit.wikimedia.org/r/285085 (https://phabricator.wikimedia.org/T105999) (owner: 10Alex Monk) [09:30:13] (03PS1) 10DCausse: Fix unicast hosts for elastic in codfw [puppet] - 10https://gerrit.wikimedia.org/r/285612 [09:30:57] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Collect metrics on pool counter usage - https://phabricator.wikimedia.org/T130617#2242287 (10fgiunchedi) p:05Triage>03Normal [09:33:22] 06Operations: Grafana: Job Queue Health: Panel is displayed incorrectly - https://phabricator.wikimedia.org/T130512#2242294 (10fgiunchedi) p:05Triage>03Low seems to render fine for me, are you still seeing the same @Luke081515 ? [09:34:44] PROBLEM - mediawiki-installation DSH group on mw1130 is CRITICAL: Host mw1130 is not in mediawiki-installation dsh group [09:35:29] PROBLEM - mysqld processes on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:36:02] new install [09:36:29] ack [09:36:46] I found a problem here: https://phabricator.wikimedia.org/T128737#2242263 [09:38:20] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2242297 (10Gehel) 05Resolved>03Open [09:38:43] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: upgrade HHVM to 3.12.1 on terbium (Elastica: missing curl_init_pooled method.) - https://phabricator.wikimedia.org/T132751#2209239 (10Gehel) I can still see the same error message in the logs. Need to investigate... [09:39:31] ACKNOWLEDGEMENT - mysqld processes on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Volans T128737 [09:40:08] forgot it pages on ack... I should have put disable notification too, sorry [09:40:13] you should not have acked until I disabled it [09:42:14] PROBLEM - ElasticSearch health check for shards on elastic2022 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.34:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.34, port=9200): Read timed out. (read timeout=4) [09:43:46] (03CR) 10Gehel: [C: 032] Fix unicast hosts for elastic in codfw [puppet] - 10https://gerrit.wikimedia.org/r/285612 (owner: 10DCausse) [09:46:42] (03PS2) 10Yuvipanda: Use IP rather than hostname for registering services [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285599 [09:46:44] (03PS4) 10Yuvipanda: Rename websevice-new to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 [09:46:47] volans, fixed on the other 2 servers, but holmium will page again when the current mysqld fails [09:47:00] s/fails/stops/ [09:47:08] and I suppose it will be stopped [09:47:27] we can disable the notifications before stopping it? [09:47:35] I already did [09:47:49] oh, no, only on 1001 [09:48:25] (03CR) 10Yuvipanda: [C: 032] Don't barf on --release for non-lighttpd usage [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285582 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [09:48:39] (03CR) 10Yuvipanda: [C: 032] Implement webservice 'status' command [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285583 (owner: 10Yuvipanda) [09:51:30] (03PS5) 10Yuvipanda: Rename websevice-new to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 [09:51:32] (03PS1) 10Yuvipanda: Set version even if only just restarting [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285614 [09:52:25] (03CR) 10Yuvipanda: [C: 032] Fix job_running check to match old webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285589 (owner: 10Yuvipanda) [09:52:45] RECOVERY - ElasticSearch health check for shards on elastic2022 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 2, number_of_pending_tasks: 6505, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3118, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 9387, initializing_shards: 20, number_of_data_nodes: 24, delayed_unassign [09:53:09] (03CR) 10Yuvipanda: [C: 032] Set version even if only just restarting [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285614 (owner: 10Yuvipanda) [09:53:24] (03PS1) 10Alexandros Kosiaris: service::node: only rotate log files [puppet] - 10https://gerrit.wikimedia.org/r/285615 [09:54:36] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [09:54:44] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [09:56:30] <_joe_> !log clean puppet certs and facts on mw10[7-8][0-9] and mw112[1-9]/mw1130 for T126242 [09:56:31] T126242: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242 [09:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:58:54] (03CR) 10Yuvipanda: [C: 032] Add support for uwsgi-plain webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285588 (owner: 10Yuvipanda) [09:59:25] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: puppet fail [09:59:55] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: puppet fail [10:01:34] <_joe_> !log shutting down mw10[7-8][0-9] and mw112[1-9]/mw1130 for T126242 [10:01:35] T126242: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242 [10:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:03:09] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2242346 (10elukey) Removed some zero sized logs on scb1002 causing: ``` root@scb1002.eqiad.wmnet via wikimedia.org to root /etc/cron.daily/logrotate: error: error renaming /srv/log/mobil... [10:09:06] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2242350 (10hashar) 05Resolved>03Open Since `fonts-gujr-extra` got installed via https://gerrit.wikimedia.org/r/#/c/284655/2/modules/mediawiki/m... [10:13:06] (03PS1) 10Giuseppe Lavagetto: mediawiki: re-add mw1120, now as a canary [puppet] - 10https://gerrit.wikimedia.org/r/285619 [10:13:12] (03CR) 10Hashar: [C: 04-1] "Cherry picked on puppet master and it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [10:13:47] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: re-add mw1120, now as a canary [puppet] - 10https://gerrit.wikimedia.org/r/285619 (owner: 10Giuseppe Lavagetto) [10:13:56] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: re-add mw1120, now as a canary [puppet] - 10https://gerrit.wikimedia.org/r/285619 (owner: 10Giuseppe Lavagetto) [10:16:12] 06Operations: "Unable to connect to redis server" log spam - https://phabricator.wikimedia.org/T130078#2242360 (10fgiunchedi) p:05Triage>03Normal also it seems load-related as there is a daily pattern for "unable to connect to redis server" {F3936027} [10:17:24] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:07] hashar: FYI I merged all of puppet swat but https://gerrit.wikimedia.org/r/#/c/276346/ which IMO should get some +1s [10:18:25] godog: noticed that thanks a lot [10:18:49] yeah the hiera_lookup is not much of a concern. I have hacked it while I was polishing hiera support in Nodepool instances [10:18:51] (03CR) 10JanZerebecki: [C: 031] phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [10:18:51] it can wait [10:19:08] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 07Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#2242365 (10fgiunchedi) p:05Triage>03Normal [10:20:07] 06Operations, 06Commons, 10MediaWiki-Page-deletion: API request failed (internal_api_error_MWException): [408e8b0f] Exception Caught: Could not acquire lock for 'Full_size_20150703094950ನಿಲ್ಲದ_ಬರವಣಿಗೆ.jpg.' - https://phabricator.wikimedia.org/T130359#2242368 (10fgiunchedi) [10:20:10] 06Operations, 06Commons, 10MediaWiki-Page-deletion, 10media-storage, 13Patch-For-Review: Unable to delete file pages on commons: "Could not acquire lock" - https://phabricator.wikimedia.org/T132921#2242369 (10fgiunchedi) [10:20:34] PROBLEM - ElasticSearch health check for shards on elastic2023 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.35:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.35, port=9200): Read timed out. (read timeout=4) [10:20:34] PROBLEM - ElasticSearch health check for shards on elastic2001 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.130:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.130, port=9200): Read timed out. (read timeout=4) [10:20:42] 06Operations, 06Commons, 10MediaWiki-Page-deletion: API request failed (internal_api_error_MWException): [408e8b0f] Exception Caught: Could not acquire lock for 'Full_size_20150703094950ನಿಲ್ಲದ_ಬರವಣಿಗೆ.jpg.' - https://phabricator.wikimedia.org/T130359#2133678 (10fgiunchedi) looks like a dup of T132921 ? [10:21:13] Heads up, the codfw elasticsearch cluster is behaving erraticaly [10:21:35] We are going to switch the morelike traffic back to eqiad to give us time to understand what's happening [10:22:33] RECOVERY - ElasticSearch health check for shards on elastic2023 is OK: OK - elasticsearch status production-search-codfw: status: red, number_of_nodes: 23, unassigned_shards: 863, number_of_pending_tasks: 3106, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3117, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8487, initializing_shards: 64, number_of_data_nodes: 23, delayed_unassigne [10:22:34] RECOVERY - ElasticSearch health check for shards on elastic2001 is OK: OK - elasticsearch status production-search-codfw: status: red, number_of_nodes: 23, unassigned_shards: 856, number_of_pending_tasks: 3250, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3118, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8490, initializing_shards: 69, number_of_data_nodes: 23, delayed_unassigne [10:22:44] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:23:12] (03PS1) 10DCausse: Reroute morelike queries to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285620 [10:23:22] (03PS1) 10Muehlenhoff: Only install font-gujr-extra on jessie [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) [10:23:24] PROBLEM - ElasticSearch health check for shards on elastic2013 is CRITICAL: CRITICAL - elasticsearch http://10.192.32.118:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.32.118, port=9200): Read timed out. (read timeout=4) [10:23:41] (03PS3) 10Hashar: hhvm: log dir creation requires rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/285526 [10:23:53] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2242379 (10JanZerebecki) @Chmarkine Yes, please do so. [10:25:42] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2242386 (10MoritzMuehlenhoff) Let's also update the description while we're at it :-) "Wikimedia's Ubuntu Archive mirror in Tampa, Florida" [10:26:41] (03CR) 10Hashar: "To solve the dependency loop I have changed rsyslog::conf['hhvm'] dependency:" [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [10:27:34] RECOVERY - ElasticSearch health check for shards on elastic2013 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 536, number_of_pending_tasks: 12383, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8807, initializing_shards: 72, number_of_data_nodes: 24, delayed_unass [10:27:57] <_joe_> gehel: us traffic switched over to eqiad? [10:28:23] _joe_: ? [10:28:30] <_joe_> *is [10:28:44] (03CR) 10Gehel: [C: 032] Reroute morelike queries to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285620 (owner: 10DCausse) [10:28:48] <_joe_> you said you'll switch [10:28:50] <_joe_> oh ok [10:28:51] <_joe_> :) [10:28:55] almost... [10:30:07] !log switching elasticsearch morelike traffic from codfw to eqiad [10:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:15] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2242401 (10BBlack) According to [[ https://letsencrypt.org/upcoming-features/ | https://letsencrypt.org/upcoming-features/ ]], they don't yet have [[ https://... [10:30:31] (03CR) 10Hashar: "mw1070 got removed in between apparently thus I compiled against mw1090.eqiad.wmnet https://puppet-compiler.wmflabs.org/2585/" [puppet] - 10https://gerrit.wikimedia.org/r/285526 (owner: 10Hashar) [10:31:13] Reedy: what's the script to purge old branches in mediawiki-staging you mentioned in T130317 btw? [10:31:14] T130317: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317 [10:31:32] Errm [10:32:30] godog: There's scap-purge-l10n-cache for the l10n cache specifically [10:33:09] And /srv/mediawiki-staging/multiversion/deleteMediaWiki for deleteing the whole branch [10:33:17] purging l10n is done fairly quickly after [10:33:24] The rest of the files stay around a month or so [10:34:40] (03CR) 10Yuvipanda: [C: 032] Rename websevice-new to webservice [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285590 (owner: 10Yuvipanda) [10:35:52] Reedy: thanks, that makes sense, looking at PurgeL10nCache in scap [10:36:07] !log gehel@tin Synchronized wmf-config/CirrusSearch-production.php: (no message) (duration: 02m 47s) [10:36:10] That seems the most obvious way of doing it [10:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:25] purge from /srv/mediawiki-staging, purge from l10nupdate folder [10:38:05] 06Operations, 10ops-eqiad: mw1070-89 and mw1121-30 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T133770#2242410 (10Joe) [10:38:38] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2008460 (10Joe) I think I can close this task as resolved, the subtasks aren't real blockers, more of "related tickets" [10:38:46] 06Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2242429 (10Joe) [10:38:47] (03PS1) 10Yuvipanda: Use webservice not webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/285624 [10:38:48] 06Operations, 13Patch-For-Review, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Reduce the number of appservers we're using in eqiad - https://phabricator.wikimedia.org/T126242#2242427 (10Joe) 05Open>03Resolved [10:39:13] 06Operations, 10ops-eqiad: mw1070-89 and mw1121-30 are shut down and should be physically decommissioned - https://phabricator.wikimedia.org/T133770#2242410 (10Joe) a:05Joe>03None [10:43:22] (03PS2) 10Yuvipanda: Use webservice not webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/285624 (https://phabricator.wikimedia.org/T98440) [10:45:14] (03PS9) 10Hashar: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [10:46:39] (03CR) 10Hashar: [C: 031] "Simply rebased. That is solely for contint on labs instances." [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [10:49:19] !log restarting elastic on elastic2001.codfw.wmnet [10:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:05] expect some alerts from codfw (sorry) but no more traffic is served by this cluster ^ [10:51:11] (03CR) 10Hashar: [C: 031] "Looks fine to me. Be bold?" [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [10:52:00] (03CR) 10Paladox: [C: 031] contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [10:52:00] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2242447 (10BBlack) Also, on the SAN list length limits, LE has this to say: https://community.letsencrypt.org/t/sans-per-cert-and-sni-for-hosting-service/5105... [10:53:13] Reedy: sigh, except that scap doesn't seem to know about /var/lib/l10nupdate, I'll update the ticket with the findings tho [10:53:48] Hmm. Not at all? [10:55:16] (03CR) 10Yuvipanda: [C: 032] Use IP rather than hostname for registering services [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285599 (owner: 10Yuvipanda) [10:56:36] (03CR) 10Jakob: [C: 031] phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [10:58:30] Reedy: no, it uses stage_dir/php-/cache/l10n/ [10:58:35] 06Operations, 10MediaWiki-JobQueue, 10Monitoring, 13Patch-For-Review: Redis monitoring needs to be improved - https://phabricator.wikimedia.org/T133179#2242452 (10Joe) a:05Joe>03None [10:58:46] !log reedy@tin LocalisationUpdate failed: git pull of core failed [10:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:55] !log reedy@tin LocalisationUpdate failed: git pull of core failed [10:59:25] ugh [10:59:26] fail [10:59:34] (03PS1) 10Muehlenhoff: Update debdeploy config for varnish maps clusters in codfw, esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/285625 [10:59:41] ^ ignore those [11:00:26] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2242454 (10BBlack) We also need to decide on a data model, and especially about what kinds of hostnames we're going to support for the redirect domains. We ca... [11:00:51] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2242455 (10BBlack) [11:01:07] godog: I guess, it could be blocked against the task to convert l10nupdate to scap [11:02:27] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) [11:03:28] Reedy: is there one already? [11:03:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update debdeploy config for varnish maps clusters in codfw, esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/285625 (owner: 10Muehlenhoff) [11:04:17] Hmm. https://phabricator.wikimedia.org/T72443 [11:04:19] But fixed? [11:04:48] sync-l10n was moved to scap.. [11:05:45] !log restarting elastic on elastic2002.codfw.wmnet [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:18] Reedy: possibly the caches in /var/lib keep being updated but never used/read ? [11:06:35] No, they're still being used [11:06:48] See /usr/local/bin/l10nupdate-1 [11:09:22] (03CR) 10KartikMistry: [C: 031] Only install font-gujr-extra on jessie [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [11:09:55] Reedy: gah, ok thanks [11:10:44] godog: That script *could* be used to do it too [11:10:56] if it pulls from activeMWVersions, but barf [11:12:23] Reedy: it == cleanup old caches? [11:12:31] yeah [11:12:35] it runs daily [11:12:43] As part of it, look for non used MW versions, delete dirs if exist [11:13:08] but seems stupid [11:13:27] yeah the git log of l10nupdate-1 has made my eyes bleed, better not [11:16:10] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2132321 (10fgiunchedi) notes from looking into this with @Reedy on irc: * scap doesn't seem to know about `/var/lib/l10nupdate` but instead it drops cdb files... [11:16:29] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2242482 (10fgiunchedi) p:05Triage>03Normal [11:19:05] 06Operations: conftool-merge should report which node is setting attributes for - https://phabricator.wikimedia.org/T129847#2242500 (10fgiunchedi) p:05Triage>03Low [11:25:51] (03PS1) 10BBlack: LE: bugfix for acme-setup SAN list checks [puppet] - 10https://gerrit.wikimedia.org/r/285628 [11:26:27] (03CR) 10BBlack: [C: 032 V: 032] LE: bugfix for acme-setup SAN list checks [puppet] - 10https://gerrit.wikimedia.org/r/285628 (owner: 10BBlack) [11:27:38] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [11:29:43] 06Operations: replace bast3001 with newer hardware - https://phabricator.wikimedia.org/T131562#2242526 (10fgiunchedi) p:05Triage>03Normal [11:29:48] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:30:58] 06Operations, 07Performance: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2242527 (10fgiunchedi) p:05Triage>03Normal [11:31:51] i'm having a lot of trouble with the mailinglists.... [11:32:24] i receive stuff from lists that i'm already subscribed to [11:33:08] (03CR) 10BBlack: [C: 031] Modify the default Varnish error page to increase visibility of error messages. [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [11:33:28] but when it comes to subscribing to new lists, or trying to change my password, or basically any other administrative action, it seems the (confirmation/notification) mails being being sent to me disappear in a black hole [11:34:57] like i just requested a password reminder for one of the accounts, and havent' received it, though i receive other mails from that list. [11:35:25] and i've been trying to subscribe incidently to the video-l list for 2 years and have never succeeded. [11:35:27] (03PS2) 10Elukey: Modify the default Varnish error page to increase visibility of error messages. [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) [11:36:24] thedj: which email address are those being sent to? [11:37:05] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2242535 (10hashar) deployment-cache-text04 had the issue again. So at f... [11:39:33] (03CR) 10Elukey: [C: 032] Modify the default Varnish error page to increase visibility of error messages. [puppet] - 10https://gerrit.wikimedia.org/r/285363 (https://phabricator.wikimedia.org/T76348) (owner: 10Elukey) [11:41:38] 07Puppet, 10Beta-Cluster-Infrastructure, 06Labs, 13Patch-For-Review: /etc/puppet/puppet.conf keeps getting double content - first for labs-wide puppetmaster, then for the correct puppetmaster - https://phabricator.wikimedia.org/T132689#2242540 (10hashar) p:05Triage>03Normal https://gerrit.wikimedia.org... [11:43:07] !log restarting elastic on elastic2003.codfw.wmnet [11:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:45:08] godog: d.j.hartman+wmf_ml with gmail.com [11:52:43] (03PS1) 10Yuvipanda: Log webservice command invocations to EL [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285631 [11:56:40] (03CR) 10Yuvipanda: [C: 032] Log webservice command invocations to EL [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285631 (owner: 10Yuvipanda) [11:59:36] (03CR) 10Hashar: "Cherry picked on beta cluster puppetmaster. Gave it a try on Trusty instance deployment-mediawiki02. fonts-gujr-extra is no more installe" [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [12:06:12] (03CR) 10Elukey: "Some comments, but looks good! I tried to run the puppet compiler just to check warnings/errors and only got:" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [12:07:06] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:12] !log restarting elastic on elastic2004.codfw.wmnet [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:32] 06Operations, 10Citoid, 06Services, 10VisualEditor: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#2242619 (10mobrovac) >>! In T133696#2241618, @Mvolz wrote: > As I said in chat, we currently look up a pmid for every citation that has a doi. We could easily stop... [12:30:01] (03PS1) 10Muehlenhoff: Avoid cron spam in aead sync [puppet] - 10https://gerrit.wikimedia.org/r/285634 [12:36:35] (03PS2) 10Muehlenhoff: Avoid cron spam in aead sync [puppet] - 10https://gerrit.wikimedia.org/r/285634 [12:38:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Avoid cron spam in aead sync [puppet] - 10https://gerrit.wikimedia.org/r/285634 (owner: 10Muehlenhoff) [12:39:22] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242648 (10Cmjohnson) [12:40:16] 06Operations, 10ops-eqiad, 06DC-Ops: eqiad: Rack/Setup 6 new pool servers - https://phabricator.wikimedia.org/T132684#2242662 (10Cmjohnson) Mgmt DNS wmf4746 1H IN A 10.65.7.75 wmf4747 1H IN A 10.65.7.76 wmf4748 1H IN A 10.65.7.77 wmf474... [12:42:16] (03Draft2) 10Hashar: apt: make components parameterizable [puppet] - 10https://gerrit.wikimedia.org/r/270872 (https://phabricator.wikimedia.org/T120963) [12:42:28] (03Abandoned) 10Hashar: apt: make components parameterizable [puppet] - 10https://gerrit.wikimedia.org/r/270872 (https://phabricator.wikimedia.org/T120963) (owner: 10Hashar) [12:44:37] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [12:45:45] (03PS1) 10Cmjohnson: Removing old lsearch dns and adding 16 new elastic and 6 misc mgmt entries to 10.in and wmnet file. [dns] - 10https://gerrit.wikimedia.org/r/285636 [12:47:24] (03CR) 10Cmjohnson: [C: 032] Removing old lsearch dns and adding 16 new elastic and 6 misc mgmt entries to 10.in and wmnet file. [dns] - 10https://gerrit.wikimedia.org/r/285636 (owner: 10Cmjohnson) [12:48:11] 06Operations, 10ops-eqiad: Rack and Setup new elastic search - https://phabricator.wikimedia.org/T133772#2242691 (10Cmjohnson) Mgmt DNS +elastic1032 1H IN A 10.65.7.1 +WMF3175 1H IN A 10.65.7.1 +elatic1033 1H IN A 10.65.7.2 +WMF3174 1H IN A 10.65.7.2 +elastic1034 1H IN A 10.65.7.3... [12:49:52] A surge on Notice: Undefined index: recentChangesFlagsRaw in /srv/mediawiki/php-1.27.0-wmf.21/includes/changes/EnhancedChangesList.php on line 268, Notice, but seems new to me [13:12:00] 06Operations, 10Traffic: restrict upload cache access for private wikis - https://phabricator.wikimedia.org/T129839#2242749 (10fgiunchedi) p:05Triage>03Normal [13:13:17] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2078914 (10fgiunchedi) hi @Ppena, did shopify come back to you with an answer? thanks! [13:13:38] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2242757 (10fgiunchedi) p:05Triage>03Normal [13:13:52] 06Operations, 10RESTBase-Cassandra: cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2242758 (10fgiunchedi) p:05Triage>03Low [13:17:07] 06Operations: the centralauth databases is accessible form the mysql shell on terbium only in some cases - https://phabricator.wikimedia.org/T122475#2242772 (10fgiunchedi) 05Open>03Resolved p:05Triage>03Normal a:03fgiunchedi @Amire80 according T122479 you should have access to stat boxes and therefore... [13:17:18] 06Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2242776 (10BBlack) We've basically never configured any websockets stuff through our #Traffic layer before. Phab isn't the only use-case, either. We also have `str... [13:19:07] 06Operations, 10Wikimedia-Stream: occasional 502 from rcstream seen by pybal - https://phabricator.wikimedia.org/T126313#2242778 (10fgiunchedi) p:05Triage>03Normal [13:19:32] (03PS1) 10Urbanecm: Enable NewUserMessage on hiwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285639 (https://phabricator.wikimedia.org/T133775) [13:22:53] 06Operations, 10Traffic, 10Wikimedia-Shop, 07HTTPS: https://store.wikimedia.org doesn't set HSTS header - https://phabricator.wikimedia.org/T128559#2242787 (10BBlack) [13:22:55] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2242786 (10BBlack) [13:23:33] <_joe_> !log uploading new conftool packages, T128199 [13:23:33] T128199: confctl: improve/upgrade --tags/--find - https://phabricator.wikimedia.org/T128199 [13:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:58] !log hard restart of codfw elasticsearch cluster [13:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:19] _joe_: \o/ [13:27:04] expect a few alerts from elasticsearch cluster in codfw [13:28:53] <_joe_> bblack: are you using confctl on the varnishes? [13:29:44] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2012.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2010.codfw.wmnet because of too many down! [13:30:00] <_joe_> gehel: heh looks bad in fact :P [13:30:45] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2012.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2010.codfw.wmnet because of too many down! [13:30:47] Ah, I did not think of pybal... yep, they are all down... [13:31:51] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 341 bytes in 0.303 second response time [13:32:01] <_joe_> gehel: expected? [13:32:02] <_joe_> ^^ [13:32:10] yes [13:32:14] (03PS1) 10Filippo Giunchedi: mariadb: allow up to two eventlogging_sync processes [puppet] - 10https://gerrit.wikimedia.org/r/285640 (https://phabricator.wikimedia.org/T123509) [13:32:24] there is no traffic (except updates on codfw elasticsearch at the moment [13:32:49] <_joe_> ok cool [13:32:58] <_joe_> bblack: we need to change pool/depool [13:34:11] should I do somthing special to lvs/pybal during this restart? [13:35:13] should be enough to downtime the relevant lvs/pybal alerts for search [13:35:44] 06Operations, 10DBA, 10Monitoring, 07Icinga, 13Patch-For-Review: "db1047/eventlogging_sync processes" icinga alert is flaky since at least early January - https://phabricator.wikimedia.org/T123509#2242816 (10fgiunchedi) p:05Triage>03Low [13:36:31] 06Operations: Migrate puppetmaster/backends to jessie - https://phabricator.wikimedia.org/T123730#2242818 (10fgiunchedi) p:05Triage>03Normal see also {T98128} for similar work [13:37:50] (03PS1) 10Muehlenhoff: Update to 3.0.7 (almost) [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/285641 [13:43:31] ACKNOWLEDGEMENT - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2012.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2010.codfw.wmnet because of too many down! Gehel cluster restart [13:43:41] ACKNOWLEDGEMENT - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2012.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2010.codfw.wmnet because of too many down! Gehel elasticsearch cluster restart [13:43:51] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me. Any objections in merging this ?" [puppet] - 10https://gerrit.wikimedia.org/r/282405 (owner: 10BryanDavis) [13:44:55] 06Operations: Some labvirt systems use qemu from "cloud archive" - https://phabricator.wikimedia.org/T127113#2242825 (10fgiunchedi) p:05Triage>03Normal @Andrew thoughts/preferences on qemu up/down grade? [13:48:52] 06Operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2242841 (10fgiunchedi) p:05Triage>03Normal [13:49:04] (03CR) 10Rush: [C: 031] "good stuff, we could add bug T67270 to it's message" [puppet] - 10https://gerrit.wikimedia.org/r/282405 (owner: 10BryanDavis) [13:50:40] 06Operations, 10ops-esams, 10hardware-requests: decom cp3011-22 (12 machines) - https://phabricator.wikimedia.org/T130883#2242842 (10fgiunchedi) p:05Triage>03Normal [13:52:18] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2242844 (10fgiunchedi) p:05Triage>03Low [13:53:03] !log restarted kafka on kafka1018.eqiad.wmnet for Java upgrades [13:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:31] (03PS1) 10Muehlenhoff: Add reference to Phab task [puppet] - 10https://gerrit.wikimedia.org/r/285644 [13:53:36] RECOVERY - LVS HTTP IPv4 on search.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 522 bytes in 0.187 second response time [13:53:40] (03PS2) 10Muehlenhoff: Add reference to Phab task [puppet] - 10https://gerrit.wikimedia.org/r/285644 [13:53:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add reference to Phab task [puppet] - 10https://gerrit.wikimedia.org/r/285644 (owner: 10Muehlenhoff) [13:54:39] (03PS1) 10Rush: toollabs limits.conf as a template [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) [13:55:59] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:56:20] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [13:56:39] (03CR) 10jenkins-bot: [V: 04-1] toollabs limits.conf as a template [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [13:57:15] (03PS1) 10Giuseppe Lavagetto: conftool: upgrade confctl's pool/depool for 0.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/285646 [13:58:10] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: forwarder/legacy-zmq [13:58:27] ---^ this is me, checking [13:59:51] (03CR) 10Jcrespo: [C: 031] mariadb: allow up to two eventlogging_sync processes [puppet] - 10https://gerrit.wikimedia.org/r/285640 (https://phabricator.wikimedia.org/T123509) (owner: 10Filippo Giunchedi) [13:59:53] 06Operations, 06Release-Engineering-Team: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2242850 (10fgiunchedi) p:05Triage>03Low [14:03:54] (03PS2) 10Rush: toollabs limits.conf as a template [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) [14:04:05] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [14:06:45] (03PS3) 10Rush: dnsmasq: drop beta cluster mobile public IP [puppet] - 10https://gerrit.wikimedia.org/r/283987 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [14:08:56] (03CR) 10Rush: [C: 032] "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/283987 (https://phabricator.wikimedia.org/T130473) (owner: 10Hashar) [14:13:14] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.004 second response time [14:14:06] ^ that's me again. WDQS data import not yet finished. I'll reschedule a longer downtime [14:14:24] !log restart phd on iridium as it keeps complaining it lost procs (seems ok now) [14:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:34] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 416 bytes in 0.002 second response time [14:21:27] (03PS2) 10Filippo Giunchedi: mariadb: allow up to two eventlogging_sync processes [puppet] - 10https://gerrit.wikimedia.org/r/285640 (https://phabricator.wikimedia.org/T123509) [14:21:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mariadb: allow up to two eventlogging_sync processes [puppet] - 10https://gerrit.wikimedia.org/r/285640 (https://phabricator.wikimedia.org/T123509) (owner: 10Filippo Giunchedi) [14:26:33] (03CR) 10Muehlenhoff: "@hashar: That seems like an outdated/unclean config on that deployment-prep host. Here's the PCC output for an image scaler in production:" [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [14:30:00] 06Operations: rsync module doesnt work on trusty - https://phabricator.wikimedia.org/T132532#2242927 (10fgiunchedi) p:05Triage>03Low [14:30:21] 06Operations, 10DBA: Email spam from some MariaDB's logrotate - https://phabricator.wikimedia.org/T127638#2242928 (10fgiunchedi) p:05Triage>03Normal [14:31:25] RECOVERY - cassandra-b CQL 10.64.48.139:9042 on restbase1015 is OK: TCP OK - 0.001 second response time on port 9042 [14:33:48] !log kafka1001.eqiad.wmnet depooled from eventbus for kafka upgrades (via confctl) [14:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:22] 06Operations, 06Mobile-Apps, 10Traffic: WikipediaApp for Android hits loads.php on bits.wikimedia.org - https://phabricator.wikimedia.org/T132969#2242936 (10fgiunchedi) p:05Triage>03Normal [14:38:19] <_joe_> !log upgrading conftool on all cp servers [14:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:11] !log restarted kafka on kafka1001 [14:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:49] <_joe_> !log upgraded conftool on palladium [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:46] ah ok that's why I got an error :P [14:42:02] <_joe_> elukey: what error? [14:42:33] <_joe_> in pvt too [14:42:53] sure sure I was about to paste [14:42:57] probably my fault [14:43:19] (03PS1) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [14:44:50] !log repooled kafka1001 after upgrades, will do the same procedure to kafka1002 [14:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:21] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 1 failures [14:47:15] (03PS1) 10Giuseppe Lavagetto: conftool: add config for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/285650 [14:48:22] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add config for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/285650 (owner: 10Giuseppe Lavagetto) [14:51:28] 06Operations, 10DBA, 10Monitoring, 07Icinga, 13Patch-For-Review: "db1047/eventlogging_sync processes" icinga alert is flaky since at least early January - https://phabricator.wikimedia.org/T123509#2242985 (10jcrespo) I think this will not completely fix the issue, as it seems the script may still fail du... [14:52:46] !log uploaded pcre 8.31-2ubuntu2.3+wm1 to carbon for trusty-wikimedia (rebuild of latest trusty update with our patch to enable JIT) [14:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:42] 06Operations: Installer issues for notebook1001 & 1002 - https://phabricator.wikimedia.org/T131964#2242989 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I've changed the boot order on notebook1001 too and that seems to have fixed it, machine is up/provisioned now! [14:57:22] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2242998 (10fgiunchedi) p:05Triage>03Low [14:57:45] (03PS2) 10Alex Monk: Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) [14:58:09] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Wikimedia-General-or-Unknown, 07HTTPS: securecookies - https://phabricator.wikimedia.org/T119570#2242999 (10fgiunchedi) p:05Triage>03Normal [14:58:52] 06Operations: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2243000 (10fgiunchedi) p:05Triage>03Normal [14:58:54] (03CR) 10jenkins-bot: [V: 04-1] Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [14:59:57] (03CR) 10Yuvipanda: [C: 031] toollabs limits.conf as a template [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:00:04] anomie ostriches thcipriani marktraceur Krenair aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160427T1500). Please do the needful. [15:00:04] Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:17] (03PS3) 10Yuvipanda: toollabs: Use a template for limits.conf [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [15:00:19] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2218145 (10hashar) We have the `operations-puppet-typos` Jenkins job which reads from `/typos` but rely on `fgrep`: ``` fgrep -r --color=alw... [15:00:26] 06Operations, 10ops-eqiad: db1046.eqiad.wmnet: slot=3 failed - https://phabricator.wikimedia.org/T132917#2243003 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [15:01:14] (03CR) 10Alex Monk: "wtf?" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:01:28] (03PS3) 10Alex Monk: Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) [15:01:51] Urbanecm: ping for SWAT [15:01:56] I'm here [15:03:28] 06Operations: Set jessie as the default os installer on network boot and manually mark other versions (precise, trusty) - https://phabricator.wikimedia.org/T133539#2243008 (10fgiunchedi) p:05Triage>03Normal [15:03:36] (03CR) 10jenkins-bot: [V: 04-1] Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:03:43] (03PS6) 10Thcipriani: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [15:03:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [15:03:57] 06Operations, 10Continuous-Integration-Config, 06Release-Engineering-Team: Write a test to check for clearly bogus hostnames - https://phabricator.wikimedia.org/T133047#2243012 (10hashar) There is also https://www.npmjs.com/package/grunt-tyops / https://github.com/jdforrester/grunt-tyops which is a Grunt ta... [15:04:36] (03Merged) 10jenkins-bot: Add Subject namespace to hiwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285008 (https://phabricator.wikimedia.org/T133440) (owner: 10Urbanecm) [15:06:58] 06Operations, 10Traffic, 07HTTPS: Secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2243021 (10fgiunchedi) p:05Triage>03Normal [15:07:00] 06Operations, 06Discovery, 10Maps: Switch Maps to production status - https://phabricator.wikimedia.org/T133744#2243020 (10fgiunchedi) p:05Triage>03Normal [15:07:04] 06Operations, 10netops: HTCP purges flood across CODFW - https://phabricator.wikimedia.org/T133387#2243023 (10fgiunchedi) p:05Triage>03Normal [15:07:06] 06Operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T133517#2243022 (10fgiunchedi) p:05Triage>03Normal [15:07:09] 06Operations, 10Monitoring: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318#2243024 (10fgiunchedi) p:05Triage>03Normal [15:07:11] 06Operations, 10MediaWiki-General-or-Unknown, 10Monitoring: edit.success in graphite never reached zero during codfw switchover - https://phabricator.wikimedia.org/T133177#2243025 (10fgiunchedi) p:05Triage>03Normal [15:07:13] 06Operations, 05codfw-rollout: test2wiki has no recent changes before the 20 april - https://phabricator.wikimedia.org/T133225#2243026 (10fgiunchedi) p:05Triage>03Normal [15:07:15] 06Operations, 10RESTBase-Cassandra: service cassandra-b fails on restbase2004 - https://phabricator.wikimedia.org/T132999#2243027 (10fgiunchedi) p:05Triage>03Normal [15:07:19] 06Operations, 07Documentation: Write documentation on how / when to use custom Diamond metrics collectors - https://phabricator.wikimedia.org/T132856#2243028 (10fgiunchedi) p:05Triage>03Normal [15:07:21] 06Operations, 10VisualEditor experimentation: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2243029 (10fgiunchedi) p:05Triage>03Normal [15:07:22] 06Operations, 10Monitoring: [RFC] Alert about *when* partitions will run out of space, not a percentage/absolute number - https://phabricator.wikimedia.org/T126158#2243031 (10fgiunchedi) p:05Triage>03Normal [15:07:24] 06Operations, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452#2243030 (10fgiunchedi) p:05Triage>03Normal [15:07:26] 06Operations: Upgrade fluorine to trusty/jessie - https://phabricator.wikimedia.org/T123728#2243032 (10fgiunchedi) p:05Triage>03Normal [15:07:28] 06Operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#2243033 (10fgiunchedi) p:05Triage>03Normal [15:07:45] (03PS1) 10Giuseppe Lavagetto: tcpircbot: allow sending messages from palladium [puppet] - 10https://gerrit.wikimedia.org/r/285653 [15:09:26] hmm is mw1070 having issues? scap is stuck syncing to 1 proxy, it looks like mw1070. [15:09:35] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add Subject namespace to hiwikibooks [[gerrit:285008]] (duration: 02m 41s) [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:50] ^ Urbanecm check please [15:10:36] (03PS1) 10Alex Monk: Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) [15:10:38] thcipriani: _joe_ has decommissioned a bunch of eqiad app servers earlier today [15:10:51] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:11:14] including mw1070 apparently. https://gerrit.wikimedia.org/r/#/c/285604/ https://phabricator.wikimedia.org/T126242 [15:11:24] Thcipriani: It works. Thanks. [15:11:31] so what ever scap is using as a reference seems to be outdated [15:11:44] there is still hieradata/common/scap/dsh.yaml: - "mw1070.eqiad.wmnet" # B6 [15:12:34] blerg. Need to update /etc/dsh/group/scap-proxies yeah, that hieradata file is probably where it needs to happen. [15:12:40] <_joe_> hashar: sigh I forgot to remove it from the proxies [15:12:45] <_joe_> thcipriani: sorry man [15:12:50] <_joe_> I will fix it now [15:13:11] _joe_: thanks [15:13:16] we might need another scap proxy for B6 though seems all B6 app servers got removed [15:13:48] would be nice to have the raw number in hostname like mw1070-b6 :D [15:14:52] (03PS1) 10Giuseppe Lavagetto: scap: remove decommissioned scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/285655 [15:15:00] (03PS2) 10Alex Monk: Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) [15:15:02] <_joe_> hashar: I think we're ok [15:15:28] ���� [15:15:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: remove decommissioned scap proxy [puppet] - 10https://gerrit.wikimedia.org/r/285655 (owner: 10Giuseppe Lavagetto) [15:15:37] <_joe_> akosiaris: that you ^^ [15:15:46] <_joe_> (logmsgbot spitting nonsense) [15:16:13] yup [15:16:30] <_joe_> :P [15:16:43] <_joe_> thcipriani: running puppet on tin [15:17:15] kk [15:18:46] (03CR) 10Alex Monk: "checked with puppet-compiler: https://puppet-compiler.wmflabs.org/2589/" [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [15:18:54] Urbanecm: I'm a little wary of the DynamicPageList extension for tewiki. There is an open performance ticket about that extension that is blocking it being enabled on some other wikis: https://phabricator.wikimedia.org/T124841 [15:20:19] (03PS5) 10BryanDavis: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 (https://phabricator.wikimedia.org/T67270) [15:23:19] Thcipriani: Thanks for the link. This extension is already enabled on 25 wikis. But if you think that it's impossible to have this ext at another wiki, I think that we can decline this patch and wait until DPL will be rewritten. By the way, why is T124841 stalled at this moment? I can't find any reason for it. [15:23:19] T124841: Performance review of DynamicPageList - https://phabricator.wikimedia.org/T124841 [15:23:22] (03PS2) 10Thcipriani: Enable NewUserMessage on hiwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285639 (https://phabricator.wikimedia.org/T133775) (owner: 10Urbanecm) [15:25:15] Urbanecm: I can't speak to why it is stalled. I would suggest adding T124841 as a blocker for the tewiki task, then asking for an update on that ticket as to the current status. [15:25:15] T124841: Performance review of DynamicPageList - https://phabricator.wikimedia.org/T124841 [15:25:28] (03CR) 10Alex Monk: "15:02:39 ERROR: unknown environment 'pep8'" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:25:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285639 (https://phabricator.wikimedia.org/T133775) (owner: 10Urbanecm) [15:26:02] Ok, I'll ask. [15:26:09] (03CR) 10Alex Monk: "ah, wrong test. still get the previous error" [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:26:15] thank you :) [15:26:17] (03Merged) 10jenkins-bot: Enable NewUserMessage on hiwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285639 (https://phabricator.wikimedia.org/T133775) (owner: 10Urbanecm) [15:26:31] (03CR) 10Alexandros Kosiaris: [C: 032] Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 (https://phabricator.wikimedia.org/T67270) (owner: 10BryanDavis) [15:26:38] (03PS6) 10Alexandros Kosiaris: Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 (https://phabricator.wikimedia.org/T67270) (owner: 10BryanDavis) [15:26:40] (03PS1) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [15:26:42] (03CR) 10Alexandros Kosiaris: [V: 032] Add .mailmap to cleanup duplicate authors [puppet] - 10https://gerrit.wikimedia.org/r/282405 (https://phabricator.wikimedia.org/T67270) (owner: 10BryanDavis) [15:27:20] 06Operations, 07Puppet: Reboot during puppet run causes /var/lib/puppet/state/agent_catalog_run.lock to be left and puppet to not start running again - https://phabricator.wikimedia.org/T127602#2243074 (10fgiunchedi) p:05Triage>03Normal I don't remember experiencing this very often in production, anyways r... [15:27:28] PROBLEM - ElasticSearch health check for shards on elastic2001 is CRITICAL: CRITICAL - elasticsearch inactive shards 3163 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3115, number_of_pending_tasks: 1278, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6252, initializing_shards: 48, number_of_data_ [15:27:38] PROBLEM - ElasticSearch health check for shards on elastic2009 is CRITICAL: CRITICAL - elasticsearch inactive shards 3160 threshold =0.1% breach: status: red, number_of_nodes: 24, unassigned_shards: 3111, number_of_pending_tasks: 163, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6254, initializing_shards: 49, number_of_data_node [15:27:39] PROBLEM - ElasticSearch health check for shards on elastic2021 is CRITICAL: CRITICAL - elasticsearch inactive shards 3158 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3110, number_of_pending_tasks: 249, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6256, initializing_shards: 48, number_of_data_n [15:27:39] PROBLEM - ElasticSearch health check for shards on elastic2013 is CRITICAL: CRITICAL - elasticsearch inactive shards 3158 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3110, number_of_pending_tasks: 249, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6256, initializing_shards: 48, number_of_data_n [15:27:48] (03CR) 10jenkins-bot: [V: 04-1] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [15:27:57] (03CR) 10Alex Monk: "Now other changes fail with this:" [puppet] - 10https://gerrit.wikimedia.org/r/285332 (owner: 10Yuvipanda) [15:28:09] PROBLEM - ElasticSearch health check for shards on elastic2017 is CRITICAL: CRITICAL - elasticsearch inactive shards 3155 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3107, number_of_pending_tasks: 274, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6260, initializing_shards: 48, number_of_data_n [15:28:09] PROBLEM - ElasticSearch health check for shards on elastic2010 is CRITICAL: CRITICAL - elasticsearch inactive shards 3155 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3107, number_of_pending_tasks: 274, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6260, initializing_shards: 48, number_of_data_n [15:28:09] PROBLEM - ElasticSearch health check for shards on elastic2018 is CRITICAL: CRITICAL - elasticsearch inactive shards 3155 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3107, number_of_pending_tasks: 274, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6260, initializing_shards: 48, number_of_data_n [15:28:19] PROBLEM - ElasticSearch health check for shards on elastic2015 is CRITICAL: CRITICAL - elasticsearch inactive shards 3151 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3103, number_of_pending_tasks: 457, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6264, initializing_shards: 48, number_of_data_n [15:28:19] PROBLEM - ElasticSearch health check for shards on elastic2003 is CRITICAL: CRITICAL - elasticsearch inactive shards 3149 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3101, number_of_pending_tasks: 504, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6266, initializing_shards: 48, number_of_data_n [15:28:28] PROBLEM - ElasticSearch health check for shards on elastic2022 is CRITICAL: CRITICAL - elasticsearch inactive shards 3145 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3097, number_of_pending_tasks: 711, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6270, initializing_shards: 48, number_of_data_n [15:28:28] PROBLEM - ElasticSearch health check for shards on elastic2019 is CRITICAL: CRITICAL - elasticsearch inactive shards 3145 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3097, number_of_pending_tasks: 710, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6270, initializing_shards: 48, number_of_data_n [15:28:39] PROBLEM - ElasticSearch health check for shards on elastic2004 is CRITICAL: CRITICAL - elasticsearch inactive shards 3140 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3092, number_of_pending_tasks: 975, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6275, initializing_shards: 48, number_of_data_n [15:28:39] PROBLEM - ElasticSearch health check for shards on elastic2007 is CRITICAL: CRITICAL - elasticsearch inactive shards 3140 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3092, number_of_pending_tasks: 975, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6275, initializing_shards: 48, number_of_data_n [15:28:39] PROBLEM - ElasticSearch health check for shards on elastic2011 is CRITICAL: CRITICAL - elasticsearch inactive shards 3139 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3091, number_of_pending_tasks: 1053, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6276, initializing_shards: 48, number_of_data_ [15:28:39] PROBLEM - ElasticSearch health check for shards on elastic2006 is CRITICAL: CRITICAL - elasticsearch inactive shards 3139 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3091, number_of_pending_tasks: 1053, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6276, initializing_shards: 48, number_of_data_ [15:28:48] PROBLEM - ElasticSearch health check for shards on elastic2020 is CRITICAL: CRITICAL - elasticsearch inactive shards 3136 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3088, number_of_pending_tasks: 571, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6276, initializing_shards: 48, number_of_data_n [15:28:49] PROBLEM - ElasticSearch health check for shards on elastic2005 is CRITICAL: CRITICAL - elasticsearch inactive shards 3136 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3088, number_of_pending_tasks: 571, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6276, initializing_shards: 48, number_of_data_n [15:28:58] (03CR) 10Filippo Giunchedi: [C: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285445 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:28:59] PROBLEM - ElasticSearch health check for shards on elastic2023 is CRITICAL: CRITICAL - elasticsearch inactive shards 3135 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3087, number_of_pending_tasks: 834, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6280, initializing_shards: 48, number_of_data_n [15:29:09] PROBLEM - ElasticSearch health check for shards on elastic2012 is CRITICAL: CRITICAL - elasticsearch inactive shards 3131 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3083, number_of_pending_tasks: 1131, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6281, initializing_shards: 48, number_of_data_ [15:29:17] sorry about the spam, scheduled downtime was not long enough... [15:29:18] PROBLEM - ElasticSearch health check for shards on elastic2014 is CRITICAL: CRITICAL - elasticsearch inactive shards 3129 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3081, number_of_pending_tasks: 1189, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6283, initializing_shards: 48, number_of_data_ [15:29:18] PROBLEM - ElasticSearch health check for shards on elastic2024 is CRITICAL: CRITICAL - elasticsearch inactive shards 3129 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3081, number_of_pending_tasks: 1189, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6283, initializing_shards: 48, number_of_data_ [15:29:18] PROBLEM - ElasticSearch health check for shards on elastic2002 is CRITICAL: CRITICAL - elasticsearch inactive shards 3129 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3081, number_of_pending_tasks: 1189, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6283, initializing_shards: 48, number_of_data_ [15:29:19] PROBLEM - ElasticSearch health check for shards on elastic2008 is CRITICAL: CRITICAL - elasticsearch inactive shards 3126 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3078, number_of_pending_tasks: 1309, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6288, initializing_shards: 48, number_of_data_ [15:29:19] PROBLEM - ElasticSearch health check for shards on elastic2016 is CRITICAL: CRITICAL - elasticsearch inactive shards 3126 threshold =0.1% breach: status: yellow, number_of_nodes: 24, unassigned_shards: 3078, number_of_pending_tasks: 1309, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 6288, initializing_shards: 48, number_of_data_ [15:29:57] 06Operations, 07Wikimedia-log-errors: "internal_api_error_MWException: [dbf916b7] Exception Caught: Could not acquire lock for" for some uploads (during upload with Pywikibot OAuth) - https://phabricator.wikimedia.org/T129621#2243088 (10fgiunchedi) p:05Triage>03Normal dup of {T132921} ? [15:30:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable NewUserMessage on hiwikiquote [[gerrit:285639]] (duration: 00m 31s) [15:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:54] ^ Urbanecm check please [15:32:03] thcipriani: let me know when you're done, I have a small config fix to HHVM to deploy (https://gerrit.wikimedia.org/r/#/c/285554/) [15:32:15] gehel: will do. [15:32:54] (03PS4) 10Alex Monk: Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) [15:32:56] (03PS1) 10Alex Monk: Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 [15:34:09] (03CR) 10jenkins-bot: [V: 04-1] Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) (owner: 10Alex Monk) [15:34:47] (03CR) 10jenkins-bot: [V: 04-1] Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 (owner: 10Alex Monk) [15:35:04] It's installed. When I logged in with my test account, no message appeard. It seems that this is because there is no in-wiki conf, so I'll notify hiwikiquote about this. [15:35:08] 06Operations, 10Wikimedia-IRC-RC-Server: IRC RC server still mentions pmtpa on various places - https://phabricator.wikimedia.org/T133328#2243101 (10fgiunchedi) p:05Triage>03Low indeed the `irc.pmtpa.wikimedia.org` hostname should be changed, though the bot name might be harder to do as pointed out in http... [15:35:30] (03CR) 10jenkins-bot: [V: 04-1] Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [15:35:44] Urbanecm: sounds good. Thank you for checking! [15:36:01] Thcipriani: Thanks for deploys! [15:36:05] (03PS2) 10Alex Monk: Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 [15:36:07] (03PS5) 10Alex Monk: Point dynamicproxy to IPs instead of hostnames [puppet] - 10https://gerrit.wikimedia.org/r/285288 (https://phabricator.wikimedia.org/T133554) [15:36:12] gehel: I'm done with SWAT [15:36:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "there is a mechanism to do this and you should use it. See incoming comments" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:36:24] thcipriani: thanks! [15:36:38] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [15:36:53] (03PS2) 10Gehel: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:36:58] (03CR) 10Alex Monk: "No idea what this is:" [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [15:37:17] <_joe_> gehel: see my comment, do not merge [15:37:54] _joe_: just in time, stopping now... [15:39:35] (03CR) 10Giuseppe Lavagetto: "see comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:39:40] <_joe_> gehel: I'll fix it [15:40:18] 06Operations, 10OTRS: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476#2243110 (10fgiunchedi) p:05Triage>03Normal [15:41:07] _joe_: now that I know what to do, I can fix it if you have something more important to do [15:41:14] _joe_: which you probably have... [15:42:44] <_joe_> gehel: already on it [15:42:56] _joe_: thanks! [15:44:11] (03PS3) 10Giuseppe Lavagetto: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:44:12] Is Elasticsearch ok? [15:44:15] Tons of.... [15:44:25] ElasticaWrite job reported failure on cluster {cluster}. Requeueing job with delay of {delay}. [15:44:25] Search backend error during sending {numBulk} documents to the {indexType} index after {took}: {message} [15:44:25] ostriches: codfw is in bad shade [15:44:34] ostriches: codfw clsuter is in a bad state [15:44:35] *shape [15:44:38] <_joe_> gehel: let me test it [15:45:28] Ok, long as it's known. [15:45:42] (03CR) 1020after4: "@Mobrovac: scap::target is updated in https://gerrit.wikimedia.org/r/#/c/284418/ and I have taken a fairly thorough look at ssh::userkey u" [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [15:46:03] !log restarted kafka1013 for java upgrades [15:46:09] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/client-side-04 processor/client-side-01 forwarder/legacy-zmq [15:46:10] ostriches: where did you see those specific logs? [15:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:30] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2243135 (10hashar) [15:46:34] EL is me [15:46:37] gehel: Logstash, mediawiki logs to be specific. [15:46:45] https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki [15:47:27] ostriches: thanks! [15:47:28] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2212399 (10fgiunchedi) I'm seeing graphs for cp1044 in ganglia (for misc eqiad though, not maps caches) https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cp1044.eqiad.wmn... [15:47:43] (03PS4) 10Giuseppe Lavagetto: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:47:44] 06Operations, 07Diamond, 07Upstream: Upstream our Diamond PowerDNSRecursorCollector - https://phabricator.wikimedia.org/T133643#2243153 (10fgiunchedi) p:05Triage>03Normal [15:48:19] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [15:48:24] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2243156 (10Gehel) [15:48:26] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2243155 (10Gehel) [15:49:14] 06Operations, 07LDAP: Add wmf LDAP group members into nda group, delete wmf group - https://phabricator.wikimedia.org/T129786#2243157 (10fgiunchedi) p:05Triage>03Normal [15:49:29] 06Operations, 07Documentation, 07LDAP: Review list of LDAP groups and document exactly what kind of access they can be allowed to provide - https://phabricator.wikimedia.org/T129788#2243158 (10fgiunchedi) p:05Triage>03Normal [15:50:33] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2243135 (10Gehel) During deployment of T110236 we started to see eratic behaviour of the codfw elasticsearch clus... [15:51:11] (03PS5) 10Giuseppe Lavagetto: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:52:17] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2243166 (10fgiunchedi) p:05Triage>03Normal [15:52:42] _joe_: thanks for the fix! I'll deploy this after our meeting [15:53:54] (03CR) 10Giuseppe Lavagetto: [C: 031] "DTRT https://puppet-compiler.wmflabs.org/2592/mw1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [15:55:36] (03CR) 10Filippo Giunchedi: [V: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285445 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [15:55:58] 06Operations, 10Security-Reviews, 06Security-Team, 10Wikimedia-Site-requests: ACL configuration for url-downloader.wikimedia.org allowing upload.wikimedia.org - https://phabricator.wikimedia.org/T130695#2243169 (10fgiunchedi) p:05Triage>03Normal [15:59:35] (03PS2) 10Giuseppe Lavagetto: conftool: upgrade confctl's pool/depool for 0.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/285646 [15:59:40] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#2243179 (10RobH) [15:59:45] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2243176 (10RobH) 05Open>03Resolved They've arrived onsite, and are in the queue for chris to rack. I'm marking this as resolved by the purchase task T132067. [16:00:43] (03PS3) 10Giuseppe Lavagetto: conftool: upgrade confctl's pool/depool for 0.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/285646 [16:01:59] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: upgrade confctl's pool/depool for 0.3.0 [puppet] - 10https://gerrit.wikimedia.org/r/285646 (owner: 10Giuseppe Lavagetto) [16:02:29] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:03:13] (03CR) 10Gehel: "Puppet compiler results: https://puppet-compiler.wmflabs.org/2593/" [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [16:04:27] (03PS6) 10Gehel: Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [16:04:58] (03PS1) 10Volans: Add the semi_sync parameter [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285664 (https://phabricator.wikimedia.org/T133333) [16:05:03] !log increasing curl pool size for jobrunners (T133755) [16:05:04] T133755: Job runners all report: Timeout reached waiting for an available pooled curl connection! - https://phabricator.wikimedia.org/T133755 [16:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:23] 06Operations, 10ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10RobH) [16:05:38] 06Operations, 10ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243208 (10RobH) [16:05:39] (03CR) 10Gehel: [C: 032] Increase curl pools on jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/285554 (https://phabricator.wikimedia.org/T133755) (owner: 10EBernhardson) [16:05:55] 06Operations, 10ops-eqiad: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10RobH) [16:07:05] 06Operations, 06Mobile-Apps, 10Traffic: Millions of request per minute to /.well-known/apple-app-site-association producing 404s - https://phabricator.wikimedia.org/T130647#2243217 (10fgiunchedi) p:05Triage>03Low it looks like this has reduced dramatically since fixing {T111829} but [[ https://grafana.wi... [16:07:50] 06Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2243224 (10fgiunchedi) p:05Triage>03Low [16:08:18] 06Operations, 10ops-eqiad, 10Analytics: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243229 (10RobH) [16:09:04] 06Operations, 10ops-eqiad, 10Analytics: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10RobH) [16:11:42] (03CR) 10Volans: "Tested also the compiler on a submodule, thanks to elukey modifications:" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285664 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [16:12:39] (03PS1) 10Giuseppe Lavagetto: palladium: add v6 address [puppet] - 10https://gerrit.wikimedia.org/r/285665 [16:13:01] <_joe_> akosiaris: ^^ but I'm going off now, I'll see this tomorrow morning [16:13:18] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2243255 (10fgiunchedi) agree with @joe that making ocg machines stateless would make (de)pooling easier and operations in general. do we have stats on how big such a cache wo... [16:13:43] οκ [16:14:36] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 2 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2243256 (10Gehel) a:03Gehel [16:15:24] 06Operations, 10OCG-General, 05codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#2243258 (10Joe) I actually have a question for @cscott how does mediawiki learn which backend to contact? I don't see any reference to the OCG redis server in mediawiki-conf... [16:18:14] 06Operations, 10MobileFrontend, 10Reading-Web, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2243266 (10MBinder_WMF) [16:29:41] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2243639 (10Ottomata) @cmjohnson if you could prioritize this one a little, we'd appreciate it. We've been waiting for a while and the current OOW nodes that are... [16:32:36] (03CR) 10Jcrespo: [C: 031] Add the semi_sync parameter [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285664 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [16:34:23] (03CR) 10Volans: [C: 032] Add the semi_sync parameter [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285664 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [16:35:50] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2243827 (10Ottomata) We can call these aqs100[456]. If you can just get these to DNS and ready for install, we will handle the actual partman layout and install.... [16:36:03] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2243191 (10Nuria) [16:39:35] (03PS2) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [16:40:20] Hi folks, we have an unbreak now CentralNotice issue on production, temporary patch in review now (underlying issue unknown, however) https://phabricator.wikimedia.org/T133765 https://gerrit.wikimedia.org/r/#/c/285671/ [16:40:32] Ah patch just got +2'd [16:40:42] greg-g: ^ [16:43:38] 06Operations, 10ops-eqiad, 10Analytics-Cluster: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192798 (10Nuria) [16:45:15] 06Operations, 10hardware-requests: reclaim restbase1001-1006 to spares - https://phabricator.wikimedia.org/T130752#2243939 (10Eevans) [16:45:17] 06Operations, 10RESTBase, 10hardware-requests, 13Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2243940 (10Eevans) [16:45:19] 06Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2243937 (10Eevans) 05Open>03Resolved With the successful bootstrap of restbase1015-b, I believe this task is complete! :) [16:46:54] thcipriani: ^ [16:48:36] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2243951 (10Papaul) [16:48:42] ostriches: ^ [16:48:57] ok [16:49:13] I wonder who our resident RL expert is these days... [16:49:41] probably Krinkle [16:50:02] 06Operations, 10Traffic, 07Varnish: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2243981 (10Nuria) [16:50:12] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2243951 (10Papaul) p:05High>03Normal a:05Papaul>03None [16:52:03] Krenair: thx! mmm hoping for someone currently online ;p [16:54:32] 06Operations, 10Traffic, 13Patch-For-Review, 07Varnish: varnishstatsd crashes with ValueError in vsl_callback without being restarted by systemd - https://phabricator.wikimedia.org/T132430#2243997 (10Nuria) [16:55:13] (03PS3) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [16:55:34] 06Operations, 13Patch-For-Review: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2244000 (10Nuria) [16:55:37] 06Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2243999 (10Nuria) 05Open>03Resolved [16:58:19] RECOVERY - ElasticSearch health check for shards on elastic2013 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 859, number_of_pending_tasks: 22525, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8478, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:58:22] !log increase throttling limit and concurrency on recoveries for elasticsearch codfw cluster (T133784) [16:58:23] T133784: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784 [16:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:40] RECOVERY - ElasticSearch health check for shards on elastic2017 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 817, number_of_pending_tasks: 23630, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8519, initializing_shards: 79, number_of_data_nodes: 24, delayed_unass [16:58:48] RECOVERY - ElasticSearch health check for shards on elastic2018 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 817, number_of_pending_tasks: 23630, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8519, initializing_shards: 79, number_of_data_nodes: 24, delayed_unass [16:58:49] RECOVERY - ElasticSearch health check for shards on elastic2010 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 817, number_of_pending_tasks: 23632, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8519, initializing_shards: 79, number_of_data_nodes: 24, delayed_unass [16:58:58] RECOVERY - ElasticSearch health check for shards on elastic2015 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 804, number_of_pending_tasks: 24146, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8531, initializing_shards: 80, number_of_data_nodes: 24, delayed_unass [16:59:08] RECOVERY - ElasticSearch health check for shards on elastic2003 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 790, number_of_pending_tasks: 24996, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8547, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:08] RECOVERY - ElasticSearch health check for shards on elastic2022 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 790, number_of_pending_tasks: 24997, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8547, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:08] RECOVERY - ElasticSearch health check for shards on elastic2019 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 790, number_of_pending_tasks: 24998, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8547, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:19] RECOVERY - ElasticSearch health check for shards on elastic2004 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 787, number_of_pending_tasks: 25532, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8550, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:19] RECOVERY - ElasticSearch health check for shards on elastic2007 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 787, number_of_pending_tasks: 25532, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8550, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:29] RECOVERY - ElasticSearch health check for shards on elastic2011 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 782, number_of_pending_tasks: 26054, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8556, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:29] RECOVERY - ElasticSearch health check for shards on elastic2006 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 782, number_of_pending_tasks: 26056, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8556, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:29] RECOVERY - ElasticSearch health check for shards on elastic2020 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 782, number_of_pending_tasks: 26060, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8556, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:29] RECOVERY - ElasticSearch health check for shards on elastic2005 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 782, number_of_pending_tasks: 26060, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8556, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:39] RECOVERY - ElasticSearch health check for shards on elastic2023 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 779, number_of_pending_tasks: 26584, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8559, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:41] hmm, maybe we should only have a couple machines in the cluster test that, it's the same for all of them [16:59:48] RECOVERY - ElasticSearch health check for shards on elastic2012 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 775, number_of_pending_tasks: 26756, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8562, initializing_shards: 78, number_of_data_nodes: 24, delayed_unass [16:59:58] RECOVERY - ElasticSearch health check for shards on elastic2024 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 762, number_of_pending_tasks: 27615, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8576, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:58] RECOVERY - ElasticSearch health check for shards on elastic2014 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 762, number_of_pending_tasks: 27615, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8576, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [16:59:59] RECOVERY - ElasticSearch health check for shards on elastic2002 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 762, number_of_pending_tasks: 27615, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8576, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [17:00:08] RECOVERY - ElasticSearch health check for shards on elastic2016 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 758, number_of_pending_tasks: 27970, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8580, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [17:00:08] RECOVERY - ElasticSearch health check for shards on elastic2008 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 758, number_of_pending_tasks: 27970, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8580, initializing_shards: 77, number_of_data_nodes: 24, delayed_unass [17:00:14] (03PS10) 10Dzahn: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [17:00:19] RECOVERY - ElasticSearch health check for shards on elastic2001 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 754, number_of_pending_tasks: 28661, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8587, initializing_shards: 74, number_of_data_nodes: 24, delayed_unass [17:00:39] RECOVERY - ElasticSearch health check for shards on elastic2009 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 739, number_of_pending_tasks: 29501, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3120, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8601, initializing_shards: 75, number_of_data_nodes: 24, delayed_unass [17:00:39] RECOVERY - ElasticSearch health check for shards on elastic2021 is OK: OK - elasticsearch status production-search-codfw: status: yellow, number_of_nodes: 24, unassigned_shards: 739, number_of_pending_tasks: 29672, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 3119, cluster_name: production-search-codfw, relocating_shards: 0, active_shards: 8598, initializing_shards: 75, number_of_data_nodes: 24, delayed_unass [17:01:07] (03CR) 10Dzahn: [C: 032] "per hashar and "solely for contint on labs instances"" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [17:04:07] (03PS5) 10Dzahn: phragile: move role class to autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/285333 [17:05:37] (03CR) 10Jcrespo: "Add es, and +1." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) (owner: 10Volans) [17:07:19] gehel: are you having fun with elastic search :P ? [17:07:46] elukey: all depends on your definition of fun! [17:08:00] :) [17:10:28] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Reading-Web, and 6 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2244039 (10MBinder_WMF) [17:11:05] (03PS4) 10Volans: MariaDB: Load and enable semi-sync replication [puppet] - 10https://gerrit.wikimedia.org/r/285649 (https://phabricator.wikimedia.org/T133333) [17:11:55] (03PS4) 10Urbanecm: Enable DynamicPageList extension on tewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T133032) [17:12:45] (03PS1) 10Papaul: DHCP: Adding MAC address entries for restbase200[7-9] Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285675 (https://phabricator.wikimedia.org/T132976) [17:14:22] (03PS5) 10Dereckson: Enable DynamicPageList extension on te.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T104163) (owner: 10Urbanecm) [17:14:49] (03CR) 10Dereckson: "PS5: gsed -i "s/T133032/T104163/g" InitialiseSettings.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T104163) (owner: 10Urbanecm) [17:14:58] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2244103 (10Papaul) [17:15:11] (03CR) 10Urbanecm: "Thanks @Dereckson." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285009 (https://phabricator.wikimedia.org/T104163) (owner: 10Urbanecm) [17:17:09] (03PS2) 10Dzahn: DHCP: Adding MAC address entries for restbase200[7-9] Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285675 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [17:17:15] (03CR) 10Dzahn: [C: 032] DHCP: Adding MAC address entries for restbase200[7-9] Bug: T132976 [puppet] - 10https://gerrit.wikimedia.org/r/285675 (https://phabricator.wikimedia.org/T132976) (owner: 10Papaul) [17:20:06] (03CR) 10Ottomata: "Hm, it looks like zookeeper_url was added to hiera by Marko in https://gerrit.wikimedia.org/r/#/c/275772/7/manifests/role/changeprop.pp. " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) (owner: 10Elukey) [17:21:24] (03CR) 10Ottomata: [C: 031] udp2log: Move ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/285375 (owner: 10Muehlenhoff) [17:22:35] (03CR) 10Dzahn: "i wanted to use the "watroles" tool to check which instances are using this, because i did not want to break things due to renaming the cl" [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [17:22:56] (03CR) 10Ottomata: "Ah, marko has already fixed changeprop.pp. Luca, we can remove the 'zookeeper_url' from hiera altogether." [puppet] - 10https://gerrit.wikimedia.org/r/285393 (https://phabricator.wikimedia.org/T131959) (owner: 10Elukey) [17:23:39] (03PS2) 10Dzahn: testsystem: move role class to test::system [puppet] - 10https://gerrit.wikimedia.org/r/285334 [17:23:41] jdlrobson: hi! :) any thoughts on https://phabricator.wikimedia.org/T133765 ? [17:24:05] Krenair: ^ ? [17:24:11] soorry I ment Krinkle [17:24:23] (though if u have any comments Krenair happy 2 hear 'em!) [17:24:54] (03PS2) 10Dzahn: Only install font-gujr-extra on jessie [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [17:25:26] (03CR) 10Dzahn: [C: 031] "also seems vaguely familiar from a previous attempt we made and then reverted (the issue on labs that hashar mentioned too)" [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [17:26:45] (03PS1) 10Ppchelko: Set up redlinks processing in change propagation [puppet] - 10https://gerrit.wikimedia.org/r/285678 (https://phabricator.wikimedia.org/T133221) [17:27:04] (03Abandoned) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [17:32:53] (03PS2) 10Dzahn: mediawiki: include font packages on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) [17:33:55] 06Operations, 13Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#931171 (10Dzahn) a:05Dzahn>03None [17:33:58] (03PS1) 10Volans: Avoid loading my.cnf twice [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/285679 (https://phabricator.wikimedia.org/T133780) [17:34:14] jynus, what's a good time to discuss External Store? [17:34:30] (03PS4) 10Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) [17:35:30] matt_flaschen, I already reviewed the change to beta [17:36:19] 06Operations, 13Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#2244179 (10Dzahn) @Muehlenhoff back in August 2015 i uploaded this one https://gerrit.wikimedia.org/r/#/c/231284 (just amended to fix path conflict) as a possible fix for this tic... [17:36:23] 06Operations, 13Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#2244181 (10Dzahn) p:05Low>03Normal [17:36:51] 06Operations, 13Patch-For-Review: install font packages on all appservers, not just imagescalers (was: Install fonts-wqy-zenhei on all mediawiki app servers) - https://phabricator.wikimedia.org/T84777#931171 (10Dzahn) 05stalled>03Open [17:37:10] jynus, oh, I see. Existing content that doesn't use External Store will still be accessible. I don't think it needs to be migrated. [17:37:36] AndyRussG: hey taking a loook [17:37:43] jdlrobson: cool thx! [17:37:58] Looks like the ext.centralNotice.display RL dependency isn't getting loaded when it should [17:38:16] matt_flaschen, yeah, I was thinging too big assuming it was production [17:38:31] and needed 100% perfect migration [17:38:35] not needed really [17:38:38] AndyRussG: modules.exports may help you here [17:38:55] AndyRussG: as you can avoid relying on ext.centralNotice existing [17:39:09] sorry, but last week was datacenter failover time and all of ops was overloaded [17:39:17] jdlrobson: it seems that's just the symptom [17:39:18] but it sounds like whatever first defines ext.centralNotice needs to be declared as a dependency of ext.centralNotice.choiceData [17:39:21] and before that, I was on vacation [17:39:36] jdlrobson: it is declared, that code has been working for like more than a year [17:39:43] it's declared dynamically from PHP [17:39:59] jdlrobson: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/dac230995248c5345216d23707b42d22d99b9cb7/includes/CNChoiceDataResourceLoaderModule.php#L169-L170 [17:40:48] AndyRussG: I see it returns [] If there are no choices, no dependencies [17:40:57] YuviPanda, yeah, no problem. I just wanted to sync up now that you're back and data center stuff is over. [17:41:10] jdlrobson: exactly... But in this case there is definitely a choice, 'cause stuff in choiceData [17:41:10] Sorry, that was to jynus. [17:41:22] (03PS3) 10Dzahn: mediawiki: include font packages on all appservers [puppet] - 10https://gerrit.wikimedia.org/r/231284 (https://phabricator.wikimedia.org/T84777) [17:41:29] it's happening a lot for me on de.wikipedia.org [17:41:29] we are still doing the ES stuff [17:41:34] Krinkle is a way on vacation fyi [17:41:39] AndyRussG: how can i replicate? [17:41:54] just go to de.wikpedia.org, open a console, and reload a few times [17:41:58] (03CR) 10Ottomata: [C: 031] "Cool with me then, someone in traffic should give final review" [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [17:42:07] Maybe clear out your localstorage between reloads if you don't see it [17:42:30] I'm trying to figure out if the dependency is getting there correctly in the call to mw.loader.register [17:42:32] it is just that as new servers arrived, that also blocked the deployment and there is no longer a total failure scenario [17:42:45] (03PS1) 10ArielGlenn: Raise connection limit for dumps server, for specific shared IP [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) [17:42:47] but I want to create a new cluster soon on production [17:43:11] (the same way you created it on beta) [17:43:18] and start doing tests [17:43:26] jynus, yep. Do you want to re-review the first Beta patch, or should we just review it on the Collaboration team? [17:43:52] first? do you have a link? [17:45:07] jynus, same one you reviewed before: https://gerrit.wikimedia.org/r/#/c/282440/4 . I'm just asking if you'll want to look at that again or if all your concerns are addressed (just the existing content). [17:45:21] (03CR) 10Dzahn: [C: 032] testsystem: move role class to test::system [puppet] - 10https://gerrit.wikimedia.org/r/285334 (owner: 10Dzahn) [17:45:32] yes, that had an implied +1, let me make it expliciy [17:45:55] (03CR) 10Jcrespo: [C: 031] Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [17:46:10] for labs I do not think we need such a scrict review [17:46:16] (beta, I mean) [17:46:41] because worst case scenario, we revert and lose just a handful or edits [17:46:58] production is more complex due to the high edit ration [17:47:00] *ratio [17:47:19] (03CR) 10Dzahn: "no-op on test systems (except motd/role name)" [puppet] - 10https://gerrit.wikimedia.org/r/285334 (owner: 10Dzahn) [17:47:28] if you want me to be around on deploy, just tell me and I will in any case [17:47:48] jynus, okay, thanks. I just wanted to make sure to sync up and that we were both ready to go forward. I don't think you necessarily have to be around. I can revert if it fails. [17:47:56] great [17:48:13] jynus, also I wanted you to look just in case it actually looked wrong. [17:48:17] I think beta is technically owned by releng, but of course, feel free to ping me [17:48:28] looks good [17:48:59] we will do some edits there, naturally [17:49:04] and then do a beta test [17:49:13] with the script [17:49:59] jdlrobson: quote from ejegg, who was at scrum of scrums: "AndyRussG: Thiemo says getModifiedHash is outdated and we should use getDefinitionSummary" [17:52:03] papaul: I'm snagging your restbase system switch config task now [17:52:13] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2244238 (10RobH) a:03RobH [17:52:39] robh: thanks but not a rush since i am waiting on the SSDs [17:53:02] no worries, i am playing procurement catchup, its a nice change of pace =] [17:53:08] 06Operations, 06Labs: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244253 (10chasemp) p:05Triage>03High [17:55:22] jdlrobson: yeah choiceData is not getting registered with the proper dependencies. I'm not deeply familiar with how getDefinitionSummary() should be used [17:56:39] 06Operations, 06Labs: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244256 (10chasemp) A small addendum in case someone else runs into it. I was initially confused by the difference in behavior here: ```dig blah @labs-ns0.wikimedia.org ; <<>> DiG 9.8.3-P1 <<>> blah @labs-ns0.w... [17:58:28] robh: i yes i don't know why this HP procurement is taking almost a month now this it the first time [17:58:56] 06Operations, 06Labs: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244262 (10chasemp) [17:59:15] papaul: so none of the sysetms shipped with the ssds? [17:59:26] akosiaris, most recently the dump task was blocked on https://gerrit.wikimedia.org/r/283117 rolling out (since the train was delayed by data center switch). So things are progressing, and that is also out now. [17:59:30] only the first one [17:59:49] robh: the last 2 that came in on monday had no ssds [17:59:57] ok, i'm emailing dasher/hp now. [18:00:09] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:00:17] (03PS1) 10Aaron Schulz: Set "autoResync" on for local-multiwrite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285687 (https://phabricator.wikimedia.org/T128096) [18:00:39] (03PS1) 10Dzahn: introduce ununpentium.wm.org [dns] - 10https://gerrit.wikimedia.org/r/285688 (https://phabricator.wikimedia.org/T123713) [18:00:46] robh: ^ :) [18:01:32] (03CR) 10jenkins-bot: [V: 04-1] introduce ununpentium.wm.org [dns] - 10https://gerrit.wikimedia.org/r/285688 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [18:01:41] ha [18:01:52] thats eiximenis level for typos man [18:01:52] (03PS2) 10Dzahn: introduce ununpentium.wm.org [dns] - 10https://gerrit.wikimedia.org/r/285688 (https://phabricator.wikimedia.org/T123713) [18:01:53] well done [18:02:31] "not a pentium" [18:02:31] !log generating new triggers for eventlogging_sync schema T108856 [18:02:32] T108856: Set up bucketization of editCount fields {tick} - https://phabricator.wikimedia.org/T108856 [18:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:46] un-un-pentium , so maybe it is one [18:04:22] but its all about the pentiums!https://www.youtube.com/watch?v=qpMvS1Q1sos [18:04:37] (03CR) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [18:04:44] hahah [18:04:47] nice [18:04:55] (03PS5) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [18:05:07] AndyRussG: can you point me to the code that loads ext.centralNotice.choiceData: ? [18:05:09] bblack: i'm requesting a new hostname in wikimedia.org [18:05:22] bblack: it would only replace an existing one though, magnesium [18:05:55] mw.loader.moduleRegistry['ext.centralNotice.choiceData'] shows that it has no dependencies according to ResourceLoader [18:06:11] jdlrobson: yes and yes [18:06:32] jdlrobson: the dependencies remember are dynamic. You have to be in a project/language context that is targeted for a possible CN campaign to get choices [18:06:34] AndyRussG: I think you need to remove the line if ( count( $choices ) === 0 ) { return [] [18:06:41] because this will cache for all modules [18:06:44] mutante: yeah we have no process around it yet. the main thing is that it gets added to the list at https://wikitech.wikimedia.org/wiki/HTTPS/domains [18:06:52] you should always load the same dependencies [18:06:54] jdlrobson: again, this has been working for a while [18:06:59] even a blank entry will do, I can always flesh it out the next time I go back there to audit something [18:07:13] mutante: but, this is a server hostname rather than a service, so it doesn't go there (yet) anyways [18:07:25] jdlrobson: That's the standard way to go, but for performance, it was decided to do otherwise, and that exact part was fully vetted by our RL folks [18:07:43] AndyRussG: how can you be sure though? Maybe you have exposed an edge case that was always broken. [18:08:01] i'm not too familiar with the code, but somewhere somehow dependencies is being defined as an empty list [18:08:05] jdlrobson: that'd be really weird... If so, we should fix the edge case? [18:08:12] bblack: ok, i'll make sure to add it and comment what it's for, also i plan to apply the RT role on that [18:08:15] Sounds to me like something wrong with the module hash [18:08:23] AndyRussG: the thing that's interesting is mw.loader.moduleRegistry['ext.centralNotice.choiceData'] on enwiki shows an array [18:08:39] so i would suggest trying to work out what differences between dewiki and enwiki could lead to an empty dependency list [18:08:47] 06Operations, 13Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#1369726 (10hashar) Danke alle! [18:08:47] bblack: (which i removed from the internal VM krypton) [18:08:57] (03PS2) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [18:09:01] jdlrobson: but... if you set country= and uselang= params for a place with a campaign happening, you should get data in choiceData [18:09:07] (which I should try inded) [18:09:12] lemme see [18:09:42] mutante: ok [18:09:44] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:09:55] (03CR) 10jenkins-bot: [V: 04-1] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:09:58] mutante: maybe we need a self-only flag for these transitional situations? [18:10:02] (03PS3) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [18:10:14] (which we could put in hieradata so it can be temporarily set per-node) [18:10:27] let me work that up, and then you can try it for ununpentium [18:10:33] it's pretty painless I think [18:11:09] woot, did we run out of the nice element names? [18:11:26] bblack: cool, i haven't created it yet so no rush in particular [18:11:33] but needs to be in DNS first [18:11:38] (03PS4) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [18:12:03] MatmaRex: pretty much, and i dont like re-using [18:12:13] huh. [18:12:28] also, it fits [18:12:36] i guess it's isotopes next? do we have deuterium yet? :D [18:13:01] (03CR) 10jenkins-bot: [V: 04-1] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:13:17] MatmaRex: i keep saying isotopes should have been VMs in eqiad.. and planet names VMs in codfw. .but too late [18:14:06] jdlrobson: another indication it's a module hash/caching issue is that when I get an error on dewiki, it's trying to send me data for the moon banner, but that campaign has ended [18:14:12] why the fuck is the powerdns recursor causing trouble for me in this patch?! [18:15:38] (03PS1) 10ArielGlenn: fix up rsync of kiwix openzim files to dataset host [puppet] - 10https://gerrit.wikimedia.org/r/285689 [18:16:44] 06Operations, 06Labs: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244402 (10chasemp) In the short term maybe it makes sense just to switch to `/usr/lib/nagios/plugins/check_dig` which seems semi sane, although a check built around http://www.dnspython.org/examples.html would be... [18:17:41] AndyRussG: The moon data is still in [18:17:41] https://de.wikipedia.org/w/load.php?debug=false&lang=de&modules=Spinner%7Cext.centralNotice.bannerController%2CchoiceData%2CgeoIP%2CkvStoreMaintenance%2CstartUp%7Cext.centralauth.centralautologin%7Cext.uls.init%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor.supportCheck%2Ctrack%2Cve%7Cjquery.byteLength%2C [18:17:41] cookie%2CembedPlayer%2CloadingSpinner%2CmwEmbedUtil%2CtabIndex%2Cthrottle-debounce%2CtriggerQueueCallback%7Cmediawiki.Title%2CUri%2Capi%2Ccldr%2Ccookie%2CjqueryMsg%2Clanguage%2Ctemplate%2Cuser%7Cmediawiki.api.options%2Cuser%7Cmediawiki.language.data%2Cinit%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.startup%7Cmediawiki.template.regexp%7Cmmv.head%7Cmw. [18:17:41] EmbedPlayer.loader%7Cmw.MediaWikiPlayer.loader%7Cmw.MwEmbedSupport%2CPopUpMediaTransform%7Cmw.MwEmbedSupport.style%7Cmw.PopUpMediaTransform.styles%7Cmw.TimedText.loader%7Cskins.vector.js%7Cuser.defaults&skin=vector&version=f03c8171a7f3 [18:17:42] (03PS1) 10Halfak: Adds myspell-hu to ores base [puppet] - 10https://gerrit.wikimedia.org/r/285690 [18:17:44] (03PS1) 10BBlack: LE: add do_acme hieradata control for provisioning [puppet] - 10https://gerrit.wikimedia.org/r/285691 [18:17:45] could be related to https://phabricator.wikimedia.org/T99096 [18:18:23] mutante: so with https://gerrit.wikimedia.org/r/285691 , when you're first setting up ununpentium with the requesttracker role in site.pp, also add hieradata/hosts/ununpentium.yaml with contents 'do_acme: false' [18:18:48] otherwise it will constantly fail puppet and spam letsencrypt with illegitimate requests, because the public hostname rt.wm.o doesn't map to that machine [18:19:06] (03PS5) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [18:19:13] once it's already to make the public switch, switch DNS first, then remove that flag and let puppet get it a real signed cert [18:19:23] AndyRussG: choiceData in that url doesn't set mw.centralNotice=(mw.centralNotice||{}) [18:19:27] when did that line get added? [18:19:33] (03PS2) 10Yuvipanda: ores: Adds myspell-hu to base [puppet] - 10https://gerrit.wikimedia.org/r/285690 (owner: 10Halfak) [18:19:41] ottomata, potential source of breakage (heads up) https://phabricator.wikimedia.org/T108856#2244410 [18:19:44] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Adds myspell-hu to base [puppet] - 10https://gerrit.wikimedia.org/r/285690 (owner: 10Halfak) [18:20:01] jdlrobson: just like 1 hr ago [18:20:28] jdlrobson: I was gonna try deploying that, but then I saw that ext.centralNotice.display isn't even getting loaded [18:20:29] AndyRussG: it will still impact old cached pages/js then [18:20:31] (03CR) 10jenkins-bot: [V: 04-1] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:20:38] So that patch won't help [18:20:45] (03PS2) 10BBlack: LE: add do_acme hieradata control for provisioning [puppet] - 10https://gerrit.wikimedia.org/r/285691 [18:20:54] (03CR) 10BBlack: [C: 032 V: 032] LE: add do_acme hieradata control for provisioning [puppet] - 10https://gerrit.wikimedia.org/r/285691 (owner: 10BBlack) [18:21:01] Without the display module, we'll get more js errors down the line, when the startUp module tries to check if a banner should be shown [18:21:32] YuviPanda: ok to merge? or you can do mine when you do yours, it's not time-critical [18:21:47] bblack: oops, yes, do merge! [18:21:56] done! [18:22:25] Since caching may be involved, pinging bblack.... :) https://phabricator.wikimedia.org/T133765 [18:22:52] bblack: I think this is due to an issue with the wrong version of a RL module being served..... [18:23:42] jdlrobson: I'm gonna prepare a patch that uses getDefinitionSummary() instead of getModifiedHash() [18:24:06] since getModifiedHash() is now deprecated [18:24:07] (03PS1) 10ArielGlenn: fix string comparison in dumpcirrussearch for old dump cleanups [puppet] - 10https://gerrit.wikimedia.org/r/285693 [18:24:07] sounds complicated [18:24:36] step 1 is: do you know how to tell the difference between working + broken outputs for a URL you can test with curl? [18:24:43] (03PS6) 10Yuvipanda: tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) [18:24:45] (03PS1) 10Yuvipanda: diamond: add .pep8 file raising line length limits [puppet] - 10https://gerrit.wikimedia.org/r/285694 [18:24:54] step 2 is: do the results from mw* hosts directly look good? (if not, there's no legit fix yet) [18:24:54] bblack: huh? [18:25:15] step 3 is: if mw* hosts output good stuff, and the public hostnames don't, then you have a varnish cache problem [18:25:35] bblack: yeah I don't think it's a varnish cache problem per se, or I hope not [18:26:01] it sounds like still a CN + RL problem, and I really don't have much depth in that area [18:26:02] might be... [18:26:04] Krinkle: ? [18:26:08] He's on vacation [18:26:22] jdlrobson, who has done some recent work on RL, has been looking into it.... [18:26:38] ok [18:26:43] do we know what change broke things? [18:26:53] bblack: also no [18:27:17] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2244454 (10RobH) Please note that of the three new systems, only one of the three systems was ordered with SSDs. Please make the system with the Intel SSDs restbase2009. restbase2007-2008 need to have th... [18:27:17] that's an increasingly common answer for any kind of breakage on the wikis this year :/ [18:27:36] I still blame pace of code deployment getting too fast for our QA abilities [18:28:26] (03CR) 10Yuvipanda: [C: 032] diamond: add .pep8 file raising line length limits [puppet] - 10https://gerrit.wikimedia.org/r/285694 (owner: 10Yuvipanda) [18:29:49] Hmmm [18:29:54] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2244457 (10RobH) [18:29:56] 06Operations, 10ops-codfw: rack/setup/deploy restbase200[7-9] switch configuration - https://phabricator.wikimedia.org/T133788#2244455 (10RobH) 05Open>03Resolved restbase2007 = ge-1/0/2 row B rack B1 restbase2008 = ge-1/0/2 row C rack C1 restbase2009 = ge-1/0/6 row D rack D1 all the switch ports have had... [18:30:02] AndyRussG: in any case, still, I'd break this down to where you have an easy test-case in curl on the commandline. "When I fetch this URL, if the output looks like X, things are still broken / now fixed" [18:30:33] the problem is the curl test URL will likely be an RL URL [18:31:06] bblack: K that makes sense... I don't know enuf about how RL module versions are determined to know how to do that, but I can look into it.... [18:31:08] jdlrobson: ^ [18:31:29] well, perhaps dig from current output [18:31:46] start with a fetch of the HTML spit out by https://de.wikipedia.org/wiki/Wikipedia:Hauptseite [18:31:46] yep [18:31:56] and follow the trail of URLs down to the RL fetch that contains the buggy bit [18:32:03] yep [18:32:24] * AndyRussG stares buggy bit in its buggy eyes [18:32:39] (03PS10) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [18:33:47] AndyRussG: the url I gave you above should work [18:37:28] (03PS2) 10ArielGlenn: Raise connection limit for dumps server, for specific shared IP [puppet] - 10https://gerrit.wikimedia.org/r/285682 (https://phabricator.wikimedia.org/T133790) [18:41:22] bblack: got it, that was quick. tyvm ! [18:43:14] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244507 (10Krenair) ```lang=irc Krenair: It's a hack, but I tend to put those things in sink plugins, since sink is already in charge of cleaning up... [18:44:26] (03CR) 10jenkins-bot: [V: 04-1] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:45:37] (03CR) 10Dzahn: [C: 032] introduce ununpentium.wm.org [dns] - 10https://gerrit.wikimedia.org/r/285688 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [18:46:51] mutante: np [18:48:57] (03PS1) 10EBernhardson: Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) [18:49:26] (03CR) 10Mobrovac: "You will also need to set cassandra::target_version in hieradata/labs/host/deployment-restbase02.yaml (which you have to create) to tempor" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [18:49:33] (03CR) 10jenkins-bot: [V: 04-1] Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) (owner: 10EBernhardson) [18:49:55] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2244514 (10TerraCodes) >>! In T109331#2207393, @NahidSultan wrote: > Another one: https://upload.wikimedia.org/... [18:50:45] (03CR) 10Yuvipanda: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:51:13] (03PS2) 10EBernhardson: Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) [18:51:48] (03CR) 10DCausse: [C: 031] Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) (owner: 10EBernhardson) [18:52:35] (03PS3) 10EBernhardson: Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) [18:53:37] (03CR) 10EBernhardson: [C: 032] Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) (owner: 10EBernhardson) [18:54:03] (03Merged) 10jenkins-bot: Stop pushing elasticsearch writes to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285698 (https://phabricator.wikimedia.org/T133784) (owner: 10EBernhardson) [18:55:09] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: Drop codfw from elasticsearch config T133784 (duration: 00m 25s) [18:55:10] T133784: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784 [18:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:55] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Drop codfw from elasticsearch config T133784 (duration: 00m 36s) [18:55:56] T133784: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784 [18:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:20] !log creating VM ununpentium on ganeti/eqiad (T123713) [18:56:21] T123713: decom magnesium (was: Reinstall magnesium with jessie) - https://phabricator.wikimedia.org/T123713 [18:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:12] (03CR) 10Yuvipanda: [C: 032] tools: Switch to newer webservice comand [puppet] - 10https://gerrit.wikimedia.org/r/285656 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:57:29] (03PS3) 10Alex Monk: Followup I6b0bbb34: Fix pep8 in modules/diamond/files/collector/powerdns_recursor.py [puppet] - 10https://gerrit.wikimedia.org/r/285659 [18:58:02] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1216-1307 - https://phabricator.wikimedia.org/T133798#2244548 (10Cmjohnson) [18:58:12] (03PS1) 10Bartosz Dziewoński: Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) [18:58:46] (03PS1) 10Papaul: DNS: Adding mgmg DNS for mw1261 to mw1307 Bug: T133798 [dns] - 10https://gerrit.wikimedia.org/r/285701 (https://phabricator.wikimedia.org/T133798) [18:58:47] 06Operations, 06Labs, 10Traffic: check_dns needs to be rewritten - https://phabricator.wikimedia.org/T133791#2244564 (10BBlack) Sticking the #Traffic tag on because this affects monitoring of the production DNS authservers too, and that check_dns utility is awful to be relying on for monitoring something so... [18:59:04] ostriches: so I am around for good :-} [18:59:56] (03PS2) 10Bartosz Dziewoński: Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) [19:00:01] (03PS11) 1020after4: Fix multiple ssh::userkey resources per user [puppet] - 10https://gerrit.wikimedia.org/r/285519 (https://phabricator.wikimedia.org/T132747) [19:00:04] hashar ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160427T1900). [19:00:11] !log restarting elastic on elastic2007.codfw.wmnet (master) [19:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:33] 06Operations, 10Analytics, 10ArchCom-RfC, 06Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2244571 (10csteipp) [19:00:59] (03CR) 10Bartosz Dziewoński: "I think we can do this anytime. Steinsplitter, Rillke, please take a look and check that this is right, and I'll schedule a deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [19:01:04] 06Operations, 10ops-eqiad: Rack and Set up new application servers mw1216-1307 - https://phabricator.wikimedia.org/T133798#2244572 (10Papaul) [19:01:05] the train on group 1 might take a bit longer, because I havent prepared anything in advance [19:01:39] It only takes a few minutes [19:02:20] You don't actually need that deploy-promote script, you can just use updateWikiVersions group1 $version [19:02:35] (03CR) 10Dzahn: "watroles works again. it looks this role class is not used on any instances though (https://tools.wmflabs.org/watroles/role/role::phragile" [puppet] - 10https://gerrit.wikimedia.org/r/285333 (owner: 10Dzahn) [19:03:21] (03PS1) 10Hashar: group1 wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285703 [19:03:42] ah and here is the php symlink update [19:07:26] (03CR) 10Hashar: [C: 032] group1 wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285703 (owner: 10Hashar) [19:07:47] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 628 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5133273 keys - replication_delay is 628 [19:09:24] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285703 (owner: 10Hashar) [19:09:42] !log hashar@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.22 [19:09:50] ostriches: I am disappointed. That is too easy [19:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:10:07] Why does everyone think it's so hard? :) [19:10:47] the train has been made so easy & fast that maybe it is time to consider deploying more often [19:11:50] !log update restbase to e9fbdfe: staging [19:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:18] 06Operations, 10Analytics, 10ArchCom-RfC, 06Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2244625 (10mobrovac) [19:12:46] 06Operations, 10Analytics, 06Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#2244628 (10mobrovac) [19:12:52] 06Operations, 10Analytics, 10ArchCom-RfC, 06Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1711519 (10mobrovac) 05Open>03Resolved And we're done here! [19:13:33] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2244641 (10Dzahn) @Southparkfan see the link in the comment above. it seems to be just about which cluster the server is in, your link had the "maps caches eqiad" part but it's in "misc caches eqiad" (maybe that... [19:13:53] 06Operations: Investigate cp1044's strange Ganglia graphs - https://phabricator.wikimedia.org/T132859#2244642 (10Dzahn) p:05Normal>03Low [19:14:04] hashar: My only objection is the current branching process. [19:14:11] That's the biggest pain on doing it more often [19:14:19] Actually deploying and swapping versions is easy [19:15:56] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 03Scap3: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2244644 (10Dzahn) [19:16:12] so unsurprisingly there is a spike of warnings messages "Duplicate get(): "{key}" fetched {count} times". That has been set by Krinkle && ori to track dupe get to bag of stuff [19:16:58] and a bunch of "https-expected" [19:17:11] Long as things are expected it's not a problem. [19:17:19] (03PS1) 10Dzahn: add ununpentium to site.pp as test::system, DHCP/netboot [puppet] - 10https://gerrit.wikimedia.org/r/285705 (https://phabricator.wikimedia.org/T123713) [19:17:52] (03PS1) 10Yurik: Add yurik to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/285706 [19:18:01] looking at logstash nothing looks concerning for .22 [19:18:10] !log update restbase to e9fbdfe: canary on restbase1007 [19:18:15] (03PS3) 10Dzahn: Only install font-gujr-extra on jessie [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [19:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:31] (03PS2) 10Yurik: Add yurik to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/285706 [19:18:44] (03CR) 10Dzahn: [C: 032] "to fix issues for beta cluster" [puppet] - 10https://gerrit.wikimedia.org/r/285621 (https://phabricator.wikimedia.org/T129500) (owner: 10Muehlenhoff) [19:20:18] 06Operations, 07Puppet, 06Commons, 10Wikimedia-SVG-rendering, and 2 others: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2244661 (10Dzahn) @hashar should be gone now [19:21:06] (03PS2) 10Dzahn: add ununpentium to site.pp as test::system, DHCP/netboot [puppet] - 10https://gerrit.wikimedia.org/r/285705 (https://phabricator.wikimedia.org/T123713) [19:22:22] (03PS3) 10Dzahn: add ununpentium to site.pp as test::system, DHCP/netboot [puppet] - 10https://gerrit.wikimedia.org/r/285705 (https://phabricator.wikimedia.org/T123713) [19:23:16] (03CR) 10Dzahn: [C: 032] add ununpentium to site.pp as test::system, DHCP/netboot [puppet] - 10https://gerrit.wikimedia.org/r/285705 (https://phabricator.wikimedia.org/T123713) (owner: 10Dzahn) [19:24:20] !log update restbase to e9fbdfe [19:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:26:06] (03PS1) 10ArielGlenn: correct usage message when argument to job option matches no known job [dumps] - 10https://gerrit.wikimedia.org/r/285707 [19:28:43] (03CR) 10ArielGlenn: [C: 032] correct usage message when argument to job option matches no known job [dumps] - 10https://gerrit.wikimedia.org/r/285707 (owner: 10ArielGlenn) [19:29:38] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:33:50] (03PS1) 10Bartosz Dziewoński: Configure 'testwiki' as foreign file repo for 'test2wiki', allow cross-wiki uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) [19:34:17] (03CR) 10Bartosz Dziewoński: "How crazy does this look to you?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285708 (https://phabricator.wikimedia.org/T133305) (owner: 10Bartosz Dziewoński) [19:39:17] (03CR) 10Eevans: [WIP]: Cassandra 2.2.5 config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:40:15] (03CR) 10Eevans: "> You will also need to set cassandra::target_version in" [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:40:42] ebernhardson: gehel: we have switched group1 to .22 --- we have a bunch of "Received job for unwritable cluster codfw." [19:40:56] (03PS5) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:40:58] hashar: Isn't that known? [19:41:00] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2244694 (10Dzahn) thanks for the fix. definitely fewer IPs in there now. the remaining ones i see currently: 10.68.16.147 (down) 10.68.17.204 (down) 10.68.16.53 (integrat... [19:41:01] And reported earlier? [19:41:03] a good 15k per minute [19:41:05] yeah [19:41:36] I am not wondering whether it is going to fix or if we can turn the exception spam off ;-} [19:42:02] (03CR) 10jenkins-bot: [V: 04-1] [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:42:52] * gehel reading back to see what hashar means.. [19:43:35] hashar: what do you mean by "group1 to .22" ? [19:43:42] we also had a spike of "No configuration available for cluster: codfw" between 19:09:30 UTC and 19:10:40 UTC, must have been caused by the group1 switch [19:43:47] oh [19:44:12] I have deployed MediaWiki 1.27.0-wmf.22 to a lot of different wikis which together are known as group1 [19:44:30] before that it was just test wiki / mediawiki.org and a few other small ones [19:47:12] anyway that does not seem to be much of a concern [19:47:28] and filtering that out, there are now new exception/fatals [19:48:21] (03PS1) 10Yuvipanda: Write new service.manifest all the time when restarting [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285710 [19:48:24] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 13850 bytes in 0.002 second response time [19:49:04] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 13850 bytes in 0.003 second response time [19:49:32] (03PS1) 10Eevans: cleanup uneeded jars [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/285712 [19:50:43] hashar: Ok, understood. I though there was many groups (1, 2, ..., 22) and did not know what to make of it. [19:51:19] (03CR) 10Yuvipanda: [C: 032] Write new service.manifest all the time when restarting [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/285710 (owner: 10Yuvipanda) [19:51:21] hashar: That exception sound like something related to us disabling writes to codfw elasticsearch cluster. ebernhardson is probably interested [19:51:56] (03CR) 10Yuvipanda: [C: 032] Use webservice not webservice-new [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/285624 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [19:52:01] (03PS6) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:52:21] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2244714 (10Dzahn) host 10.68.16.66 is special, look how many names that has: host 10.68.16.66 | wc -l 27 all in contintcloud.eqiad.wmflabs [19:52:48] (03PS1) 10Cmjohnson: Fixing typo on mgmt dns entries for elastics [dns] - 10https://gerrit.wikimedia.org/r/285715 [19:53:14] (03CR) 10jenkins-bot: [V: 04-1] [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) (owner: 10Eevans) [19:53:33] (03PS2) 10Cmjohnson: Fixing typo on mgmt dns entries for elastics [dns] - 10https://gerrit.wikimedia.org/r/285715 [19:54:46] (03CR) 10Cmjohnson: [C: 032] Fixing typo on mgmt dns entries for elastics [dns] - 10https://gerrit.wikimedia.org/r/285715 (owner: 10Cmjohnson) [19:55:42] (03PS7) 10Eevans: [WIP]: Cassandra 2.2.5 config [puppet] - 10https://gerrit.wikimedia.org/r/284078 (https://phabricator.wikimedia.org/T126629) [19:56:20] gehel: yeah sorry my message lacked context :} [19:56:25] 06Operations, 10OCG-General, 06Services, 13Patch-For-Review: Implement flag to tell an OCG machine not to take new tasks from the redis task queue - https://phabricator.wikimedia.org/T120077#2244717 (10cscott) I believe step 1 and step 2 can be performed in a single puppet commit. Step 3 can probably be i... [19:57:03] gehel: ebernhardson: the ElasticSearch exception for codfw being readonly is nicely seen on https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor :-} [19:57:03] hashar: context is probably implied for people who have been around for longer than I have... [19:57:16] yeah my bad sorry :( [19:57:44] that is why everyone keeps asking question to clarify ! [19:57:46] hashar: and now I can't seem to find those logs... [19:58:00] * gehel is good at asking questions! [19:59:46] hashar: found it, something about UTC vs CEST ... [19:59:52] in logstash ? [20:00:04] gwicke cscott arlolra subbu bearND mdholloway yurik: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160427T2000). Please do the needful. [20:00:28] no mobileapps deployment today. [20:00:33] deploying kartotherian & tilerator [20:00:56] (03CR) 10BBlack: Read values inbound in X-Analytics header (pageview and preview) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285051 (https://phabricator.wikimedia.org/T133204) (owner: 10Nuria) [20:01:13] train is completed [20:01:21] so deployment is open [20:02:32] gehel: I can give you a quick course about logstash whenever you want [20:03:04] hashar: logstash itself is OK. My brain just need to engage again... [20:04:31] (03PS1) 10Gehel: Revert "Depooled wdqs1001 during reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/285716 (https://phabricator.wikimedia.org/T133566) [20:06:52] (03CR) 10Luke081515: [C: 031] Set $wgRateLimits['upload'] for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285700 (https://phabricator.wikimedia.org/T132930) (owner: 10Bartosz Dziewoński) [20:08:19] (03CR) 10Gehel: [C: 032] Revert "Depooled wdqs1001 during reinstall" [puppet] - 10https://gerrit.wikimedia.org/r/285716 (https://phabricator.wikimedia.org/T133566) (owner: 10Gehel) [20:08:23] 06Operations, 10netops, 13Patch-For-Review: block labs IPs from sending data to prod ganglia - https://phabricator.wikimedia.org/T115330#2244755 (10Dzahn) >>! In T115330#2244714, @Dzahn wrote: > host 10.68.16.66 is special, look how many names that has: hashar> mutante: chasemp relevant task is T126518 [20:09:09] !log adding back wdqs1001 to varnish configuration after reinstall (T133566) [20:09:11] T133566: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566 [20:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:17] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244768 (10Krenair) [20:13:43] 06Operations, 10Traffic, 07HTTPS, 13Patch-For-Review: enable https for (ubuntu|apt|mirrors).wikimedia.org - https://phabricator.wikimedia.org/T132450#2244769 (10faidon) I'm not convinced https for that is a good idea. apt doesn't support it by default — apt-transport-https isn't installed out of the box ev... [20:15:21] (03CR) 10BBlack: [C: 04-1] "It's a good start, but:" [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) (owner: 10Alex Monk) [20:16:15] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244776 (10hashar) From T126518 It is back around :( ``` lang=irc [21:50:04] dzahn@bastion-restricted-01:~$ host 10.68.16.66 [21:50:04] ;; Tr... [20:20:32] !log deployed kartotherian & tilerator services [20:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:35] 06Operations, 10ops-eqiad, 10hardware-requests, 13Patch-For-Review: reclaim or decom: cp1043 + cp1044 - https://phabricator.wikimedia.org/T133614#2237371 (10RobH) a:03mark These were part of the original squids in eqiad, and are well out of warranty. I'm requesting @mark's approval on this task to decom... [20:28:42] !log updated OCG to version e39e06570083877d5498da577758cf8d162c1af4 [20:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:14] !log switching wdqs1002 to maintenance and reimporting data (T133566) [20:32:14] T133566: Reinstall and data reload of WDQS servers - https://phabricator.wikimedia.org/T133566 [20:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:29] (03PS2) 10Dzahn: DNS: Adding mgmt DNS for mw1261 to mw1307 Bug: T133798 [dns] - 10https://gerrit.wikimedia.org/r/285701 (https://phabricator.wikimedia.org/T133798) (owner: 10Papaul) [20:39:13] hashar: ostriches: is mediawiki train still deploying? [20:40:50] is anyone deploying? I think the train is done, but is service deploy done? I'd like to restart jenkins, there's a plugin that's causing some issues. [20:41:15] !log Enabled cirrussearch writes to codfw only on mw1165 w/ live hack [20:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:37] ebernhardson: done [20:41:41] hashar: thanks [20:41:56] !log 1.27.0-wmf.22 to group1 has been completed without incident. Deployment is open ! [20:42:00] to make it clear [20:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:58] ebernhardson: please hack ;-} [20:43:12] hashar: :) [20:44:54] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244874 (10Andrew) there are relatively many ldap connection failures in the sink log. That fits with the fact that our designate setup is subject to periodic OOMs... [20:45:19] (03PS3) 10Dzahn: DNS: Adding mgmt DNS for mw1261 to mw1307 Bug: T133798 [dns] - 10https://gerrit.wikimedia.org/r/285701 (https://phabricator.wikimedia.org/T133798) (owner: 10Papaul) [20:45:26] (03CR) 10Dzahn: [C: 032] DNS: Adding mgmt DNS for mw1261 to mw1307 Bug: T133798 [dns] - 10https://gerrit.wikimedia.org/r/285701 (https://phabricator.wikimedia.org/T133798) (owner: 10Papaul) [20:45:47] 06Operations, 06Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2244890 (10Andrew) It would also be useful to know if we are leaking A records that correspond to the leaked PTR records. [20:47:18] cmjohnson1: ^ [20:47:38] thanks mutante [20:48:13] !log restarting jenkins after plugin downgrade [20:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:33] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1065 is CRITICAL: Connection refused by host [20:52:33] PROBLEM - dhclient process on cp1065 is CRITICAL: Connection refused by host [20:52:43] PROBLEM - Varnish traffic logger - varnishstatsd on cp1065 is CRITICAL: Connection refused by host [20:52:44] PROBLEM - Varnish HTCP daemon on cp1065 is CRITICAL: Connection refused by host [20:52:44] PROBLEM - Disk space on cp1065 is CRITICAL: Connection refused by host [20:52:44] PROBLEM - salt-minion processes on cp1065 is CRITICAL: Connection refused by host [20:52:53] PROBLEM - DPKG on cp1065 is CRITICAL: Connection refused by host [20:52:53] PROBLEM - Confd vcl based reload on cp1065 is CRITICAL: Connection refused by host [20:52:53] PROBLEM - puppet last run on cp1065 is CRITICAL: Connection refused by host [20:52:54] PROBLEM - traffic-pool service on cp1065 is CRITICAL: Connection refused by host [20:53:13] PROBLEM - Varnish traffic logger - varnishreqstats on cp1065 is CRITICAL: Connection refused by host [20:53:14] PROBLEM - Freshness of OCSP Stapling files on cp1065 is CRITICAL: Connection refused by host [20:53:24] PROBLEM - Varnish traffic logger - varnishrls on cp1065 is CRITICAL: Connection refused by host [20:53:34] PROBLEM - Varnish traffic logger - varnishxcps on cp1065 is CRITICAL: Connection refused by host [20:53:52] Well that doesn't look pretty... [20:54:03] PROBLEM - HTTPS on cp1065 is CRITICAL: Return code of 255 is out of bounds [20:54:03] PROBLEM - Varnishkafka log producer on cp1065 is CRITICAL: Connection refused by host [20:54:04] PROBLEM - RAID on cp1065 is CRITICAL: Connection refused by host [20:54:04] PROBLEM - confd service on cp1065 is CRITICAL: Connection refused by host [20:54:20] indeed, but it's still just a single host that went down [20:54:23] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1065 is CRITICAL: Connection refused by host [20:54:23] PROBLEM - configured eth on cp1065 is CRITICAL: Connection refused by host [20:54:47] bblack: maybe this _is_ a caching issue? See my last comment: https://phabricator.wikimedia.org/T133765#2244903 [20:54:47] (03PS1) 10EBernhardson: Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 [20:54:57] still reachable, i expect recoveries soon [20:55:14] Cache should bifurcate RL content on the version= parameter, right? [20:55:22] (03CR) 10DCausse: [C: 031] Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 (owner: 10EBernhardson) [20:56:05] (03CR) 10Gehel: [C: 031] Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 (owner: 10EBernhardson) [20:56:12] (03PS2) 10EBernhardson: Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 [20:56:25] (03CR) 10EBernhardson: [C: 032] Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 (owner: 10EBernhardson) [20:57:06] (03PS1) 10Rush: icinga for labs Auth DNS convert to check_dig [puppet] - 10https://gerrit.wikimedia.org/r/285746 (https://phabricator.wikimedia.org/T124680) [20:57:06] 06Operations, 10Analytics, 10ArchCom-RfC, 06Discovery, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#2244925 (10Ottomata) YEEHAW [20:57:22] (03PS4) 10Rush: toollabs: Use a template for limits.conf [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) [20:58:11] (03CR) 10Rush: [C: 032 V: 032] toollabs: Use a template for limits.conf [puppet] - 10https://gerrit.wikimedia.org/r/285645 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [20:58:39] 07Blocked-on-Operations, 06Operations, 10RESTBase, 10hardware-requests: Expand SSD space in Cassandra cluster - https://phabricator.wikimedia.org/T121575#2244929 (10RobH) a:05RobH>03fgiunchedi My understanding is we've now ordered all the items needed for this? I'm going to assign to @fgiunchedi for h... [20:58:52] (03PS2) 10Rush: icinga for labs Auth DNS convert to check_dig [puppet] - 10https://gerrit.wikimedia.org/r/285746 (https://phabricator.wikimedia.org/T124680) [20:58:58] (03Merged) 10jenkins-bot: Revert "Stop pushing elasticsearch writes to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285734 (owner: 10EBernhardson) [21:00:15] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: Restore codfw to elasticsearch config T133784 (duration: 00m 37s) [21:00:16] T133784: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784 [21:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:04] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: Restore codfw to elasticsearch config T133784 (duration: 00m 31s) [21:01:05] T133784: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784 [21:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:03] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:02:21] (03PS1) 10Dzahn: installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) [21:02:34] PROBLEM - IPsec on cp1065 is CRITICAL: Connection refused by host [21:02:43] (03PS2) 10Dzahn: installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) [21:02:45] sorry I haven't been looking here, I'm still on cp1065 looking there... [21:03:15] !log rebooting cp1065 [21:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:42] (03CR) 10Dzahn: [C: 032] installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) (owner: 10Dzahn) [21:03:48] (03CR) 10Rush: [C: 032] icinga for labs Auth DNS convert to check_dig [puppet] - 10https://gerrit.wikimedia.org/r/285746 (https://phabricator.wikimedia.org/T124680) (owner: 10Rush) [21:04:00] Ah K [21:05:26] (03PS3) 10Dzahn: installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) [21:07:11] AndyRussG: yes, that URL differs from the origin due to caching [21:07:27] when originally fetched, it had CC headers for 30 days, and nothing has purged it recently [21:07:44] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [21:07:50] (the version I see is about 4 days old) [21:08:07] bblack: hmmm, based on version I guess? [21:08:14] based on the Age: header [21:09:06] bblack: right, I mean, maybe the client is sending the wrong version URL param? [21:09:15] if you mean the final field in the URL: &version=f03c8171a7f3 [21:09:29] yeah [21:09:54] regardless of what the client is doing, I'm sending identical requests (including that version= field) to the caches and directly to MediaWiki, and getting different results for my "grep WikipediaToTheMoon" [21:10:23] which means MediaWiki has lied about something somewhere. It has emitted two different outputs for the supposedly-same "versioned" object [21:10:40] Right! [21:11:11] I thought RL module content was supposed to be stable based on that param, tho truth is I don't know 99% of the details [21:11:20] probably somewhere in the bowels of RL, there's some magic that bumps the version when the output should change, which involves some deep hooks somewhere, and some other code managed to change the output without doing the right thing to ensure it gets a new version hash [21:11:26] or something like that [21:12:10] hmmm [21:12:19] Whom to ping? [21:12:30] I have no idea [21:12:54] 06Operations, 13Patch-For-Review: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#2244980 (10Dzahn) >>! In T123727#2224351, @MoritzMuehlenhoff wrote: > For the dnsrec service the server should be depooled via confctl. get: [palladium:~] $ sudo confctl --tags dc=eqiad,cluster=d... [21:13:05] a better question for the moment might be: what's a reasonable regex pattern to ban out the affected URLs without wiping out all RL URLs, to get past this one update? [21:13:22] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5090150 keys - replication_delay is 0 [21:13:24] RoanKattouw: hey :) how's it going? is your RL-fu good for a question about RL bowls? [21:13:32] (affected by whatever the general-case problem is here, not just WPttM) [21:13:43] AndyRussG: Sure, hit me [21:13:46] bblack: BTW I got the WPttM campaign with a different version number, too [21:13:51] Oh, reading backscroll [21:13:54] RoanKattouw: cool! https://phabricator.wikimedia.org/T133765#2244903 [21:14:01] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:14:01] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:14:11] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:14:22] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 03Discovery-Search-Sprint, and 2 others: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236#2244987 (10EBernhardson) [21:14:24] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2244985 (10EBernhardson) 05Open>03Resolved Cluster master appears to have gotten into a bad state. We ended u... [21:14:30] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:14:38] RoanKattouw: it seems possible that CentralNotice is sticking campaign-specific stuff in RL javascript somehow, and when campaigns are removed or added the RL version= parameter doesn't change [21:14:50] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:14:54] so when they add/remove campaigns and related things, they get inconsistent results on caching, etc [21:14:57] or something like that [21:15:00] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:15:00] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:15:00] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:15:00] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 5 failures [21:15:12] bblack: actually sticking campaign specific stuff into RL modules is what's supposed to happen [21:15:32] It's a bit of magic to get campaign data onto the client quickly, all bundled up with other RL module stuff [21:15:33] yeah but if that's the case, it's probably also supposed to bump the version= [21:15:42] Yeah, it always has [21:15:50] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:15:58] somehow, WPttM was added or removed without the version= changing [21:15:58] It normally turns over every 10 minutes, has been working fine for over a year [21:16:03] The version is supposed to be based on a hash of the content [21:16:49] bblack: Can you tell me in more detail what you are observing? RL itself ignores the &version= param, it's just a cache buster, so if you are seeing different output for the same version param some time apart that's not /necessarily/ a bug [21:17:01] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:17:02] The version param we are directing people to should change when the content does, though [21:17:29] it can't be just a cache buster, right? [21:17:36] but in any case [21:17:41] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:17:41] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:17:41] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:17:41] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:17:45] It really is just a cache buster [21:17:45] RoanKattouw: maybe we're directing people to the wrong version # sometimes? [21:17:51] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:17:56] I mean, it's computed based on knowledge of what we thing should be there [21:18:02] Or maybe the cache isn't busting as it should? [21:18:11] But when the actual request is made, the value of the version param is ignore [21:18:20] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:18:20] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:18:28] I got the stale campaign with more than one different version URL param [21:18:32] There's a task about not doing that, and setting a short cache timeout on the response when the version param is incorrect [21:18:42] well anyways [21:18:59] the intent is for all version=12345678 of the same RL URL, MW should *always* (over all time) emit the same content, right? [21:19:27] However, when I put in only the ext.centralNotice.choiceData URL module in the request, I get the updated data (no stale campaign) [21:19:34] I won't repeat the whole ugly URL, which is from https://phabricator.wikimedia.org/T133765#2244659 [21:19:40] but this is my test: [21:20:08] bblack@palladium:~$ curl -s $URL | grep -c WikipediaToTheMoon [21:20:09] 1 [21:20:16] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:20:37] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:20:38] bblack@palladium:~$ curl -s $URL -H 'X-Forwarded-Proto: https' --resolve de.wikipedia.org:80:10.2.2.1 | grep -c WikipediaToTheMoon [21:20:41] 0 [21:20:46] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:20:47] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp1065_v4, cp1065_v6 [21:20:54] where $URL is actually that whole ugly URL from the ticket, and all the added-on -H parameters in the ticket too [21:21:16] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp1065_v4, cp1065_v6 [21:21:19] basically when I fetch that from the real dewiki (varnish caches), it has WPttM in the otuput [21:21:23] OK, I see the URL from the task that contains WPttM [21:21:25] when I fetch it directly from the appservers, it does not [21:21:34] the caches got it from the appservers [21:21:42] Right [21:21:48] so that implies the appservers, at two different points in time, emitted different content for the same version= [21:22:01] Yes, and that can happen [21:22:09] There is a race condition that happens occasionally [21:22:10] if that can happen, the current RL design is broken [21:22:28] Yes, which is why there's a task to set a short cache header when the version param is wrong; right now it's not checked [21:22:30] Let me explain [21:22:42] 06Operations, 10Traffic, 06WMF-Legal, 10domains: wikipedia.lol - https://phabricator.wikimedia.org/T88861#2244998 (10Dzahn) a:05Dzahn>03None [21:23:41] The basic caching strategy is that we compute a hash that describes what the content of each module should be (either a hash of the actual content, or of precursors for perf reasons in some cases), and we stick that hash in the "startup module", which is the ?modules=startup request that's on every page and has a 5-minute max-age [21:24:30] When requesting a module, we then stick that hash in the ?version= parameter (in practice, we request multiple modules at once, and ?version= becomes the hash of the concatenation of the hashes of those modules, or something like that) [21:25:05] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search-Backlog, and 3 others: ElasticSearch Not enough active copies to meet write consistency - https://phabricator.wikimedia.org/T133784#2245018 (10EBernhardson) p:05Triage>03Unbreak! [21:25:08] So when a module's content changes, its hash will change. Within 5 minutes, the ?modules=startup response will go stale and get refetched from an appserver, so it'll get the new hash [21:25:49] Krinkle: i wonder why planet gets an Error 500 when trying to read the feed from your site [21:25:51] Then clients will start making requests for that module with a ?version= param that the caches have never seen before, so the first request will miss and go to an appserver, which will deliver the new content [21:26:08] But this assumes updates are automic [21:26:11] *atomic [21:26:25] In reality, the way our deployment system is set up, a situation can arise where one appserver has the new version but another one has the old one [21:26:59] And if the timing is "right", the appserver that generates the new startup module response will put in the hash of the new content, but the appserver that responds to the resulting request will still have the old content [21:27:19] So then the old content gets cached for the new hash, and it gets stuck this way until someone manually purges it or the content changes again [21:27:43] (Previously we had the option of touching files because stuff was timestamp based, but with content hashes that trick doesn't work unless you're willing to have unversioned changes where people add random whitespace in places) [21:27:56] This race condition doesn't happen often, but we have seen it [21:27:58] (03PS1) 10Rush: icinga labs auth dns check update description [puppet] - 10https://gerrit.wikimedia.org/r/285755 (https://phabricator.wikimedia.org/T124680) [21:28:05] bblack: Does that all make sense? [21:28:21] (03PS2) 10Rush: icinga labs auth dns check update description [puppet] - 10https://gerrit.wikimedia.org/r/285755 (https://phabricator.wikimedia.org/T124680) [21:28:30] (03CR) 10Hashar: "mutante : thank you! That is a step toward migrating PHP Jenkins jobs to Nodepool instances ;-}" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [21:29:06] Of course, it's also possible that CN is doing something it's not supposed to do, like using a custom hash that's not computed correctly, or parameterizing on $wgUser in places where that's not supported [21:29:16] (03PS1) 10Dzahn: planet: remove broken feed from fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) [21:29:43] 06Operations, 05Continuous-Integration-Scaling, 07Nodepool, 07WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2245031 (10hashar) a:05hashar>03None [21:29:48] (03CR) 10Rush: [C: 032] icinga labs auth dns check update description [puppet] - 10https://gerrit.wikimedia.org/r/285755 (https://phabricator.wikimedia.org/T124680) (owner: 10Rush) [21:29:50] RL modules are generally cookieless so if you try to use information about the current user you'll get interesting cache pollution effects. Same if you depend on anything else that's not explicitly in the request params [21:29:59] (03PS2) 10Dzahn: planet: remove broken feed from fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) [21:30:32] honestly I can onyl follow all of that a little [21:30:34] (03PS4) 10Dzahn: installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) [21:30:57] but the bottom line is, we're treating ?version=foo as having some kind of versioned-content meaning for caching purposes, and it clearly doesn't really [21:30:58] (03CR) 10Dzahn: [V: 032] installserver: let hydrogen use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/285753 (https://phabricator.wikimedia.org/T123727) (owner: 10Dzahn) [21:31:37] (03PS3) 10Dzahn: planet: remove broken feed from fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) [21:31:47] and that's because it's "just a cache-buster", it doesn't actually determine the content variant or anything [21:31:49] RoanKattouw: yeah CN doesn't do that AFIK... tho it was suggested that we switch from using getModifiedHash() to getDefinitionSummary() here: https://github.com/wikimedia/mediawiki-extensions-CentralNotice/blob/dac230995248c5345216d23707b42d22d99b9cb7/includes/CNChoiceDataResourceLoaderModule.php#L180-L182 [21:32:41] bblack: Yup that's right. And if we only had one appserver, it would work correctly, but with multiple servers race conditions are possible where it doesn't [21:32:51] well :) [21:33:06] downgrading to only running 1x appserver isn't a solution for this design problem either :) [21:33:11] something has to give [21:33:14] hahaha [21:33:14] es [21:33:17] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2245057 (10Dereckson) I've still a 404 for https://upload.wikimedia.org/wikipedia/commons/7/7f/Sajid-Monkey-Biz... [21:33:24] if it were a reallly big appserver? [21:33:30] The planned workaround for this is to verify the version param [21:33:33] getModifiedHash() is deprecated so maybe something's gone woky with it? [21:33:44] this sounds like workarounds piled on workarounds already without that [21:33:45] And if it's not the value we expect to see, serve the data we have but with a low cache timeout to force a refetch later [21:34:00] someone should step back and take in the whole picture of how this works and think of better solutions... [21:34:08] (03PS6) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [21:34:16] but that someone certainly isn't me [21:34:34] Well, this IS that design [21:34:55] (03CR) 10Dzahn: [V: 032] planet: remove broken feed from fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:35:03] Could someone *not* in Europe check if https://upload.wikimedia.org/wikipedia/commons/7/7f/Sajid-Monkey-Bizness.webm is 404 or not, and tell us if they use upload-lb.codfw.wikimedia.org or upload-lb.eqiad.wikimedia.org? [21:35:08] (03CR) 10Dzahn: [C: 032] planet: remove broken feed from fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:35:13] And this race condition is the only issue with it that I'm aware of, and with that workaround I think it'll be fixed [21:35:19] What about the immediate solution for CN? We're getting JS errors all around on dewiki [21:35:31] Sorry, yes, let me investigate that specific issue too [21:35:40] Likely one of two things happened [21:36:04] Either you fell victim to that race condition and we just need to do a manual purge, or something is systematically wrong with that RL module in CN [21:36:12] Either way, doing a manual purge is a good idea [21:36:15] (or both) [21:36:32] also, a manual purge of what exactly? [21:36:44] Of the giant URL from the task [21:36:55] but not 100 other variants with slightly different parameters? [21:36:57] If you change the version param to something else (e.g. ?version=roan) it returns the correct result [21:37:02] (03PS7) 10Ottomata: Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) [21:37:11] Hmm, yeah there could be a few [21:37:18] we need some kind of reasonable regex [21:37:22] Do we still have wildcard purges/bans for this? [21:37:29] (03CR) 10Dereckson: "Hmmmm, actually it's a 404 for the world, only for Planet." [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:37:35] regex, not wildcard, and yes it's a ban not a purge [21:37:37] Honestly, any URL that contains both load.php and choiceData [21:37:50] and bans don't scale, bans are emergency hacks [21:37:58] (03CR) 10Dereckson: "(it's not 404 for the world)" [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:37:59] our design can't be "this will sometimes go wrong and we need to ask ops to ban something" [21:38:03] I know [21:38:10] (03CR) 10Dzahn: "it does not in my browser from over here" [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:38:12] Hence the planned solution [21:38:19] It doesn't seem to be high on the perf team's radar though [21:38:40] Lemme find the task and bump it, and ask Krinkle if he can work on it soon after he's back from vacation [21:39:18] (03CR) 10Dzahn: "also with curl from my laptop: 404 Not Found" [puppet] - 10https://gerrit.wikimedia.org/r/285757 (https://phabricator.wikimedia.org/T133573) (owner: 10Dzahn) [21:42:28] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2245119 (10Ottomata) They should be in separate racks from each other, but these are replacements for aqs100[123], so it doesn't matter if they are in the same racks as those. If yall can... [21:43:10] RoanKattouw: bblack: so, for the time being, I guess, try to purge the relevant parts of the cache, and hope it isn't some deeper CN issue? [21:43:26] I will look at the CN code to see if I can find anything suspicious [21:43:37] RoanKattouw: oh OK that sounds amazing, thanks so much! [21:43:53] I think that part hasn't changed much since you last checked it out [21:44:11] Well what has changed is I've completely forgotten everything about it :P [21:44:13] Dunno if anything changed in the RL API that we used... [21:45:33] Dereckson: attempt to find wordpress.com ops.. joined #wordpress and asked for sysadmins (of the actual site) [21:45:33] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy conf200[123] - https://phabricator.wikimedia.org/T131959#2245135 (10Ottomata) Naw, this won’t matter. The info in the 2 zookeeper clusters is totally independent. Everything equal will only talk to the eqiad zookeeper cluster. Same goes for... [21:46:18] mutante: k, I'll see with Garfieldairlines what happens and what kind of block it has [21:46:26] RoanKattouw: it's not complicated... We just create a json-encoded object with data on possible campaigns for users, as a RL module. It varies based on project and language, the two things that are available server-side [21:46:50] Dereckson: that one seems like DNS just changed and different regions of the world have different views now [21:46:57] We can definitely update to use getDefinitionSummary [21:48:40] I think that might be automatic actually [21:49:17] mutante: oh I've found what happens: ok in IPv6, not ok in IPv4 [21:49:23] 06Operations, 13Patch-For-Review: Boot time race condition when assembling root raid device on cp1052 - https://phabricator.wikimedia.org/T131961#2245145 (10BBlack) I've just rebooted cp1065 today for unrelated reasons, and the problem does not seem fixed (in fact, it seems worse if anything. neither of the d... [21:50:34] Dereckson: oh :) that fits , yea [21:51:02] Ah hmmm [21:51:11] AndyRussG: One thing you could do is override enableModuleContentVersion() to return true, then remove getModifiedHash() [21:51:27] But you're basically already using content hashing, that would just make it implicit [21:51:35] Dereckson: i guess we should contact him, but dont have time to talk to all the external people .. hrmm [21:51:48] i.e. it would make for cleaner code but wouldn't fundamentally change anything [21:52:09] RECOVERY - Varnish traffic logger - varnishxcps on cp1065 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishxcps, UID = 0 (root) [21:52:09] RECOVERY - confd service on cp1065 is OK: OK - confd is active [21:52:10] RECOVERY - RAID on cp1065 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [21:52:10] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 3.87 ms [21:52:10] RECOVERY - traffic-pool service on cp1065 is OK: OK - traffic-pool is active [21:52:10] RECOVERY - Varnish HTCP daemon on cp1065 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [21:52:21] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 44 ESP OK [21:52:21] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - 44 ESP OK [21:52:26] RoanKattouw: K interesting... What about getDefinitionSummary? Should we add that? [21:52:34] It doesn't look to me like there's anything wrong with CN's RL module, it was probably just the known race condition [21:52:39] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1065 is OK: No errors detected [21:52:40] RECOVERY - Varnish traffic logger - varnishreqstats on cp1065 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishreqstats, UID = 0 (root) [21:52:40] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 1045 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5093963 keys - replication_delay is 1045 [21:52:40] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [21:52:40] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 44 ESP OK [21:52:40] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [21:52:40] RECOVERY - IPsec on cp4016 is OK: Strongswan OK - 44 ESP OK [21:52:51] mutante: I know him from #wikipedia-fr, so I'll get in touch [21:53:03] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1065 is OK: No errors detected [21:53:03] RECOVERY - Varnish traffic logger - varnishrls on cp1065 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishrls, UID = 0 (root) [21:53:03] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [21:53:03] AndyRussG: So, there are two ways you can do versioning [21:53:03] RECOVERY - Confd vcl based reload on cp1065 is OK: reload-vcl successfully ran 0h, 4 minutes ago. [21:53:04] RECOVERY - Varnish traffic logger - varnishstatsd on cp1065 is OK: PROCS OK: 1 process with args /usr/local/bin/varnishstatsd, UID = 0 (root) [21:53:04] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [21:53:04] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 44 ESP OK [21:53:04] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - 44 ESP OK [21:53:13] PROBLEM - NTP on cp1065 is CRITICAL: NTP CRITICAL: Offset unknown [21:53:13] RECOVERY - DPKG on cp1065 is OK: All packages OK [21:53:13] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [21:53:17] Dereckson: cool, thanks [21:53:32] RECOVERY - Disk space on cp1065 is OK: DISK OK [21:53:33] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 44 ESP OK [21:53:33] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 44 ESP OK [21:53:33] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - 44 ESP OK [21:53:43] If enableContentVersion() returns true (defaults to false), then RL will just compute the content of your module, hash that, and use that as the hash [21:53:44] RECOVERY - Freshness of OCSP Stapling files on cp1065 is OK: OK [21:53:44] RECOVERY - Varnishkafka log producer on cp1065 is OK: PROCS OK: 3 processes with command name varnishkafka [21:53:44] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [21:54:12] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 44 ESP OK [21:54:12] RECOVERY - HTTPS on cp1065 is OK: SSLXNN OK - 36 OK [21:54:22] RECOVERY - configured eth on cp1065 is OK: OK - interfaces up [21:54:22] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [21:54:26] RoanKattouw: hmm that does sound like what we want... [21:54:31] Yes, exactly [21:54:32] RECOVERY - dhclient process on cp1065 is OK: PROCS OK: 0 processes with command name dhclient [21:54:32] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [21:54:33] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 44 ESP OK [21:54:42] getDefinitionSummary() is what's recommended for when you don't want to use that [21:54:44] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:54:44] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [21:54:48] e.g. for performance reasons [21:55:03] RECOVERY - salt-minion processes on cp1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:55:03] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [21:55:04] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 44 ESP OK [21:55:07] For example, we don't use content versioning for file modules, because LESS compilation is expensive [21:55:12] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - 44 ESP OK [21:55:22] getModifiedTime and getModifiedHash are still supported but deprecated [21:55:43] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5091945 keys - replication_delay is 0 [21:56:27] AndyRussG: I recommend reading the in-code documentation of ResourceLoaderModule::getVersionHash() and the methods below it in ResourceLoaderModule.php, it explains these things fairly clearly [21:56:45] K! [21:57:46] Heh I did read some of it when we wrote this code, but I don't think I ever got really comfortable with it, and have also forgotten stuffff 8p [22:00:06] !log banned req.url ~ "^/load.php.*choiceData" on cache_text [22:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:00:47] /load.php? [22:00:48] bblack: woohoo! [22:01:09] That's gonna ban a lot of load.php stuff I think [22:01:27] yeah that's not gonna work, it missed /w/ [22:01:34] hehe [22:01:36] let's go again [22:01:50] Is that just to create a temporary purge or are you going to leave that on? [22:02:21] the effect is somewhat like a regex-based one-shot purge [22:02:47] the mechanism is something that lingers for a while and impacts performance and doesn't scale well [22:03:12] That is something that should be fixed though.. [22:03:16] no [22:03:30] there's no good reason to be doing ban operations routinely [22:03:43] I know [22:04:02] !log banned req.url ~ "^/w/load.php.*choiceData" on cache_text [22:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:39] (03PS3) 10Alex Monk: Set up Let's Encrypt certificate for labtestwikitech [puppet] - 10https://gerrit.wikimedia.org/r/285654 (https://phabricator.wikimedia.org/T133167) [22:04:47] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2245176 (10TerraCodes) It was on a windows mobile phone on LTE data (AT&T), so I don't have a console. When vie... [22:05:05] (03CR) 10Mattflaschen: [C: 032] "Existing content is preserved and accessible as normal. We agreed it does not need to be migrated." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:05:18] (03PS5) 10Mattflaschen: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) [22:05:30] (03CR) 10Mattflaschen: [C: 032] Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:05:52] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2245183 (10TerraCodes) >>! In T109331#2245176, @TerraCodes wrote: > It was on a windows mobile phone on LTE dat... [22:05:56] But I wonder why ban just doesn't delete objects from the cache matching that regex instead of avoiding use of such objects [22:06:13] RECOVERY - NTP on cp1065 is OK: NTP OK: Offset -0.004890799522 secs [22:06:15] (03Merged) 10jenkins-bot: Beta Cluster: Use ExternalStore on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282440 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:06:42] SPF|Cloud: because they're response objects, they don't have URLs [22:07:33] basically bans boil down to this: [22:07:50] you can ban on attributes of the object, or attributes of a request [22:07:52] it would be too complicated to loop over all kinds of hashes and stuff that could change (internal) urls immediately when the ban is executed? [22:08:04] hash_data's* [22:08:18] if you ban on attributes of the object, two things are happenning: [22:08:45] 1. From the moment the ban starts, all live requests are checked against the ban, to see if the response object matches, in which case it's tossed out of storage and turned into a cache miss. [22:09:02] 06Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-extensions-GettingStarted: GettingStarted on Beta Cluster periodically loses its Redis index - https://phabricator.wikimedia.org/T100515#2245193 (10Mattflaschen) I reopened it since it did happen again: T94154#2066037 [22:09:13] 2. in the background, a ban lurker thread iterates all the objects in cache storage and checks them against the condition, which can take a long time (there might be, in our case 600GB of cache data). [22:09:50] when the ban lurker has finished going over all the cache objects older than when the ban entered the cache, it can then remove itself from all of that (no more runtime checks needed on every request), so it's all done [22:10:02] the time boundary on how long it takes for a ban to complete is how long it takes to search cache [22:10:08] I never knew about that ban lurker thread, thx [22:10:35] the ban is in effect from the moment it enters, the completion is when it's no longer a burden varnish has to deal with at runtime [22:10:59] but if you ban on attributes of the client request rather than the cached response object, there's nothing the ban lurker can do. [22:11:36] so the ban sits in memory checking all live requests and ensuring that if they match the request-ban conditions *and* the request finds a cache object older than the timestamp of the ban, the object is tossed out and we do a cache miss. [22:11:38] But bans only take effect if the cached object is older than the ban, right? [22:11:49] and it has to stay there until every single object in the cache is older than the ban [22:11:50] jinx, you just answered that [22:12:02] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2245198 (10Dzahn) [22:12:07] so that's typically 30 days that request regex will linger [22:12:12] (on the backends) [22:12:17] (03CR) 10Thcipriani: [C: 031] "The first step down the path to scap3 :)" [puppet] - 10https://gerrit.wikimedia.org/r/285706 (owner: 10Yurik) [22:12:33] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [22:12:38] (or can be up to 30 days anyways, I don't know the typical time) [22:13:18] Technically with some complicated coding by upstream the ban lurker should also be able to work for req.url... [22:13:23] 4point 1 above about bans on attributes of the object wasn't quite right... [22:13:51] 1. From the moment the ban starts, all live requests are checked against the ban, to see if the response object matches ** and is older than the ban timestamp **, in which case it's tossed out of storage and turned into a cache miss. [22:14:13] ideally you'd only ban against response properties, not request properties [22:14:55] but the response object does not contain the request URL for various good reasons: sometimes multiple URLs can hash to one response object anyways depending on your VCL, and it's extra cache bloat that's pointless other than for bans. [22:15:26] we could add that bloat, if we didn't think aliasing was an issue, by coping req.url to obj.req_url or something [22:15:43] err that's obj.http.req_url I guess, or maybe it has to enter it via resp.http.req_url [22:16:29] but even if you copy all the request parameters you think you'd ever ban on into the response objects, bans still don't scale. lurking through the existing giant disk caches takes too long. if we ban routinely or with some script, they'll stack up and slow everything down. [22:17:22] Guess who served a production site with ban("req.url ~ ^" + req.url + "$") ;) [22:17:39] Eh obviously only when req.method == PURGE.. [22:17:47] yeah [22:17:55] there's a purge method anyways, though [22:18:06] !log mattflaschen@tin Synchronized wmf-config/db-labs.php: Beta Cluster change (duration: 00m 29s) [22:18:13] bblack: seems to have worked! [22:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:18:25] the whole thing comes down to that bans can have regexes and complex expressions [22:18:30] It took me 6 months to find out that when you want to purge using the PURGE method you need to look at all youe hash_data instances [22:18:42] if you have one single static literal URL, you can purge that with PURGE without using bans [22:19:05] The whole purge & ban system is just one complicated thing. [22:19:09] SPF|Cloud: we still didn't fix hash_data issues here [22:19:21] bblack: BTW I found the task I was thinking of: https://phabricator.wikimedia.org/T117587 [22:19:35] hash_data is basically hand-encoded Vary, but without the purge magic that comes with Vary :/ [22:19:51] Due to MobileFrontend I have a complicated config with X-Device and X-Use-Mobile [22:20:05] yeah we have several hash_data add-ons too [22:20:13] I'm just saying, we never fixed the problem. they still don't purge right:) [22:20:15] I had to patch MediaWiki to execute purges for X-Device: desktop and X-Device: phone-tablet [22:20:42] I'm hoping nobody makes a big issue out of that long-standing issue until we get past deploying XKey vmod and move past all this PURGE stuff [22:21:11] And after I removed the bans (I thought they used a lot of memory) Puppet still runs out of memory :| [22:21:11] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2232043 (10Jdlrobson) Probably related? https://twitter.com/therealprotonk/status/723963502856142848 [22:21:41] bblack: If you like I'll write a comment on that bug saying bans don't scale and this should be fixed soon; but if you wanna comment yourself I don't wanna step on your toes [22:21:56] sure I'll comment [22:22:17] OK cool [22:22:33] 06Operations, 10Internet-Archive, 10Wikimedia-Planet: wordpress.com seems to have blocked us from fetching feeds - https://phabricator.wikimedia.org/T133818#2245239 (10Dzahn) @Haeb Offered to open a ticket with Wordpress (thanks) text: ``` Hello Wordpress.com admins, the Wikimedia Foundation runs an RSS... [22:26:50] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2245296 (10Jdlrobson) Sorry ignore that. Seems like a different problem. [22:27:03] PROBLEM - puppet last run on ms-fe1003 is CRITICAL: CRITICAL: Puppet has 1 failures [22:27:27] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2245299 (10BBlack) It looks like the same problem to me... I put this in our operations outbound for SoS, and raised this ticket the other day... [22:28:30] Dereckson: in zh.planet Error 500 while updating feed http://www.wretch.cc/blog/wikiken&rss20=1 redirects to tw.yahoo.com [22:29:00] 06Operations, 10MobileFrontend, 10Reading-Web-Backlog, 10Traffic: Seeing desktop text cache while browsing mobile sites - https://phabricator.wikimedia.org/T133441#2245306 (10BBlack) Oh you're right, not the same problem. Still, if we don't understand the problem in this ticket, how do we know it's not go... [22:33:41] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#2245320 (10TerraCodes) >>! In T109331#2245057, @Dereckson wrote: > I've still a 404 for https://upload.wikimedi... [22:33:55] (03PS1) 10Mattflaschen: Enable External Store everywhere on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285765 (https://phabricator.wikimedia.org/T95871) [22:35:55] (03CR) 10Catrope: [C: 031] Enable External Store everywhere on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285765 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:37:02] (03CR) 10Mattflaschen: [C: 032] Enable External Store everywhere on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285765 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:37:27] (03Merged) 10jenkins-bot: Enable External Store everywhere on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/285765 (https://phabricator.wikimedia.org/T95871) (owner: 10Mattflaschen) [22:39:27] bblack: if I recall correctly some header forces the mobile version, right? [22:43:35] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2245349 (10BBlack) So, I had intended to do the quick test and start the 24H test today, but I've run into some issues. Running the kernel object with `staprun` caused cp1065 to get i... [22:44:22] (When visiting the m.* sites) if so, then someone should try looking why MobileFrontend is not honoring that header. I usually do that using var_dumps in some php functions, but that's completely unacceptable in production. Using a depooled (or mw[12]017/mw[12]090?) mwserver for that is afaik not common practice(?) but still an option I guess.. [22:45:39] SPF|Cloud: we basically split mobile-vs-desktop on hostname (en.wiki vs en.m.wiki) in varnish, but then when talking to the MediaWikis, we translate the mobile name back to the desktop name (en.wiki) and set the header X-Subdomain to tell MW to render it as mobile [22:46:22] Yea, that header X-Subdomain does not work [22:46:42] well, doesnt' work for some pages anyways [22:47:08] I wouldn't know 100% sure how to debug that [22:47:19] me either [22:47:30] I also don't know what changed when, or why this isn't affecting other wikis [22:47:55] at the time it was reported, all wikis were on the same software version because we hadn't done a train deploy since the freeze week for the codfw-switchover testing [22:48:29] If you do it the SPF-way then you need a regular appserver, but it's completely unacceptable to use appservers that is serving production traffic. [22:49:01] I think the best is to wait for a MobileFrontend developer to look at it. [22:51:49] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [22:52:30] RECOVERY - puppet last run on ms-fe1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:54:11] checking mira, but i dont believe it [22:54:20] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:55:30] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:55:59] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160427T2300). Please do the needful. [23:00:40] SWAT will be delayed today because CI is still broken [23:00:49] bah [23:01:13] hah, and I added my patches for the wrong day [23:07:57] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819#2245416 (10Dereckson) [23:08:52] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331#1546225 (10Dereckson) Thanks, that was useful.. [23:09:39] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819#2245434 (10Dereckson) [23:12:56] 06Operations, 06Commons, 10Traffic, 10media-storage, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819#2245445 (10Dereckson) [23:38:47] (03PS1) 10Tim Starling: Allow wikidev to upload to carbon:/srv/wikimedia/incoming [puppet] - 10https://gerrit.wikimedia.org/r/285772 [23:41:46] (03CR) 10Tim Starling: [C: 032] Allow wikidev to upload to carbon:/srv/wikimedia/incoming [puppet] - 10https://gerrit.wikimedia.org/r/285772 (owner: 10Tim Starling) [23:44:50] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review, 07WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#2245540 (10mmodell) [23:48:20] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 8 failures [23:50:15] (03CR) 10Tim Starling: [V: 032] Allow wikidev to upload to carbon:/srv/wikimedia/incoming [puppet] - 10https://gerrit.wikimedia.org/r/285772 (owner: 10Tim Starling) [23:58:54] (03CR) 10jenkins-bot: [V: 04-1] Add new confluent module and puppetization for using confluent Kafka [puppet] - 10https://gerrit.wikimedia.org/r/284349 (https://phabricator.wikimedia.org/T132631) (owner: 10Ottomata) [23:59:31] going to resume SWAT now RoanKattouw? CI is back