[00:00:05] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T0000). [00:00:56] Krenair: everything looks good, thank you! [00:02:35] that's not going to happen this week, or is it [00:02:45] phab update [00:03:37] mutante thats a phab update, maintenance window [00:03:45] and yep that's not happenning this week [00:07:56] (03CR) 10BBlack: [C: 032] eqiad recdns IP fix: add new address (.254) [puppet] - 10https://gerrit.wikimedia.org/r/315929 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [00:08:03] (03PS3) 10BBlack: eqiad recdns IP fix: add new address (.254) [puppet] - 10https://gerrit.wikimedia.org/r/315929 (https://phabricator.wikimedia.org/T143915) [00:08:05] (03CR) 10BBlack: [V: 032] eqiad recdns IP fix: add new address (.254) [puppet] - 10https://gerrit.wikimedia.org/r/315929 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [00:23:24] !log restarting pybal on lvs1002 for new recdns IP [00:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:35] (03PS2) 10BBlack: eqiad recdns IP fix: switch in puppet [puppet] - 10https://gerrit.wikimedia.org/r/315930 (https://phabricator.wikimedia.org/T143915) [00:30:40] 06Operations, 06Commons, 10Traffic, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2730435 (10matmarex) [00:43:08] 06Operations, 06Commons, 10Traffic, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2724723 (10BBlack) If I had to venture a guess after-the-fact, I'd guess that some bad mime-type headers slipped into at least some of the caches for these fil... [00:53:02] (03PS1) 10Andrew Bogott: Don't ask LDAP about instance puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/316915 [00:54:10] (03CR) 10Andrew Bogott: "This can be merged as soon as https://phabricator.wikimedia.org/T148683 is closed." [puppet] - 10https://gerrit.wikimedia.org/r/316915 (owner: 10Andrew Bogott) [01:01:02] PROBLEM - Host wtp2019 is DOWN: PING CRITICAL - Packet loss = 100% [01:09:40] (03CR) 10Dzahn: [C: 032] installserver: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315882 (owner: 10Dzahn) [01:09:47] (03PS4) 10Dzahn: installserver: move standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315882 [01:13:29] (03PS1) 10Legoktm: Revert "Enable AbuseFilterCachingParser by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316917 (https://phabricator.wikimedia.org/T148673) [01:13:42] (03CR) 10jenkins-bot: [V: 04-1] Revert "Enable AbuseFilterCachingParser by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316917 (https://phabricator.wikimedia.org/T148673) (owner: 10Legoktm) [01:14:37] (03PS2) 10Legoktm: Revert "Enable AbuseFilterCachingParser by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316917 (https://phabricator.wikimedia.org/T148673) [01:15:09] ori: ^^ [01:15:30] (03CR) 10Legoktm: [C: 032] Revert "Enable AbuseFilterCachingParser by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316917 (https://phabricator.wikimedia.org/T148673) (owner: 10Legoktm) [01:15:54] (03Merged) 10jenkins-bot: Revert "Enable AbuseFilterCachingParser by default" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316917 (https://phabricator.wikimedia.org/T148673) (owner: 10Legoktm) [01:17:36] !log legoktm@mira Synchronized wmf-config/InitialiseSettings.php: Revert Enable AbuseFilterCachingParser by default - T148673 (duration: 00m 51s) [01:17:36] T148673: Order of operations has changed in AbuseFilters - https://phabricator.wikimedia.org/T148673 [01:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:18:25] wtf BanBot?? [01:18:27] Platonides: ^ [01:20:30] it's always the same line morebots says [01:20:43] maybe it's detecting something about the timing of the reaction and sashbots response? [01:20:46] I donno [01:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:37] eh who knows [01:28:02] (03PS3) 10Dzahn: pmacct: move firewall, standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315879 [01:38:28] i'm getting a bunch on unstyled wiki pages on enwiki, and a bunch of errors like this: [01:38:29] Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "/mathoid/local/v1/":): a_{n} [01:41:26] link? [01:42:19] it's intermittent and the page i saw it on isn't doing it now. i'll post one if i see it again [01:44:20] Pchelolo, mobrovac, gwicke [01:44:51] morebots still not around [01:45:02] This is shown when MathRestbaseInterface::evaluateRestbaseCheckResponse doesn't get HTTP 200 and doesn't get more error details [01:45:29] (03CR) 10Dzahn: [C: 032] pmacct: move firewall, standard include to role [puppet] - 10https://gerrit.wikimedia.org/r/315879 (owner: 10Dzahn) [01:48:10] (03CR) 10Dzahn: "please re-add me or other reviewers after https://phabricator.wikimedia.org/T133548 is resolved , until then this is blocked" [puppet] - 10https://gerrit.wikimedia.org/r/254305 (owner: 10Odder) [01:48:38] (03CR) 10Dzahn: "please re-add me after https://phabricator.wikimedia.org/T133548 is resolved, until then this is blocked and i'd like to reduce my incomin" [puppet] - 10https://gerrit.wikimedia.org/r/227079 (https://phabricator.wikimedia.org/T62220) (owner: 10Nemo bis) [01:49:15] (03CR) 10Dzahn: "please re-add me after https://phabricator.wikimedia.org/T133548 is resolved. until then this is blocked and i'd like to reduce my incomin" [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) (owner: 10Microchip08) [01:54:26] Can we ban banbot for here it seems more problems come from it then benefit [01:54:28] (03PS1) 10BBlack: LVS: move ocg to low-traffic set [puppet] - 10https://gerrit.wikimedia.org/r/316920 (https://phabricator.wikimedia.org/T143915) [01:54:30] (03PS1) 10BBlack: LVS: move git-ssh to high-traffic2 set [puppet] - 10https://gerrit.wikimedia.org/r/316921 (https://phabricator.wikimedia.org/T143915) [01:57:15] (03CR) 10Dzahn: "so the ones that don't have it yet are puppetmasters (please see https://gerrit.wikimedia.org/r/#/c/316032/) and eventlog (nothing yet)" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [01:58:17] I marked the address as not to be banned [01:58:23] still… [01:58:32] (03PS1) 10Yuvipanda: labs: Don't read roles from LDAP for all non-tools projects [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) [01:58:40] andrewbogott: ^ [01:58:52] andrewbogott: err, https://gerrit.wikimedia.org/r/316923 [01:58:56] not sure why that didn't come here [01:59:00] i disagree with Zppix|mobile, it saved us from having to edit SAL and Twitter 5 more times [01:59:07] well, not sure [01:59:40] mutante: but it makes it so people must manually lov [01:59:42] Log [01:59:48] (03CR) 10jenkins-bot: [V: 04-1] labs: Don't read roles from LDAP for all non-tools projects [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) (owner: 10Yuvipanda) [02:00:12] I wonder if "Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master" matched something [02:00:23] 06Operations, 10Traffic, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730602 (10BBlack) This will soon be the last cache_misc backend left that doesn't conform to the new normal (single service hostname handled by LVS), so it's becoming a blocker for furthe... [02:00:37] (03CR) 10Andrew Bogott: [C: 031] "Once flake8 is happy, I'm happy" [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) (owner: 10Yuvipanda) [02:00:37] Platonides: seems like "pastes the identical URL 100 times" triggers spam rules or so? [02:00:48] Platonides, that single address? [02:00:52] Zppix|mobile: yes, true, i dunno [02:00:54] 06Operations, 10Traffic, 10Wikimedia-Stream, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2730603 (10BBlack) [02:00:56] mutante: no [02:01:09] it could if it was repeating the same line many times [02:01:15] and there was no other irc activity [02:01:16] Platonides, you know that each time it runs it runs on a different machine behind a different IP, right? [02:01:42] why doesn't it have a cloak? [02:01:47] Platonides the fact that morebots was banned just only after stashbot logged [02:02:57] (03PS1) 10Dzahn: repeat hostname for all records: lists,fermium,carbon [dns] - 10https://gerrit.wikimedia.org/r/316924 [02:03:21] therepro is to have stashbot log something when morebots is around and banbot is op [02:03:44] Platonides, it doesn't need a cloak [02:04:28] banbot disagrees (: [02:04:33] BanBot is wrong [02:04:46] BanBot can't hold op anywhere if it's going to ban things for not having a cloak [02:04:59] it must have matched something [02:05:05] but it's now out of the scroll [02:05:25] the cloak would have prevented it even if it matched [02:05:27] One of your patterns must be far too broad to be used [02:05:52] I'll leave it deopped here until I figure t out [02:06:28] Where else does it have access? [02:06:53] to a bunch of channels [02:06:59] today's pattern: "if nick in w:List_of_The_Simpsons_characters" [02:07:05] but it didn't give these problems [02:07:29] :/ [02:07:47] night all [02:07:53] night [02:07:54] Platonides: it sure seems like what banbot doesn't like is tha tmorebots only ever says one thing over an dover [02:08:01] (03PS2) 10Yuvipanda: labs: Don't read roles from LDAP for all non-tools projects [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) [02:08:51] it does have an anti-repeat rule [02:09:01] but supposedly the buffer is limited [02:09:08] (03PS3) 10Andrew Bogott: labs: Don't read roles from LDAP for all non-tools projects [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) (owner: 10Yuvipanda) [02:09:08] maybe we could mitigate this by making morebots include more entropy? It could replace ", Master" at the end with the name of the !log sender, and/or a random number in brackets or whatever [02:09:25] it does already, for some users who have customized it years ago [02:09:38] I need to find out the reason [02:09:48] but now, I'm going to bed :) [02:09:51] nite! [02:10:46] (03CR) 10Andrew Bogott: [C: 032] labs: Don't read roles from LDAP for all non-tools projects [puppet] - 10https://gerrit.wikimedia.org/r/316923 (https://phabricator.wikimedia.org/T148683) (owner: 10Yuvipanda) [02:11:10] (03PS2) 10Dzahn: repeat hostname for all records: lists,fermium,carbon [dns] - 10https://gerrit.wikimedia.org/r/316924 [02:11:18] (03CR) 10Dzahn: [C: 032] repeat hostname for all records: lists,fermium,carbon [dns] - 10https://gerrit.wikimedia.org/r/316924 (owner: 10Dzahn) [02:15:23] (03PS1) 10Yuvipanda: labs: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/316925 [02:15:35] andrewbogott: ^ [02:16:03] !log test [02:16:05] (03CR) 10Andrew Bogott: "Hm, good point!" [puppet] - 10https://gerrit.wikimedia.org/r/316925 (owner: 10Yuvipanda) [02:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Alex [02:16:14] tada [02:16:48] (03PS2) 10Andrew Bogott: labs: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/316925 (owner: 10Yuvipanda) [02:18:07] (03CR) 10Andrew Bogott: [C: 032] labs: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/316925 (owner: 10Yuvipanda) [02:20:06] (03CR) 10Dzahn: "@hashar what do you think, let's start wit IPv6 here from the beginning and check if the firewall rules are still fine (while we switch fr" [puppet] - 10https://gerrit.wikimedia.org/r/316040 (owner: 10Dzahn) [02:20:47] (03PS1) 10Dzahn: repeat hostname for AAAA, bast1001/2001,ripe-atlas,silver [dns] - 10https://gerrit.wikimedia.org/r/316926 [02:22:31] (03CR) 10Dzahn: [C: 032] "yea, i'm doing the same thing as https://gerrit.wikimedia.org/r/#/c/304155/ just not the entire zone at once, but more like a few each day" [dns] - 10https://gerrit.wikimedia.org/r/316926 (owner: 10Dzahn) [02:25:56] (03CR) 10Dzahn: "i think kafka::analytics::burrow might be wrong on krypton, the reason was given as this being a "monitoring" host but it never was one" [puppet] - 10https://gerrit.wikimedia.org/r/316041 (owner: 10Dzahn) [02:28:13] (03Abandoned) 10Dzahn: wmnet: repeat host names on each line, fix indentation, misc cleanup [dns] - 10https://gerrit.wikimedia.org/r/304171 (owner: 10Dzahn) [03:04:39] (03PS1) 10Alex Monk: Get rid of mw-deployment-vars.sh [puppet] - 10https://gerrit.wikimedia.org/r/316928 [03:13:43] 06Operations, 06Discovery, 06Maps, 10Maps-data, 03Interactive-Sprint: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2730629 (10Yurik) [03:54:25] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3037632 keys - replication_delay is 0 [03:57:29] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[kafkacat] [04:04:00] madhuvishy: ^ :) [05:39:54] AlvaroMolina [05:39:56] HI [05:43:43] !log AlvaroMolina loves AlexZ_ [05:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:43:56] goddamnit [05:44:12] ... [05:54:30] 06Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 06Services (done): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#2730654 (10Joe) >>! In T94329#2729735, @GWicke wrote: > From a cost / benefit perspective, improving on the Firewall with a separate VLan might be m... [06:01:16] !log AlvaroMolina loves AlexZ_ [06:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:04:54] <_joe_> AlexZ: DON'T [06:05:02] <_joe_> and you've already been told that [06:05:29] _joe_: It's temporary [06:05:54] <_joe_> heh [06:06:11] <_joe_> AlexZ: we don't want this channel to be +r even temporarily [06:06:28] <_joe_> you should just stop giving this troll so much attention, that's what they crave for [06:06:46] I'm not, they're spamming over 20 channels. [06:07:13] <_joe_> well, this one, we can manage [06:07:27] and letting logmsgbot continue working without an acl, is encouraging him [06:07:34] <_joe_> I'm asking myself why the hell logmsgbot is still linked to twitter [06:07:41] and twitter [06:07:48] <_joe_> AlexZ: twitter I agree [06:07:52] <_joe_> I said it repeatedly [06:08:19] For some users, like AlvaroMolina this harassment [06:08:21] <_joe_> but, as a part of the ops team, I have two requirements for this channel: 1) that unregistered users can freely join [06:08:53] <_joe_> 2) that anyone and not just people on a whitelist can register things to the SAL [06:09:19] <_joe_> I don't see a good reaon for linking it to a twitter account though [06:09:53] You can't simply think we should let it continue unabated? If +r is a temp stop gap until they get bored.. that's really the only option we have. [06:10:16] <_joe_> AlexZ: I am aware this is vandalism (harassment is a more serious thing, I wouldn't use that term so lightly), and on the SAL we can just revert [06:10:54] They've been doing this for months unfortunately, -operations is a rather new target though. [06:11:41] Hi, _joe_ [06:11:50] <_joe_> it began to be more of a target as soon as policing has started; typical of trolls, they're happy to see you react to their actions. [06:12:13] <_joe_> hi AlvaroMolina [06:12:54] <_joe_> (I think I said this to everyone over and over) [06:19:21] <_joe_> bbiab, sorry [06:19:32] well they're more than happy to spam over 200 users for hours on end. I personally can't keep up with the amount of channels they're in. Half the time, ignoring is fine. Though, in this case, they aren't really discouraged by that. [06:20:36] I think there are at least 9 or so channels I've let them go on in this whole night. [06:21:42] but yeah if twitter got de-linked logmsgbot might be less of a target for them [06:21:49] <_joe_> yep [06:22:17] <_joe_> I have not even an idea of where logmsgbot runs exactly - I guess toollabs [06:22:35] neon [06:23:01] morebots is the one that actually logs [06:23:01] I am a logbot running on tools-exec-1408. [06:23:01] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [06:23:01] To log a message, type !log . [06:25:13] <_joe_> p858snake|L2: yeah, morebots heh [06:25:21] <_joe_> I always confuse the two [06:25:47] <_joe_> lomgsgbot is tcpircbot, right [06:26:33] <_joe_> I'll just tell one of the maintainers to unlink twitter [06:29:12] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2730656 (10Marostegui) And everything looks good now, thanks a lot @Papaul ``` => ctrl slot=0 physicaldrive all show Smart Array P420i in Slot 0 (Embedded) array A physicaldrive 1I:... [06:29:25] 06Operations, 10ops-codfw, 10DBA: db2037: Disk in predictive failure - https://phabricator.wikimedia.org/T148373#2730657 (10Marostegui) 05Open>03Resolved [06:32:09] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2730658 (10Joe) >>! In T147718#2729234, @BBlack wrote: >>>! In T147718#2727431, @BBlack wrote: >> What are good examples of data that naturally belongs... [06:35:04] 06Operations, 05Prometheus-metrics-monitoring, 15User-Joe: Port HHVM metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147423#2730659 (10Joe) [06:36:32] 06Operations, 06Release-Engineering-Team, 07Beta-Cluster-reproducible, 07HHVM, 15User-Joe: Switch mwscript from Zend PHP5 to default php alternative (egHHVM) - https://phabricator.wikimedia.org/T146285#2730660 (10Joe) [07:17:16] !log J O I N #wikipedia-es [07:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:17:25] !log AlexZ vale mierda [07:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:39] Is gerrit broken again? [07:26:32] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2730697 (10Joe) [07:26:42] <_joe_> aharoni: not for me [07:27:48] !log rebooting snapshot1005-1007 for kernel update [07:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:31:19] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2730717 (10jcrespo) I suggest we should avoid providing wildcard grants. Those have created issues in the past, and we want to move away from them. When g... [07:32:19] !log rebooting snapshot1001 for kernel update [07:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:39:58] (03CR) 10Giuseppe Lavagetto: "@yuvipanda I am merging this to start experimenting in production with docker/calico/kubernetes. But I will amend this code if we decide t" [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [07:40:06] (03PS6) 10Giuseppe Lavagetto: kubernetes: introduce 1st-stage worker role [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) [07:45:00] (03CR) 10Alexandros Kosiaris: [C: 031] maps::server: move base::firewall to role [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [07:47:59] (03CR) 10Giuseppe Lavagetto: [C: 032] kubernetes: introduce 1st-stage worker role [puppet] - 10https://gerrit.wikimedia.org/r/315717 (https://phabricator.wikimedia.org/T147181) (owner: 10Giuseppe Lavagetto) [07:50:28] (03CR) 10Giuseppe Lavagetto: [C: 032] kubernetes: install kubernetes1001-4 as worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/315923 (https://phabricator.wikimedia.org/T147933) (owner: 10Giuseppe Lavagetto) [07:50:37] (03PS3) 10Giuseppe Lavagetto: kubernetes: install kubernetes1001-4 as worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/315923 (https://phabricator.wikimedia.org/T147933) [07:51:05] !log start of elasticsearch codfw rolling restart [07:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:31] !log rebooting bast3001 for kernel update [07:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:56:01] 06Operations, 10Adminbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2730740 (10Peachey88) That would cause double logging. Just ask one of the ops to restart the bot, on the rare instances this happens. (Also, @Stashbot captures and logs them all as well) [08:09:32] !log change-prop deploying 3a11886 [08:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:10:46] !log reboot wtp20{03,05,08,09,12,15,17,18,20} for kernel upgrade [08:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:25] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: wtp2019.codfw.wmnet (tags: ['dc=codfw', 'cluster=parsoid', 'service=parsoid']) [08:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:43] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2730746 (10akosiaris) [08:17:36] ACKNOWLEDGEMENT - Check size of conntrack table on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:36] ACKNOWLEDGEMENT - DPKG on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:37] ACKNOWLEDGEMENT - Disk space on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:37] ACKNOWLEDGEMENT - MD RAID on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:37] ACKNOWLEDGEMENT - NTP on wtp2019 is CRITICAL: NTP CRITICAL: No response from NTP server alexandros kosiaris T148710 [08:17:37] ACKNOWLEDGEMENT - SSH on wtp2019 is CRITICAL: Connection timed out alexandros kosiaris T148710 [08:17:37] ACKNOWLEDGEMENT - configured eth on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:38] ACKNOWLEDGEMENT - dhclient process on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:38] ACKNOWLEDGEMENT - parsoid on wtp2019 is CRITICAL: Connection timed out alexandros kosiaris T148710 [08:17:39] ACKNOWLEDGEMENT - puppet last run on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:17:39] ACKNOWLEDGEMENT - salt-minion processes on wtp2019 is CRITICAL: Timeout while attempting connection alexandros kosiaris T148710 [08:18:59] PROBLEM - puppet last run on elastic1027 is CRITICAL: Connection refused by host [08:19:25] !log reboot the rest of the wtp20XX hosts for kernel upgrade [08:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:34] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730765 (10Shoichi) I also paste [[ http://ids-testing.wmflabs.org/wiki/%E6%B2%99%E7%AE%B1 | the testing wiki ]] as an reference. Most funny , there is also some special em... [08:25:14] !log rebooting wtp10{10,14,15,16,20,21} for kernel upgrade [08:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:27:09] PROBLEM - Host wtp1015 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:38] PROBLEM - Host wtp1011 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:39] PROBLEM - Host wtp1014 is DOWN: PING CRITICAL - Packet loss = 100% [08:27:39] PROBLEM - Host wtp1021 is DOWN: PING CRITICAL - Packet loss = 100% [08:28:37] RECOVERY - Host wtp1021 is UP: PING OK - Packet loss = 0%, RTA = 1.73 ms [08:28:38] RECOVERY - Host wtp1011 is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [08:28:47] RECOVERY - Host wtp1015 is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [08:28:52] RECOVERY - Host wtp1014 is UP: PING OK - Packet loss = 0%, RTA = 2.06 ms [08:30:07] (03PS1) 10Giuseppe Lavagetto: Fix lan assignment for kubernetes1003 [dns] - 10https://gerrit.wikimedia.org/r/316936 [08:31:38] !log rebooting aqs100[123] for kernel upgrades (one at the time, de-pool/reboot/pool) [08:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:32:29] !log rebooting aqs100[456] for kernel upgrades (one at the time, de-pool/reboot/pool) [08:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:34:57] !log rebooting wtp10{07,08,09,10,19,24} for kernel upgrade [08:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix lan assignment for kubernetes1003 [dns] - 10https://gerrit.wikimedia.org/r/316936 (owner: 10Giuseppe Lavagetto) [08:39:25] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/316546 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [08:43:05] !log rebooting wtp10{01,03,04,05,18,23} for kernel upgrade [08:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:12] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:49:22] !log rebooting restbase-test* for kernel upgrade [08:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:20] !log rebooting bast4001 for kernel update [08:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:30] !log rebooting eventlog1001 for kernel upgrades (Eventlogging host) [08:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:17] !log rebooting eventlog2001 for kernel upgrades (EL spare host) [08:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:45] !log rebooting wtp10{02,06,12,13,17,22} for kernel upgrade [08:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:00:09] PROBLEM - Host eventlog2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:00:46] me --^ [09:02:19] PROBLEM - Disk space on graphite1002 is CRITICAL: DISK CRITICAL - free space: /boot 1 MB (2% inode=98%) [09:08:15] moritzm: ^^^ 88M /boot partition :( [09:08:33] ack, fixing that by removing old kernels [09:08:49] those really old trusty boxes have a old part scheme [09:09:58] RECOVERY - Host eventlog2001 is UP: PING OK - Packet loss = 0%, RTA = 37.47 ms [09:10:07] it's even a jessie, but fixed now [09:10:18] RECOVERY - Disk space on graphite1002 is OK: DISK OK [09:16:46] !log restarts of mw2075,6,7 done, starting rolling restarts shortly of 8,9, 2120-2147 [09:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:08] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 4 others: publish lag and response time for wdqs codfw to graphite - https://phabricator.wikimedia.org/T146207#2730826 (10Addshore) [09:29:10] (03PS3) 10Jcrespo: mariadb: move db1053 from s1 to s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315975 [09:29:54] jynus: did you start the alter table by the way? [09:30:54] I was going to do that, wait for the log [09:30:56] (03PS1) 10Ema: tlsproxy: proper indentation of nginx_ssl_conf params [puppet] - 10https://gerrit.wikimedia.org/r/316947 [09:31:04] jynus: :) [09:31:44] needs moar icinga downtimes [09:32:05] !log rolling restart of graphite machines for kernel upgrade [09:32:10] should I compress at the same time? [09:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:26] it should take the same time [09:35:42] jynus: Go for it! [09:36:10] (03CR) 10Volans: [C: 031] "Result on https://puppet-compiler.wmflabs.org/4447/cp1008.wikimedia.org/ looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/316947 (owner: 10Ema) [09:37:33] (03CR) 10Ema: [C: 032] tlsproxy: proper indentation of nginx_ssl_conf params [puppet] - 10https://gerrit.wikimedia.org/r/316947 (owner: 10Ema) [09:40:49] (03CR) 10Volans: [C: 04-1] "On precise hosts it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/310383 (https://phabricator.wikimedia.org/T125205) (owner: 10Dzahn) [09:43:26] maybe we should chanbe innodb_file_format to Barracuda? [09:43:54] what do you have in mind? [09:44:03] I would bet that will fix some of the issues you have been experiencing [09:44:15] well, that is needed for compression anyway [09:45:09] is there any reason why we were usng antelope? [09:45:18] or simply: just because [09:45:53] it is what it was being used before [09:46:46] Let's use db1053 then to test it too [09:46:56] Although as you said, if we are going to compress anyways [09:47:20] but innodb_large_prefix is certainly an option in the future [09:47:44] and we get better blob handling (of which those tables have a lot) [09:48:14] it is also the default in 5.7 [09:50:36] !log stop sql thread replication for db1053 and applying partitioning as a "special slave" [09:50:37] (03CR) 10Filippo Giunchedi: "> so, this unbreak specifically the cassandra ssl checks on jessie." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [09:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:52:00] Stage: 1 of 2 'copy to tmp table' 0.064% of stage done [09:52:10] \o/ [09:52:38] 2.1T available [09:52:45] I am going to do logging in parallel [09:53:55] Wow, 135G logging table [09:54:00] I bet it is going to take 2-3 days too :) [09:56:18] you can see the downtime alters giving warnings already [09:56:25] hopefuly it will be fast [09:56:51] PROBLEM - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [09:56:57] ACKNOWLEDGEMENT - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T148715 [09:57:00] 06Operations, 10ops-eqiad: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2730900 (10ops-monitoring-bot) [09:57:33] volans ^ that is great :) [09:57:46] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2730904 (10jcrespo) [09:58:01] <_joe_> it is, great work volans [09:58:09] :-D [09:58:42] <_joe_> now can you make a script that detects issues on HHVM, opens a ticket, debugs the memleak and submits the patch upstream, after having built the package and deployed it? [09:58:48] so much excitement for broken disks and degraded RAIDs :-P [09:59:11] _joe_: sure, by EOD it's ok? XD [09:59:24] <_joe_> volans: end of quarter would be fine, thanks [09:59:29] <_joe_> I'm not an impatient man [09:59:32] lol [10:00:08] (03CR) 10Alexandros Kosiaris: "Hm, it does sound a bit weird to enforce cluster separation at the X509 level instead of say, the networking level." [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:01:39] (03CR) 10Giuseppe Lavagetto: "@Akosiaris that's pretty common for cluster that can do TLS client cert and it's considered a better protection than network segregation. " [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:05:13] !log rebooting the Analytics Hadoop cluster for kernel upgrades [10:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:50] (03CR) 10Alexandros Kosiaris: "I was thinking more of firewall rules than VLans, but I suppose VLans would work as well" [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:11:52] PROBLEM - NTP on wtp2001 is CRITICAL: NTP CRITICAL: Offset unknown [10:13:24] looking into wtp2001, rarely happens after reboots [10:16:33] akosiaris: would it be complex to do with wmf ca to issue certs for cassandra machines? [10:19:35] moritzm: discarding peer 0: stratum=0 [10:19:36] godog: no I don't think so. look at puppetmaster1001:/srv/private/modules/secret/secrets/ssl/wmf_ca_2014_2017 for how it is structured [10:19:51] I think it should be quite easy [10:20:13] !log removing a few older kernels on analytics1036, was short of disk space in /boot partition [10:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:29] volans: on which host is that? [10:21:40] running the check from neon against wtp2001 [10:21:42] !log while the first batch of codfw api servers trundle along, starting rolling reboots for appservers in codfw starting with mw2090-2098, 2100-2119 [10:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:21:48] $ /usr/lib/nagios/plugins/check_ntp_time -H wtp2002 -w 1 -c 2 -v [10:21:51] moritzm: ^^ [10:23:10] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1239 is CRITICAL: Host mw1239 is not in mediawiki-installation dsh group Muehlenhoff depooled for hardware maintenance, T148421 [10:24:18] moritzm: looks ok now, did you change anything? [10:24:53] not on 2002, no [10:25:16] akosiaris: indeed doesn't look too hard, that plus figuring out the magic to shove the result keypair into a java keystore [10:25:16] according to http://serverfault.com/questions/269701/nagios-ntp-discarding-peer this might rather be a bug in the icinga check [10:25:35] I was looking at http://serverfault.com/questions/625027/nagios-check-ntp-time-offset-unknown :D [10:28:15] moritzm: sorry still failing I was checking wtp2002 for comparison [10:28:42] (03PS4) 10Filippo Giunchedi: prometheus: expand domain search list [puppet] - 10https://gerrit.wikimedia.org/r/316546 (https://phabricator.wikimedia.org/T140646) [10:29:23] hmmm when I saw NTP and then most hosts recovering I assumed it's the usually NTP taking a while to recover after a reboot [10:29:39] akosiaris: this is up since 2h [10:29:55] yeah I know, I 've rebooted all these hosts [10:30:06] and it's only present on wtp1001 now ... [10:30:06] akosiaris: these usually sort themselves out after up to 10 mins, this one is different [10:30:20] it rarely happens after reboots [10:30:23] looking as well [10:30:59] should we just restart ntpd ? [10:31:15] ah already done on 10:14 ? [10:31:23] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: expand domain search list [puppet] - 10https://gerrit.wikimedia.org/r/316546 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [10:31:56] I restarted it at 10:14 yes, we'll see whether that fixes it the next time the Icinga check runs [10:32:19] moritzm: I can tell you the check from icinga still fails [10:32:35] volans: hmm, ok [10:33:21] /usr/lib/nagios/plugins/check_ntp_time -H wtp2001.codfw.wmnet -w 1 -c 2 [10:33:21] NTP OK: Offset -0.0001112222672 secs|offset=-0.000111s;1.000000;2.000000; [10:33:24] seems fine [10:33:52] I think though it just recovered [10:33:58] akosiaris: I ran it at 10:33:20 UTC and was still failing [10:33:59] :D [10:34:09] now it's ok... bleh [10:34:12] lol [10:34:19] I wonder what happened [10:34:31] maybe ntpd started before network was online ? [10:34:45] RECOVERY - NTP on wtp2001 is OK: NTP OK: Offset -0.001834630966 secs [10:34:46] a ntpq -c peers would have given us something [10:34:52] or at least I think so [10:35:01] I've done ntpq -p [10:35:08] if you want it [10:35:11] same thing.. what did it say ? [10:35:25] from syslog (re-starting before network) [10:35:26] 08:21:13 wtp2001 ntpd[1068]: Listen normally on 7 eth0 [10:35:33] Oct 20 08:21:13 wtp2001 ntpd[1068]: peers refreshed [10:35:41] ntpq -p looked benign [10:36:28] akosiaris: what moritzm says, the first time I run it it was having only achernar as * and the other 3 without +/- but after a minute or two looked like all the others [10:36:41] but maybe I don't know what to look exactly for ;) [10:36:52] akosiaris: https://phabricator.wikimedia.org/P4269 [10:37:40] it might a subtle race condition with the network not fully up due to a slow NIC or so [10:38:16] heh, so chromium out of tolerance, then 2 considered fine and achernar being the reference one [10:38:20] !log rolling restarts on first batch of api servers in eqiad: mw1189-1208 [10:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:38:29] and now chromium is the actually reference one [10:38:38] oh wait, PEBKAC, that's neon [10:38:49] yeah output is the same [10:39:12] akosiaris: refresh the paste, I've added mines [10:39:38] akosiaris: the first ntpq that was different, any retry later was like the one mortiz has pasted [10:39:50] progress on a systemd unit seems stalled unfortunately: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=635752 [10:40:16] it's an incredible mess, there primary VCS is Bitkeeper(!) [10:41:05] (ntpd upstream I mean) [10:41:11] yeah [10:42:44] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2730961 (10Volans) @Cmjohnson I'm not sure the `Other Error Count: =====> 1 <=====` should really to be considered a "failing" components. Let me know if you think this is too verbose and I... [10:42:45] (03CR) 10BBlack: check_ssl: Unbreak by not verifying server certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [10:43:55] bitkeeper ? oh my [10:46:20] hmm it did have 4 succesful polls for every host [10:47:48] ok I admit defeat.. I have no idea why it would not report correctly in icinga [10:48:43] it does say refid=163.87.109.84 which is none of our ntp servers ofc [10:48:53] and an offset=0.00000 which might be that [10:49:20] but that does not correlate with * in front of achernar meaning it had picked it as a good and reliable source of time [10:50:10] akosiaris: I'm a little confused by https://gerrit.wikimedia.org/r/#/c/316906/1/modules/nagios_common/files/check_commands/check_ssl [10:50:33] what do you mean by "the validity of the server certificate"? [10:50:41] !log rebooting kafka200[12] for kernel upgrades (Kafka main-codfw non live cluster) [10:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:59] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, and 2 others: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2730968 (10Joe) `kubernetes1001` can't reach the PXE/DHCP server with the following error: ``` PXE-E61: Media test failure, check cable... [10:51:55] paravoid: verification of the certificate (in any kind of way) [10:52:13] I'm pretty sure check_ssl actually verifies certificates [10:52:16] IO::Socket::SSL before 1.950 would happily accept any kind of certificate [10:52:25] on jessie perhaps [10:52:28] not on precise [10:52:29] no [10:52:46] strace it if you want on neon.. it will never check /etc/ssl/certs [10:53:01] why would it? [10:53:01] whereas it will on a trusty system [10:53:25] PROBLEM - NTP on bast4001 is CRITICAL: NTP CRITICAL: Offset unknown [10:53:33] I mean /etc/ssl/certs/.crt [10:53:35] so, verify the chain of the cert you mean? [10:53:47] yes [10:54:10] yeah, that was never the intention [10:54:22] that's not especially useful for the original use case [10:54:34] although given the default on precise's IO::Socket:SSL I am not sure it check's even the Subject [10:54:37] lemme check [10:54:40] it does, explicitly [10:54:54] see sub ssl_verify [10:55:00] ah [10:55:05] yes not on connection [10:55:07] later on [10:55:08] yes [10:55:18] also verifies expiry, explicitly [10:55:38] and you can pass a --issuer argument (with a subject match) if you want to be sure it's issued by a particular vendor [10:56:19] ok, then I may have misunderstood ssl_connect purpose [10:56:27] it is to always connect, no matter what ? [10:56:45] so no kind of verification at all in that function ? [10:57:08] yeah [10:57:20] although verify_hostname should verify the authority as well I suppose [10:57:42] we do pass the --rootcert [10:57:46] ok, mind commenting about that ? I 'll amend the patchset to display that in a comment and add SSL_VERIFY_NONE as a default then [10:58:17] akosiaris: same problem on bast4001 after reboot, BTW. outout of ntpq -p currently: https://phabricator.wikimedia.org/P4270 [10:58:29] !log rolling reboots for first batch of app servers in eqiad: mw1170-1188 [10:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:20] ok that one makes sense [10:59:31] no peer has yet been chosen to sync with [11:00:27] but nothing strange related to ntpd in syslog [11:01:09] (03PS1) 10Giuseppe Lavagetto: partman: fix recipe for docker [puppet] - 10https://gerrit.wikimedia.org/r/316953 [11:01:11] .XFAC. [11:01:16] .XFAC. – association changed (IP address changed or lost); [11:01:18] wat ? [11:01:46] http://lists.ntp.org/pipermail/questions/2010-May/026734.html [11:02:03] akosiaris: hrm, so why not default to SSL_VERIFY_PEER and pass --rootcert (or --noverify) for cassandra hosts? [11:02:16] although I suppose this may break with --nosni [11:03:38] so, if you have written functions that basically do most of the work that IO::Socket:SSL now does anyway, defaulting to SSL_VERIFY_PEER would mean that in any of the cases that ssl_connect() breaks on, your functions would no longer be called [11:04:01] I assume these functions are in some way better (more verbose output?) [11:04:21] that just getting openssl's output (which is often cryptic...) [11:04:34] than* [11:05:02] --rootcert on cassandra hosts does require we ship the rootCAs for those btw. Which we currently dont [11:05:12] PROBLEM - check_mysql on frdb1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1215 [11:06:42] for --noverify we have to also update ssl_connect to basically do the SSL_VERIFY_NONE dance [11:07:02] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [11:07:11] otherwise the script exists prematurely at $client = IO::Socket::SSL->new(%sopts) [11:07:16] !log change-prop restarting in codfw after kafka kernel upgrade [11:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:07:22] PROBLEM - NTP on mw2128 is CRITICAL: NTP CRITICAL: Offset unknown [11:07:28] elukey: kafka mirror problem ^^^ [11:07:40] it's not running [11:09:07] ah this is weird [11:10:13] RECOVERY - check_mysql on frdb1001 is OK: Uptime: 1373457 Threads: 1 Questions: 128184221 Slow queries: 10653 Opens: 11153 Flush tables: 1 Open tables: 602 Queries per second avg: 93.329 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:11:08] I started it [11:11:10] When started for the first time and a frequency file is not present, the daemon enters a special mode in order to calibrate the frequency. This takes 900s during which the time is not disciplined. When calibration is complete, the daemon creates the frequency file and enters normal mode to amortize whatever residual offset remains. [11:11:24] (03PS1) 10Mobrovac: RESTBase: Use the LVS Realserver role [puppet] - 10https://gerrit.wikimedia.org/r/316954 [11:11:34] in case anyone was wondering why we get these ^ after a reboot [11:11:43] but still does not explain the XFAC [11:11:45] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2730980 (10BBlack) I don't disagree with the above, but I do still think we run the risk of namespace confusion and clutter issues on the latter two poi... [11:12:22] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw on kafka2001 is OK: PROCS OK: 1 process with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_main-codfw/producer\.properties [11:12:28] Typically, that XFAC refid is replaced on receipt of the next packet from the server or peer. [11:12:34] and yet... this has not happend [11:13:27] it's definitely related to ntp startup after a reboot, I recently upgraded ntp on all jessie for the security updates and that didn't happen for a plain ntp restart [11:13:49] mobrovac: from the logs it seems that there were produce errors and mirrormaker shut down [11:13:59] so I see no traffic up to now between bast4001 and any of the 4 peers [11:14:11] elukey: as in it could not send messages to eqiad? [11:14:28] so that might explain why it does not get out of the XFAC state [11:15:13] mobrovac: might be related to change prop restart? Error when sending message to topic eqiad.change-prop.transcludes.resource-change [11:15:18] ok, it's in stratum=16 and refid=INIT [11:15:34] this one has failed to start initialization.. something changed during the boot process [11:15:48] mobrovac: anyhow, will talk with Andrew about this issue [11:15:53] I've seen it in the past [11:15:58] I wonder if it can be kicked somehow without restarting it [11:15:59] hm ok [11:17:13] (03CR) 10Mobrovac: "Here's the rather cryptic PCC - https://puppet-compiler.wmflabs.org/4449/ . It seems rb1008 fails to compile entirely, regardless of this " [puppet] - 10https://gerrit.wikimedia.org/r/316954 (owner: 10Mobrovac) [11:17:35] !log rebooting all the Analytics Hadoop nodes for kernel upgrades [11:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:51] mobrovac: yes Error: Failed to compile catalog for node restbase1008.eqiad.wmnet: Attempt to assign to a reserved variable name: 'trusted' on node [11:18:04] we need to update the facts on the compiler, removing that trusted one [11:18:17] ah ok, so it's PCC not me [11:18:17] I 'll upload a patch [11:18:26] always glad to hear that [11:18:27] :D [11:18:36] thnx AK [11:18:42] akosiaris thnx [11:18:46] yw [11:22:48] this also happens a a few other servers being rebooted, I'll bounce ntp on those, we can keep bast4001 in that state a bit for debugging [11:22:50] (03PS1) 10Filippo Giunchedi: site: add ipv6 for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/316955 [11:23:52] !log rolling restarts of more api servers in codfw: mw2200 - 2220 [11:23:54] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=683061 might be related [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:24:42] moritzm: may I pm a sec please ? [11:24:50] matanya: sure [11:25:07] (03CR) 10Filippo Giunchedi: "More specifically, the issue I ran into is that prometheus hosts have no AAAA and therefore are not added by ferm to ip6tables" [puppet] - 10https://gerrit.wikimedia.org/r/316955 (owner: 10Filippo Giunchedi) [11:26:04] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:04] PROBLEM - Host analytics1031 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:04] PROBLEM - Host analytics1032 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:05] RECOVERY - NTP on mw2128 is OK: NTP OK: Offset 0.0003720521927 secs [11:26:10] 06Operations, 06Labs: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2730983 (10faidon) [11:26:15] akosiaris: ^^^ [11:26:19] akosiaris: special gift for you [11:27:23] RECOVERY - Host analytics1031 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [11:27:24] RECOVERY - Host analytics1030 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [11:27:28] hehehe [11:27:48] RECOVERY - Host analytics1032 is UP: PING OK - Packet loss = 0%, RTA = 3.02 ms [11:27:56] akosiaris: look at the last paragraph [11:28:12] yeah read it already [11:30:00] oooops sorry [11:30:05] forgot to silence icinga [11:30:40] (03PS3) 10BBlack: eqiad recdns IP fix: switch in puppet [puppet] - 10https://gerrit.wikimedia.org/r/315930 (https://phabricator.wikimedia.org/T143915) [11:32:49] (03CR) 10BBlack: [C: 032] eqiad recdns IP fix: switch in puppet [puppet] - 10https://gerrit.wikimedia.org/r/315930 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [11:36:12] (03PS2) 10BBlack: eqiad recdns IP fix: add new to DNS [dns] - 10https://gerrit.wikimedia.org/r/315927 (https://phabricator.wikimedia.org/T143915) [11:36:33] (03CR) 10BBlack: [C: 032] eqiad recdns IP fix: add new to DNS [dns] - 10https://gerrit.wikimedia.org/r/315927 (https://phabricator.wikimedia.org/T143915) (owner: 10BBlack) [11:37:28] 06Operations, 06Commons, 10Traffic, 10media-storage, 07Regression: Some JPGs are being served as text - https://phabricator.wikimedia.org/T148497#2731008 (10Aklapper) p:05High>03Low Lowering priority as this cannot be reproduced anymore. [11:37:47] <_joe_> bblack: uhm interesting, how do you add the IP around? a whole block of config again? [11:38:03] huh? [11:38:46] _joe_: I don't understand any of your question :) [11:39:10] (03PS1) 10Faidon Liambotis: mail: add an empty statement for 4.87+ compatibility [puppet] - 10https://gerrit.wikimedia.org/r/316956 [11:39:28] elukey: ^ partly responsible for the cp1047 cronspam [11:39:29] <_joe_> bblack: I guess you're adding a second LVS IP [11:39:39] <_joe_> right? [11:40:17] _joe_: yes. the IP is changing from .239 to .254. Changes merged yesterday turned on .254 as a functional service IP in parallel with .239 for the transition period. [11:40:31] <_joe_> bblack: ok I didn't see those changes [11:40:38] <_joe_> :) [11:40:57] paravoid: lol [11:41:14] and the transition period will probably end after some lengthy annoying process of tracking down every outdated recdns client via tcpdump :P [11:41:32] <_joe_> sigh [11:41:33] <_joe_> :P [11:42:10] but I'll start with waiting for that to push out and then maybe salting a grep for the old IP in /etc/ in case there's stuff left by the installer in some cases and not puppet-managed, or something [11:42:28] then we get to find out which things take addrs from /etc/resolv.conf but don't watch the file for changes :P [11:42:35] then there's all the hardware/network devices... [11:43:11] (or take addrs from @nameservers in puppet, but don't reconfigure their runtime selves on change) [11:48:08] !log depool cp1047 (cache_maps eqiad) [11:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:27] re: adding ipv6 to hosts, ip6_mapped plus AAAA records is all that's needed correct? [11:48:43] !log bounced ntp on mw2101 and mw2147 (XFAC state) [11:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:49:56] 06Operations, 10Cassandra, 06Services (blocked): SSL handshake errors - https://phabricator.wikimedia.org/T148654#2729072 (10fgiunchedi) Indeed that's tegmen's address for certs expiration `check_ssl` [11:51:42] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:53:28] !log rebooting an1027 (camus job launcher) for kernel upgrades [11:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:21] 06Operations, 10ops-esams, 10Traffic: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731071 (10BBlack) [12:00:02] 06Operations, 10Traffic: reimage cp1047 - https://phabricator.wikimedia.org/T148723#2731084 (10BBlack) [12:03:02] 06Operations, 10ops-esams: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2731103 (10BBlack) [12:03:04] 06Operations, 10ops-esams, 10Traffic: cp3009 hw issues - https://phabricator.wikimedia.org/T148722#2731105 (10BBlack) [12:03:26] 06Operations, 10ops-esams, 10Traffic: cp3009: memory scrubbing error - https://phabricator.wikimedia.org/T148422#2722360 (10BBlack) It's depooled from service as of yesterday as well (didn't see this ticket!). [12:04:52] !log rebooting baham / ns2.wikimedia.org for kernel [12:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:08] bleh [12:05:24] the log stuff needs a !logsed s/ns2/n1/ [12:05:30] I typo log entries all the time :P [12:05:46] !log correction: rebooting baham / ns1.wikimedia.org for kernel [12:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:21] PROBLEM - Host baham is DOWN: PING CRITICAL - Packet loss = 100% [12:08:25] !log bounced ntp on mw2206 (XFAC state) [12:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:13] PROBLEM - Host ns1-v4 is DOWN: PING CRITICAL - Packet loss = 100% [12:10:36] !log more api server rolling restarts for eqiad: mw1209-1216, 128-1220, 1236-38, 1240-1258 [12:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:11:13] !log retaction. those are app servers, not starting them yet [12:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:12:27] !log restarting bast2001 for kernel update [12:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:22] RECOVERY - Host baham is UP: PING OK - Packet loss = 0%, RTA = 38.06 ms [12:14:25] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [12:16:22] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:21:21] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:25:11] ^ not the first time we've seen an outdated package database post-reboot [12:25:16] (which caused that puppetfail) [12:25:37] apt-get upgrade was reporting 69 outdated packages needing updates, and puppet was failing on various package checks of its own [12:25:56] run 1x apt-get update, then the state is back to normal (nothing needs upgrading, puppet is fine, as all was before the reboot) [12:26:05] is some important state getting wiped on reboot? [12:26:41] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:27:00] !log more APP server rolling restarts for eqiad: mw1209-1216, 128-1220, 1236-38, 1240-1258 [12:27:03] mmh, haven't seen that before [12:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:11] !log more API server rolling restarts for eqiad: mw1221-1235, 1276-1290 [12:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:48] ^ that because of api cluster issues a few days ago?? [12:31:26] no, unrelated [12:31:30] Ok [12:31:43] So just normal maintence? [12:31:55] !log more app server rolling restarts for codfw: mw2163-2199 [12:31:56] yep [12:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:06] 06Operations, 06Labs, 10Wikimedia-Video, 07Need-volunteer: Upload the Wikimania 2014 videos to Commons - https://phabricator.wikimedia.org/T106038#2731269 (10Matanya) 05Open>03Resolved This was done by volunteers by uploading lower quality to commons. [12:35:17] !log bounced ntp on baham (was stick in INIT phase) [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:39:16] !log restarting an1003 for kernel upgrades (oozie/hive master) [12:39:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:30] * Nemo_bis read "sick in init" [12:43:12] Lol [12:48:06] !log bounced ntp on mw2116 (XFAC state) [12:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:01] !log restarting mx2001 for kernel update [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:32] !log ferm on baham (failed to start due to failing DNS resolution in early boot) [12:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:42] !log force failover for Hadoop Master node (an1002) to its stanby (an1002) and rebooting an1001 for kernel upgrades [12:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T1300). [13:03:25] There is nothing to SWAT. So SWAT done. [13:03:40] gj [13:05:22] !log correction: force failover for Hadoop Master node (an1001) to its stanby (an1002) and rebooting an1001 for kernel upgrades [13:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:10:19] !log force failover from temporary Hadoop Master node (an1002) to its stanby (an1001) to restore the standard configuration [13:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:14] !log restarting ms1001 for kernel update [13:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:53] !log rolling reboot of prometheus machines for kernel update [13:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:15] !log restarting francium for kernel update [13:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:52] 06Operations, 10OCG-General, 13Patch-For-Review: Tons of OCG jobs caused a massive increase in queue length - https://phabricator.wikimedia.org/T147211#2731359 (10elukey) p:05High>03Normal [13:27:09] 06Operations, 13Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#2731361 (10Gehel) 05Open>03Resolved Resovled as this has been done for month (since then nobelium has actually been decommissioned and replaced by relforge) [13:29:01] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10elukey) ping :) [13:29:26] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T148633#2731366 (10elukey) p:05Triage>03High [13:29:48] 06Operations, 10ops-eqiad, 10Dumps-Generation: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T148715#2731367 (10elukey) p:05Triage>03High [13:30:16] 06Operations, 10Adminbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2731368 (10elukey) p:05Triage>03Low [13:30:32] 06Operations: Reboot dataset1001 - https://phabricator.wikimedia.org/T148737#2731369 (10Aklapper) [13:30:32] 06Operations, 06Release-Engineering-Team, 07HHVM, 07Wikimedia-Incident: 2016-10-17 API cluster overload - https://phabricator.wikimedia.org/T148652#2731370 (10elukey) p:05Triage>03High [13:31:09] 06Operations, 10ops-codfw, 06DC-Ops: wtp2019 issues an uncorrectable memory error - https://phabricator.wikimedia.org/T148710#2731371 (10elukey) p:05Triage>03Normal a:03Papaul [13:31:40] 06Operations, 10media-storage: refresh swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T148647#2731376 (10elukey) p:05Triage>03Normal [13:32:13] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2731383 (10elukey) p:05Triage>03Normal a:03Cmjohnson [13:32:39] 06Operations, 10Mail, 10OTRS: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#2731386 (10elukey) p:05Triage>03Normal [13:33:01] 06Operations, 10ArticlePlaceholder, 10Traffic, 10Wikidata: Performance and caching considerations for article placeholders accesses - https://phabricator.wikimedia.org/T142944#2731387 (10elukey) p:05Triage>03Normal [13:33:45] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review, and 2 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2731388 (10Cmjohnson) update racktables and physical labels Rename sca1001, sca1002 to scb10... [13:35:48] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2731390 (10elukey) >>! In T147926#2729547, @Dzahn wrote: > reading "have role spare" and then "return to spare pool", doesn't that mean it's already in the spare pool? Hey Daniel, I did... [13:36:45] (03PS1) 10DCausse: [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) [13:37:31] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2708643 (10MoritzMuehlenhoff) >>! In T147926#2729547, @Dzahn wrote: > reading "have role spare" and then "return to spare pool", doesn't that mean it's already in the spare pool? No, it... [13:37:43] Hi, we noticed a strange error at cswiki. Image File:Czech Republic location map.svg from commons won't load when it it inserted as [[File:Czech Republic location map.svg|500px]]. See https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/1&oldid=14217744 as example. [13:38:25] !log restarting mx1001 for kernel update [13:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:51] I noticed https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Czech_Republic_location_map.svg/500px-Czech_Republic_location_map.svg.png don't give any result althrough https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Czech_Republic_location_map.svg/403px-Czech_Republic_location_map.svg.png returns corret picture. [13:39:14] Does anybody know what's wrong and if this is server-side isuee or not? [13:41:13] Fascinating is https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Czech_Republic_location_map.svg/500px-Czech_Republic_location_map.svg.png doesn't give anything only in browser (Google Chrome, Windows 10), when I run wget https://upload.wikimedia.org/wikipedia/commons/thumb/4/4a/Czech_Republic_location_map.svg/500px-Czech_Republic_location_map.svg.png at toollabs, all works correctly. [13:41:47] Google Chrome throws ERR_CONTENT_DECODING_FAILED as error description. [13:43:07] I tried Edge and Firefox, nothing works... [13:43:28] (03PS2) 10Gehel: kibana - allow access to both /status and /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316771 (https://phabricator.wikimedia.org/T132458) [13:44:40] (03CR) 10Gehel: [C: 032] kibana - allow access to both /status and /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316771 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [13:44:57] It doesn't work at enwiki too. [13:46:22] Any other size of the same image works correctly. [13:46:26] Strange isuee... [13:47:43] the issue of some images rendered as text [13:47:53] (03PS2) 10Gehel: kibana - change probe URL to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316772 (https://phabricator.wikimedia.org/T132458) [13:48:02] there is a task somewhere [13:48:16] arseny92: Your message belongs to me? [13:48:34] Yes [13:48:53] Okay. And is there any workaround? [13:49:11] This brokes some of our templates... [13:49:34] Urbanecm, I think I fixed it now [13:50:01] Thanks! [13:50:13] It seems it works! [13:50:13] Yes, https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/1&oldid=14217744 displays for me. [13:50:20] I confirm it too. [13:50:29] thanks! [13:50:33] (03CR) 10Muehlenhoff: "The ferm rules used in maps::server query values from Hiera, we need to doublecheck if these are you used in labs, otherwise this might br" [puppet] - 10https://gerrit.wikimedia.org/r/315889 (owner: 10Dzahn) [13:50:45] (03CR) 10Gehel: [C: 032] kibana - change probe URL to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316772 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [13:50:49] jynus: Where was the problem? [13:51:11] for the 500px, it said it returned a png gziped [13:51:28] it didn't- it was just a plain png [13:51:50] I purged the cache and now it works as it should [13:52:02] Thanks for explanation! [13:52:54] 06Operations, 10Dumps-Generation: Reboot dataset1001 - https://phabricator.wikimedia.org/T148737#2731449 (10ArielGlenn) p:05Triage>03Normal [14:00:38] (03CR) 10Gehel: [C: 04-1] [cirrus] Increase the number of shards to 15 for commonswiki_file (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) (owner: 10DCausse) [14:04:07] (03CR) 10Jcrespo: [C: 032] mariadb: move db1053 from s1 to s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315975 (owner: 10Jcrespo) [14:07:46] (03PS2) 10DCausse: [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) [14:07:48] (03PS1) 10Ema: varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) [14:08:18] (03CR) 10Gehel: [C: 031] [cirrus] Increase the number of shards to 15 for commonswiki_file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316964 (https://phabricator.wikimedia.org/T148736) (owner: 10DCausse) [14:09:00] !log jynus@mira Synchronized wmf-config/db-eqiad.php: mariadb: move db1053 from s1 to s4 (duration: 02m 06s) [14:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:52] mw2197.codfw.wmnet & mw2098.codfw.wmnet have to be pulled, they are almost certainly in the middle of a reboot [14:10:15] mw2098 had hardware problems, there's a dc ops ticket for it [14:10:22] oh [14:10:26] I'm pulling it from dsh [14:11:23] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw2098.codfw.wmnet [14:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:47] I have pulled mw2197 [14:17:17] (03PS1) 10Jcrespo: prometheus: Move db1053 from s1 to s4 on mysql group monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316968 [14:17:34] (03PS2) 10Jcrespo: prometheus: Move db1053 from s1 to s4 on mysql group monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316968 [14:17:47] !log rolling reboot of remaining app servers in codfw: mw2221-2245, and in eqiad: mw1261-1275 [14:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:01] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1046 - https://phabricator.wikimedia.org/T148633#2731507 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson The disk was replaced yesterday. All systems go root@db1046:~# megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spun Up Firmw... [14:18:19] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus2002.codfw.wmnet [14:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:33] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus2001.codfw.wmnet [14:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:19:05] mw2197 is fine now, it finished its reboot and has been repooled [14:19:13] you just got unlucky with the timing [14:19:30] (03PS1) 10Elukey: Fix yarn.w.o's mod_proxy_html configuration [puppet] - 10https://gerrit.wikimedia.org/r/316969 (https://phabricator.wikimedia.org/T147927) [14:19:30] jynus: ^^ [14:20:03] yes, as I mentioned, I pulled it individually [14:20:08] after it failed [14:20:18] with no issues [14:20:22] !log starting rolling restart of analytics-eqiad kafka brokers to apply kernel update [14:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:20:40] oh that pull [14:20:51] (03CR) 10Elukey: [C: 032] Fix yarn.w.o's mod_proxy_html configuration [puppet] - 10https://gerrit.wikimedia.org/r/316969 (https://phabricator.wikimedia.org/T147927) (owner: 10Elukey) [14:20:52] !log rebooting auth* servers [14:20:56] yes pull, not pool :-) [14:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:11] I read morit zm's "pulling it from dsh" and thought your comment right before that was that kind of pull! [14:21:17] *right after [14:21:18] no no [14:21:26] scap pull kind of pull :-) [14:21:29] english fail :-D [14:21:36] all clear now [14:21:48] I hate when english fails to do its job [14:22:16] we should all use ruby to communicate [14:22:22] much easier [14:22:29] oh I'm suuuuuure [14:24:10] !log bounce ntpd on bast4001 [14:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:27] (03PS3) 10Dereckson: Apply rate limit to edits for normal users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:25:56] (03CR) 10Dereckson: "PS3: Rebased, added reference to the relevant task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:26:01] (03CR) 10jenkins-bot: [V: 04-1] Apply rate limit to edits for normal users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:26:23] (03CR) 10Dereckson: [C: 04-1] "The limit is too restrictive for small edits like adding categories." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:26:49] !log filippo@puppetmaster1001 conftool action : set/pooled=yes; selector: name=prometheus1002.eqiad.wmnet [14:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:02] !log filippo@puppetmaster1001 conftool action : set/pooled=no; selector: name=prometheus1001.eqiad.wmnet [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:39] (03CR) 10Jcrespo: [C: 032] prometheus: Move db1053 from s1 to s4 on mysql group monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316968 (owner: 10Jcrespo) [14:27:46] (03PS3) 10Jcrespo: prometheus: Move db1053 from s1 to s4 on mysql group monitoring [puppet] - 10https://gerrit.wikimedia.org/r/316968 [14:27:47] dcausse: https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/9811/console — there are some unit tests failing for Cirrus in mediawiki-config repo [14:29:09] Dereckson: looking [14:29:23] RECOVERY - NTP on bast4001 is OK: NTP OK: Offset 0.04553484917 secs [14:29:38] (03PS2) 10Filippo Giunchedi: site: add ipv6 for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/316955 [14:30:50] (03CR) 10Filippo Giunchedi: [C: 032] site: add ipv6 for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/316955 (owner: 10Filippo Giunchedi) [14:31:32] (03PS3) 10Filippo Giunchedi: site: add ipv6 for prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/316955 [14:32:35] wow that's strange: Use of undefined constant 50 - assumed ' 50' in /srv/jenkins-workspace/workspace/operations-mw-config-phpunit/wmf-config/InitialiseSettings.php on line 7226 [14:33:17] jynus: ok to merge your change too? [14:33:41] yes [14:34:14] I was cehcking the dbstore errors [14:34:26] Dereckson: looks like a special white space before 50 introduced in this patch [14:34:53] !log rebooting rutherfordium for kernel update [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:40] (03CR) 10DCausse: Apply rate limit to edits for normal users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:37:19] (03PS2) 10Madhuvishy: maps nfs: Symlink project and home to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/316826 [14:38:12] dcausse: ah thanks, so I guess cirrusTests are the only ones to parse IS [14:38:34] Dereckson: I suppose yes [14:38:42] (03PS1) 10Filippo Giunchedi: wmnet: add AAAA for prometheus [dns] - 10https://gerrit.wikimedia.org/r/316972 [14:39:06] (03CR) 10Madhuvishy: [C: 032] maps nfs: Symlink project and home to mount from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/316826 (owner: 10Madhuvishy) [14:40:49] (03PS4) 10Dereckson: Apply rate limit to edits for normal users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [14:44:14] (03PS2) 10Ema: varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) [14:46:57] (03PS1) 10Madhuvishy: maps nfs: Match nfs mount options for maps [puppet] - 10https://gerrit.wikimedia.org/r/316974 [14:47:53] Krenair: I'm around now :) [14:49:28] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2731568 (10Aklapper) @awight: Is this a [[ https://www.mediawiki.org/wiki/Review_queue#Checklist | production deployment tracking task ]]? If so please set https://phabric... [14:50:35] (03CR) 10Madhuvishy: [C: 032] maps nfs: Match nfs mount options for maps [puppet] - 10https://gerrit.wikimedia.org/r/316974 (owner: 10Madhuvishy) [14:51:20] (03CR) 10Elukey: [C: 031] varnishrls4: use VSL query and proper tags [puppet] - 10https://gerrit.wikimedia.org/r/316966 (https://phabricator.wikimedia.org/T131353) (owner: 10Ema) [14:55:19] !log bounced ntp on mw2196/mw2197 (XFAC state) [14:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:51] 06Operations, 10Adminbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#2731572 (10Zppix) Must of misunderstood my idea as in it wouldnt actually log mags until the backup detects that morebots is offline or not in channel for atleast 3-4 minutes [14:56:43] (03PS2) 10Gehel: kibana - only allow unauthenticated access to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316773 (https://phabricator.wikimedia.org/T132458) [14:56:46] (03CR) 10Filippo Giunchedi: [C: 032] wmnet: add AAAA for prometheus [dns] - 10https://gerrit.wikimedia.org/r/316972 (owner: 10Filippo Giunchedi) [14:57:11] PROBLEM - NTP on mw2196 is CRITICAL: NTP CRITICAL: Offset unknown [14:58:09] (03CR) 10Gehel: [C: 032] kibana - only allow unauthenticated access to /api/status [puppet] - 10https://gerrit.wikimedia.org/r/316773 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [14:58:32] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 696 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3051486 keys - replication_delay is 696 [15:00:12] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review, 07User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515#2731588 (10JEumerus) Added the #operations tag under the assumption that it's an operation question whether such a change is needed at all. In any c... [15:04:54] PROBLEM - NTP on mw2197 is CRITICAL: NTP CRITICAL: Offset unknown [15:07:22] PROBLEM - statsv process on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args statsv [15:08:53] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:42] statsv is surely related to the kafka restarts [15:10:01] hmm yeah possibly [15:10:03] i'll look at that [15:10:10] thanks for the bounce, mor itzm [15:10:31] elukey: I'm gonna do the image scalers for codfw and eqiad now [15:10:32] !log restarted statsv on hafnium [15:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:21] apergos: super thanks, maybe I can do the jobrunners tomorrow [15:11:26] sorry for today [15:11:32] or maybe they will already be done by then :-) [15:11:37] no need to be sorry [15:11:40] it's all good [15:12:52] RECOVERY - statsv process on hafnium is OK: PROCS OK: 13 processes with command name python, args statsv [15:14:12] RECOVERY - NTP on mw2196 is OK: NTP OK: Offset 2.562999725e-05 secs [15:14:39] !log rolling reboot of image scalers for codfw, eqiad: mw2086-2089, mw2148-2151, mw1293-1298 [15:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:02] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 29 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] [15:15:53] RECOVERY - NTP on mw2197 is OK: NTP OK: Offset -0.0001555681229 secs [15:15:57] cp machines are going to complain about being able to talk to kafka1020 btw [15:16:02] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:16:03] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:16:03] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:16:03] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:16:03] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:16:11] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:16:12] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1020_v4,kafka1020_v6 [15:16:14] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1020_v4,kafka1020_v6 [15:16:23] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:16:32] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:16:33] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:16:33] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:16:33] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:16:33] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:16:34] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:16:37] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1020_v4,kafka1020_v6 [15:16:53] ottomata elukey ^ [15:16:56] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1020_v4,kafka1020_v6 [15:16:58] yeah [15:17:02] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:02] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:09] kafka1020 didn't like the reboot [15:17:10] kafka1020 isn't coming back up because /dev sd numbers got swapped around [15:17:12] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:17:12] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1020_v4,kafka1020_v6 [15:17:12] PROBLEM - IPsec on cp4002 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:12] PROBLEM - IPsec on cp4001 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:13] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1020_v4,kafka1020_v6 [15:17:13] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:13] PROBLEM - IPsec on cp3005 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:14] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:32] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1020_v4,kafka1020_v6 [15:17:32] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:32] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:33] PROBLEM - IPsec on cp4004 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:33] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:33] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:33] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:33] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:34] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:34] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:35] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:35] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: kafka1020_v4,kafka1020_v6 [15:17:43] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: kafka1020_v4,kafka1020_v6 [15:17:43] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: kafka1020_v4,kafka1020_v6 [15:17:43] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:17:44] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:17:44] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: kafka1020_v4,kafka1020_v6 [15:17:47] killing ircecho [15:17:51] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:17:52] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: kafka1020_v4,kafka1020_v6 [15:17:52] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: kafka1020_v4,kafka1020_v6 [15:25:52] 06Operations, 10ops-codfw, 10ops-eqiad, 06DC-Ops, and 3 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2731659 (10Cmjohnson) a:05Cmjohnson>03Papaul Eqiad: Labels have been changed, racktables updated... [15:29:18] ok elukey, kafka1020 back up, disk dev # shuffled a bit but ¯\_(ツ)_/¯ [15:29:48] elukey: i should have been doing this during reboots anyway [15:30:17] setting uuids before reboot in fstab [15:30:29] elukey: do you think we should do that for analytics brokers now as you do reboots too? [15:30:32] oorrr, mabye too much? [15:30:38] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [15:31:39] ircecho running [15:33:00] ottomata: yes I think we should fix fstab before rebooting [15:33:53] elukey: i'm going to set uuids in fstab for the remaining two [15:34:04] super [15:34:04] and i will note that 1012 and 1014 need theirs fixed int he audit ticket we made [15:36:40] !log rolling reboots for jobrunners in codfw: mw2080-2085, mw2153-mw2162, mw2247-2250 [15:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:13] (03PS1) 10Gehel: elasticsearch: tuning of zen discovery settings [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T148736) [15:39:36] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2731690 (10awight) [15:40:00] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730181 (10awight) [15:40:42] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730181 (10awight) [15:43:44] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2731712 (10bd808) >>! In T147718#2730658, @Joe wrote: > - whatever is a general feature of production that can reasonably be referenced by more than one... [15:46:21] 06Operations, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2731725 (10awight) @Aklapper: Thanks for the pointer! I've linked with the tracking task, but balked on removing the Extension:Ids parent because this is a hard dependency... [15:47:03] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:47:20] (03PS2) 10Alexandros Kosiaris: check_ssl: Do not verify server cert chain on connect [puppet] - 10https://gerrit.wikimedia.org/r/316906 [15:53:02] elukey: kafka1013 might start alarming [15:53:10] for strongswan [15:55:19] (03CR) 10Alexandros Kosiaris: "Recapping from IRC (correct me on errors please):" [puppet] - 10https://gerrit.wikimedia.org/r/316906 (owner: 10Alexandros Kosiaris) [15:55:30] (03CR) 10Luke081515: [C: 04-1] "As before" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [15:58:47] weird I can see the alarms but no icinga-wm [15:58:51] (03CR) 10DCausse: "just nitpicks for now," (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T148736) (owner: 10Gehel) [15:59:07] !log rebooting uranium [15:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:21] !log short downtime of ganglia web ui [15:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T1600). Please do the needful. [16:00:17] 06Operations, 06Analytics-Kanban, 06Performance-Team, 06Reading-Admin, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2731863 (10dr0ptp4kt) [16:00:36] puppet swat looks empty [16:00:50] (03PS2) 10Giuseppe Lavagetto: partman: fix recipe for docker [puppet] - 10https://gerrit.wikimedia.org/r/316953 [16:01:16] elukey: didn't you killed it before? [16:01:44] (03PS1) 10Dereckson: Throttle user edits to 1000 per minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316980 (https://phabricator.wikimedia.org/T56515) [16:01:59] (03CR) 10DCausse: elasticsearch: tuning of zen discovery settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T148736) (owner: 10Gehel) [16:02:01] (03PS3) 10Giuseppe Lavagetto: partman: fix recipe for docker [puppet] - 10https://gerrit.wikimedia.org/r/316953 [16:02:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] partman: fix recipe for docker [puppet] - 10https://gerrit.wikimedia.org/r/316953 (owner: 10Giuseppe Lavagetto) [16:02:27] (03CR) 10Dereckson: "Pending discussions on https://gerrit.wikimedia.org/r/280002, we can have a very high limit, that will be better than no limit at all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316980 (https://phabricator.wikimedia.org/T56515) (owner: 10Dereckson) [16:03:03] volans: yes I did but usually it is enought to start ircecho again [16:04:26] (03CR) 10Filippo Giunchedi: Enable simple-json-datasource on prod Grafana (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314029 (https://phabricator.wikimedia.org/T147329) (owner: 10Addshore) [16:04:51] PROBLEM - Varnishkafka Delivery Errors per minute on cp3033 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [20000.0] [16:07:28] (03CR) 10Gehel: elasticsearch: tuning of zen discovery settings (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T148736) (owner: 10Gehel) [16:07:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp3033 is OK: OK: Less than 80.00% above the threshold [0.0] [16:08:40] !log bounced ntp on mw2089/mw2241 (XFAC state) [16:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:18] no it seems working [16:09:26] varnishkafka delivery errors?? [16:10:26] ahh maybe ottomata's kafka reboot [16:10:34] (03PS1) 10BryanDavis: wikitech: Fix Undefined variable: wgMWOAuthCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316981 [16:10:44] PROBLEM - NTP on mw2241 is CRITICAL: NTP CRITICAL: Offset unknown [16:10:51] i just rebooted the last one, will do election after it syncs [16:11:03] yes https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&from=now-3h&to=now [16:11:13] ottomata: vk did not like these reboots [16:11:21] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=20&fullscreen&from=now-3h&to=now [16:11:58] hm [16:12:20] (03CR) 10Gehel: elasticsearch: tuning of zen discovery settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/316976 (https://phabricator.wikimedia.org/T148736) (owner: 10Gehel) [16:12:48] ottomata: mmm super weird.. did you do preferred replica election after each reboot? [16:13:02] checking logs in the meantime [16:13:10] yes [16:14:55] (03PS1) 10Paladox: Gerrit: Enable Conccurent Garbage Collection [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [16:15:28] maybe the latest chances to librdkafka's settings are causing this [16:15:29] ? [16:15:32] (03PS2) 10Gehel: kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316774 (https://phabricator.wikimedia.org/T132458) [16:15:45] FYI I'm about to reboot the graphite machines in eqiad, can hold off if needed though [16:15:52] I can see a lot of Oct 20 16:09:50 cp3033 varnishkafka[132074]: %3|1476979790.289|FAIL|varnishkafka#producer-1| kafka1022.eqiad.wmnet:9092/22: [16:16:02] ottomata: --^ [16:16:27] (03CR) 10Reedy: [C: 031] wikitech: Fix Undefined variable: wgMWOAuthCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316981 (owner: 10BryanDavis) [16:16:34] elukey: fyi i'm totally done with reboots now [16:17:00] yeah I can see that we are recovering [16:17:06] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731899 (10Paladox) @Chad it's not gerrit causing it, since java 7 it has a built in gc whereas java 6 kind of didn't. So even i... [16:17:26] (03PS2) 10Paladox: Gerrit: Enable Conccurent Garbage Collection [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [16:18:20] (03CR) 10MGChecker: [C: 04-1] Apply rate limit to edits for normal users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (https://phabricator.wikimedia.org/T56515) (owner: 10Jforrester) [16:21:11] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Reedy) >>! In T148478#2731899, @Paladox wrote: > @Chad it's not gerrit causing it, jvm since java 7 it has a built in... [16:21:49] ottomata: I added perSecond to https://grafana.wikimedia.org/dashboard/db/varnishkafka?from=now-3h&to=now&editorTab=Axes&panelId=20&fullscreen [16:21:53] now it looks better [16:21:56] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731919 (10Paladox) >>! In T148478#2731915, @Reedy wrote: >>>! In T148478#2731899, @Paladox wrote: >> @Chad it's not gerrit causi... [16:23:00] oh good ja [16:23:11] jouncebot: refresh [16:23:14] I refreshed my knowledge about deployments. [16:23:17] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731922 (10Reedy) >>! In T148478#2731919, @Paladox wrote: > "This tutorial covers the basics of how Garbage Collection works with... [16:23:23] PROBLEM - NTP on mw2089 is CRITICAL: NTP CRITICAL: Offset unknown [16:23:45] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731925 (10Reedy) Haven't we been on Java 7 for a long time, anyway? [16:24:08] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731928 (10Paladox) Oh, I doint know which one were using, but if everyone is ok with https://gerrit.wikimedia.org/r/#/c/316983/... [16:24:40] ottomata: https://gerrit.wikimedia.org/r/#/c/314336/2/modules/role/manifests/cache/kafka/webrequest.pp - maybe this one has side effects in librdkafka that we didn't consider? [16:24:50] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 - https://phabricator.wikimedia.org/T148478#2731931 (10Paladox) Yes you might be right, just i didnt notice gerrit going down with gerrit 2.8. [16:25:02] !log reboot graphite1003 for kernel upgrade [16:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:57] ottomata: also we retry on produce failure by default, so this might have exacerbated the problem [16:26:57] !log deploying new LVS service for kibana - T132458 [16:26:58] T132458: Move logstash.wikimedia.org (kibana) to an LVS service - https://phabricator.wikimedia.org/T132458 [16:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:07] (03CR) 10Gehel: [C: 032] kibana - move to an LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316774 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [16:27:42] RECOVERY - NTP on mw2241 is OK: NTP OK: Offset 1.94311142e-05 secs [16:29:22] RECOVERY - NTP on mw2089 is OK: NTP OK: Offset 0.0003063678741 secs [16:30:02] !log rolling reboots for jobrunners in eqiad: mw1161-1169, mw1299-1306 [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:32] PROBLEM - Host graphite1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:17] RECOVERY - Host graphite1003 is UP: PING OK - Packet loss = 0%, RTA = 1.90 ms [16:31:25] meh expired downtime [16:31:29] heh [16:31:29] PROBLEM - carbon-cache@f service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-cache@f is inactive [16:31:44] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2731999 (10mark) Instead of a one-off, should we get similar spares as eqiad in codfw? [16:31:44] PROBLEM - carbon-local-relay service on graphite1003 is CRITICAL: CRITICAL - Expecting active but unit carbon-local-relay is inactive [16:33:12] RECOVERY - carbon-cache@f service on graphite1003 is OK: OK - carbon-cache@f is active [16:33:32] RECOVERY - carbon-local-relay service on graphite1003 is OK: OK - carbon-local-relay is active [16:33:33] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2732002 (10Ottomata) @mark, this ticket is about that too. It just happens that there is already a spare that matches these specs in eqiad, so we can s... [16:34:55] !log reboot graphite1001 for kernel upgrade [16:35:04] that might cause some spurious alerts, FYI [16:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:08] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2732032 (10RobH) >>! In T148065#2731891, @mark wrote: > This looks like a contract, not a one time purchase, as it seems to auto-renew. Therefore it nee... [16:38:32] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3027731 keys - replication_delay is 0 [16:44:18] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=logstash,service=kibana [16:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:45:07] !log rebooting install1001 [16:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:52] !log rolling reboot of testservers in codfw/eqiad: mw1017 mw1099 mw2017 mw2099 [16:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:42] Hi, im wondering what do people recommend for jvm garabage colection, g1 or cms? [16:50:46] http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/ [16:51:12] (03PS3) 10Gehel: service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 [16:51:16] paladox: i heard we use g1 for some services but not for gerrit yet [16:51:25] mutante thanks [16:51:30] I can try g1 then [16:51:33] I found the config [16:51:35] i think so, yes [16:51:42] –XX:+UseG1GC [16:52:24] (03CR) 10Gehel: [C: 032] service::node: Adding minimal test [puppet] - 10https://gerrit.wikimedia.org/r/316560 (owner: 10Gehel) [16:53:21] PROBLEM - NTP on mw2085 is CRITICAL: NTP CRITICAL: Offset unknown [16:53:29] (03PS3) 10Paladox: Gerrit: Enable G1 Collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [16:53:34] (03PS4) 10Paladox: Gerrit: Enable G1 Collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [16:53:42] mutante ^^ done :) [16:53:45] thankyou [16:54:05] paladox: cool, though i thought by "try" you mean on the test instance, right [16:54:16] Yep [16:54:18] Let me try it [16:54:27] to see if it at least works and wont return and fail [16:58:50] ACKNOWLEDGEMENT - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdl] Filippo Giunchedi disk on slot 10 is broken, T83811 [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T1700). [17:00:19] !log rolling reboot of video scalers in codfw/eqiad: mw1259 mw1260 mw2152 mw2246 [17:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:47] (03PS2) 10Gehel: kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316775 (https://phabricator.wikimedia.org/T132458) [17:09:09] (03CR) 10Gehel: [C: 032] kibana - activate icinga check on new LVS service [puppet] - 10https://gerrit.wikimedia.org/r/316775 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [17:13:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 663 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3029236 keys - replication_delay is 663 [17:13:57] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2732127 (10AlexMonk-WMF) >>! In T148560#2730717, @jcrespo wrote: > Right now we maintain a blacklist on realm.pp. We should transform that into a white li... [17:14:22] (03PS5) 10Paladox: Gerrit: Enable G1 Collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [17:14:32] mutante ^^ done :) [17:16:11] paladox: thanks, please add releng people [17:16:17] (03CR) 10Paladox: "Tested and works on labs instance." [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:16:20] Ok [17:16:23] Yep i have :) [17:16:29] kind of busy with reboots too [17:16:34] ok [17:17:27] RECOVERY - NTP on mw2085 is OK: NTP OK: Offset -0.001341104507 secs [17:19:45] (03CR) 10ArielGlenn: "I wonder what good values for the Xms/Xmx options are for gerrit, or if we need change them from the default." [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [17:20:09] at last, ntpd recovered after I stomped on it a couple of times [17:20:18] tedious [17:20:29] apergos is arielglenn you? [17:20:34] uh huh [17:20:52] Is ArielGlenn you [17:20:57] from the comment above [17:20:58] yes uh huh it is [17:21:00] Just wondering [17:21:14] Ok, it seems that testing more has brought up a js error [17:21:27] and so gerrit starts up through the command but loading the site it breaks [17:21:34] https://gerrit-test.wmflabs.org/ [17:21:42] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2732172 (10AlexMonk-WMF) I suppose we go from one list that needs updating when private wikis are created to one that needs updating when non-private (the... [17:21:52] guess you get to play with those options some :-) [17:22:03] Yep :) [17:22:15] Trying some gc that doint pause applications :) [17:22:49] Have the stat machines been rebooted yet? [17:23:36] oh, nv, I see its scheduled for tomorow [17:24:10] it's this -XX:MaxGCPauseMillis=200 [17:24:17] I wonder how we are suppose to do it [17:26:14] Hmm, and it's -XX:+UseG1GC [17:26:22] well even G1 will freeze apps, just for less time [17:29:12] !log rebooting install2001 [17:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:22] Oh [17:29:27] (this includes ganglia aggregator codfw) [17:29:28] apergos will cms be better? [17:29:44] you should read up on the differences a little [17:30:18] Ok [17:31:06] apergos cms trys not to pause anything [17:31:11] but if it has too it will [17:32:16] my very limited understanding is that g1 is the next gen collector which will eventually be the default [17:32:22] Oh [17:32:33] i think in javaOptions it is having a problem with those options [17:32:36] and that in cases of large heap and desired low latency you want to prefer it [17:32:37] but [17:32:45] you need to talk to the experts [17:32:51] ie those who maintain gerrit :-P [17:33:24] Ok [17:35:32] !log reboot of last few stragglers for mw* hosts in codfw/eqiad: mw2152 mw2079 mw1239 [17:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:56] !log rebooting kraz - short downtime of irc.wikimedia.org please prepare to reconnect your clients if they dont automatically do it [17:41:28] aww, come on. on another ganeti host i had to use gnt-instance commands to reboot and here it's the other way around [17:42:56] !log T133395, T113805: Starting a primary-range, incremental repair of local_group_wiktionary_T_parsoid_html.data on restbase2001.codfw.wmnet [17:42:58] T113805: Establish a strategy for regular anti-entropy repairs - https://phabricator.wikimedia.org/T113805 [17:42:58] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [17:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:25] Krenair: yay, both ircd and ircbot back after reboot without manual intervention [17:43:34] yay [17:43:35] it's already talking again on #en.wikipedia [17:43:43] yep, so fixing the unit files and stuff worked :) [17:44:52] Krenair: should i still send an email for those clients that dont auto-reconnect? (they really should) [17:45:09] not sure which list though [17:45:13] mutante, just wikitech-l [17:45:17] ok, will do [17:46:35] what is the difference between irc.freenode and irc.wikimedia? [17:46:51] they're separate IRC networks [17:46:54] they run separate software [17:47:08] Ah [17:47:16] freenode accepts messages from unauthorised users, wikimedia does not [17:47:19] <|L> and different config + purposes [17:47:30] Zppix: irc.wikimedia.org is a readonly IRC server that has a bot that reports all RecentChanges from wikis [17:47:48] you can look at it by connecting and joining a channel like #en.wikipedia [17:47:51] but you cant talk there [17:47:54] <|L> (only opers can speak and create chans at irc.wikimedia) [17:48:18] so its more intented as a "interal" irc server in a sense? [17:48:19] some people use it for anti-vandalism tools [17:48:49] some of it has been replaced by stream.wikimedia.org but not all [17:49:04] Zppix: no, it's not internal, it's public [17:49:05] <|L> Zppix: nope, more as a live feed [17:49:13] <|L> it's not for talking, but everybody can watch [17:49:28] i know its pub i meant like communication wise its meant for like internal use [17:49:36] <|L> nope [17:49:43] it only reports RecentChanges, no talking between people [17:49:47] <|L> only if you define RC as internal :o [17:50:04] <|L> so the bot talks to you, but you can't talk to the bot [17:50:07] (03PS2) 10Zppix: Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 [17:50:15] <|L> and there are no services, and you can't create new chans [17:50:30] the bot creates the chans on join, one for each wiki [17:50:45] it's like the equivalent of a fishbowl [17:50:46] ah [17:50:49] other bots connect to it to get the data [17:50:52] !log warming up elastic@codfw from wasat.codfw.wmnet (take 3) [17:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:07] It is cache, LOL (gerrit testing) [17:53:23] (03CR) 10Zppix: [C: 031] "Complete." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 (owner: 10Zppix) [18:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T1800). [18:00:05] bd808, James_F, and dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:12] * James_F waves. [18:00:29] o/ [18:01:08] 06Operations, 10hardware-requests, 06Services (watching): Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2732368 (10Dzahn) gotcha, thanks both [18:01:12] (03CR) 10Jforrester: [C: 031] "(Now ready to go.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:01:21] Hello, I can SWAT. [18:01:28] Thank you. [18:03:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316981 (owner: 10BryanDavis) [18:03:51] o/ [18:03:59] bd808: you want to test this one on silver directly I guess? [18:04:08] (03Merged) 10jenkins-bot: wikitech: Fix Undefined variable: wgMWOAuthCentralWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316981 (owner: 10BryanDavis) [18:05:08] Dereckson: hmm... lets do mw1099 and make sure nothing blows up there first. If the regular wikis are good then we can jsut sync it I think. Worst case is that the OAuth that no one is using yet on wikitech gets busted [18:05:14] bd808: ok, live on mw1099 [18:06:14] enwiki and mediawikiwiki load ok for me there. anything funky in the logs? [18:06:29] logs are clean [18:06:49] LGTM then [18:07:44] (03PS3) 10Dereckson: Enable the visual editor for all users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:07:59] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:08:21] bd808: syncing (sync is a little slow) [18:08:29] (03Merged) 10jenkins-bot: Enable the visual editor for all users on remaining phase 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315278 (https://phabricator.wikimedia.org/T142589) (owner: 10Jforrester) [18:08:32] (scap pull on mw1099 took 28 seconds) [18:08:42] !log dereckson@mira Synchronized wmf-config/CommonSettings.php: wikitech: Fix Undefined variable: wgMWOAuthCentralWiki ([[Gerrit:316981]]) (duration: 01m 26s) [18:08:45] !log rebooting fermium (lists.wm.org) [18:08:46] Here you are ^ [18:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:15] James_F: live on mw1099 [18:09:20] * James_F tests. [18:09:35] Yup, LGTM. [18:09:50] Dereckson: warnings are gone on silver. thanks for your help [18:11:16] you're welcome [18:11:20] James_F: ok [18:11:22] !log mailing list server back, normal operation [18:11:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:12] (03PS3) 10Dereckson: [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:12:22] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2015 crashed with no os logs (kernel logs or other software ones) - it shuddenly went down - https://phabricator.wikimedia.org/T147769#2732383 (10Papaul) p:05High>03Normal [18:12:46] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Enable Visual Editor for all users remaining phase 6 Wikipedias (T142589) (duration: 00m 50s) [18:12:47] T142589: Enable VisualEditor by default for all users of all remaining non-language variant Wikipedias - https://phabricator.wikimedia.org/T142589 [18:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:42] LGTM. [18:16:07] :o [18:16:45] Nemo_bis: Something broken? [18:16:57] yay [18:17:00] congratulations [18:18:09] Thanks. Still have the eight language variants to support and then switch, and of course Dutch. [18:18:32] Okay if all still looks good, we can go on with new search result algo for Cirrus. [18:18:39] (03CR) 10Dereckson: [C: 032] [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:19:06] (03Merged) 10jenkins-bot: [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:19:14] 06Operations, 06Services (done), 15User-mobrovac: Expand SCB cluster - https://phabricator.wikimedia.org/T147903#2732405 (10Papaul) [18:19:17] 06Operations, 10ops-codfw, 10ops-eqiad, 06DC-Ops, and 3 others: Re-image sca1001, sca1002, sca2001, sca2002, as scb1003, scb1004 and scb2003, scb2004 respectively - https://phabricator.wikimedia.org/T148380#2732403 (10Papaul) 05Open>03Resolved Complete [18:19:53] dcausse: live on mw1099 [18:20:01] Dereckson: ok, testing on few wikis [18:23:06] OK, thank you Dereckson again. :-) I'm popping off. [18:23:43] You're welcome. See you later. [18:24:19] Dereckson: looks good, not that I would not be surprised to see a short spike of pool counter errors [18:24:31] s/not/note/ [18:24:56] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2732418 (10Dzahn) [18:25:08] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review, 07User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515#596370 (10Ash_Crow) "Should users have a limit? " → What problem would it solve, exactly? Wouldn't a limit cause problems to tools like Cat-A-Lot o... [18:26:34] 06Operations, 07Puppet, 07Documentation, 03Google-Code-In-2016, and 2 others: document all puppet classes / defined types!? - https://phabricator.wikimedia.org/T127797#2054254 (10Dzahn) This task is about adding comments/docs to all puppet classes that don't have any. I could imagine mentoring somebody to... [18:28:14] dcausse: okay, logs look good too for the moment [18:28:38] nice [18:30:04] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Activate Cirrus BM25 algo on top 10 wikis (step 2, T147508) (duration: 00m 50s) [18:30:05] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [18:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:32:37] dcausse: 349 in_array() expects parameter 2 to be an array or collection in /srv/mediawiki/php-1.28.0-wmf.22/extensions/CirrusSearch/includes/Search/RescoreBuilder [18:32:41] s.php on line 160 [18:32:49] Dereckson: yes I've seen that :/ [18:33:27] that's queries or your script to reindex? [18:33:50] that must be exotic queries, looking [18:34:44] Do you have a fix or should we revert? [18:34:49] Dereckson: do we know which wiki? [18:35:35] de, en [18:35:40] ok [18:36:37] Dereckson: can you give me few mins or we must revert right now? [18:36:49] Error count seems stable at 311 [18:37:14] so yes you can have a quick look, we don't seem to serve new errors [18:37:43] hey yes we do, there are still :37 errors [18:37:46] okay I revert that [18:38:44] ok :( [18:39:09] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Revert "[cirrus] Activate BM25 on top 10 wikis: Step 2" (duration: 00m 54s) [18:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:22] dcausse: would that help I pull it on mw1017? [18:39:57] ebernhardson: do you think we can debug there ^? [18:40:20] I can't find which kind of query is causing this eror :/ [18:41:49] dcausse: we can debug from 1017, but only if we can figure out what queries trigger the problem [18:42:00] indeed :/ [18:42:51] because these messages come from hhvm instead of mediawiki logging, there is nothing to relate it to the original url [18:44:04] ok I'll dig into the code to find a plausible explanation [18:44:17] Dereckson: thanks, and sorry for the revert [18:44:27] no problem [18:45:27] (03PS1) 10Dereckson: Revert "[cirrus] Activate BM25 on top 10 wikis: Step 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317003 [18:45:49] (03CR) 10Dereckson: [C: 032] "SWAT. Already reverted in prod." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317003 (owner: 10Dereckson) [18:46:19] (03Merged) 10jenkins-bot: Revert "[cirrus] Activate BM25 on top 10 wikis: Step 2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317003 (owner: 10Dereckson) [18:48:54] (03PS2) 10Gehel: kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/316776 (https://phabricator.wikimedia.org/T132458) [18:50:55] (03CR) 10Gehel: [C: 032] kibana - configure varnish to use new LVS service as backend [puppet] - 10https://gerrit.wikimedia.org/r/316776 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [19:11:29] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3023920 keys - replication_delay is 0 [19:22:41] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [19:28:00] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [19:31:41] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review, 07User-notice: Apply editing rate limits for all users - https://phabricator.wikimedia.org/T56515#2732580 (10Dereckson) I've asked feedback on IRC and French Wikipedia / Commons French community village pump on this issue. First, users raised th... [19:35:10] (03PS2) 10Andrew Bogott: shinkengen: get role classes from puppet enc too [puppet] - 10https://gerrit.wikimedia.org/r/312109 (owner: 10Alex Monk) [19:36:44] (03CR) 10Andrew Bogott: [C: 032] shinkengen: get role classes from puppet enc too [puppet] - 10https://gerrit.wikimedia.org/r/312109 (owner: 10Alex Monk) [19:38:50] PROBLEM - MariaDB Slave SQL: s5 on db1070 is CRITICAL: CRITICAL slave_sql_state could not connect [19:40:11] (03CR) 10Paladox: "Actually going to go back to prevous patch and use XX:+UseConcMarkSweepGC just tested on gerrit-test3 which is a replica of production ger" [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [19:40:40] db1070 looks ok at fisrt sight, checking [19:41:01] i was just about to ping folks [19:41:02] heh [19:41:14] dumps should be unrelated this time at least [19:41:19] (03PS6) 10Paladox: Gerrit: Enable concurrent collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) [19:41:32] at first looks an issue with the check [19:41:34] from icinga [19:41:34] mutante ^^ that works [19:42:09] LOL just the gerrit-test playing up, not sure why but i carn't visit gerrit-test.wmflabs.org website now, it has a js error [19:42:28] but anyways i tested on gerrit-test3 and it works :) [19:42:30] ok too many connections [19:42:41] But will be overwritten when puppet runs again [19:42:56] I am checking db1070 too [19:43:12] marostegui: ack, tons of unauthenticated user | connecting host [19:43:17] at login step [19:43:49] wow, yeah [19:43:53] PROBLEM - Disk space on cp4014 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=86%) [19:44:05] load ~14 [19:44:50] marostegui: tons of conns in CLOSE_WAIT [19:45:12] ~9k [19:46:17] (03PS1) 10Alex Monk: Follow-up I966f6422: Add missing package dependency, fix format call [puppet] - 10https://gerrit.wikimedia.org/r/317007 [19:46:29] Should we stop slave it to get it depooled from the LB? [19:46:57] I can send the CR to depool it [19:47:03] RECOVERY - MariaDB Slave SQL: s5 on db1070 is OK: OK slave_sql_state Slave_SQL_Running: Yes [19:47:06] marostegui: while you check [19:47:09] sure [19:47:16] I am worried if this is going to move to another host of S5 [19:47:18] if we depool this one [19:47:59] Maybe we can just remove it from main traffic but not from API one [19:48:32] marostegui: db1071 is api too but doesn't show any problem [19:48:35] I can also kill those connections in Connect state [19:48:53] ok [19:49:50] (03PS2) 10Alex Monk: Follow-up I966f6422: Add missing package dependency, fix format call [puppet] - 10https://gerrit.wikimedia.org/r/317007 [19:50:49] nice [19:50:52] I am killing them now [19:51:15] Hey jynus welcome to the party [19:51:30] (03PS1) 10Volans: MariaDB: depooled db1070 for investigation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317009 [19:51:34] jynus: marostegui CR sent, at your discretion [19:51:42] I will *NOT* merge it [19:51:46] (03PS3) 10Alex Monk: Follow-up I966f6422: Add missing package dependency, fix format call, de-duplicate roles [puppet] - 10https://gerrit.wikimedia.org/r/317007 [19:51:59] there is several 1-hour queries coming from the api [19:52:04] this is a software problem [19:52:32] yes the ApiQueryAllRevisions [19:52:34] :rin [19:52:36] surprisingly, it is not wikidata, it is dewiki [19:52:36] *run [19:52:53] do you want to kill that dozen? [19:52:57] yes [19:53:08] also whatever was deploy it, git it reverted [19:54:35] jynus: why db1071 is not affected? [19:54:47] volans: I think depooling it then won't make any difference :( [19:56:07] volans, no idea [19:56:30] I was looking at the last SWAT, config only [19:56:30] maybe it run out of space or got stuck (pilup first) [19:56:35] My one liner has finished killing those zombie connections [19:57:04] this started happening at 14h [19:57:05] UTC [19:57:12] ok looking before then [19:57:20] so look there at that deployment window [19:57:30] no patch for that SWAT [19:58:15] the problem with pileups is that the difference between losing service and a mere annoyance is very slim [19:59:15] maybe it existed for a long time [19:59:24] but it was not used until now [20:00:08] Could be yep [20:00:12] But the connections have not come back [20:00:25] (03Abandoned) 10Volans: MariaDB: depooled db1070 for investigation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/317009 (owner: 10Volans) [20:00:30] You think they will? [20:00:53] the killing watchdog was working at full speed [20:01:04] and despite that, it didn't work [20:01:05] with which timeout? [20:01:20] --match-user wikiuser --group-by fingerprint --any-busy-time 50 --query-count 10 [20:01:22] the ~15 queries where there since 3000 seconds [20:01:33] yes, but they were 10 [20:01:33] I ran this by the way: for i in `mysql --skip-ssl -e "nopager;show processlist;" | grep unauthenticated | awk -F " " '{print $1}'`; do mysql --skip-ssl -e "kill $i;";done [20:01:52] which is exactly what was happening [20:01:52] yeah 12 [20:01:55] queries [20:01:57] more or less [20:02:20] or maybe the fingerprint was different [20:02:31] maybe the host has some issue, it could be [20:02:39] but the queries were not too good [20:03:08] (03CR) 10Andrew Bogott: [C: 032] Follow-up I966f6422: Add missing package dependency, fix format call, de-duplicate roles [puppet] - 10https://gerrit.wikimedia.org/r/317007 (owner: 10Alex Monk) [20:04:50] I will create a specific watchdog for this query, just in case [20:05:24] explain says 152445630 rows ;) [20:05:28] volans, there were 2 different queries [20:05:40] hence the watchdog not cathing it [20:05:54] 2 as in 2 different fingerprints [20:06:02] yeah I saw them [20:06:24] As everything looks fine, I am going to go back and finish dinner :-) [20:06:27] Thanks guys! [20:08:21] jynus: should the watchdog kill any query longer than a safe timeout also if there are only a couple of them? [20:08:48] That's a good point [20:09:17] it is not that easy [20:09:37] that is what the events do [20:10:08] apparently they are not enabled here [20:10:21] I have put my own watchdog [20:10:40] probably that was the difference with the other api server [20:10:48] the other killed queries sooner [20:11:08] make sense [20:11:47] yep, mistery solved [20:11:57] the software is still bad [20:12:03] still means that we have bad queries :( [20:12:09] wow [20:12:14] that is such a shock! [20:12:14] lol [20:12:18] what? [20:12:47] I mean, nothing happened here really, db1071 took over [20:12:56] (03PS2) 10Alex Monk: shinkengen: Remove old fix for ec2id -> fqdn ldap host entry migration [puppet] - 10https://gerrit.wikimedia.org/r/309011 [20:12:58] (03PS2) 10Alex Monk: shinkengen: Remove unused instance attributes [puppet] - 10https://gerrit.wikimedia.org/r/309012 [20:13:03] some slowdown for a few seconds [20:13:14] I will make sure the watchdog is working on all servers [20:13:39] ok, we should probably collect and review the killed queries too on a regular basis [20:13:40] and tomorrow I will put some tickers why we allow 50 million row queries :-) [20:13:42] in order to fix them [20:13:49] we do that [20:13:52] it is called tendril [20:14:07] and mediawiki-slow log [20:14:11] mmmh but if they are killed we don't miss them there? [20:14:15] no [20:14:25] first one does it with show processlist [20:14:36] ture [20:14:40] mutante: you around? have a sec for me to pick your brain? [20:14:41] true, forgot that :) [20:14:41] second logs whwever goes I think over 10 seconds [20:14:53] plus the kill will generate an error 2 [20:14:59] so it is not for the lack of logging [20:15:18] ok great :) [20:16:16] Im not sure weather to ask it here but going to http://korma.wmflabs.org/browser/ causes js errors [20:22:55] urandom: what's up [20:23:26] mutante: heya: I have this: https://github.com/eevans/cassandra-tools-wmf, which corresponds to this: https://phabricator.wikimedia.org/T132958 [20:23:40] mutante: and i'm wonder about what makes sense for deployment [20:23:46] s/wonder/wondering/ [20:25:03] mutante: we want this on the restbase cluster, but i imagine analytics will want it on aqs too [20:25:54] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2732849 (10mpopov) That looks like the changes have been deployed? Hm... Still can't install... [20:26:06] what were the bad queries btw? [20:27:58] mutante: it builds a debian package, and in perfect world i think that's how you'd do it, but i understand that may be frowned upon by ops [20:28:24] urandom: i dont thikn debian packages are frowned up, more the opposite [20:29:12] the concern, as i understood it, is that (for example) post,pre-installs run as root [20:29:21] urandom: i think ideally it would be a Debian package that we put on apt.wm.org and then puppet installs it [20:30:16] urandom: i think that concern is about people installing software manually from unknown sources [20:30:40] urandom: but not for something that has been reviewed and uploaded to our own repo [20:31:11] mutante: where i heard it was in the context of software deployment generally, in place of scap, say [20:31:45] hey is it possible someone could complete my gerrit repo request? If your too busy thats fine im just curious [20:32:06] and, we could use scap for this, but it would be a bit of pain i think; you're going to want these things in path for everyone [20:32:58] urandom: debs are not good for software like mediawiki or nodejs services that change often. That may be what you heard or what got corrupted in transmit by the time it got to you. [20:33:43] bd808: so it just boils down to frequency? [20:33:52] urandom: i'm not sure about scap, but i think probably not in this case. either package or if it's simple enough there is another way, you can put the tools in a separate repo on git/gerrit, like wikimedia/software/cassandra-tools or whatever and then tell puppet to just git clone from that [20:34:00] that group of scripts looks like something that could be deployed via ops/puppet, scap3, or a deb [20:34:12] as long as it's not from github or an external source but from our own servers [20:34:45] k, it already builds a deb (complete w/ man pages!) [20:35:05] :) i think we like that and then you should probably stick to that [20:35:32] so, a gerrit-based repo, and a CI built package? [20:35:57] and phab for ops when apt.wm.org needs updating? [20:35:57] i dont know about the "CI built"-part [20:36:06] yes, to that second part [20:36:26] mutante that would be debian-glue [20:36:29] Builds debs [20:36:48] yea, i dont know if you should use that or another build server in labs [20:37:02] others will know better [20:37:56] do we actually have any auto-built debs that are used in prod? [20:38:09] urandom: ask the releng folks when they are around how they build the deb for scap [20:38:42] its trivially easy to setup cowbuilder on a Labs VM if nothing else [20:39:01] bd808: auh, an example to crib, i ❤ examples [20:39:04] you still need to be root to upload [20:39:16] sure, but we have roots :) [20:39:28] which means that they'll want to inspect the source [20:39:30] See https://integration.wikimedia.org/ci/job/debian-glue-non-voting/ [20:39:34] which probably means that auto builds are out [20:39:47] there failing when testing due to a bug, but there building the debs [20:39:58] unless the pipeline is verified to build a particular revision [20:40:22] if it's not auto-build, your first step would be requesting a repo, normally all the packages used in prod are under operations/debs/foo [20:40:43] except the ones that aren't;) [20:40:51] in my experience, the whole deb resistance is about roots not being enthusiastic about using a deploy mechanism that potentially grants root on boxes [20:42:24] which is understandable, given that it's all bare metal boxes [20:42:38] Zppix hi, qchris normally does this and can be found in -devtools [20:42:44] i have not heard about 'deb resistance', it's just about where they come from [20:43:26] ack paladox [20:43:37] yep [20:44:00] i don't expect they'll change often, i could build them myself, sign them, and drop them on people.wm.org [20:44:02] urandom: my advice would be to find a root to sponsor & upload your deb [20:44:34] otherwise, it'll be a lot harder [20:44:51] i can sign and upload packages directly to Debian, so technically i already have this ability, even if delayed somewhat :) [20:45:06] that would be easier, yeah [20:46:37] i thought that was already established at "and phab for ops when apt.wm.org needs updating" [20:47:01] also we have several debian devs [20:47:13] mutante: yeah, i know :) [20:47:43] mutante: we've formed a secret cabal [20:47:43] urandom being one of them [20:49:07] mutante: also in https://phabricator.wikimedia.org/T132958 is an idea for (the option of) logging to SAL from some of these commands [20:49:10] bd808: ^^^ [20:49:19] that seems kind of precedent setting [20:49:43] mutante: thoughts? [20:50:02] i think you upload source to a git repo, get it reviewed, somebody, you or others, build it. ops push it up to apt.wm.org, optionally it gets added to Debian [20:50:14] that is the normal way imho [20:50:45] mutante: k [20:50:57] there's a script on tin for logging to sal and/or a process on neon that can be used. That's "logmsgbot" here [20:51:08] bd808: yup [20:51:31] but this would be sending from the nodes of Cassandra clusters [20:51:36] what bd808 said about logging [20:51:44] at a minimum it would mean ferm changes [20:51:58] and i'm wondering if anyone would be philisophically opposed [20:52:12] scap talks directly to neon -- https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/log.py;d407eaf640e121ca581de5580ff511d6e69b906c$59 [20:52:17] philosophically even [20:52:19] i have _just_ changed those ferm rules for logmsgbot [20:52:27] we can allow other source if needed [20:53:07] what would be great is if they have IPv6 mapped addresses [20:53:13] yeah, I don't see why it wouldn't be possible to open up [20:53:29] i say that because i want https://gerrit.wikimedia.org/r/#/c/316497/ [20:54:01] (03CR) 10Krinkle: [C: 031] Switch to LoadMonitorMySQL instead of the generic one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316732 (owner: 10Aaron Schulz) [20:54:17] bd808: oh, this sends them to irc? [20:55:20] yeah. the dologmsg script is jsut a tiny nc wrapper that pushes strings to the listener on neon [20:55:22] yes, one bot says it on IRC and the other bot then logs it [20:55:41] (and sends to Twitter and SAL) [20:55:55] Twitter!? [20:55:57] scap talks directly to neon because its easier [20:56:05] urandom: yes, each !log entry also is on Twitter [20:56:14] TIL [20:56:19] what user? [20:57:01] https://twitter.com/wikimediatech [20:57:40] chatops before chatops was cool ;) [21:01:47] !log Started sending Tool Labs survey emails from silver [21:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:10] ^ whom does that go to? all tool labs users? [21:02:34] Zppix: T147336 [21:02:34] T147336: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336 [21:05:37] (03Abandoned) 10Paladox: Gerrit: Fix double escaping [puppet] - 10https://gerrit.wikimedia.org/r/315853 (owner: 10Paladox) [21:06:27] bd808: is there a version of logmsgbot you can test against? something in labs? [21:07:24] (03CR) 10Zppix: [C: 031] Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:07:25] urandom: Not that I know of. We could probably set it up in deployment-prep and send the messages to #wikimedia-releng though [21:08:11] I think I "tested" by running a trivial socket listener [21:08:53] yeah, nc would work for that too [21:09:09] that'll work [21:09:16] (03PS2) 10Niedzielski: Add Xdummy daemon [puppet] - 10https://gerrit.wikimedia.org/r/264303 (https://phabricator.wikimedia.org/T133183) [21:10:08] (03PS3) 10Zppix: Adds translations to the user's lang in the links within the readme in the ROOT dir. [puppet] - 10https://gerrit.wikimedia.org/r/315728 [21:10:21] (03PS3) 10Zppix: Added a new commonly typed typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315743 [21:11:20] (03CR) 10Zppix: [C: 031] "Done. Awaiting review whenever seems fit" [puppet] - 10https://gerrit.wikimedia.org/r/315728 (owner: 10Zppix) [21:30:37] (03CR) 10QChris: [C: 04-1] "Note that adding trackingIds needs to reindex existing changes" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:32:01] (03CR) 10Paladox: Add support for searching gerrit using bug:T1 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:43:53] (03PS9) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [21:43:58] (03PS10) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [21:44:56] (03CR) 10QChris: Add support for searching gerrit using bug:T1 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:45:09] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 13Patch-For-Review, 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2732951 (10chelsyx) >>! In T147682#2732849, @mpopov wrote: > That looks like the changes have... [21:46:01] (03CR) 10QChris: [C: 04-1] "See discussion in Patch Set#8" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:46:08] (03PS11) 10Paladox: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) [21:48:14] (03CR) 10Paladox: "@QChris hi, how can I support T1#1 for example please?" [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [21:58:59] bug bugs [22:21:55] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 07Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#2733110 (10Deskana) This task will truly live forever. Yay, scope creep! :-p [22:22:41] how do you make git recreate .gitconfig? [22:32:43] probably set a config option via git config [22:42:57] mutante: Thank you for the irc.wikimedia.org reboot e-mail! :D [22:52:02] !log Finished sending Tool Labs survey emails from silver (T147336) [22:52:03] T147336: 2016 Tool Labs user survey - https://phabricator.wikimedia.org/T147336 [22:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:58] Debra: :) welcome [22:59:49] hmm, is something going on with jenkins? seems to be taking a long time to merge my change [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161020T2300). [23:00:04] MatmaRex and AndyRussG: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] hi. [23:00:55] hi too [23:01:11] !== too hi [23:05:40] Hello. [23:05:45] I can SWAT this evening. [23:06:39] Dereckson: hey :) [23:06:46] cool thx much! [23:09:05] Dereckson: in addition to the core patch do you think we could push out a CentralNotice submodule update? It's all just minor code cleanup and one minor user-facing bugfix [23:10:44] AndyRussG: while the previous/current discussions don't lead to a clear solution about when to deploy CentralNotice, yes, it's acceptable [23:10:57] especially for the user-facing bugfix [23:11:23] I [23:12:19] Dereckson: yeah I was just trying to remember exactly where that settled... I think it was, we need to figure out a better process 8p [23:12:20] I'll state a new time I'm totally happy to deploy CentralNotice in weekly dedicated "misc extensions" windows, and I'm sure both some train deployers and SWAT members would also be happy to help for this windows too [23:12:47] Ah I didn't know about the misc extensions [23:12:54] thx :) [23:14:30] Dereckson: looks like w/ my +2 on the CN wmf_deploy branch, there was an automagic did a core update (?) 93ce62e8c8972e49001d54bc85448cd4755d52d9 [23:14:31] It's a proposal to solve the current impasse: that would allow to have a planned, previsible window in the calendar, so everyone would : your team to put changes, and releng team to have them planned and previsible [23:15:03] I think that would help us be more organized generally too, yeah [23:15:54] yes it has well created a core change [23:16:18] 93ce62e8c89 [23:16:24] Project: mediawiki/extensions/CentralNotice wmf_deploy 40ba87c67acf105db53cd62d06102accd31d4687 [23:17:48] yep [23:19:59] Hmm funny how it gets into Diffusion https://phabricator.wikimedia.org/rMW93ce62e8c8972e49001d54bc85448cd4755d52d9 [23:20:27] but not Gerrit (the Change-id shown in Diffusion is wrong...) [23:21:05] the change-id at the bottom? It's the one for the last change in your branch [23:21:10] (Localisation updates from https://translatewiki.net.) [23:21:43] Yeah but that commit message was autogenerated and just shows the change-ids from all the commits merged in [23:22:41] MatmaRex: live on mw1099 [23:23:44] AndyRussG: any order of preference between CN and core? [23:24:16] Dereckson: sure, FWIW... let's do core first? Not that it matters much [23:24:24] ok [23:24:29] thx! [23:24:59] Dereckson: thanks, checking [23:25:19] i'm on a slow connection at the moment and things take a long time to load [23:26:14] Dereckson: ok, looks good [23:26:28] MatmaRex: ok [23:26:39] Gerrit is a little slow. [23:26:52] a git fetch took 1 minute [23:28:52] just a tad! [23:29:19] MatmaRex: syncing to prod [23:29:37] gerrit web interface not responding atm [23:30:16] !log dereckson@mira Synchronized php-1.28.0-wmf.22/extensions/UploadWizard/resources/ui/steps/uw.ui.Upload.js: Fix a weird ghost "or" for non-Flickr users ([[Gerrit:317013]]) (duration: 01m 31s) [23:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:31:01] Nikerabbit: it's working for me, but excruciatingly slowly... [23:31:28] AndyRussG: core change is live on mw1099 [23:32:55] Nikerabbit: Gerrit back at normal speed for me [23:33:16] Dereckson oh [23:33:21] I know why this is [23:33:23] mutante ^^ [23:33:31] gc is causing gerrit to get slow again [23:34:55] Oh hang on gc dosent look like it was running when they said gerrit was slow [23:35:01] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [23:35:54] AndyRussG: CentralNotice live on mw1099 too [23:36:46] Dereckson: core change seems fine :) checking logstash [23:36:57] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2733454 (10Paladox) [23:37:50] !log rolling restarts of citoid on scb* (for recdns update) [23:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:00] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [23:38:10] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: Investigate why gerrit slowed down on 17/10/2016 / 18/10/2016 / 21/10/2016 - https://phabricator.wikimedia.org/T148478#2724169 (10Paladox) [23:38:24] Dereckson: if I just search Kibana for "mw1099" should that show me any goings-on? [23:39:15] AndyRussG: sure [23:39:29] AndyRussG: you've a link to mw1099 at https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#General_Advice [23:39:36] https://logstash.wikimedia.org/app/kibana#/dashboard/mw1099 [23:39:40] a dashboard ready for thaty [23:41:10] kewly :) [23:42:14] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2733472 (10BBlack) I'm guessing probably bad partman recipe for new hardware? In any case, these seem to be sitting in a half-installed state presently. [23:43:51] Dereckson: CN update also looks fine! [23:44:10] ok [23:45:05] !log dereckson@mira Synchronized php-1.28.0-wmf.22/includes/cache/MessageCache.php: Use checkKeys for large messages (T144952) (duration: 00m 50s) [23:45:06] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2733475 (10BBlack) `lead.wikimedia.org` is still up and running, with puppet disable and stern warning to not re-enable it. We left it up Just In Case we found some issue that we need to go back to its data... [23:45:06] T144952: Banner not showing up on site - https://phabricator.wikimedia.org/T144952 [23:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:52] 06Operations, 10media-storage: ms-be1027 borked - https://phabricator.wikimedia.org/T148807#2733479 (10BBlack) [23:48:39] !log dereckson@mira Synchronized php-1.28.0-wmf.22/extensions/CentralNotice: Bump CentralNotice version to fix T145738 and T145447 ([[Gerrit:317077]]) (duration: 00m 54s) [23:48:41] T145738: CentralNotice should not call WikiPage::doEdit() - https://phabricator.wikimedia.org/T145738 [23:48:41] T145447: CN Campaign Setting oddities - https://phabricator.wikimedia.org/T145447 [23:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:19] AndyRussG: here you are [23:51:38] !log rebooting eeden (ns2) for kernel [23:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:29] Dereckson: the sweet fragrance of apparently functioning software wafts through the smoggy air... i.e., all good! [23:55:36] thx much! [23:55:39] You're welcome.