[00:00:04] RoanKattouw, ^d, jackmcbarn: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150206T0000). Please do the needful. [00:01:31] who's up? [00:01:36] 3Ops-Access-Requests: Requesting access to tin for nuria - https://phabricator.wikimedia.org/T88760#1019773 (10Nuria) 3NEW [00:02:22] hi jackmcbarn [00:02:58] jackmcbarn, you want me to deploy that for you? [00:03:01] yeah [00:03:48] any Hindi speakers? [00:04:52] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:05:52] jackmcbarn, ok, doing [00:05:56] ok [00:07:48] 3operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1019786 (10RobH) 3NEW [00:08:43] 3operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1019796 (10RobH) 5Open>3stalled p:5Triage>3Normal [00:09:40] 3operations: Introduce Virtualization in our infrastructure - https://phabricator.wikimedia.org/T87258#1019807 (10RobH) [00:09:41] 3operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1019786 (10RobH) [00:13:18] !log krenair Synchronized php-1.25wmf16/includes/CategoryViewer.php: https://gerrit.wikimedia.org/r/#/c/188945/1 (duration: 00m 06s) [00:13:20] jackmcbarn, check ^ [00:13:24] Logged the message, Master [00:13:27] checking [00:13:40] works on wmf16 [00:14:13] 3operations, Phabricator: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1019810 (10chasemp) >>! In T87611#1019762, @RobH wrote: > Yes, but it should ONLY relay into the ops-datacenter site projects, not #operations itself. > > I realize thats what we talked about, but just calling it o... [00:14:50] !log krenair Synchronized php-1.25wmf15/includes/CategoryViewer.php: https://gerrit.wikimedia.org/r/#/c/188944/1 (duration: 00m 06s) [00:14:52] jackmcbarn, check ^ [00:14:53] Logged the message, Master [00:15:28] 3operations, Phabricator: Add @emailbot to #operations - https://phabricator.wikimedia.org/T87611#1019811 (10RobH) 5Open>3Resolved @emailbot has been added to #operations group [00:15:29] works. thanks! [00:15:37] yw [00:18:37] 3operations: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1019818 (10hashar) @andrew All CI slaves are now using puppet-lint 1.1.0 :-] So unless my patches have a weird side effect on prod they are good to land :-] [00:19:51] (03PS1) 10Ori.livneh: vbench: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/188949 [00:20:24] (03PS2) 10Ori.livneh: vbench: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/188949 [00:21:16] (03CR) 10Hashar: "Yup -lenient only look for puppet-lint 'error' level while -strict also includes 'warning' level." [puppet] - 10https://gerrit.wikimedia.org/r/188805 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [00:22:54] (03CR) 10Hashar: "I came up with that list of ignores ages ago merely because they had a high occurrence. We might revisit the list though but skipping tho" [puppet] - 10https://gerrit.wikimedia.org/r/188375 (https://phabricator.wikimedia.org/T87132) (owner: 10Hashar) [00:26:02] (03CR) 10Catrope: [C: 032] vbench: various fixes [puppet] - 10https://gerrit.wikimedia.org/r/188949 (owner: 10Ori.livneh) [00:27:58] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019829 (10Krenair) [00:29:27] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019834 (10Andrew) Nuria -- To proceed with this I'll need a response to this ticket from your manager (Toby, I presume) approving this access. After that there's a three-day waiting period, and then I can e... [00:32:27] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019841 (10greg) I approve from the RelEng side (given this is for deployment/tin rights) assuming @nuria will be training with someone on her team (or maybe ori?, not sure, her pick). If that won't work, I ca... [00:34:19] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019851 (10Andrew) p:5Triage>3Normal [00:35:52] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019853 (10Nuria) I have no ambition to deploy anything else besides EL so I will ask ori to give me a tour. [00:37:08] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1019856 (10ori) >>! In T88760#1019853, @Nuria wrote: > I have no ambition to deploy anything else besides EL so I will ask ori to give me a tour. I'd be happy to do that. [00:41:20] 3operations, Beta-Cluster: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076#1019863 (10hashar) Really nice! I am very happy to see beta cluster being used for such staging work \O/ [00:51:45] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#1019902 (10greg) This was already done, right? Or do we still need a time? If it wasn't done, pick a time that works next week and JFDI :) [00:52:30] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1019904 (10Tgr) The earlier overload issues weren't related to upload batch size, but rather to image size, and in theory should not happen again (al... [00:57:29] (03PS1) 10Andrew Bogott: Reduce ttl to 5M for wikitech [dns] - 10https://gerrit.wikimedia.org/r/188955 [00:59:25] (03PS1) 10Ori.livneh: xenon: links => follow for File['/srv/xenon'] [puppet] - 10https://gerrit.wikimedia.org/r/188956 [01:00:01] (03PS2) 10Ori.livneh: xenon: links => follow for File['/srv/xenon'] [puppet] - 10https://gerrit.wikimedia.org/r/188956 [01:00:08] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon: links => follow for File['/srv/xenon'] [puppet] - 10https://gerrit.wikimedia.org/r/188956 (owner: 10Ori.livneh) [01:00:30] (03CR) 10RobH: [C: 031] "I am not entirely certain the labsconsole.wikimedia.org TTL also needs to be shortened, since it just points to wikitech.wikimedia.org. T" [dns] - 10https://gerrit.wikimedia.org/r/188955 (owner: 10Andrew Bogott) [01:01:10] (03CR) 10Andrew Bogott: [C: 032] Reduce ttl to 5M for wikitech [dns] - 10https://gerrit.wikimedia.org/r/188955 (owner: 10Andrew Bogott) [01:01:38] !log restarting xenon on fluorine [01:01:44] Logged the message, Master [01:08:47] (03PS1) 10GWicke: Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 [01:11:25] Why do threads titled like "I opened LocalSettings.php using Notepad++ and I saved it" remind me of pop songs xD [01:11:43] I verbed thing and I verbed it. [01:12:06] (03PS2) 10GWicke: Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 [01:19:30] too muchg Katy Perry. [01:19:33] much* [01:19:48] (03CR) 10Mobrovac: [C: 031] Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 (owner: 10GWicke) [01:24:12] RECOVERY - Kafka Broker Messages In Per Second on tungsten is OK: OK: No anomaly detected [01:24:56] (03PS1) 10Ori.livneh: vbench: add --show-uncached-requests [puppet] - 10https://gerrit.wikimedia.org/r/188960 [01:25:14] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1019963 (10LuisV_WMF) It appears to be assigned to you, Quim? (Unless I'm misreading Phab?) I need @MBrar.WMF to take the lead on approving this right now - Manprit, please... [01:25:18] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: add --show-uncached-requests [puppet] - 10https://gerrit.wikimedia.org/r/188960 (owner: 10Ori.livneh) [01:34:31] (03PS2) 10Dzahn: use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 (https://phabricator.wikimedia.org/T86887) [01:34:52] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1019967 (10Dzahn) We are going with jessie. < _joe_> so yes, let's go with jessie < _joe_> we sould also look at changes in redis between the version we have and the one on jessie [01:37:23] (03PS3) 10Dzahn: use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 (https://phabricator.wikimedia.org/T86887) [01:39:35] (03PS4) 10Dzahn: use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 (https://phabricator.wikimedia.org/T86887) [01:45:25] hnmm. why cant i merge that [01:45:38] it's not a dependency this time [01:45:43] but greyed out [01:46:19] (03PS5) 10Dzahn: use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 (https://phabricator.wikimedia.org/T86887) [01:46:30] (03CR) 10Dzahn: [C: 032] use jessie for redis hosts in codfw [puppet] - 10https://gerrit.wikimedia.org/r/188932 (https://phabricator.wikimedia.org/T86887) (owner: 10Dzahn) [01:49:52] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1019979 (10Dzahn) ``` dzahn@iron:~$ ssh root@rbf2001.mgmt root@rbf2001.mgmt's password: dzahn@iron:~$ ssh root@rbf2002.mgmt ssh: Could not resolve hostname rbf2002.mgmt: Name or service not know... [01:52:08] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1019982 (10Dzahn) [01:52:09] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1019983 (10Dzahn) [01:52:10] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1019981 (10Dzahn) 5Resolved>3Open [01:52:48] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#979084 (10Dzahn) i can SSH to the host, but something is wrong about the mgmt entry in DNS it seems [01:56:00] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1019994 (10Milimetric) 3NEW [01:56:52] PROBLEM - Host rbf2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:57:53] ACKNOWLEDGEMENT - Host rbf2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstalling with Debian [02:04:48] !log LocalisationUpdate failed: git pull of extensions failed [02:04:57] Logged the message, Master [02:07:21] 3Incident-20150205-SiteOutage: Stale database connections during outage - https://phabricator.wikimedia.org/T88770#1020014 (10Springle) 3NEW a:3Springle [02:08:26] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1020022 (10Milimetric) [02:08:29] 3Incident-20150205-SiteOutage: sleeper database connection surges during outage - https://phabricator.wikimedia.org/T88770#1020024 (10Springle) [02:10:26] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1020027 (10Dzahn) i tried to install Debian on rbf2001 and the installer claims: ``` ┌────────────┤ [!!] Download debconf preconfiguration file ├────────────┐ │... [02:15:13] (03CR) 10Dzahn: "something went wrong here, see inline comment" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/188669 (owner: 10RobH) [02:15:55] (03PS1) 10Ori.livneh: vbench: fixes [puppet] - 10https://gerrit.wikimedia.org/r/188966 [02:16:16] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1020029 (10Dzahn) >>! In T86897#1019979, @Dzahn wrote: > dzahn@iron:~$ ssh root@rbf2002.mgmt > ssh: Could not resolve hostname rbf2002.mgmt: Name or service not known see https://gerrit.wikimedia... [02:17:03] (03CR) 10Dzahn: setting mgmt entries for rbf2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/188669 (owner: 10RobH) [02:17:14] (03CR) 10Catrope: [C: 031] vbench: fixes [puppet] - 10https://gerrit.wikimedia.org/r/188966 (owner: 10Ori.livneh) [02:17:20] (03PS2) 10Ori.livneh: vbench: fixes [puppet] - 10https://gerrit.wikimedia.org/r/188966 [02:17:27] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: fixes [puppet] - 10https://gerrit.wikimedia.org/r/188966 (owner: 10Ori.livneh) [02:20:42] RECOVERY - Check status of defined EventLogging jobs on vanadium is OK: OK: All defined EventLogging jobs are runnning. [02:21:21] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [02:22:02] RECOVERY - Disk space on vanadium is OK: DISK OK [02:23:02] 3operations, ops-codfw, hardware-requests: Procure and setup rbf2001-2002 - https://phabricator.wikimedia.org/T86897#1020031 (10Dzahn) >>! In T86897#1020027, @Dzahn wrote: > i tried to install Debian on rbf2001 and the installer claims: > │ The IP address you provided is malformed. https://gerrit.wikimedia.o... [02:23:31] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:40:31] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [02:41:32] (03PS1) 10Ori.livneh: vbench: Instead of --proxy, use --host-rules to specify mapping of hosts [puppet] - 10https://gerrit.wikimedia.org/r/188967 [02:41:51] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: Instead of --proxy, use --host-rules to specify mapping of hosts [puppet] - 10https://gerrit.wikimedia.org/r/188967 (owner: 10Ori.livneh) [02:51:07] (03PS1) 10Ori.livneh: [WIP] simulate network conditions [puppet] - 10https://gerrit.wikimedia.org/r/188968 [02:57:08] 3operations: Migrate RT to Phabricator - https://phabricator.wikimedia.org/T38#1020060 (10Dzahn) >>! In T38#939694, @Dzahn wrote: > so it seems we decided to not import the domains queue. should/can i manually move an open ticket i was still working on? meanwhile there is T87465 to request that as a new project [03:02:07] 3Multimedia, operations: Errors when generating thumbnails should result in HTTP 400, not HTTP 500 - https://phabricator.wikimedia.org/T88412#1020067 (10Tgr) a:3Tgr [03:10:51] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [03:48:41] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 45 below the confidence bounds [04:37:54] Newbie question: what's the best vector for reporting 503s in production? [04:38:33] I just got sent to http://en.wikiversity.org/503.html, though it seems to be working now. [04:43:02] earldouglas: Sent from where? [04:45:33] 3operations, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1020104 (10Heather) This looks great! Would you mind sharing where the text is being finalized? [04:46:09] Fiona: just from http://en.wikiversity.org/ [04:46:37] Oh, correction, from http://wikiversity.org/ [04:46:49] It's reproducible, but doesn't seem like an ops issue per se [04:47:19] I'll file it in phabriactor [04:47:48] 3operations, ops-eqiad: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88773#1020107 (10emailbot) [04:49:19] 3operations, ops-eqiad: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88773#1020113 (10greg) spammy spam spam [04:50:06] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020114 (10Jdouglas) 3NEW [04:51:34] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020121 (10greg) This is fine, thanks @Jdouglas. @bblack, could this be a cached 503 from today's outage? [04:51:49] greg-g: we don't cache 503s [04:51:56] hmm, then not i guess :) [04:52:16] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020123 (10Jdouglas) [05:03:23] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020131 (10MZMcBride) Weird. I'm currently getting page content (200 OK) at . I believe should be redirecting to (compare with `curl -I "http://... [05:03:40] ? [05:03:51] earldouglas: There's some weirdness there for sure. [05:05:10] it is strange that its redirects are set up different from the other sites [05:05:11] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020138 (10Jdouglas) Turns out I'm not getting an HTTP 503, but a 301 to the 503 page: ``` $ curl http://wikiversity.org/ -i -s HTTP/1.1 301 Moved Permanently Server: Apache X-Powered-By: HHVM/3.3.0-static Location: http://e... [05:06:01] earldouglas == jdouglas? [05:06:15] if you can hang around for just a sec, let me try something [05:06:48] Yes, I == jdouglas [05:08:33] earldouglas: can you try again now? [05:08:56] bblack: working for me now [05:09:01] A 301 to 503.html... ehhh. If wikipedia.org were doing this, I imagine it would've been noticed already. ;-) [05:09:01] awesome [05:09:17] so yes, we don't cache 503s, but we did cache the 301 that went to the explicit url for the 503 error [05:09:30] Ah, that makes sense. [05:09:37] Thanks for the quick fix! [05:09:44] I cleared it from cp4008 in ulsfo which was where the reported trace came through. I'm going to go ban that everywhere now to clear it [05:10:28] sweet [05:10:33] thanks bblack ! [05:11:53] we should probably either extend this ticket or spawn another one or whatever to look at why that site emitted a cacheable direct to a 503.html and how we can prevent this down the line [05:12:36] Extend seems fine. I think a separate ticket is needed for the www issue. [05:15:32] 3operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1020140 (10BBlack) The 301 redirecting to 503.html was cached in varnish. I've cleared that globally now for this particular case. We should dig a little deeper on why/how we ever emitted a cacheable redirect to 503.html (pr... [05:17:13] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1020141 (10bd808) One possible negative of rsyslog as the shipping transport for structured application logs is the hard 4K line length limit.... [05:21:48] (03PS1) 10BBlack: add vm tuning params to jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188972 [05:25:25] 3operations, ops-eqiad: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88773#1020147 (10chasemp) 5Open>3Invalid a:3chasemp [05:26:25] chasemp: but for free!!! why not accept? [05:27:13] It was a hard decision but I stand by it [05:28:43] hmm, I seem to have missed all messages from 1AM to 8AM [05:28:45] (03PS2) 10Ori.livneh: add vm tuning params to jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188972 (owner: 10BBlack) [05:28:56] so you're the reason my academic publishing based career has not taken off [05:29:14] (03CR) 10Ori.livneh: "(fixed syntax; you had multiple 'values' parameter values)" [puppet] - 10https://gerrit.wikimedia.org/r/188972 (owner: 10BBlack) [05:29:18] (03PS1) 10BBlack: re-disable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188973 [05:30:07] oops, thanks ori! [05:30:59] (03CR) 10BBlack: [C: 032] add vm tuning params to jessie cache nodes [puppet] - 10https://gerrit.wikimedia.org/r/188972 (owner: 10BBlack) [05:31:09] (03PS2) 10BBlack: re-disable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188973 [05:31:18] (03CR) 10BBlack: [C: 032] re-disable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188973 (owner: 10BBlack) [05:31:26] (03CR) 10BBlack: [V: 032] re-disable compact_memory cron for jessie caches [puppet] - 10https://gerrit.wikimedia.org/r/188973 (owner: 10BBlack) [05:45:41] RECOVERY - Kafka Broker Messages In Per Second on tungsten is OK: OK: No anomaly detected [06:20:54] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:22] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:22] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:31] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:31] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:43] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:12] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:26] hmm, puppet is failing on random h osts, so _joe_ should be here any moment now… :) [06:39:00] <_joe_> hey [06:39:12] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:39:12] <_joe_> I already groaned on the other channel :) [06:39:21] _joe_: haha :) [06:39:26] I hadn't seen that yet [06:39:40] <_joe_> I am tired [06:39:52] yeah, sounds like an exciting / tiring day yesterday [06:39:58] <_joe_> yes [06:40:03] <_joe_> the whole week was nice [06:40:07] :) [06:40:10] <_joe_> first, WTF day [06:40:20] <_joe_> then, the largest outage since I joined [06:41:58] <_joe_> were you around at the time of the outage? or were you sleeping? [06:44:17] _joe_: I was out having food [06:44:31] _joe_: and then I came back, and was pleasantly surprised to hear that we survived a cold cache start [06:44:42] <_joe_> well, not exactly [06:44:46] <_joe_> the cache was there [06:44:54] didn't all the memcached machines restart? [06:44:57] <_joe_> we were just not fully reaching it [06:44:58] <_joe_> no [06:45:02] <_joe_> just the switch [06:45:14] oh [06:45:18] <_joe_> the one parav.oid just told me about 20 minutes earlier [06:45:32] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:43] <_joe_> the switch and I guess one varnish box? [06:45:47] _joe_: so the power went down only to the switch, and not to the machines themselves? [06:45:50] <_joe_> I don't exactly remember [06:45:52] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:45:52] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:55] <_joe_> YuviPanda: exactly [06:46:03] <_joe_> that made things easier I guess [06:46:03] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:09] yeah [06:46:28] <_joe_> but, we were mostly down because sync logging is _bad_ [06:46:37] yeah, I read backscroll [06:46:52] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:01] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:54:47] 3operations, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1020223 (10Nirzar) @heather It's here https://docs.google.com/a/wikimedia.org/spreadsheets/d/1pm2flhyK7CjwyfBHc5GM0OC8PX_7tuASt_ArEYW5F5M/edit#gid=0 [07:08:15] hello ops people [07:08:18] just a head ups [07:08:29] the parsoid cluster is not looking good [07:08:29] https://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&c=Parsoid+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [07:08:33] and may need a restart [07:17:56] andrewbogott_afk: ^ when you're back [07:35:20] <_joe_> arlolra: mh thanks [07:35:40] <_joe_> but apart from high cpu usage, do you have any other reason to think that? [07:37:51] <_joe_> because I went to one of the high-cpu machines and the parsoid log shows no errors [07:39:18] _joe_: yeah, the high cpu usage is from stuck processes. if you look at the logs, https://logstash.wikimedia.org/#/dashboard/elasticsearch/parsoid, there's a failure scenario "Maximum call stack size exceeded" that's causing it [07:39:52] normally those fatal error would restart the process but we deployed something on wednesday that's messing that up [07:39:57] <_joe_> it's pretty hard to grep parsoid logs for errors tbh, ok [07:40:46] <_joe_> arlolra: my next question would be why we don't monitor/alert on this [07:41:03] <_joe_> but I'll reserve this, and some other, for an email later [07:41:14] that's a good question [07:41:33] sadly i don't have an answer [07:43:09] <_joe_> arlolra: :) no worries [07:43:36] <_joe_> it's just friday of a stressful week, and I'm grumpy. Especially so before the second coffee :) [07:48:21] i can only imagine. today seemed rough [07:49:45] (03CR) 10KartikMistry: cxserver: Use different registry for Beta and Production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [07:50:42] <_joe_> !log restarting the parsoid cluster, one node at a time, some processes are stuck. [07:50:47] Logged the message, Master [07:52:24] thanks for restarting. processes should continue to get stuck but that should by us some time to get a fix out in the (my) morning [07:58:37] <_joe_> eheh ok [08:06:37] 3operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1020282 (10faidon) [08:08:09] 3operations: Miscellaneous servers to track in eqiad for possible inclusion in codfw misc virt cluster - https://phabricator.wikimedia.org/T88761#1019786 (10faidon) Thanks for filing this. We already had a couple of items on our TODO (basically: zirconium + core infrastructure such as email), but this is a good... [08:31:24] _joe_: who are you on gerrit? [08:34:46] https://gerrit.wikimedia.org/r/#/c/188982/ [08:37:33] that should prevent the processes from getting stuck [08:46:37] (03PS1) 10Yuvipanda: toollabs: Pass in full parent environment to npm start [puppet] - 10https://gerrit.wikimedia.org/r/188985 (https://phabricator.wikimedia.org/T1102) [08:46:56] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Pass in full parent environment to npm start [puppet] - 10https://gerrit.wikimedia.org/r/188985 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [08:48:19] hi kart_ [08:48:22] Hola [08:48:25] yeah, just make it a mandatory parameter? [08:48:30] $registry, [08:48:36] (03PS2) 10Yuvipanda: toollabs: Pass in full parent environment to npm start [puppet] - 10https://gerrit.wikimedia.org/r/188985 (https://phabricator.wikimedia.org/T1102) [08:50:56] (03CR) 10Yuvipanda: [C: 032] toollabs: Pass in full parent environment to npm start [puppet] - 10https://gerrit.wikimedia.org/r/188985 (https://phabricator.wikimedia.org/T1102) (owner: 10Yuvipanda) [08:52:39] YuviPanda: testing puppet always scares me :P [08:52:51] (03PS7) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [08:52:56] kart_: :) cherry-pick on deployment-prep, test, and then un-cherrypick :) [08:53:51] YuviPanda: yep. [08:54:44] 3Tool-Labs: Fully puppetize Grid Engine (Tracking) - https://phabricator.wikimedia.org/T88711#1020330 (10faidon) p:5Triage>3Normal [08:58:53] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1020333 (10fgiunchedi) thanks for the heads up @harej ! space-wise we are in the process of expanding our swift cluster capacity in T1268 and current... [08:59:40] YuviPanda: what was the way to reset to HEAD after cherry-pick on beta? [08:59:46] git reset HEAD? [08:59:52] <_joe_> arlolra: so another log-induced outage, NICE! [09:00:23] kart_: hmm, do a git log, and then just git reset —hard to the hash just before that? [09:01:03] <_joe_> or maybe not, I'm not sure [09:01:20] greetings [09:01:43] thanks for directing harej to phab last night [09:02:15] no problems [09:02:24] i like phabricator! it's very sleek and surprisingly powerful [09:07:56] (03PS8) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [09:09:50] it really is, better than the bugzilla+rt+etc mixture we had before [09:10:51] YuviPanda: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::labs::instance for i-000006a2.eqiad.wmflabs on node i-000006a2.eqiad.wmflabs - what's this? [09:11:06] updating puppet after cherry-picking patch [09:11:13] kart_: transient failure. just run puppet again [09:11:39] now better error :) [09:11:53] Must pass registry to Class[Role::Cxserver] on node i-000006a2.eqiad.wmflabs [09:12:52] godog, it appears you guys still use RT? [09:12:59] (which apparently is something other than Retweet or Russia Today?) [09:13:35] Most stuff goes to phabricator [09:13:36] harej: partly true yeah but it is on its way out [09:16:05] (03PS9) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [09:18:37] godog: "Starting reconnaissance" [09:18:37] heh [09:19:01] harej: I guess the answer is you can start doing some small batches for testing now [09:20:29] Reedy: they seem to like army words [09:20:47] (03CR) 10Yuvipanda: [C: 04-1] cxserver: Use different registry for Beta and Production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [09:21:23] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1020359 (10Harej) The timeline is whatever you tell me it is. ;) In any case, the plan was to start off small and then increment from there. Nothing... [09:22:28] Presuambly swift does store more than 1 copy of each file? [09:22:37] 3Continuous-Integration, operations: [OPS] Jenkins: puppet master fills /var on labs with yaml reports - https://phabricator.wikimedia.org/T75472#750717 (10hashar) We will get the instance reimaged to have a bigger `/var/` partition ( T87484 ) [09:22:41] yep, it stores 3 [09:23:24] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1020364 (10Reedy) >>! In T88758#1020359, @Harej wrote: > The timeline is whatever you tell me it is. ;) > > In any case, the plan was to start off s... [09:24:01] well ok tecnically we choose 3, it can be any number >0 [09:24:09] yeah [09:25:22] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [09:27:20] so yeah usable is ~100TB at the moment [09:29:12] that's thumbs and originals too, right? [09:29:19] <_joe_> godog: still rolling-restarting ES? [09:30:29] _joe_: nope, was about to resume yesterday before outage ensued [09:30:46] <_joe_> I was asking because ot the alert [09:31:42] Reedy: yep [09:32:23] 3Multimedia, operations, MediaWiki-extensions-GWToolset: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1020366 (10fgiunchedi) >>! In T88758#1020359, @Harej wrote: > The timeline is whatever you tell me it is. ;) > > In any case, the plan was to start... [09:35:22] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [09:36:53] YuviPanda: going through some more examples for yaml. [09:37:34] kart_: cool :) [09:53:17] (03PS10) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [09:54:45] (03PS11) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [10:05:51] YuviPanda: still same :/ [10:07:14] 3operations, ops-esams: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88788#1020389 (10emailbot) [10:07:58] 3operations, ops-esams: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88788#1020394 (10yuvipanda) 5Open>3Invalid a:3yuvipanda [10:08:09] 3operations, ops-esams, Spam-Spam: Invitation to Special Issue and Publish Papers for Free in It - https://phabricator.wikimedia.org/T88788#1020389 (10yuvipanda) [10:08:13] that emailbot is maybe not the best idea [10:09:14] wow already, srsly? [10:10:04] I should start a more firm push towards migrating tools off trusty [10:10:05] err [10:10:06] off precise [10:10:53] /q godog [10:10:56] er :) [10:11:26] paravoid: no /go ? :) [10:11:44] as in scripts.irssi.org/scripts/go.pl [10:11:44] :P [10:12:17] /nodejs dog [10:12:32] haha I see what you did there [10:12:39] kart_: am finishing up some documentation, I’ll take a look after? [10:13:01] I’ve successfully tricked akosiaris into helping out with a much more hairy issue, so I can help out with cxserver for today :) [10:19:00] YuviPanda: hehe. [10:19:07] YuviPanda: take your time. [10:19:41] akosiaris: poke for you too :) [10:19:54] yeah, I am aware [10:19:58] :) [10:20:12] I'll make tea or use sudo make tea. Lets see what will work. [10:27:12] deployment-cxserver (labs?) is showing various kinds of failures, is someone working on it right now? [10:29:03] Nikerabbit: what kinds of failures ? [10:29:22] RECOVERY - NTP on dataset1001 is OK: NTP OK: Offset 0.003860592842 secs [10:29:24] akosiaris: Nikerabbit that’s kart_ testing his patch [10:29:32] ok [10:31:34] 3operations: reimage ms-be2014 - https://phabricator.wikimedia.org/T88790#1020415 (10fgiunchedi) 3NEW a:3fgiunchedi [10:36:44] I've seen that ntp alert cropping up from time to time, any idea why? [10:37:17] I just restarted ntpd on dataset1001 [10:37:26] there's an issue that bblack was tracking down [10:37:35] after reboots, ntp doesn't really sync up until it gets restarted [10:37:49] dataset1001 was rebooted last night [10:38:17] <_joe_> paravoid: when was it? [10:38:22] <_joe_> I didn't reboot it [10:38:47] it was powercycled along with the switch I think [10:39:11] <_joe_> oh ok, that, yes :) [10:39:33] <_joe_> I was thinking later yesterday when I fixed dumps [10:40:03] <_joe_> and yes the problem was it didn't export its filesystems at boot [10:40:33] <_joe_> (which of course is not monitored in any way [10:43:08] there's an alert for rbf2001, is it handled already? [10:43:17] expected, part of the build-up? [10:43:32] <_joe_> I guess so, mutante was reinstalling it as jessie [10:51:04] !log reimage ms-be2014 [10:51:13] Logged the message, Master [11:20:53] kart_: looking at your patch now [11:22:40] PROBLEM - swift-object-server on ms-be2014 is CRITICAL: Connection refused by host [11:22:40] PROBLEM - dhclient process on ms-be2014 is CRITICAL: Connection refused by host [11:22:50] PROBLEM - puppet last run on ms-be2014 is CRITICAL: Connection refused by host [11:22:50] PROBLEM - swift-object-updater on ms-be2014 is CRITICAL: Connection refused by host [11:23:09] PROBLEM - very high load average likely xfs on ms-be2014 is CRITICAL: Connection refused by host [11:23:09] PROBLEM - salt-minion processes on ms-be2014 is CRITICAL: Connection refused by host [11:23:17] aaaand downtimed [11:26:08] <_joe_> I have a script, icinga-schedule-downtime [11:26:21] RECOVERY - very high load average likely xfs on ms-be2014 is OK: OK - load average: 1.75, 1.62, 0.95 [11:26:21] RECOVERY - salt-minion processes on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:26:23] <_joe_> I should maybe share it [11:27:00] RECOVERY - dhclient process on ms-be2014 is OK: PROCS OK: 0 processes with command name dhclient [11:30:10] RECOVERY - swift-object-server on ms-be2014 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:30:20] RECOVERY - swift-object-updater on ms-be2014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:33:18] akosiaris: RT email is broken again [11:34:17] I think you fixed the cert issue but perhaps it's still related, can you take a look? [11:34:36] we really do need that to work still [11:36:57] again ? [11:37:01] looking into it [11:37:21] 3operations: reimage ms-be2014 - https://phabricator.wikimedia.org/T88790#1020428 (10fgiunchedi) 5Open>3Resolved done, machine reimaged [11:41:08] (03PS12) 10Yuvipanda: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [11:44:03] (03CR) 10KartikMistry: cxserver: Use different registry for Beta and Production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [11:44:21] YuviPanda: thanks. See comment. [11:44:30] kart_: it’s still broken, am still fixing [11:44:40] eh. [11:45:18] the yaml structure with empty keys is very confusing [11:45:24] We're discussing really compact format for registry, as current one won't scale. [11:45:38] Right. Even in js, it was confusing. [11:46:20] (03PS13) 10Yuvipanda: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [11:48:02] mark____: lwp issues it seems. 'client-warning' => 'Internal response', after saying 500 Can't connect to rt.wikimedia.org:443 [11:48:15] hm [11:48:25] debugging it now [11:57:21] (03PS3) 10Glaisher: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://phabricator.wikimedia.org/T57737) [11:58:24] (03PS4) 10Glaisher: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://phabricator.wikimedia.org/T57737) [12:01:16] (03PS5) 10Glaisher: Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 [12:05:40] YuviPanda: should I fix this? https://gerrit.wikimedia.org/r/#/c/188796/13/hieradata/role/common/cxserver/production.yaml [12:05:44] or is it okay? [12:24:13] (03PS14) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 [12:34:49] PROBLEM - DPKG on magnesium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:35:50] RECOVERY - DPKG on magnesium is OK: All packages OK [12:45:58] (03Restored) 10Alexandros Kosiaris: Disable LWP SSL hostname verification [puppet] - 10https://gerrit.wikimedia.org/r/188553 (owner: 10Mark Bergsma) [12:46:42] (03CR) 10Alexandros Kosiaris: [C: 032] Disable LWP SSL hostname verification [puppet] - 10https://gerrit.wikimedia.org/r/188553 (owner: 10Mark Bergsma) [12:49:56] mark: Ended up with Perl Library Hell [12:50:06] lol [12:50:08] I gave up. Restored and merged your patch [12:50:12] see [12:50:18] you people who know better ;) [12:50:24] you were right all along :-) [12:50:35] i just imagined that mess and gave up [12:50:43] but good we got the cert chain fixed though ;) [12:50:46] didn't know it was that [12:50:56] I 'll keep that as consolation [12:51:03] goddamn precise [12:51:14] trusty hosts do not have that problem :-( [12:52:35] why didn't we fix the chain? [13:01:47] (03CR) 10Yuvipanda: "Failing with:" [puppet] - 10https://gerrit.wikimedia.org/r/188796 (owner: 10KartikMistry) [13:04:00] kart_: ^ [13:04:33] kart_: I’ve to go now, sadly :( but I fixed the other issues ($registry param was in the wrong place), and also fixed an additional one ($port was really unnecessary in that role) [13:53:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [13:55:45] !log upgrading boron to trusty [13:55:52] Logged the message, Master [14:02:37] YuviPanda: Thanks! [14:02:47] I will now poke akosiaris :) [14:05:49] (03PS2) 10Faidon Liambotis: sysctl: make service call init system-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/187429 [14:06:19] bblack: ^ [14:06:24] currently broken on jessie [14:06:35] puppetized sysctl values don't get applied until a reboot right now [14:06:50] (03CR) 10Faidon Liambotis: [C: 032] sysctl: make service call init system-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/187429 (owner: 10Faidon Liambotis) [14:07:04] heh nice [14:07:11] very silent failure :) [14:07:15] what's awesome about that bug is that puppet claims it executed it fine [14:07:24] no, it won't say anything [14:07:25] I should have looked at the other 3 boxes to see if it applied [14:07:27] because it's an onlyif [14:07:33] it just won't mention the Exec at all [14:07:52] Notice: /Stage[main]/Role::Cache::Varnish::Base/Sysctl::Parameters[cache_role_vm_settings]/Sysctl::Conffile[cache_role_vm_settings]/File[/etc/sysctl.d/70-cache_role_vm_settings.conf]/ensure: created [14:07:56] Info: /Stage[main]/Role::Cache::Varnish::Base/Sysctl::Parameters[cache_role_vm_settings]/Sysctl::Conffile[cache_role_vm_settings]/File[/etc/sysctl.d/70-cache_role_vm_settings.conf]: Scheduling refresh of Exec[update_sysctl] [14:08:00] Notice: /Stage[main]/Sysctl/Exec[update_sysctl]: Triggered 'refresh' from 1 events [14:08:03] ^ cp1064 last night [14:08:04] oh hm [14:08:32] let me see if it applied in practice on one of the others I wasn't messing with [14:08:52] no, it didn't [14:09:03] according to /proc/sys/ [14:09:31] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [14:09:36] uh oh [14:09:38] wheee [14:09:53] Feb 6 14:07:19 cp1058 puppet-agent[8639]: 'service procps start' is not qualified and no path was specified. Please qualify the command or specify a path. [14:09:56] oh for the love [14:09:59] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: puppet fail [14:10:00] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: puppet fail [14:10:00] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: puppet fail [14:10:00] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: puppet fail [14:10:00] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: puppet fail [14:10:05] of puppet? [14:10:10] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [14:10:10] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: puppet fail [14:10:11] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: puppet fail [14:10:11] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [14:10:11] PROBLEM - puppet last run on analytics1022 is CRITICAL: CRITICAL: puppet fail [14:10:19] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on analytics1016 is CRITICAL: CRITICAL: puppet fail [14:10:20] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: puppet fail [14:10:21] PROBLEM - puppet last run on analytics1013 is CRITICAL: CRITICAL: puppet fail [14:10:21] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail [14:11:54] (03PS1) 10Faidon Liambotis: sysctl: brown-paper bag fix for service's Exec [puppet] - 10https://gerrit.wikimedia.org/r/189010 [14:12:16] (03CR) 10Faidon Liambotis: [C: 032] sysctl: brown-paper bag fix for service's Exec [puppet] - 10https://gerrit.wikimedia.org/r/189010 (owner: 10Faidon Liambotis) [14:12:47] !log bounce diamond on lvs2004/lvs2005 [14:12:52] Logged the message, Master [14:13:32] (03PS2) 10Faidon Liambotis: Add sysfs module, to handle /sys settings [puppet] - 10https://gerrit.wikimedia.org/r/187430 [14:13:42] ok, this we don't need yet, so I won't merge it [14:17:14] (03CR) 10Faidon Liambotis: [C: 04-1] "The only user for this that I had in mind went away and we have no use for this yet. Let's not merge yet." [puppet] - 10https://gerrit.wikimedia.org/r/187430 (owner: 10Faidon Liambotis) [14:33:19] !log starting up a fresh round of SSL testing on eqiad upload pooling (cp1064) [14:33:26] Logged the message, Master [14:40:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [14:44:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:45:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [14:45:19] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [14:49:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:50:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [14:50:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [14:50:19] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [14:52:32] 3operations, hardware-requests, Wikimedia-Logstash: Production hardware for Logstash service - https://phabricator.wikimedia.org/T84958#1020615 (10mark) [14:52:39] 3Continuous-Integration, operations: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-7+wmf2.1 or equivalent - https://phabricator.wikimedia.org/T88798#1020616 (10Anomie) 3NEW [14:54:56] (03PS1) 10BBlack: more-aggressive vm tuning for jessie-varnish [puppet] - 10https://gerrit.wikimedia.org/r/189015 [14:54:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [14:55:10] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [14:55:19] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [14:55:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [14:55:20] PROBLEM - check_puppetrun on payments1002 is CRITICAL: CRITICAL: puppet fail [14:56:33] (03CR) 10BBlack: [C: 032] "This seems likely to help smooth the remaining (tolerable, smaller) sys%/iowait spikes on upload caches." [puppet] - 10https://gerrit.wikimedia.org/r/189015 (owner: 10BBlack) [14:59:38] 3operations, Incident-20150205-SiteOutage, Wikimedia-Logstash: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1020624 (10chasemp) >>! In T88732#1020141, @bd808 wrote: > One possible negative of rsyslog as the shipping transport for structured applicatio... [14:59:42] of course as soon as I say that, I get to see another mini-spike in my graph :p [14:59:59] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [15:00:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [15:00:10] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: puppet fail [15:00:19] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [15:00:19] PROBLEM - check_puppetrun on payments1004 is CRITICAL: CRITICAL: puppet fail [15:00:20] RECOVERY - check_puppetrun on payments1002 is OK: OK: Puppet is currently enabled, last run 103 seconds ago with 0 failures [15:01:23] what's with check_puppetrun ? all payments hosts I take it? [15:01:39] godog: that's me, updgraded boron to trusty [15:01:51] puppetmaster [15:02:10] cmjohnson1: ah! thanks I was missing the last bit boron == puppetmaster [15:02:15] is backup4001 on it as well? [15:04:43] bblack: yes it is [15:04:59] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 182 seconds ago with 0 failures [15:05:09] PROBLEM - check_puppetrun on indium is CRITICAL: CRITICAL: puppet fail [15:05:10] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 292 seconds ago with 0 failures [15:05:19] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: puppet fail [15:05:19] RECOVERY - check_puppetrun on payments1004 is OK: OK: Puppet is currently enabled, last run 233 seconds ago with 0 failures [15:10:09] RECOVERY - check_puppetrun on indium is OK: OK: Puppet is currently enabled, last run 267 seconds ago with 0 failures [15:10:19] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 238 seconds ago with 0 failures [15:18:49] 3operations, ops-eqiad, Incident-20150205-SiteOutage: Restore asw2-a5-eqiad redundant power - https://phabricator.wikimedia.org/T88792#1020654 (10Se4598) [15:34:22] akosiaris: I'd need to bounce ocg to pick up statsd dns change, how can we do that safely? [15:39:01] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [15:45:47] <_joe_> godog: bounce as in restart? [15:45:58] <_joe_> restart the three nodes at 10 second distance then [15:46:10] <_joe_> ocg is pretty fast to resume correctly its operations [15:47:44] yeah bounce as in restart [15:48:23] okay I'll try that thanks _joe_ ! [15:48:51] !log restart ocg on ocg1002 to pick up statsd dns changes [15:48:57] Logged the message, Master [15:50:37] !log restart ocg on ocg1003 to pick up statsd dns changes [15:50:40] Logged the message, Master [15:50:50] !log depool -> repool cp1064 varnish-frontend, reduced cache size to 16G, re-enabled compact_memory [15:50:53] Logged the message, Master [15:52:02] mhh on ocg1001 there seem to be two copies of ocg running started one day apart :( [15:53:06] <_joe_> nice [15:53:12] <_joe_> "the upstart way" [15:53:40] nah I don't think it is upstart [15:53:41] root 2601 0.0 0.0 64952 760 ? S Jan28 0:00 sudo -u ocg -g ocg /usr/bin/nodejs-ocg /srv/deployment/ocg/ocg/mw-ocg-service.js -c /etc/ocg/mw-ocg-service.js [15:53:49] <_joe_> oh well [15:54:02] <_joe_> so someone with root? [15:59:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [16:05:37] !log bounce ocg on ocg1001 and stop additional ocg instance running [16:05:43] Logged the message, Master [16:13:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:15:10] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [16:16:30] PROBLEM - Kafka Broker Messages In Per Second on tungsten is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 45 below the confidence bounds [16:16:44] godog: if you have a moment.. https://gerrit.wikimedia.org/r/#/c/188959/ [16:16:51] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1020689 (10Papaul) complete mc2004 = port ge-70/0 mc2005 = port ge-7/0/1 mc2006 = port ge-7/0/2 mgtm setup, BIOS configuration and test complete [16:17:30] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:18:06] <_joe_> gwicke: wow, that is some config file [16:18:29] <_joe_> gwicke: you could move to namespaced XMLs now that your conf is complex enough :P [16:18:51] most of it is actually a swagger API spec [16:19:17] <_joe_> yeah just making cheap jokes [16:19:46] (03CR) 10Filippo Giunchedi: [C: 031] Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 (owner: 10GWicke) [16:19:53] gwicke: ah yes, saw that and forgot to +1 [16:20:06] it's pretty cool actually, we now even provide a swagger-ui sandbox [16:21:05] https://github.com/wikimedia/restbase/pull/159 [16:23:24] godog: we need a +2 ;) [16:25:49] can somone merge https://gerrit.wikimedia.org/r/#/c/188374/ so that the State Library of North Carolina can upload the files this weekend? [16:28:38] * Steinsplitter pokes ^d/Reedy [16:29:19] (03PS3) 10Filippo Giunchedi: Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 (owner: 10GWicke) [16:29:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Update the restbase config for v0.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/188959 (owner: 10GWicke) [16:30:06] gwicke: yep, merged [16:32:25] or godog maybe you can do it? [16:33:14] Steinsplitter: heh, not very comfortable deploying mediawiki-config in general, in particular on a friday sorry :) [16:34:50] the library like to donat 10,000 files, so it would be usefil to have the domain whitelisted soon [16:35:09] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [16:35:31] (03CR) 10Steinsplitter: "https://commons.wikimedia.org/w/index.php?title=User_talk:Steinsplitter&oldid=149169947#GWToolset" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) (owner: 10Steinsplitter) [16:36:30] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 10, down: 0, shutdown: 0 [16:38:11] 3operations, ops-eqiad, Incident-20150205-SiteOutage: Restore asw2-a5-eqiad redundant power - https://phabricator.wikimedia.org/T88792#1020715 (10Cmjohnson) I checked the power cables today and made sure that they were both fully inserted at the pdu and switch side. PEM 0 is not powering up. Next step is to re-... [16:39:38] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1020716 (10Papaul) complete mc2007 = port ge-2/0/0 mc2008 = port ge-2/0/1 mc2009 = port ge-2/0/2 mgtm setup, BIOS configuration and test complete [16:40:21] !log cancel downtime on graphite1001, enable downtime on tungsten pending full decomission [16:40:28] Logged the message, Master [16:41:01] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [16:42:00] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [16:42:24] 3operations, ops-eqiad: please wipe disks of radon - https://phabricator.wikimedia.org/T88740#1020718 (10Cmjohnson) 5Open>3Resolved This task has been completed. [16:43:10] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [16:43:38] 3operations: Place ms1004 server back into the pool - https://phabricator.wikimedia.org/T86933#1020720 (10faidon) Ping? [16:45:44] 3Ops-Access-Requests: Requesting deployment access for nuria - https://phabricator.wikimedia.org/T88760#1020724 (10Tnegrin) manager approved [16:46:30] RECOVERY - Kafka Broker Messages In Per Second on tungsten is OK: OK: No anomaly detected [16:52:15] (03PS4) 10Reedy: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) (owner: 10Steinsplitter) [16:53:56] (03CR) 10Reedy: [C: 032] Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) (owner: 10Steinsplitter) [16:54:01] (03Merged) 10jenkins-bot: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188374 (https://phabricator.wikimedia.org/T76867) (owner: 10Steinsplitter) [16:54:52] !log reedy Synchronized wmf-config/InitialiseSettings.php: Adding cdm16062.contentdm.oclc.org to wgCopyUploadsDomains (duration: 00m 05s) [16:54:55] Steinsplitter: ^ [16:54:58] Logged the message, Master [16:55:23] 3operations, Phabricator: Delete LikeLifer username - https://phabricator.wikimedia.org/T87092#1020738 (10chasemp) thanks @dzahn [16:56:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:56:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:58:10] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:58:10] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:58:39] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1020741 (10Andrew) I see two groups, eventlogging-admins and eventlogging-roots. You are already a member of eventlogging-admins, and there are 0 members of eventlogging-roots. Does that mean that you'r... [16:58:39] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1020742 (10Andrew) p:5Triage>3Normal [17:01:51] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1020754 (10Milimetric) Hm, sorry I'm not more familiar with how this is set up. Ori, Nuria, and Christian have the rights to do things that fail for me: tail -f /var/log/upstart/eventlogging_consumer-*l... [17:02:40] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:03:48] 3operations, ops-eqiad: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1020757 (10Cmjohnson) 3NEW a:3Cmjohnson [17:04:14] 3Ops-Access-Requests: Requesting sudo access on stat1003 for milimetric - https://phabricator.wikimedia.org/T88803#1020769 (10Milimetric) 3NEW [17:06:04] 3operations, ops-eqiad: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1020785 (10coren) That's going to be "fun". I'll have a talk with Andrew (Yuvi is going on vacation) and try to synchronize something for as swiftly as possible, but that needs lead time to get everyone... [17:07:30] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1020789 (10Papaul) complete mc2010 = port ge-7/0/0 mc2011 = port ge-7/0/1 mc2012 = port ge-7/0/2 mgtm setup, BIOS configuration and test complete [17:10:14] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1020797 (10Cmjohnson) 3NEW a:3Cmjohnson [17:16:49] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:22:11] Reedy: thans a lot :):):) [17:32:19] do we care about those HTTP 5xx req/min alerts flapping? http://grossmeier.net/files/tmp/Selection_017.png <- last day [17:35:26] greg-g: a good portion of that over the past day or two has come from cp1064 testing in production, where we're having issues leading to periodic 503's for upload-lb image urls [17:35:45] (the wide swath of green is when I depooled that to sleep and then work on other things for a while) [17:36:27] bblack: cool, I won't worry then :) [17:36:47] but some of the most recent bits of it, e.g. the wider spike circa 16:40 on https://gdash.wikimedia.org/dashboards/reqerror/ , didn't come from cp1064, so I'm not sure about it all [17:58:19] 3operations, WMF-Design: Better WMF error pages - https://phabricator.wikimedia.org/T76560#1020859 (10Heather) Thanks! Would you mind setting up a meeting with Communications before the text is considered final? -- Whenever you are close. [18:20:02] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [19:13:56] 3operations: re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1021045 (10Dzahn) p:5Triage>3Normal [19:14:06] 3operations, Phabricator: re-use server 'radon' as phab failover - https://phabricator.wikimedia.org/T88818#1021035 (10Dzahn) [19:14:24] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1021063 (10Andrew) OK, further research shows that qchris does not actually have those permissions, although he used to. Nuria does have those rights, but by mistake :) So, I'm opening two Ops tickets t... [19:14:40] PROBLEM - Parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:39] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1021080 (10Milimetric) Thanks Andrew, cc @tnegrin [19:18:53] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.423 second response time [19:19:06] !log deployed parsoid hotfiix a9dbd4fc (cherry-pick of 76d6658c) [19:19:08] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1021119 (10Andrew) Oh, Toby, please confirm your approval of this right for both Nuria and Milimetric (since Nuria is getting her config updated as a side-effect.) Thx. [19:19:12] Logged the message, Master [19:19:52] 3operations: Upgrade all HTTP frontends to Debian jessie - https://phabricator.wikimedia.org/T86648#1021122 (10GWicke) cp1008 looking SPDY: https://spdycheck.org/#cp1008.wikimedia.org [19:19:57] (03PS1) 10Yuvipanda: toollabs: Fix webservice2 restart when no webservice is running [puppet] - 10https://gerrit.wikimedia.org/r/189043 [19:23:02] 3operations: Puppet should actively purge sudo and access rights not enumerated by the admins module - https://phabricator.wikimedia.org/T88826#1021142 (10Andrew) 3NEW [19:23:20] 3operations: Puppet should actively purge sudo and access rights not enumerated by the admins module - https://phabricator.wikimedia.org/T88826#1021151 (10Andrew) [19:23:31] (03PS2) 10Yuvipanda: toollabs: Fix webservice2 restart when no webservice is running [puppet] - 10https://gerrit.wikimedia.org/r/189043 [19:24:08] (03PS3) 10Yuvipanda: toollabs: Fix webservice2 restart when no webservice is running [puppet] - 10https://gerrit.wikimedia.org/r/189043 [19:24:26] (03PS4) 10Yuvipanda: toollabs: Fix webservice2 restart when no webservice is running [puppet] - 10https://gerrit.wikimedia.org/r/189043 [19:24:47] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Fix webservice2 restart when no webservice is running [puppet] - 10https://gerrit.wikimedia.org/r/189043 (owner: 10Yuvipanda) [19:31:51] 3Ops-Access-Requests: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1021175 (10Tnegrin) approved by manager for both Nuria and Milimetric. [19:47:35] 3operations: add direct to task emailing for onsite queues (only for vendors!) - https://phabricator.wikimedia.org/T87454#1021243 (10RobH) [19:47:39] 3operations: add direct to task emailing for onsite queues (only for vendors!) - https://phabricator.wikimedia.org/T87454#991638 (10RobH) [19:47:58] 3operations: add direct to task emailing for onsite queues (only for vendors!) - https://phabricator.wikimedia.org/T87454#1021248 (10RobH) 5Open>3Resolved enabled and tested as working, resolving this task. [19:48:51] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1021252 (10RobH) [19:48:52] 3operations: replace [blog|techblog].wikimedia.org sha1 certificates with sha256 - https://phabricator.wikimedia.org/T88491#1021250 (10RobH) 5stalled>3Resolved confirmed that sha256 is working on the site, so resolving [19:57:11] mutante: how do static sites like {dev,doc}.wikimedia.org get content? I can see operations/puppet has rules specifying content location, but then... ? [20:00:05] spagewmf: at least the puppet part is generated from puppet-doc. the index page is probably somewhere in puppet as a file [20:00:20] spagewmf: dev doesnt have content yet, to be determined how [20:01:06] mutante: yes, I'm seeing how it works. Likewise e.g. annual.wikimedia.org, did Heather Walls write puppet .erb files for all the HTML and CSS? 8-) [20:01:09] could be that puppet git::clones or a .deb or a person just doing git pull [20:01:39] spagewmf: no, for annual i just made a request to have a gerrit project created and it was created today [20:02:06] a repo to put the docroot contents in [20:02:12] (03PS1) 10RobH: reclaiming ms1004, removing hostname [dns] - 10https://gerrit.wikimedia.org/r/189052 [20:02:30] (03PS1) 10RobH: reclaiming ms1004 to spares [puppet] - 10https://gerrit.wikimedia.org/r/189053 [20:02:57] mutante: OK, and then puppet (or a cron job?) just updates from git periodically? That's how I assume doc.wikimedia.org's root works, but I don't see the glue to a gerrit repo [20:02:58] spagewmf: there is a puppet role for it, it does the apache config, creates the docroot etc, but it does not deploy all those files [20:03:35] (03PS2) 10RobH: reclaiming ms1004 to spares [puppet] - 10https://gerrit.wikimedia.org/r/189053 [20:03:39] yea, puppet can git clone it, or you can have people do it, we have both [20:04:03] or puppet could install a .deb [20:04:07] we have that too [20:04:10] (03CR) 10RobH: [C: 032] reclaiming ms1004 to spares [puppet] - 10https://gerrit.wikimedia.org/r/189053 (owner: 10RobH) [20:04:38] a .deb containing HTML website? (my mind is blown) [20:05:03] !log ms1004 coming offline, shouldnt page (but disregard if it does) [20:05:07] Logged the message, Master [20:05:47] spagewmf: there are *a lot* of such doc packages in Debian proper [20:05:55] it's also just in gerrit, you would just have to pull and debuild .. and a .deb falls out [20:06:14] if it has the right directory structure once [20:07:04] there is some functionality for automatic deb building in jenkins already [20:07:19] for annual, i'm going to upload the contents of the jorm tarball as an initial commit [20:07:24] https://github.com/wikimedia/operations-debs-jenkins-debian-glue [20:07:48] but i didn't plan to make puppet clone it [20:10:03] spagewmf: if it's just an index.html and one logo or something, i would define them in puppet itself as files, but if it's more, i'd request a separate repo first.. then there are still different options to deploy it. also trebuchet [20:11:05] (03CR) 10RobH: [C: 032] reclaiming ms1004, removing hostname [dns] - 10https://gerrit.wikimedia.org/r/189052 (owner: 10RobH) [20:13:55] 3operations, ops-eqiad: wipe ms1004(WMF3248) / set name to asset tag - https://phabricator.wikimedia.org/T88832#1021318 (10RobH) 3NEW a:3Cmjohnson [20:16:22] 3operations: Place ms1004 server back into the pool - https://phabricator.wikimedia.org/T86933#1021333 (10RobH) p:5Normal>3Low Anyone can decommission based off the notes on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission I've gone ahead and done so, and it is no longer being mo... [20:17:03] 3operations: reclaim ms1004 back to spares pool (rename to WMF3248) - https://phabricator.wikimedia.org/T86933#1021339 (10RobH) 5Open>3stalled [20:22:20] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1021344 (10Papaul) 5Open>3Resolved complete mc2013 = port ge-2/0/0 mc2014 = port ge-2/0/0 mc2015 = port ge-2/0/0 mc2016 = port ge-7/0/0 mc2017 = port... [20:22:21] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1021346 (10Papaul) [20:23:19] 3operations, ops-codfw: rack mw2135 through mw2215 - https://phabricator.wikimedia.org/T86806#1021350 (10Papaul) all 80 mw's servers are racked [20:25:50] mutante: I figured out doc.wikimedia.org, modules/contint/manifests/website.pp: # Static files in these docroots are in integration/docroot.git [20:29:08] spagewmf: yea, and you can see some examples for puppet git cloning in that module too, in /manifests/slave-scripts. like git::clone { 'jenkins CI Composer': .. but i also don't see it doing that from the docroot repo specifically, so that might be manual deploy becaues it doesn't change that often [20:35:12] (03PS1) 10Dzahn: add Cyrillic project domain names [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) [20:35:27] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1021374 (10RobH) a:5Joe>3RobH claiming this to get the network ports configured and setup [20:37:14] spagewmf: hey :) [20:37:14] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1021379 (10RobH) FYI: When its 10Gb port on juniper like these, they are xe-7/0/0 not ge-7/0/0 [20:38:05] spagewmf: doc.wikimedia.org has some very basic files hosted in integration/docroot.git but the automatically generated software docs are build by Jenkins. See https://www.mediawiki.org/wiki/Continuous_integration/Documentation_generation [20:39:03] spagewmf: I am off but feel free to poke the QA mailing list :] [20:41:38] 3operations, ops-codfw: Rebalance mc locations & update mgmt addresses for mc2001-mc2018 memcached servers in codfw - https://phabricator.wikimedia.org/T88693#1021394 (10Papaul) Got it. Thanks [20:43:58] milimetric: I believe that your privs on stats1003 mean you can sudo as user ‘stats’ and that that user has the access you need. [20:44:01] Can you verify? [20:44:16] That seems to be the access pattern that’s used on that box. [20:44:19] (this is re: https://phabricator.wikimedia.org/T88803 ) [20:44:48] thanks, checking [20:45:47] andrewbogott: quite right, I'll close the bug as invalid [20:45:52] awesome, thanks [20:46:42] 3Ops-Access-Requests: Requesting sudo access on stat1003 for milimetric - https://phabricator.wikimedia.org/T88803#1021397 (10Milimetric) 5Open>3Invalid a:3Milimetric My fault, I had the proper access, I just had to use sudo as the "stats" user: sudo -u stats ls -l /a/limn-language-data/datafiles/ [20:59:09] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1021444 (10RobH) All network ports have set descriptions and are now enabled in the private vlan for each row. So IP allocation needs to be: mc2001-2006 row a mc2007-2012 row b mc2013-2018 row c I'll set this up shortly a... [21:09:28] (03PS1) 10RobH: setting codfw mc systems dns entries [dns] - 10https://gerrit.wikimedia.org/r/189106 [21:14:53] (03CR) 10RobH: [C: 032] setting codfw mc systems dns entries [dns] - 10https://gerrit.wikimedia.org/r/189106 (owner: 10RobH) [21:15:40] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1021450 (10RobH) dns setup. [21:16:31] 3operations: detail hardware requests policy and procedure on wikitech/officewiki - https://phabricator.wikimedia.org/T87626#1021454 (10RobH) 5Open>3Resolved I pushed an email to ops and engineering lists linking to https://wikitech.wikimedia.org/wiki/Operations_requests this outlines serverrequests, as we... [21:17:32] andrewbogott: is there any reason cron wouldn't let me do something like "sudo -u stats cat /some/file/stats/can/see"? It seems to run fine in a script but not through cron [21:18:12] (03Abandoned) 10BryanDavis: Ensure that apache's uid=48 [puppet] - 10https://gerrit.wikimedia.org/r/178690 (https://phabricator.wikimedia.org/T78076) (owner: 10BryanDavis) [21:18:33] milimetric: I’m not sure. But it probably makes the most sense for you to put that in the crontab of user ‘stats’ rather than having your own crontab try to sudo [21:20:02] 3operations, Phabricator: enable email for tickets in domains project - https://phabricator.wikimedia.org/T88842#1021464 (10Dzahn) 3NEW [21:20:27] 3operations: Re: [wikimedia #8856] Resolved: Extension:Translate - https://phabricator.wikimedia.org/T88843#1021471 (10emailbot) [21:20:30] 3operations, Phabricator: enable email for tickets in domains project - https://phabricator.wikimedia.org/T88842#1021475 (10Dzahn) [21:20:46] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1021483 (10Dzahn) [21:52:06] 3operations: Import Cassandra 2.1 packages for Jessie - https://phabricator.wikimedia.org/T88850#1021609 (10GWicke) 3NEW [21:52:51] 3operations: Import Cassandra 2.1 packages for Jessie - https://phabricator.wikimedia.org/T88850#1021627 (10GWicke) [21:52:53] 3operations, Scrum-of-Scrums, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1021626 (10GWicke) [21:54:15] 3operations, RESTBase: Import Cassandra 2.1 packages for Jessie - https://phabricator.wikimedia.org/T88850#1021609 (10GWicke) [21:54:19] mutante: I’m triaging all of these typo-squatter tickets ‘low’ unless you object. [21:54:23] 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1021631 (10GWicke) [21:54:31] (Mostly just trying to get them out of the clinic queue — I know you’re on top of them.) [21:55:00] andrewbogott: i don't object, what i'm doing is being logged in on RT and checking what is open and doesnt have phab [21:55:28] and .. just got that project for that kind of ticket. so thanks [21:58:34] mutante: I’m really glad you’re worrying about all those typo squatters because I would hate that [22:02:56] 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1021684 (10Andrew) The link https://wiki.apache.org/cassandra/DebianPackaging is currently down. I'll try to remember to revisit this, but y'all should bug me on IRC any time the site is up and I will do this. [22:03:40] 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1021694 (10Andrew) p:5Triage>3Normal a:3Andrew [22:05:44] andrewbogott: bug ;) [22:05:52] the wiki is slow for me, but it loads [22:05:55] gwicke: yeah, I see it just came up [22:06:04] Do you happen to have a URL for the actual packages? [22:06:08] If not I can dig [22:06:24] deb http://www.apache.org/dist/cassandra/debian 21x main [22:07:06] do you need a deb line or a link to an individual .deb? [22:07:09] yes, that is a url for the repo [22:07:13] I need to just download the .deb file itself [22:07:17] I see [22:07:33] http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/ [22:07:42] ah, there they are! Thanks :) [22:07:52] got there by clicking through from the repo url [22:08:01] hm, I tried that, must have aimed badly. [22:08:08] You want 2.1.1 or 2.1.2? [22:08:14] 2.1.2 [22:08:28] ok [22:09:22] ah, dammit, that site is doing something clever, can’t just cut-and-paste the link [22:09:54] yeah [22:10:20] there we go [22:10:38] it works if you remove the $ [22:10:40] # [22:20:19] 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1021750 (10Andrew) jessie-wikimedia|thirdparty|i386: cassandra 2.1.2 jessie-wikimedia|thirdparty|i386: cassandra-tools 2.1.2 Please confirm that this works, or let me know if you need other packages. [22:31:02] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1021779 (10Chmarkine) [22:32:34] (03CR) 10BBlack: [C: 031] "Looks correct to me. Personally I'd prefer we normalize on lowercase rather than uppercase for the filenames, but either way is technical" [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) (owner: 10Dzahn) [22:42:22] Hey party people! [22:42:32] I'm going to do a no-op merge and sync of a beta cluster config change [22:42:38] Just FYI [22:42:48] https://phabricator.wikimedia.org/T78807#1020105 greg-g said it was OK, I promise! [22:43:13] I do. [22:43:52] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1021823 (10Dzahn) p:5Triage>3Normal [22:44:12] (03CR) 10MarkTraceur: [C: 032] "Leerooooooy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 (https://phabricator.wikimedia.org/T78807) (owner: 10Gergő Tisza) [22:44:17] (03CR) 10jenkins-bot: [V: 04-1] Deploy Sentry on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 (https://phabricator.wikimedia.org/T78807) (owner: 10Gergő Tisza) [22:44:20] Boo. [22:44:21] tgr: ^^ [22:44:25] i am going to restart the parsoid service on the cluster since we've several stuck processes because of tripping on a bug (exposed by a deploy few weeks back) that throws them into an infinite loop. and our timeout handling and restart is not cathing all of them clearly. [22:44:45] i have a fix or the infinite loop in gerrit (https://gerrit.wikimedia.org/r/#/c/189036/) .. that needs to be merged and rt-tested before it can be deployed. [22:44:51] I wonder if halfak is also SSH'd to the cluster from this coffee shop. [22:46:01] marktraceur, halfak and I are all at peace coffee in minneapolis :) [22:46:09] :) [22:46:18] I killed my SSH connection to tin to save the coffee shop some bandwidth. [22:47:27] 3operations, WMF-Legal, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#1021838 (10MBrar.WMF) The overall process looks good and is approved. I'm glad you have a director approving all final requests. The only thing we would like to change is the... [22:48:25] WMF Minneapolis office [22:48:41] if andrewbogott were here, it would have been complete. [22:48:47] Pretty much. [22:48:55] Wait, didn't we add someone else recently? [22:48:59] Maybe I dreamt that. [22:49:10] I am here! But not there. [22:49:25] ah right, "here". [22:49:26] Boo [22:49:33] andrewbogott: We could set up a VPN through the coffee shop so you could pretend to be here [22:49:42] (03PS4) 10Gergő Tisza: Deploy Sentry on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 (https://phabricator.wikimedia.org/T78807) [22:49:46] In all the important ways. [22:49:50] Coffeehouse telepresence robot [22:49:52] And then get the coffee shop banned from editing for being an open proxy [22:49:54] Since we are talking via IRC anyway [22:50:13] andrewbogott: Throws espressos in people's faces for you [22:50:20] (03CR) 10MarkTraceur: [C: 032] Deploy Sentry on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 (https://phabricator.wikimedia.org/T78807) (owner: 10Gergő Tisza) [22:50:25] (03Merged) 10jenkins-bot: Deploy Sentry on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181439 (https://phabricator.wikimedia.org/T78807) (owner: 10Gergő Tisza) [22:50:31] Freshly brewed sustainably grown espressos [22:50:59] marktraceur: can I get you a refill? https://www.youtube.com/watch?v=VXrBowsNFis [22:51:46] Good job. [22:52:04] (03CR) 10MaxSem: [C: 031] add Cyrillic project domain names [dns] - 10https://gerrit.wikimedia.org/r/189102 (https://phabricator.wikimedia.org/T88722) (owner: 10Dzahn) [22:52:38] subbu: peace coffee! [22:53:03] !log restarted parsoid service to kill several stuck processes on multiple nodes [22:53:05] I so wanted to be a bike delivery person for them [22:53:09] Logged the message, Master [22:53:14] greg-g, oh, you were here when they opened? [22:53:20] !log marktraceur Synchronized wmf-config/: [friday] beta config change for tgr (duration: 00m 09s) [22:53:23] Logged the message, Master [22:53:27] yeah, I was in MPLS from 2001 - 2007 [22:53:44] ah, i guess i didn't realize peace coffee has bene around that long then :) [22:53:46] Seems to have not broken the site [22:53:54] tgr: How's beta lookin'? [22:54:01] marktraceur: heh, nice [friday] [22:54:10] greg-g: I was going to do #yolo [22:54:25] subbu: I think the delivery service has been around a long time, having an actual coffee shop is new. [22:54:38] aaah .. ok. [22:55:36] yeah [22:56:08] except the coffee you could get from them at their roasting facility next to the greenway bridge [22:56:18] oh the greenway.... [22:56:20] #memories [22:57:03] PROBLEM - Host radon is DOWN: PING CRITICAL - Packet loss = 100% [23:01:18] ACKNOWLEDGEMENT - Host radon is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn T88818 T88740 [23:01:57] Hm, that seems familiar [23:02:35] Maybe not. [23:02:36] Shrug [23:14:51] marktraceur: Sentry is loading, nothing seems to be in flames [23:16:34] Yay [23:17:43] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:17:43] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:20:50] 3operations: ca.wikimedia wiki - sidebar in French won't work... - https://phabricator.wikimedia.org/T88843#1021868 (10Dzahn) [23:24:15] 3operations: ca.wikimedia wiki - sidebar in French won't work... - https://phabricator.wikimedia.org/T88843#1021876 (10Dzahn) Hello Benoit, we switched the ticket system we are using to Phabricator and since you replied to the ticket that used to be on RT, this got automatically forwarded and created a ticket... [23:26:24] PROBLEM - manage_nfs_volumes_running on labstore1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes [23:26:31] 3operations, Scrum-of-Scrums, RESTBase, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1021886 (10GWicke) [23:26:32] 3operations, RESTBase: Import Cassandra packages for Jessie - https://phabricator.wikimedia.org/T88850#1021884 (10GWicke) 5Open>3Resolved Verified working in labs. Thank you, @andrew! [23:27:24] RECOVERY - manage_nfs_volumes_running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes [23:27:54] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [23:35:46] ^ that spike was from me, it was fairly brief [23:36:00] 3operations, RESTBase: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1021908 (10GWicke) @fgiunchedi, we have a few old labs vms running cassandra in the 'services' project. Those are still on trusty though, and the plan is to deploy with Jessie. Same for the prod test hosts xenon, praseo... [23:39:44] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:43:45] greg-g, what is the protocol if we want to do a cherry-pick deploy tomorrow to patch an infinite loop bug? It hadn't been an issue for almost 3 weeks but it triggered on a few pages y'day and today that required us to restart parsoid to kill some stuck processes that didn't get caught by our timeout handling. we may not get any reqs. to problem pages over the weekend and it may be fine to wait till monday ... but you never know. [23:50:21] (03PS1) 10Krinkle: contint: Allow ssh between labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/189132 [23:50:55] (03CR) 10jenkins-bot: [V: 04-1] contint: Allow ssh between labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [23:51:17] subbu: if something breaks and you need to fix it, fix it. [23:51:18] contint displays self-awareness, vetoes all patches that alter contint [23:51:36] I can't let you do that, Dave. [23:52:31] greg-g, ok, will do. [23:53:38] andrewbogott: tabs ?! it seems [23:53:49] 23:50:48 + HAS_TAB=1 [23:54:03] i guess the puppet-lint update was deployed? [23:54:06] That’s just what it wants you to think, man [23:54:11] heh [23:54:11] Oh, maybe? [23:54:21] https://integration.wikimedia.org/ci/job/operations-puppet-tabs/8979/console [23:54:53] it's the only FAILURE that doesn't have "non-voting" [23:55:00] called puppet-tabs [23:55:28] and a tab is added on line 7 in that [23:55:51] Krinkle: hear that? Your patch was vetoed strictly for containing a tab [23:56:09] (03CR) 10Dzahn: "please remove the literal tab character to make jenkins puppet-lint happy" [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [23:59:21] (03PS2) 10Dzahn: contint: Allow ssh between labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [23:59:58] andrewbogott: Patch Set 2: Verified+2