[00:08:24] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio)
[00:08:58] <grrrit-wm>	 (03CR) 10Dereckson: "As indicated on the ticket, the namespace is to organize content, not content per se, so no, you don't need to add it there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio)
[00:14:08] <grrrit-wm>	 (03PS2) 10Dereckson: Enabled ogg opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:23:30] <grrrit-wm>	 (03PS3) 10Dereckson: Enabled ogg opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:23:44] <icinga-wm>	 PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail
[00:24:06] <urandom>	 ori: you around?
[00:24:44] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is OK: TCP OK - 0.006 second response time on port 9042
[00:25:31] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "PS3: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:26:21] <grrrit-wm>	 (03CR) 10Paladox: "@Dereckson how do I do that please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:27:13] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "@Glaisher Could you schedule it for SWAT?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252627 (owner: 10Glaisher)
[00:28:16] <Dereckson>	 paladox: you have the details on https://wikitech.wikimedia.org/wiki/SWAT_deploys and the checklist of points to verify
[00:28:25] <paladox>	 Ok thanks
[00:29:07] <Dereckson>	 paladox: if you think your patch matches these criterias (it looks so for me), you can add it to the table at https://wikitech.wikimedia.org/wiki/Deployments#Week_of_March_28th
[00:29:32] <paladox>	 Dereckson: thanks do i do it for morning or afternoon
[00:29:35] <Dereckson>	 you need to be on this channel the deployment hour, and be ready to test if the change works like expected
[00:29:58] <Dereckson>	 there are two windows to allow for several timezone and working time, so it's up to you
[00:30:33] <grrrit-wm>	 (03CR) 10Jforrester: [C: 04-2] "We do not test things in production before we put them in master except in very rare circumstances. Unless Brion OKs this, please do not d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:30:38] <paladox>	 Dereckson: Oh, i wont be able to test since i only proposed the patch. thedj would you be able to test it please
[00:30:50] <paladox>	 thedj https://gerrit.wikimedia.org/r/#/c/256967/
[00:31:26] <Dereckson>	 James_F: what do you suggest, to test it on beta cluster?
[00:31:58] <James_F>	 Dereckson: If the config setting isn't good enough to enable in the extension by default, local testing would be a start. :-)
[00:34:21] <Dereckson>	 I were under the assumption it had been tested, as it's a part of the extension, just not enabled by default.
[00:35:51] <James_F>	 I'm not sure it's been tested on a multi-wiki site, for instance. Then into Beta Cluster.
[00:35:54] <brion>	 yeah no that patch looks super wrong
[00:36:05] <brion>	 it would disable ogg vorbis output
[00:36:07] <James_F>	 We don't throw things into live production and then ask Commons community members to tell us if it works for them.
[00:36:10] <James_F>	 That's not cool.
[00:36:21] <urandom>	 bd808: are you around?
[00:36:35] <Dereckson>	 Indeed, that's not.
[00:36:48] <James_F>	 Dereckson: That's the justification on that patch, though.
[00:38:14] <grrrit-wm>	 (03CR) 10Brion VIBBER: [C: 04-1] "Looking quickly at this config patch, appears that it would enable Opus but disable Vorbis output for non-Vorbis sources, which could be b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:38:40] <Dereckson>	 As a side question, it could be interesting to revisit the extension code. To let toxic code - even commented - in master branch isn't helpful.
[00:40:46] <James_F>	 Dereckson: Oh, indeed. TMH is a complete nightmare. thedj and brion are awesome for trying to tackle it.
[00:40:58] <brion>	 it'll be a while yet before it doesn't suck ;)
[00:41:24] <brion>	 i would actually like to enable opus for audio at some point
[00:41:37] <brion>	 ms edge should be getting native opus decoding in future
[00:42:08] <James_F>	 brion: As a transcoding target or just as an alternative format?
[00:42:28] <brion>	 James_F: as transcode target; we already allow uploads i think if you format em right
[00:42:39] <James_F>	 So, three of them? ;-)
[00:42:58] <brion>	 hmm actually that reminds me
[00:43:02] <James_F>	 Do we have to give Ops the heads-up if we add new transcode targets? Can't they use quite a lot of space?
[00:43:06] <brion>	 for edge i don't know if they'll support ogg container :D
[00:43:18] <brion>	 so might need an opus-in-webm audio output ;)
[00:43:23] <James_F>	 We should ship inside mkv. ;-)
[00:43:30] <brion>	 James_F: yeah i would recommend that. for audio only it shouldn't be huge amount of space though
[00:43:35] * James_F mumbles about container fanboyism.
[00:43:38] <brion>	 webm == mkv, almost ;)
[00:43:42] <brion>	 well subset
[00:43:48] * James_F nods.
[00:44:18] <brion>	 edge is testing vp9 again in latest preview build \o/
[00:44:21] <brion>	 still experimental
[00:44:30] <brion>	 and doesn't yet work with our stuff
[00:44:30] <Dereckson>	 I've prepared https://gerrit.wikimedia.org/r/278427 to add a warning about this setting, would that be valuable to merge?
[00:44:56] <James_F>	 brion: Details. :-)
[00:44:59] <brion>	 Dereckson: that looks wrong
[00:45:04] <brion>	 Dereckson: should be fine to enable them both at once
[00:45:16] <brion>	 the config patch was only enabling opus and disabling vorbis
[00:45:28] <brion>	 if it were correct it'd work fine afaik
[00:49:32] <James_F>	 Equivalent to the difference between 'enwiki' => … and '+enwiki' => … :-)
[00:50:16] <brion>	 yeah i think it might work with += instead of =
[00:50:30] <brion>	 or... /me is always suspicious of php arrays
[00:51:09] <Dereckson>	 And config could be more easy to read with the two explicitely noted.
[00:52:07] <Dereckson>	 So would it be acceptable to (1) define wgEnabledAudioTranscodeSet to both WebVideoTranscode::ENC_OGG_OPUS WebVideoTranscode::ENC_OGG_VORBIS (2) enable this on beta.wmflabs.org or should further test needed beforehand?
[00:52:08] <brion>	 ok $foo[] = blah would be better here
[00:52:38] <brion>	 Dereckson: i think that should be ok yeah, either set them both explicitly or add the ENC_OGG_OPUS on top without replacing the whole array
[00:52:58] <brion>	 note it's a straight vector of string constant keys, not an associative array
[00:53:05] <brion>	 so it's a little funky compared to a lot of our settings
[00:54:41] <grrrit-wm>	 (03CR) 10Brion VIBBER: "Best to test on beta cluster first. :) Looks like we should either include the vorbis setting in this array explicitly as well, or only ap" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:54:55] <brion>	 Dereckson: my sample line there should i think work
[00:56:26] <grrrit-wm>	 (03PS4) 10Dereckson: Enabled Ogg Opus support for TimedMediaHandler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:56:35] <brion>	 might also consider holding off on opus audio until can do audio-only transcode output in webm container as that's more likely to be supported by future devices than opus in ogg, but i'm not against it :)
[00:57:44] <grrrit-wm>	 (03CR) 10Dereckson: "PS4: setting moved to beta, default codec not overwritten" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[00:58:07] <grrrit-wm>	 (03CR) 10Brion VIBBER: [C: 031] "Looks like it should work correctly, and now is moved over to beta. :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[01:00:10] <grrrit-wm>	 (03CR) 10Brion VIBBER: "Note it may be more worthwhile to actually deploy opus audio in .webm container instead of Ogg/.opus, as Matroska/WebM container is being " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[01:17:06] <grrrit-wm>	 (03PS3) 10Dereckson: Add initial rescore profiles for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse)
[01:17:49] <grrrit-wm>	 (03CR) 10Dereckson: "PS3: rebased, and moved config to *-labs per EBernhardson comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse)
[01:22:55] <icinga-wm>	 PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Puppet has 1 failures
[01:23:12] <grrrit-wm>	 (03PS2) 10Dereckson: Reduce sampling rate for language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov)
[01:24:30] <grrrit-wm>	 (03CR) 10Dereckson: "PS2: rebased, added reference to the task ID" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272724 (https://phabricator.wikimedia.org/T127212) (owner: 10Bmansurov)
[01:29:00] <ori>	 urandom: I wasn't then, but I am now
[01:29:05] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[01:29:13] <urandom>	 ori: hi! :)
[01:29:16] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[01:29:34] <urandom>	 ori: i was going to see if i could convince you to +2 https://gerrit.wikimedia.org/r/#/c/278402/1
[01:30:20] <urandom>	 it will add the config or a second cassandra instance on restbase1013, part of a long-running and totally boring (read: safe) process
[01:30:28] <urandom>	 s/or a second/for a second/
[01:30:35] <ori>	 Heh, I was just about to say -- I can't meaningfully review it, but if you tell me it's safe, that's fine by me.
[01:30:53] <ori>	 What's the worst that could happen?
[01:30:55] <urandom>	 it's so routine at this point it is bording on tedious
[01:31:10] <urandom>	 the new instance could fail to bootstrap leaving exactly where we are now
[01:31:19] <urandom>	 leaving us, that is
[01:31:27] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] enable instance 'b'; restbase1013-b [puppet] - 10https://gerrit.wikimedia.org/r/278402 (https://phabricator.wikimedia.org/T125842) (owner: 10Eevans)
[01:31:34] <urandom>	 ori: awesome, thanks
[01:31:48] <grrrit-wm>	 (03PS5) 10Dereckson: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[01:32:14] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[01:32:50] <ori>	 puppet-merge is being s    l o    w
[01:32:54] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:32:54] <ori>	 so it is not merged yet
[01:33:01] <ori>	 and look, we fixed the site! ^
[01:33:05] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[01:33:21] <urandom>	 it's magic!
[01:34:07] <ori>	 finally merged
[01:34:19] <ori>	 do you need me to run puppet somewhere? you have sudo on the relevant nodes, right?
[01:34:25] <urandom>	 i do yes
[01:34:34] <urandom>	 just fired it off
[01:37:35] <urandom>	 ori: there will be a cql service failure for the new instance shortly, totally expected, i'll ack it when the time comes
[01:38:29] <urandom>	 it'll clear when the node finishes bootstrapping and goes online
[01:38:47] <grrrit-wm>	 (03PS6) 10Dereckson: Remove $wgCopyrightIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[01:43:32] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] "PS6: rebased, .gitignore doesn't contain any reference to images anymore by the way" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261999 (https://phabricator.wikimedia.org/T122754) (owner: 10Florianschmidtwelzow)
[01:47:56] <icinga-wm>	 RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[01:49:24] <icinga-wm>	 RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[01:54:05] <urandom>	 !log bootstrapping restbase1013-b.eqiad.wmnet : T125842
[01:54:06] <stashbot>	 T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842
[01:54:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:54:16] <urandom>	 better late than never...
[01:54:30] <grrrit-wm>	 (03CR) 10Paladox: "Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256967 (owner: 10Paladox)
[02:05:33] <wikibugs>	 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia, and 3 others: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2135995 (10matmarex) There's actually an UploadWizard-specific bit here, see T130437. That only affects Internet Explorer, o...
[02:06:27] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: Connection refused
[02:06:48] <urandom>	 ^^^ there it is; got this
[02:07:18] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is CRITICAL: Connection refused eevans Bootstrapping - The acknowledgement expires at: 2016-03-20 02:07:03.
[02:13:06] <grrrit-wm>	 (03PS2) 10Dereckson: Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[02:13:17] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[02:23:15] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 10m 07s)
[02:23:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:23:38] <icinga-wm>	 PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: puppet fail
[02:24:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:27:38] <grrrit-wm>	 (03CR) 10Smalyshev: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[02:27:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[02:28:07] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[02:29:48] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[02:31:46] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Mar 19 02:31:46 UTC 2016 (duration 8m 31s)
[02:31:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:36:52] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 031] Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 (owner: 10Ladsgroup)
[03:33:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:35:17] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[03:35:24] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: diamond: Remove unnecessary/incorrect include of stdlib [puppet] - 10https://gerrit.wikimedia.org/r/273483 
[03:35:26] <grrrit-wm>	 (03PS3) 10Tim Landscheidt: diamond: Fix comments [puppet] - 10https://gerrit.wikimedia.org/r/273451 
[03:40:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[03:42:16] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[04:11:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:15:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[04:32:56] <sabya>	 mutante:o/
[04:34:36] <sabya>	 getting this error when I run puppet agent -tv after applying a puppet::self role. It used to work few days back. Anything changed? https://gist.github.com/sabyasachi/90439a41fb564a605b6c
[04:35:50] <sabya>	 in this instance: https://wikitech.wikimedia.org/wiki/Nova_Resource:Sabya4.ores-staging.eqiad.wmflabs
[04:37:25] <ori>	 sabya: the Puppet run is failing because Puppet is configured to ensure the Puppetmaster service is running, and the service failed to start. Did you follow the advice in the output?
[04:37:31] <ori>	 > Job for puppetmaster.service failed. See 'systemctl status puppetmaster.service' and 'journalctl -xn' for details. 
[04:39:15] <sabya>	 ok.
[04:42:03] <sabya>	 puppetmaster failed to start because of cert errors
[04:44:43] <sabya>	 ori: Could not request certificate: Connection refused - connect(2) for "localhost" port 8140
[04:44:51] <sabya>	 could this be the reason?
[04:48:50] <ori>	 it probably is, yeah
[04:52:31] <grrrit-wm>	 (03CR) 10EBernhardson: "actually my labs comment was about something slightly different, the labs here is the beta cluster, which has a tiny index." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse)
[04:54:47] <grrrit-wm>	 (03CR) 10EBernhardson: "another thing that might be useful, we are working up a relevance forge project which is about being able to run sets of queries and judge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257607 (https://phabricator.wikimedia.org/T110648) (owner: 10DCausse)
[04:59:56] <sabya>	 ori: got it working.
[05:00:02] <ori>	 \o/
[05:00:43] * sabya is noob in puppet
[05:05:27] <icinga-wm>	 PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: puppet fail
[05:16:58] <mdholloway>	 ori: i'm wondering if the mobileapps flapping was due to that restbase change a little bit beforehand.  reading the backscroll, urandom seemed confident it wouldn't cause trouble, though, so who knows.  we haven't changed anything lately.
[05:17:04] <mdholloway>	 ori: i'll keep an eye on it.
[05:18:48] <mdholloway>	 ori: thanks again for the heads-up
[05:19:01] <ori>	 mdholloway: np -- I still think something is amiss, tho
[05:19:12] <ori>	 you can see in ganglia that CPU usage is lower than it has been in recent days:
[05:19:26] <ori>	 http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Service+Cluster+B+eqiad&h=scb1001.eqiad.wmnet&jr=&js=&v=4.9&m=cpu_user&vl=%25&ti=CPU+User
[05:19:44] <bearND>	 i see service workers dying left and right
[05:19:48] <bearND>	 https://www.irccloud.com/pastebin/nlhv5oQm/
[05:19:51] <mdholloway>	 ori: you're right, that doesn't look good.
[05:19:55] <ori>	 more worryingly, bytes in/out are flat
[05:19:58] <bearND>	 memory issues
[05:20:06] <ori>	 http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Service+Cluster+B+eqiad&h=scb1001.eqiad.wmnet&jr=&js=&v=82660.59&m=bytes_out&vl=bytes%2Fsec&ti=Bytes+Sent
[05:22:18] <bearND>	 ori: interesting spike before it got so low
[05:24:06] <mdholloway>	 bearND: yeah, i saw that too.  memory use went crazy for some reason on both scb1001 & scb1002 ca. 02:17 UTC
[05:25:02] <ori>	 the spike is neatly symmetrical with a spike on logstash -- http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Logstash+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report
[05:25:15] <ori>	 so probably there was a bust of log messages around that time
[05:25:22] <ori>	 *burst
[05:25:52] <ori>	 the numbers match, ~3M/s at peak
[05:26:03] <mdholloway>	 ori: yep.
[05:27:12] <ori>	 I have to run, sorry :-/ don't hesitate to page ops if there is substantial user impact and you are blocked
[05:28:43] <bearND>	 ori thanks for notifying us
[05:29:29] <mdholloway>	 ori: no worries, thanks again!
[05:30:30] <mdholloway>	 bearND: the app seems to still be working fine, at least
[05:31:28] <icinga-wm>	 RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[05:31:49] <bearND>	 mdholloway: i think that's due to pre-generation. Most pages should be stored in RB Cassandra already. The impact most likely would be that latest revisions don't get updated as quickly as they used to
[05:43:03] <bearND>	 hmm, if I read https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase correctly it looks like other RB services are having heap issues, too
[05:44:06] <bearND>	 restbase1003 through 1011
[05:46:19] <mdholloway>	 bearND: hmm, yeah.  
[05:46:56] <mdholloway>	 bearND: if you change the timeline to 6 hours ago through a few seconds ago, looks like there was a huge spike around 00:20
[05:47:07] <ebernhardson>	 i don't know much about cassandra, but i do wonder a bit what limit's those are, 460M heap total seems tiny for a java process
[05:48:39] <bearND>	 could be. I'm no cassandra expert.
[05:50:35] <ebernhardson>	 the heap usage graph in grafana looks fairly typical for a java process, it's not really running out of memory because gc looks to be collecting it back down to 20-40% (although i'm mostly just guessing here too): https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-gc?panelId=34&fullscreen
[05:51:16] <bearND>	 I was going to see if VE would be impacted but strangely I only see wikitext editing on enwiki or eswiki. Did I miss anything re: VE lately?
[05:53:34] <bearND>	 mdholloway: does VE work for you?
[05:54:41] <mdholloway>	 bearND: i'm not seeing it on enwiki either
[05:54:45] <grrrit-wm>	 (03CR) 10OliverKeyes: "As mentioned on Phabricator I'm not quite whether this change now adds anything: the underlying pageview definition has been patched for q" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T116609) (owner: 10Nemo bis)
[05:57:04] <mdholloway>	 still working on mediawiki.org
[06:02:25] <bearND>	 ok
[06:11:01] <bearND>	 mdholloway: not sure what we can do here. I'm going to send email to gwicke, mobrovac and the rest of the services team
[06:12:01] <mdholloway>	 bearND: sounds good to me. especially from the more recent errors in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase this seems likely a restbase/cassandra issue
[06:13:48] <bearND>	 mdholloway: i tend to agree
[06:16:07] <mdholloway>	 bearND: all right, i was just about to go to bed before ori's email showed up, so i think i'll head off.  
[06:16:13] <mdholloway>	 bearND: good night!
[06:16:30] <bearND>	 mdholloway: good night. Me too
[06:29:57] <icinga-wm>	 PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:16] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:47] <icinga-wm>	 PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail
[06:30:48] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:07] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:26] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:37] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:38] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:44:17] <icinga-wm>	 PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:55:28] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:56:06] <icinga-wm>	 RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[06:56:17] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:56:38] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[06:56:47] <icinga-wm>	 RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[06:56:48] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[06:56:57] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:57:57] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:57] <icinga-wm>	 PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: puppet fail
[07:12:16] <icinga-wm>	 RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:19:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[07:19:46] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[07:26:46] <icinga-wm>	 RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[07:52:46] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[07:52:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[07:58:21] <grrrit-wm>	 (03PS4) 10Nemo bis: Use full URL in $wgNoticeHideUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) 
[08:03:31] <grrrit-wm>	 (03Abandoned) 10Kelson: Fix regex to enable upload from ETHZ Library with the GWT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273774 (owner: 10Kelson)
[08:03:51] <grrrit-wm>	 (03PS7) 10Nemo bis: Enable Translate extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio)
[08:15:27] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Do not have static class inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/278431 (https://phabricator.wikimedia.org/T128411) 
[08:15:54] <grrrit-wm>	 (03CR) 10Yuvipanda: "I think I27a39b3352abb93babc7ed19b642f76524470c2d is the right way to do this." [puppet] - 10https://gerrit.wikimedia.org/r/277862 (https://phabricator.wikimedia.org/T128411) (owner: 10Tim Landscheidt)
[08:48:36] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[08:49:57] <icinga-wm>	 RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:50:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[08:51:56] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[08:52:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[08:52:28] <icinga-wm>	 PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server
[08:54:06] <icinga-wm>	 PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed
[08:58:47] <icinga-wm>	 PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:59:57] <icinga-wm>	 PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures
[09:04:27] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[09:04:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[09:06:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[09:06:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[10:21:47] <Vito>	 thumbnail generation problems at wikimania 2016 wiki
[10:21:51] <Vito>	 worth opening a bug?
[10:24:07] <icinga-wm>	 PROBLEM - Disk space on labstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:25:46] <icinga-wm>	 RECOVERY - Disk space on labstore2001 is OK: DISK OK
[10:38:47] <icinga-wm>	 PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: puppet fail
[10:41:37] <wikibugs>	 6Operations, 10MediaWiki-Uploading, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Steinsplitter)
[10:44:08] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.32.206:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on port 9042
[10:51:50] <hashar>	 !log Labs LDAP is probably down.  T130446 Cant log to tools-login.wmflabs.org / Jenkins interface and Nodepool yields error 500 communicating with OpenStack API
[10:51:51] <stashbot>	 T130446: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446
[10:54:57] <wikibugs>	 6Operations, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136300 (10Peachey88)
[10:55:33] <wikibugs>	 6Operations, 6Multimedia: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Peachey88) -#mediawiki-uploading nothing to do with MediaWiki's internal uploading system.
[11:06:05] <grrrit-wm>	 (03PS2) 10Nemo bis: Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) (owner: 10Dereckson)
[11:07:07] <icinga-wm>	 RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[11:38:06] <icinga-wm>	 RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.020 seconds response time
[11:38:12] <godog>	 !log restart slapd on seaborgium, oom-killed
[11:38:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:46:58] <godog>	 Vito: yes please, a bug would be appreciated
[11:47:07] <Vito>	 godog: already done!
[11:47:14] <Vito>	 https://phabricator.wikimedia.org/T130448
[11:48:37] <icinga-wm>	 RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active
[11:48:53] <godog>	 Vito: sweet, thanks!
[11:54:57] <icinga-wm>	 RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:56:17] <icinga-wm>	 RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[12:32:34] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] make logstash messages separable by cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans)
[12:51:17] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1177814 (10Paladox) Hi how will we still use jsduck when migrating to Jessie npm 4.3.
[13:41:10] <wikibugs>	 6Operations, 6Multimedia, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136461 (10Danny_B)
[13:58:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 89399.00 seconds
[14:48:07] <grrrit-wm>	 (03PS1) 10Aklapper: Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) 
[14:50:14] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper)
[14:51:04] <grrrit-wm>	 (03CR) 10Gergő Tisza: [C: 031] Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) (owner: 10BryanDavis)
[14:53:36] <Dereckson>	 Hi. Could someone deploy https://gerrit.wikimedia.org/r/278446 — this is a throttle rule for an event for this Sunday filled in last minute.
[15:00:57] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[15:01:28] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[15:02:46] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[15:03:16] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.039 second response time on port 9042
[15:40:14] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper)
[15:40:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Telugu Wikipedia outreach activity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278446 (https://phabricator.wikimedia.org/T130447) (owner: 10Aklapper)
[15:41:10] <Dereckson>	 Thanks Reedy.
[15:43:07] <grrrit-wm>	 (03PS1) 10Reedy: Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 
[15:43:51] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/throttle.php: Throttle rules for event T130447 (duration: 00m 26s)
[15:43:52] <stashbot>	 T130447: Raise throttling cap on user registration, image upload on commons.wikimedia.org and te.wikipedia.org on 2016-03-20 - https://phabricator.wikimedia.org/T130447
[15:43:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:44:11] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 (owner: 10Reedy)
[16:06:23] <grrrit-wm>	 (03PS1) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 
[16:13:46] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[16:14:07] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[16:14:24] <grrrit-wm>	 (03PS2) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T12345) 
[16:16:38] <grrrit-wm>	 (03PS3) 10Halfak: Adds compute node to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) 
[16:31:38] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[16:33:25] <halfak>	 Hey folks.  I'm blocked on getting a puppet change merged.  
[16:33:27] <halfak>	 https://gerrit.wikimedia.org/r/#/c/278413/
[16:33:31] <halfak>	 Can someone have a look?
[16:33:46] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[17:30:14] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel)
[17:50:51] <grrrit-wm>	 (03PS3) 10Gehel: Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) 
[17:51:46] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[17:52:40] <gehel>	 halfak: are you still there?
[17:52:47] <halfak>	 Yeah!  
[17:53:25] <gehel>	 How urgent is that merge? Can it wait next puppet SWAT?
[17:53:46] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]
[17:54:00] <halfak>	 gehel, hmm... Shouldn't affect anyone by me. 
[17:54:15] <halfak>	 And my instances in labs
[17:54:57] <gehel>	 halfak: lemme dig a bit into it...
[17:55:01] <halfak>	 kk  
[17:55:16] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0]
[17:55:18] <halfak>	 FWIW, this is for the `ores` project.  I'm the maintainer.  
[17:56:24] <halfak>	 I guess it isn't that time critical now.  I'm in the middle of a downtime event, so I just manually worked with apt-get to install my dependencies. :/
[17:57:43] <gehel>	 I'm fairly new here, I need some reflection time before playing cowboy during the weekend.
[17:57:52] <gehel>	 but the patch seems trivial enough...
[17:58:27] <gehel>	 which instances is it on labs? Can I have a look in them?
[17:59:41] <halfak>	 gehel, heh.  All the ones in the ores project on labs
[17:59:54] <halfak>	 I wouldn't sweat it if you feel uncomfortable. 
[18:00:19] <gehel>	 do you have the name of one of them?
[18:00:40] <halfak>	 ores-web-05.eqiad.wmflabs
[18:00:49] <gehel>	 thanks!
[18:02:26] <halfak>	 generally, looks for ores-(web|worker|staging)-[0-9]{2}.eqiad.wmflabs
[18:03:19] <gehel>	 strange, I don't have access to those.
[18:03:31] * gehel does not understand fully how labs access work
[18:04:19] <gehel>	 halfak: I was wondering which puppetmaster those machines use... and see if you could not cherry-pick your change while waiting for an actual merge.
[18:04:41] <gehel>	 halfak: ideally, you should not be blocked waiting for Ops on labs...
[18:05:10] <halfak>	 gehel, honestly, don't worry about it.  I have downtime and an incident report to worry about right now anyway
[18:05:41] <gehel>	 I'll merge it right now... The change is trivial enough that I can feel good about it...
[18:06:04] <halfak>	 :) thanks
[18:06:20] <gehel>	 give me just 5 minutes to rebase it and merge...
[18:06:55] <grrrit-wm>	 (03PS2) 10Gehel: Adds arabic and polish languages files to ores role. [puppet] - 10https://gerrit.wikimedia.org/r/278413 (owner: 10Halfak)
[18:08:06] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:09:10] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] "Change seems trivial enough, merging as discussed with @halfak" [puppet] - 10https://gerrit.wikimedia.org/r/278413 (owner: 10Halfak)
[18:09:28] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:10:04] <gehel>	 halfak: Merged. I'll be around, ping me if you see anything suspicious...
[18:10:14] <halfak>	 Thanks gehel 
[18:10:21] <halfak>	 Just running puppet on my last replacement worker 
[18:10:25] <gehel>	 Glad to help
[18:10:31] <halfak>	 So I'll know right away
[18:11:52] <halfak>	 Looks good
[18:21:16] <wikibugs>	 6Operations, 7Wikimedia-log-errors: "internal_api_error_MWException: [dbf916b7] Exception Caught: Could not acquire lock for" for some uploads (during upload with Pywikibot OAuth) - https://phabricator.wikimedia.org/T129621#2136832 (10Aklapper)
[19:23:16] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[19:23:27] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[19:32:26] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[19:33:57] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042
[19:55:11] <wikibugs>	 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136927 (10Krenair) I don't think upload.wikimedia.org has anything to do with apache.  I tried `curl -H "Host: upload.wikimedia.org" http://ms-fe.svc.eqiad.wmnet/` on bast1001 and got a...
[19:55:37] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[19:57:37] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:06:43] <wikibugs>	 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136932 (10Krenair) No idea where that response file actually comes from either :/
[20:19:06] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[20:27:16] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0]
[20:37:57] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[20:48:06] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused
[20:48:27] <icinga-wm>	 PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed
[20:52:26] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[20:52:27] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[21:00:54] <wikibugs>	 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10MZMcBride) Redirecting "/" on upload.wikimedia.org on both HTTP and HTTPS to <https://commons.wikimedia.org> seems reasonable and clean to me.
[21:02:37] <icinga-wm>	 RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active
[21:03:58] <icinga-wm>	 RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042
[21:25:25] <wikibugs>	 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10Southparkfan) @Krenair where does that come from (run curl with -v, and then Server: header)?
[21:28:27] <wikibugs>	 6Operations, 10Traffic, 7Design: Create page for https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136996 (10Krenair) There is no Server header in that response @Southparkfan
[21:40:30] <wikibugs>	 6Operations, 6Labs, 10Labs-Infrastructure: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#2137010 (10hashar)
[21:48:32] <wikibugs>	 6Operations, 10Traffic, 7Design: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2137019 (10MZMcBride)
[21:52:24] <wikibugs>	 7Blocked-on-Operations, 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#2137023 (10hashar) @Paladox by using the `rake-jessie` job which rely on bundler to download dependencies from ruby gems....
[22:07:13] <urandom>	 !log clearing snapshots on restbase2004.codfw.wmnet
[22:07:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:16:46] <urandom>	 !log removing 22G of heap dumps
[22:16:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:16:56] <urandom>	 !log removing 22G of heap dumps from restbase2004.codfw.wmnet
[22:17:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:28:58] <jynus>	 !log powercycling oxygen, looks kernel-dead
[22:29:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:30:07] <icinga-wm>	 RECOVERY - Host oxygen is UP: PING OK - Packet loss = 0%, RTA = 4.72 ms
[22:46:58] <icinga-wm>	 PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail
[23:00:57] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[23:13:56] <icinga-wm>	 RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[23:20:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail
[23:43:47] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:47:38] <icinga-wm>	 RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures