[00:06:33] <grrrit-wm>	 (03CR) 10Chad: [C: 031] Add /api/ listing to www.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/252863 (https://phabricator.wikimedia.org/T118519) (owner: 10GWicke)
[01:00:11] <icinga-wm>	 PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures
[01:03:57] <grrrit-wm>	 (03PS1) 10Dzahn: base::firewall: remove exec for nf_conntrack [puppet] - 10https://gerrit.wikimedia.org/r/253056 
[01:05:12] <icinga-wm>	 PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures
[01:10:11] <icinga-wm>	 PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures
[01:12:19] <mutante>	 ah, that's in .frack.eqiad, not .eqiad
[01:13:00] <Krenair>	 can icinga not be made to show the full name for misc hosts?
[01:14:32] <Krenair>	 although then I guess you could still get "db1008" which is actually in frack :/
[01:15:12] <icinga-wm>	 PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures
[01:15:50] <mutante>	 Krenair: it shows me the IP address when i am in web ui
[01:16:03] <mutante>	 yea, fqdn would be nice
[01:16:08] <mutante>	 but also an icinga-wm feature
[01:16:16] <mutante>	 member of "Fundraising Banner Logger"
[01:16:17] <mutante>	 fwiw
[01:17:00] <Krenair>	 interesting, frack has a separate mgmt domain in codfw but not eqiad?
[01:18:08] <mutante>	 hmm.. i didnt know
[01:18:22] <mutante>	 about the fr setup in codfw 
[01:20:11] <icinga-wm>	 PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures
[01:25:11] <icinga-wm>	 RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[01:55:05] <MaxSem>	 err, I'm getting a 503 on enwiki
[01:55:48] <legoktm>	 me too
[01:55:49] <MaxSem>	 ping bblack YuviPanda ori 
[01:55:57] <MaxSem>	 who else can be there
[01:56:02] <ori>	 i'm here
[01:56:20] <ori>	 but i need to take my son to a doctor. what was the last deployment?
[01:56:34] <Krinkle>	 I'm getting 503 on load.php on mediawikiorg and also when diffing things on office wiki
[01:56:35] <legoktm>	 you
[01:56:36] <legoktm>	  23:30 logmsgbot: ori@tin Synchronized rpc/RunJobs.php: I5e15ec9fb: Disable ChronologyProtector for rpc/RunJobs.php (duration: 00m 30s)
[01:57:00] <MaxSem>	 why the fuck is icinga quiet?
[01:57:01] <Krinkle>	 checking logstash
[01:57:36] <MaxSem>	 logstash is elevated but not outage grade
[01:57:42] <Krinkle>	 There's like 10,000 hits for 'Internal error in ApiQueryAllUsers::execute: Saw more duplicate rows than expected'
[01:58:01] <MaxSem>	 not related to normal pv's broken
[01:58:16] <legoktm>	 wfm now...
[01:58:16] <ori>	 !log restarted HHVM on all app servers
[01:58:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:58:41] <MaxSem>	 dafuq happened?
[01:59:14] <ori>	 i gotta run in a sec but i tailed fatal.log, saw memory exhaustion, figured things couldn't be much worse than they are already and went for an hhvm restart across the cluster
[01:59:15] <Krinkle>	 There's also a lot of A connection error occured. Query: SELECT  rev_id,rev_page,rev_text_id,rev_timesta
[01:59:25] <ori>	 $ sudo salt -G 'php:hhvm' cmd.run 'service hhvm status | grep running && service hhvm restart'
[01:59:28] <Krinkle>	 For various regular view urls
[01:59:45] <ori>	 could someone start paging?
[01:59:50] <ori>	 even if we're back up, this needs attention
[01:59:52] <Krinkle>	  Lost connection to MySQL server during query
[02:00:13] <ori>	 call jynus (jaime crespo), bblack and faidon at minimum
[02:00:36] <ori>	 anyone on that?
[02:00:49] <ori>	 ok, i'm calling
[02:00:51] <mutante>	 ori: yes
[02:00:52] <MaxSem>	 calling brandon
[02:00:59] <ori>	 i'll call jaime
[02:01:01] <mutante>	 calling faidon
[02:03:06] <bblack>	 hi
[02:03:12] <bblack>	 what's wrong?
[02:03:20] <ori>	 503s on every page request
[02:03:35] <bblack>	 I just did one fine, not logged in on enwiki...
[02:03:35] <ori>	 i tailed fatal.log, saw memory exhaustion, figured things couldn't be much worse than they are already and went for an hhvm restart across the cluster
[02:03:41] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [500.0]
[02:03:47] <ori>	 $ sudo salt -G 'php:hhvm' cmd.run 'service hhvm status | grep running && service hhvm restart'
[02:03:48] <paravoid>	 hey
[02:03:49] <bblack>	 even with cache busting query arg
[02:03:49] <ori>	 site came back
[02:03:52] <mutante>	 there's the alarm
[02:03:55] <mutante>	 it just showed up now
[02:03:56] <ori>	 but no real idea what the cause was
[02:04:59] * YuviPanda is here
[02:05:01] <paravoid>	 if you didn't call jaime, don't yet
[02:05:08] * YuviPanda reads backlog
[02:05:13] <bblack>	 I see a big 503 spike so far, but it did go back down
[02:05:30] <paravoid>	 it's 3am there, if there isnt't evidence of a database issue so far and one that we cannot solve, let's shield him from that
[02:05:39] <ori>	 k
[02:06:04] <paravoid>	 too late I guess
[02:06:18] <jynus>	 someone called?
[02:07:30] <mutante>	 18:00 < Krinkle> There's like 10,000 hits for 'Internal error in ApiQueryAllUsers::execute: Saw more duplicate rows than expected'
[02:07:38] <mutante>	 18:02 < Krinkle> There's also a lot of A connection error occured. Query: SELECT  rev_id,rev_page,rev_text_id,rev_timesta
[02:08:08] <bblack>	 a lot of the 503s in the logs seem to be disproportionately /w/index.php with some query args e.g. ?title=Special:LinkSearch
[02:08:09] <Krinkle>	 Yeah, there is also a few hundred hits in the last hour for: 'Connection error: Unknown error (208.80.154.136)' and  and 'Error connecting to 208.80.154.136: Can't connect to MySQL server on '208.80.154.136' (4)' in log stash.
[02:08:20] <Krinkle>	 !info 208.80.154.136
[02:08:21] <wm-bot>	 https://www.mediawiki.org/wiki/WMF_Projects/Wikimedia_Labs
[02:08:23] <bblack>	 ?title=MediaWiki:Wikibugs.js&action=raw&ctype=text/javascript
[02:08:24] <wm-bot>	 http://bots.wmflabs.org/dump/%23wikimedia-operations.htm
[02:08:24] <Krinkle>	 @info 208.80.154.136
[02:08:24] <bblack>	 etc
[02:08:24] <dbbot-wm>	 Krinkle: [208.80.154.136: silver] silver
[02:08:57] <Krenair>	 That's wikitech
[02:09:00] <Krinkle>	 enwiki: Could not connect to server "10.192.0.199" - /wiki/Main_Page
[02:09:04] <wm-bot>	 http://bots.wmflabs.org/dump/%23wikimedia-operations.htm
[02:09:04] <Krinkle>	 @info 10.192.0.199
[02:09:04] <dbbot-wm>	 Krinkle: Unknown identifier (10.192.0.199)
[02:09:09] <Krenair>	 Shouldn't be related to the site outage
[02:09:15] <RoanKattouw>	 !info 10.192.0.199
[02:09:15] <wm-bot>	 https://www.mediawiki.org/wiki/WMF_Projects/Wikimedia_Labs
[02:09:21] <Krinkle>	 That's Jobqueu redis
[02:09:24] <Krinkle>	 10.192.0.199
[02:09:28] <RoanKattouw>	 Hmm I guess neither of those things works
[02:09:43] <Krinkle>	 rdb2001
[02:10:00] <MaxSem>	 raw index.php almost always skips caches, hence the abundance
[02:10:34] <Krenair>	 is rdb2001 actually in use?
[02:11:05] <Krinkle>	 Can't connect to MySQL server on 10.64.48.21, 10.64.48.20, 10.64.48.26
[02:11:10] <Krinkle>	 those are also in the logs. Not nearly as high though
[02:11:16] <Krenair>	 rdb2001.codfw.wmnet (10.192.0.119)
[02:11:16] <Krinkle>	 couple dozen in the last hour
[02:11:21] <wm-bot>	 http://bots.wmflabs.org/dump/%23wikimedia-operations.htm
[02:11:21] <Krinkle>	 @info 10.64.48.21
[02:11:21] <dbbot-wm>	 Krinkle: [10.64.48.21: s1] db1066
[02:11:26] <Krenair>	 10.192.0.199 does not exist
[02:11:26] <jynus>	 I do not see mysql issues
[02:11:45] <Krenair>	 no wonder enwiki couldn't connect to it
[02:11:46] <Krinkle>	 Krenair: https://github.com/search?q=%2210.192.0.199%22+@wikimedia&type=Code&ref=searchresults
[02:11:48] <YuviPanda>	 I see segfaults in hhvm log
[02:12:09] <jynus>	 https://logstash.wikimedia.org/#dashboard/temp/AVEDw1tJptxhN1XaOpwm
[02:12:10] <bblack>	 10.192.0.199 isn't legit I don't think
[02:12:16] <Krenair>	 It's 119
[02:12:17] <Krenair>	 not 199
[02:12:28] <Krinkle>	 https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors
[02:12:44] <bblack>	 rdb2001.codfw.wmnet
[02:12:53] <bblack>	 we shouldn't be connecting to anything in codfw for live hits right?
[02:12:58] <mutante>	 wmnet:rdb2001         1H  IN A    10.192.0.119
[02:13:21] <ori>	 analytics was working on an announcement for the page view api, did they turn anything on? the spike of traffic on the app servers coincides with http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Analytics+Query+Service+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report
[02:13:56] <jynus>	 yes, 5000 errors, out of 100000/s
[02:14:20] <bblack>	 yeah logstash doesn't have a big spike or anything
[02:14:24] <bblack>	 hmmm
[02:14:43] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[02:14:47] <paravoid>	 did anyone follow the incident protocol?
[02:14:52] <paravoid>	 (side note, sorry)
[02:15:15] <MaxSem>	 ori, wqs has periodic spikes, not likely related
[02:15:30] <paravoid>	 I guess, not, I'll drop a quick note to outage-notifications
[02:15:45] <ori>	 i was on the phone with pediatric emergency assessing whether i should bring my son in because he has some massively swolen bite, did not follow protocol, sorry
[02:15:49] <grrrit-wm>	 (03PS1) 10Dzahn: fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 
[02:15:59] <paravoid>	 ouch, hope he's ok :/
[02:16:27] <YuviPanda>	 hmm, segfault only on one host so everything else probably ok
[02:16:39] <Krenair>	 mutante, shouldn't there be reverse dns there too?
[02:16:44] <ori>	 i'm actually gonna go now, thanks for responding to the page, and i will review the protocol and do anything i need to when i get back
[02:16:55] <Krinkle>	 mutante: The redis server failures are only for enwiki /wiki/Main_Page. I guess those are from health checks of some kind
[02:17:00] <paravoid>	 well I guess there's little point in it now, site works
[02:17:06] <bblack>	 I've got 1:53 -> 1:59 or so, but honestly the rates in the req logs don't look like a total outage to me
[02:17:07] <paravoid>	 I guess I shouldn't page people for no reason
[02:17:18] <Krinkle>	 very steady interval and ditribution among servers
[02:17:24] <Krenair>	 mutante, ignore message above...my bad
[02:17:37] <bblack>	 5xx running to ~700/sec, out of !33K/sec gets
[02:17:40] <paravoid>	 ori: take care, hope your son gets ok
[02:17:41] <ori>	 apologies if it was an overreaction, every pv for me and everyone around me was 533
[02:17:42] <bblack>	 s/!/~/ heh
[02:17:45] <ori>	 503
[02:17:49] <ori>	 bye!
[02:17:54] <bblack>	 it may have been mostly for logged-in?
[02:17:57] <bblack>	 bye!
[02:18:09] <MaxSem>	 yup, ws like a total outage for me
[02:18:19] <grrrit-wm>	 (03PS2) 10Dzahn: fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 
[02:18:19] <MaxSem>	 and ori's restart of hhvm helped
[02:18:19] <Krinkle>	 See spike on https://grafana.wikimedia.org/dashboard/db/varnish-http-errors
[02:18:22] <Krinkle>	 it seems to have resolved
[02:18:29] <jynus>	 I only see 40 failed connections to mysql per server
[02:18:36] <Krinkle>	 second graph from the top
[02:18:38] <mutante>	 i did not see it myself, but a user reported it on wikimedia-tech 
[02:18:43] <jynus>	 a spike indeed, but not too serious
[02:18:44] <mutante>	 and after the hhvm restarted they said it was back
[02:18:47] <paravoid>	 (third)
[02:19:16] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1447445941501&to=1447467541501&var-site=All&var-cache_type=mobile&var-cache_type=text&var-status_type=5
[02:19:33] <bblack>	 ^ that shows the rates I'm talking about, doesn't seem like most gets were affected
[02:20:03] <bblack>	 (it was all confined to mobile+text, and it's on all DCs)
[02:20:23] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.27.0-wmf.6/cache/l10n: l10nupdate for 1.27.0-wmf.6 (duration: 05m 58s)
[02:20:28] <paravoid>	 these grafana dashboard confuse me
[02:20:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:20:31] <grrrit-wm>	 (03PS3) 10Dzahn: fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 
[02:20:40] <paravoid>	 very inconsistent reporting of 500 vs. 503
[02:20:45] <paravoid>	 if I read https://grafana.wikimedia.org/dashboard/db/varnish-http-errors correctly
[02:20:46] <bblack>	 ?
[02:20:54] <paravoid>	 there were no 500, just 503s
[02:20:57] <bblack>	 I can't really read it all, try my link
[02:21:14] <paravoid>	 your link doesn't make a distinction between 5xx subcodes I think?
[02:21:16] <paravoid>	 it just says "type 5"
[02:21:19] <bblack>	 yes, it does
[02:21:32] <bblack>	 top graph is all reqs, middle graph is 5xx, bottom splits out the various 5xx codes
[02:21:39] <paravoid>	 oh, right
[02:21:52] <paravoid>	 so no 500s
[02:21:58] <paravoid>	 why are we even looking at mediawiki then?
[02:22:01] <paravoid>	 logstash and all that
[02:22:28] <paravoid>	 hrm I guess the hhvm restart did fix, maybe, possibly
[02:22:34] <paravoid>	 did fix it*
[02:22:47] <mutante>	 that would fit what was reported by users on IRC
[02:22:49] <bblack>	 well 503s could still mean varnish->mw conns failing
[02:23:08] <bblack>	 or timing out or dropping before completion, etc
[02:23:31] <paravoid>	 sure, and appservers memory graph is suspicious, but zero 500s is too
[02:23:52] <bblack>	 ori said he saw memory exhaustion before he restarted them
[02:26:26] <paravoid>	 faidon@oxygen:/srv/log/webrequest$ grep  2015-11-14T01:57 5xx.json  | jq .x_cache | sed 's/,.*//' | sort | uniq -c | sort -nr| head -5
[02:26:29] <paravoid>	   23500 "cp1066 miss (0)
[02:26:32] <paravoid>	     839 "cp1065 miss (0)
[02:26:34] <paravoid>	     671 "cp1055 miss (0)
[02:26:37] <paravoid>	     625 "cp1052 miss (0)
[02:26:39] <paravoid>	     624 "cp1067 miss (0)
[02:27:31] <bblack>	 hmmm
[02:28:06] <bblack>	 due to URL hash?
[02:28:29] <Krinkle>	 hhvm channel in logstash also has a trending 'Lost parent, LightProcess exiting'. I don't know if that's SNAFU though.
[02:28:38] <paravoid>	 http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+eqiad&h=cp1066.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=8608136&m=varnish.backend_busy&vl=N%2Fs&ti=Backend+conn.+too+many
[02:28:42] <paravoid>	 http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+eqiad&h=cp1066.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=175&m=varnish.n_sess&vl=N&ti=N+struct+sess
[02:28:46] <paravoid>	 http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+eqiad&h=cp1066.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=500&m=varnish.n_wrk&vl=N&ti=N+worker+threads
[02:29:30] <YuviPanda>	 Krinkle: that might just be the restart
[02:30:15] <bblack>	 is that just because /w/index.php hashes there? :)
[02:30:26] <bblack>	 surely not, the quewry args count too
[02:31:32] <Krinkle>	 POSt request don't have query params though
[02:31:38] <Krinkle>	 if that's what it is
[02:31:45] <Krinkle>	 but the requests I was trying were GETs though
[02:32:39] <grrrit-wm>	 (03CR) 10Bmansurov: [C: 04-1] "We need to SWAT deploy this on Monday only. -1 until then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253038 (https://phabricator.wikimedia.org/T118525) (owner: 10Jhobs)
[02:33:00] <bblack>	 well yeah but once things break the damage spread
[02:33:25] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/save-timing did have a spike as well
[02:34:21] <YuviPanda>	 anything I can do at all?
[02:34:24] <YuviPanda>	 or should I go?
[02:34:37] <paravoid>	 you can go YuviPanda
[02:35:06] <YuviPanda>	 ok. I'll have my phone handy and keep sober in case I'm needed
[02:35:13] <YuviPanda>	 thanks paravoid
[02:35:27] <bblack>	 the cp1066 bit is almost certainly because something hashes there, but nothing stands out as e.g. a huge spike in a specific uri_path in the sampled log
[02:35:40] <jynus>	 Krinkle, look at the scle
[02:35:50] <jynus>	 from 900ms to 1.1s
[02:36:08] <bblack>	 paravoid: this gets back to your whole thing about switching to backend_random when total request rate gets too high
[02:37:29] <Krinkle>	 jynus: Im faimilar with the scale, I look at it times a day. But the p75 doubled as well. The largest graph there is a median, which doesn't always tell the story. And the other lines show that such anomaly is unusual, save timing is suprisingly stable.
[02:38:38] <Krinkle>	 300ms -> 1s. 510ms -> 2.1s
[02:38:44] <Krinkle>	 p50 and p75 respectively
[02:38:51] <Krinkle>	 avg -> max
[02:38:52] <jynus>	 I have s3 recentchanges server depooled
[02:38:54] <Krinkle>	 anyway
[02:42:13] <Krinkle>	 So yeah, I still have a few of the 503s I got open in my browser
[02:42:21] <Krinkle>	 they happen to come from cp1066
[02:42:22] <Krinkle>	 e.g.
[02:42:22] <Krinkle>	 https://www.mediawiki.org/w/load.php?debug=false&lang=en&modules=ext.gadget.Edittools%2CcollapsibleTables%2Cenwp-boxes%2Csite%7Cext.uls.nojs%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.raggett%2CsectionAnchor%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Cwikibase.client.i
[02:42:22] <Krinkle>	 nit&only=styles&skin=vector
[02:42:23] <jynus>	 don't get me wrong, I think there was a spike, but related to post-reload
[02:42:38] <Krinkle>	 no, definitely not. 
[02:43:12] <Krinkle>	 It was mediawiki.org/load.php javascript/css responses (pages without styling), and index.php viewing of diffs on enwiki and officewiki.
[02:43:37] <Krinkle>	 Request from 10.20.0.103 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 2328639123.  Error: 503, Service Unavailable at Sat, 14 Nov 2015 01:56:21 GMT. Forwarded for: **, 10.20.0.105, 10.20.0.105, 10.20.0.103
[02:44:32] <Krinkle>	 Another one had the same error and IPs and hostnames (except different XID of course) for a GET on https://office.wikimedia.org/w/index.php?title=**&diff=next&oldid=**
[02:45:14] <Krinkle>	 Request from 10.20.0.166 via cp1066
[02:57:43] <grrrit-wm>	 (03PS1) 10BBlack: text/mobile: use backend_random for non-GET/HEAD [puppet] - 10https://gerrit.wikimedia.org/r/253069 
[02:59:39] <grrrit-wm>	 (03PS2) 10BBlack: text/mobile: use backend_random for non-GET/HEAD [puppet] - 10https://gerrit.wikimedia.org/r/253069 
[03:23:21] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] text/mobile: use backend_random for non-GET/HEAD [puppet] - 10https://gerrit.wikimedia.org/r/253069 (owner: 10BBlack)
[03:38:37] <bblack>	 any objections out there to https://gerrit.wikimedia.org/r/#/c/253063/3 ?
[03:42:07] <bblack>	 will let it sit since it's not apparently critical, but we should probably fix that :)
[03:45:50] <jzerebecki>	 is there any backup from where to restore an overwritten grafana dashboard?
[03:46:04] <bblack>	 I don't think so
[03:46:29] <bblack>	 you can do "export" from the setting icon to save the JSON though
[03:46:37] <bblack>	 (to backup one up for yourself locally)
[03:49:02] <jzerebecki>	 there is a cli tool to do that recursively, was thinking of running that in cron with git push
[03:50:08] <jzerebecki>	 unless someone already has something of similar result
[03:50:24] <bblack>	 I don't think so
[04:08:32] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on port 9042
[04:52:08] <Lor_>	 A user wants me to help him get http://is.gd/iaHkmC deleted..but It is reported on en-wiki and commons as not ever existing, despite on the User's end it existing on multiple browsers. Any ideas what's happening?
[04:53:26] <Lor_>	 Nevermind...it is working now...Weird.
[04:55:31] <c>	 cache probably
[04:56:48] <Lor_>	 c, The user tested it on multiple Browsers. I'm guessing there was some sort of server lag. Doesn't matter now as the user wanted it deleted anyway.
[06:15:02] <grrrit-wm>	 (03CR) 10Krinkle: "Can we switch to hostnames here? I heard something about us having HHVM's dns cache enabled. See also I93aa4df." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn)
[06:16:30] <Krinkle>	 jzerebecki: bblack: There are backups for grafana dashboards.
[06:16:40] <Krinkle>	 I don't know where, but I recall ori setting htem up
[06:17:04] <Krinkle>	 Predating that, I often make exports to here https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org/
[06:30:22] <icinga-wm>	 PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:31] <icinga-wm>	 PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:42] <icinga-wm>	 PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:51] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:03] <icinga-wm>	 PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail
[06:31:08] <c>	 oh dear
[06:31:22] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:52] <icinga-wm>	 PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:32:41] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:41] <icinga-wm>	 PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:43] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:51] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:52] <icinga-wm>	 PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:12] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:23] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:02] <c>	 robh:
[06:34:12] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:34:43] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:34:51] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:44:11] <icinga-wm>	 PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail
[06:49:22] <icinga-wm>	 PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:55:01] <icinga-wm>	 RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:55:23] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:56:22] <icinga-wm>	 RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[06:56:22] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[06:56:23] <icinga-wm>	 RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:32] <icinga-wm>	 RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[06:56:41] <icinga-wm>	 RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[06:56:42] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:43] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[06:56:43] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures
[06:56:51] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:57:11] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:52] <icinga-wm>	 RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:32] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:42] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:02] <icinga-wm>	 RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:13:52] <icinga-wm>	 RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[07:17:12] <icinga-wm>	 RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:28:01] <icinga-wm>	 PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[07:54:03] <icinga-wm>	 RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[08:22:52] <icinga-wm>	 PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail
[08:50:42] <icinga-wm>	 RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:01:14] <apergos>	 nothing like checking the channel for the morning puppet failures and thinking I'm late for work.  on a saturday.
[09:02:45] <ori>	 apergos: :)
[09:39:13] <_joe_>	 apergos: ahahaha
[10:34:02] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[10:39:31] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[14:49:03] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:50:52] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[15:41:11] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:44:52] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[16:37:09] <wikibugs>	 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1805667 (10Aklapper) > Should http://sitemap.wikimedia.org/ [...] have an actual sitemap on it?  Who to make a call? Is this something to bring up on a Wikimedia mailing list or such?
[16:56:07] <wikibugs>	 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1805678 (10Aklapper) > `# Newer librsvg supports a sane security model by default and doesn't need our security patch`  Still unclear which "newer" version is refered to and no references provided. h...
[17:05:47] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-appserver-setup, 5wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1805688 (10Aklapper) @Joe: Could you explain what is missing to close this task as resolved?
[17:13:26] <wikibugs>	 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1805695 (10Aklapper) >>! In T99132#1642500, @dr0ptp4kt wrote: > @aklapper, I'm pinging on the thread.  @dr0ptp4kt: Was there any outcome?
[18:36:42] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:13] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[18:51:08] <legoktm>	 [0e00fe2a] 2015-11-14 18:50:57: Fatal exception of type "MWException"
[18:51:13] <legoktm>	 while trying to move a page on wikitech...
[18:51:39] <legoktm>	 2015-11-14 18:50:57 silver labswiki exception ERROR: [0e00fe2a] /w/index.php?title=Special:MovePage&action=submit   MWException from line 220 of /srv/mediawiki/php-1.27.0-wmf.6/includes/Hooks.php: Detected bug in an extension! Hook SMWParseData::onTitleMoveComplete has invalid call signature; Parameter 3 to SMWParseData::onTitleMoveComplete() expected to be a reference, value given {"exception_id":"0e00fe2a"} 
[18:51:43] <ori>	 what more info do you need? i mean, it says clearly that the exception was of type mwexception, and you have a randomly generated hash to go with it
[18:51:51] <ori>	 should that not be enough for pinpointing the problem?
[18:53:01] <Reedy>	 lol
[18:53:09] <legoktm>	 yeah, I was just pasting here
[18:53:13] <legoktm>	 filed as https://phabricator.wikimedia.org/T118649?workflow=create
[18:53:30] * legoktm goes back to the bug he was actually trying to investigate
[18:53:30] <ori>	 legoktm: i was just sardonically deriding our error reporting, not actually criticizing you :P
[18:53:43] <legoktm>	 oh :P
[18:53:56] <Reedy>	 ori: "Something broke"
[18:54:11] <ori>	 Reedy: no, it's more specific than that
[18:54:17] <ori>	 "Something related to MediaWiki broke"
[18:54:41] <ori>	 i.e., it was not a bicycle gear that broke, or a gas meter
[18:54:52] <ori>	 the breakage was very definitely wiki-related
[18:55:37] <Reedy>	 Do SMW care about 1.8.X?
[18:55:46] <ori>	 it's really a UX issue; there may not be more information available that could be disclosed to the user
[18:56:04] <ori>	 but in that case we should just say "an error occurred; here's a reference number you can use when asking for help about this problem"
[18:56:31] <ori>	 instead of Fatal exception of type "MWException"
[18:56:52] <Reedy>	 !bug 1
[18:56:52] <wm-bot>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=1
[19:04:42] <icinga-wm>	 PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[19:09:28] <wikibugs>	 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1805813 (10Reedy) >>! In T104147#1805678, @Aklapper wrote: >> `# Newer librsvg supports a sane security model by default and doesn't need our security patch` >  > Still unclear which "newer" version...
[19:10:13] <icinga-wm>	 RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: OK: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: RUNNING
[19:12:51] <Nemo_bis>	 ori: I do have a patch for that https://gerrit.wikimedia.org/r/190379
[19:15:15] <ori>	 Nemo_bis: I'm not sure that's the right solution, without some kind of throttling / deduplication
[19:15:37] <Nemo_bis>	 That's pre-emptive optimisation.
[19:15:53] <ori>	 no, it's capacity planning
[19:16:55] <ori>	 well, anyways, let me back up for a second, and before quibbling with particulars, just say thanks for thinking about this issue and proposing a solution
[19:17:18] <Nemo_bis>	 :)
[19:18:46] <ori>	 why do you think it's pre-emptive optimisation? isn't there a credible risk that a site outage would spill over to a phabricator overload?
[19:19:14] <ori>	 either an actual server overload, bringing phabricator down, or phabricator stays up but there are 5,000 new tasks to clean up
[19:19:39] <Nemo_bis>	 First, if the outage is so severe and widespread, people won't be able to login in phabricator anyway.
[19:19:48] <Nemo_bis>	 Second, most people don't bother even trying to report.
[19:20:02] <Nemo_bis>	 Third, those who try usually don't bother logging in.
[19:20:25] <Nemo_bis>	 Fourth, only a minuscule minority of users manages to file reports even after logging in.
[19:21:43] <Reedy>	 "AaronSchulz committed with atdt 2 days ago"
[19:21:45] <Reedy>	 heh, github
[19:23:00] <ori>	 Nemo_bis: 
[19:23:01] <Nemo_bis>	 Ah. And at some point we had such a link, perhaps to the webchat for #wikimedia-tech, and the amount of people who ever clicked that was neglible.
[19:23:03] <ori>	 The messenger started off at once, a powerful, tireless man. ... [But] how futile are all his efforts. He is still forcing his way through the private rooms of the innermost palace. Never will he win his way through. And if he did manage that, nothing would have been achieved. He would have to fight his way down the steps, and, if he managed to do that, nothing would have been achieved.
[19:23:03] <ori>	 He would have to stride through the courtyards, and after the courtyards through the second palace encircling the first, and, then again, through stairs and courtyards, and then, once again, a palace, and so on for thousands of years.
[19:23:03] <ori>	 And if he finally burst through the outermost door—but that can never, never happen—the royal capital city, the centre of the world, is still there in front of him, piled high and full of sediment. No one pushes his way through here, certainly not someone with a message from a dead man.
[19:23:22] <Nemo_bis>	 (Though maybe I'm wrong and we never replaced the nast irc:// link.)
[19:23:56] <ori>	 (http://www.kafka-online.info/an-imperial-message.html -- second time a technical conversation reminds me of kafka in a week)
[19:24:20] <Nemo_bis>	 Ah yes, that story is nice.
[19:25:23] <jzerebec1i>	 ori: Krinkle said you set up backups of grafana dashboards. where can I find those?
[19:27:02] <ori>	 jzerebec1i: I did, but I'm not sure where they end up. This is configured in https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/grafana.pp#L176-L178
[19:27:20] <ori>	 it gets rsync'd to a backup server i think
[19:27:33] <ori>	 what did you lose?
[19:28:27] <jzerebec1i>	 ori: nothing yet
[19:29:24] <ori>	 if you are worried about that, there is something you could do
[19:30:05] <ori>	 there are two places where dashboards can be defined -- in the database (which does not keep a log of changes) but also in git
[19:30:12] <ori>	 i added a grafana::dashboard resource
[19:31:26] <ori>	 so if you want to combine the convenience of the web ui with the benefits provided by having the configuration in git, what you could do is create the dashboard using the web interface, tweak it to your heart's content, and then once you're reasonably happy with what you have, export the JSON, paste it into a file, and submit it to the puppet repo
[19:32:17] * yuvipanda would love reviews on https://gerrit.wikimedia.org/r/#/c/253048/
[19:32:21] <ori>	 ( Nemo_bis -- incidentally, this also provides a means, though admittedly a clunky one, for volunteers who don't have privileged access to create dashboards )
[19:32:55] <ori>	 yuvipanda: looking
[19:33:47] <ori>	 jzerebec1i: does that solve your problem?
[19:34:14] <ori>	 i would really love for there to be a mediawiki storage backend for grafana
[19:34:56] <ori>	 that way you get the versioning / diffing / attribution / protecting features of mediawiki
[19:35:33] <ori>	 yuvipanda: is etherpad1001 jessie or ?
[19:35:42] <yuvipanda>	 ori: jessie
[19:36:54] <jzerebec1i>	 ori: nice. yes that is probably better than what I had in mind: there is a cli tool to make a recursive export of an grafana installation. thought about adding it with git commit&&push to a cron.
[19:44:02] <icinga-wm>	 PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:44:42] <icinga-wm>	 PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: puppet fail
[20:02:48] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 04-1] etherpad: Add an autorestarter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253048 (owner: 10Yuvipanda)
[20:10:03] <icinga-wm>	 RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[20:12:33] <icinga-wm>	 RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[20:18:59] <wikibugs>	 6operations, 10Traffic, 10Wikimedia-DNS: Consider DNSSec - https://phabricator.wikimedia.org/T26413#1805883 (10Vertigre)  I second this issue. DNSSEC is a major improvement, especially for a site like wikipedia which needs to be constantly on guard for censorship and intimidation.   Regarding the DNS server,...
[20:30:33] <grrrit-wm>	 (03CR) 1020after4: [C: 031] scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis)
[20:40:23] <logmsgbot>	 !log reedy@tin Synchronized php-1.27.0-wmf.6/extensions/BounceHandler/: Email header logging on unsubscribe (duration: 00m 53s)
[20:40:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:42:52] <icinga-wm>	 PROBLEM - DPKG on mw1153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[20:44:11] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.003 second response time
[20:44:52] <icinga-wm>	 PROBLEM - HHVM rendering on mw1153 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.005 second response time
[20:46:41] <icinga-wm>	 PROBLEM - HHVM processes on mw1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm
[20:47:09] <Dereckson>	 Luke081515: go for the 3 clear rights, you'll be able to amend to add suppressredirect later when they know what they want
[20:47:32] <Luke081515>	 Dereckson: Ok :)
[20:49:42] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [5000000.0]
[20:51:22] <Reedy>	 !log Deleted php-1.26wmf11, php-1.26wmf12 and php-1.26wmf13 from mira /srv/mediawiki-staging
[20:51:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:54:45] <wikibugs>	 6operations: Cleanup root:root owned files on mira:/srv/mediawiki-staging - https://phabricator.wikimedia.org/T118657#1805929 (10Reedy) 3NEW
[20:55:13] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0]
[20:55:49] <wikibugs>	 6operations: Cleanup root:root owned files on mira:/srv/mediawiki-staging - https://phabricator.wikimedia.org/T118657#1805936 (10Reedy)
[20:57:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 1.349 second response time
[20:57:32] <icinga-wm>	 RECOVERY - DPKG on mw1153 is OK: All packages OK
[20:57:42] <icinga-wm>	 RECOVERY - HHVM processes on mw1153 is OK: PROCS OK: 11 processes with command name hhvm
[20:57:51] <icinga-wm>	 RECOVERY - HHVM rendering on mw1153 is OK: HTTP OK: HTTP/1.1 200 OK - 64246 bytes in 7.074 second response time
[21:00:53] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000000.0]
[21:03:33] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0]
[21:07:21] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0]
[21:08:23] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[21:09:07] <grrrit-wm>	 (03PS1) 10Luke081515: Add a rollbacker group for wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) 
[21:09:40] <Luke081515>	 Dereckson: Done :)
[21:11:41] <grrrit-wm>	 (03CR) 10Dereckson: Add a rollbacker group for wuuwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) (owner: 10Luke081515)
[21:12:52] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 87.50% of data above the critical threshold [5000000.0]
[21:13:55] <grrrit-wm>	 (03PS2) 10Luke081515: Add a rollbacker group for wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) 
[21:20:12] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[21:46:21] <grrrit-wm>	 (03CR) 10Steinsplitter: [C: 031] Add a rollbacker group for wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) (owner: 10Luke081515)
[22:02:03] <grrrit-wm>	 (03PS1) 10Ori.livneh: wmflib: add conflicts() function [puppet] - 10https://gerrit.wikimedia.org/r/253145 
[22:02:05] <grrrit-wm>	 (03PS1) 10Ori.livneh: Add redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253146 (https://phabricator.wikimedia.org/T100714) 
[22:03:54] <wikibugs>	 6operations: Cleanup root:root owned files on mira:/srv/mediawiki-staging - https://phabricator.wikimedia.org/T118657#1805972 (10ArielGlenn) I've reset all of /srv/mediawiki-staging to mwdeploy:wikidev.  Hope that wasn't overkill.
[22:06:18] <wikibugs>	 6operations, 10Deployment-Systems: Cleanup root:root owned files on mira:/srv/mediawiki-staging - https://phabricator.wikimedia.org/T118657#1805974 (10Krenair)
[22:09:59] <logmsgbot>	 !log reedy@tin Synchronized phpunit.xml: noop test (duration: 01m 03s)
[22:10:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:10:15] <wikibugs>	 6operations, 10Deployment-Systems: Cleanup root:root owned files on mira:/srv/mediawiki-staging - https://phabricator.wikimedia.org/T118657#1805976 (10Reedy) 5Open>3Resolved a:3Reedy LGTM. Thanks!
[22:22:09] <logmsgbot>	 !log reedy@tin Synchronized php-1.26wmf23: remove old localisation cache (duration: 01m 40s)
[22:22:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:23:16] <Reedy>	 ValueError: /srv/mediawiki-staging/php-1.26wmf24/vendor/oyejorge/less.php/lib/Less/Version.php has content before opening <?php tag
[22:23:16] <Reedy>	 22:22:53 sync-dir failed: <ValueError> /srv/mediawiki-staging/php-1.26wmf24/vendor/oyejorge/less.php/lib/Less/Version.php has content before opening <?php tag
[22:23:17] <Reedy>	 Sigh
[22:23:32] <apergos>	 unicode thingie?
[22:23:39] <apergos>	 (random guess)
[22:23:42] <Reedy>	 Possibly
[22:23:47] <Reedy>	 It's not visible
[22:23:55] <apergos>	 cat -vte is your friend
[22:24:05] <Reedy>	 reedy@tin:/srv/mediawiki-staging$ cat /srv/mediawiki-staging/php-1.26wmf24/vendor/oyejorge/less.php/lib/Less/Version.php
[22:24:05] <Reedy>	  <?php
[22:24:50] * Reedy fixes
[22:25:03] <Reedy>	 Krenair: any idea how many old versions we actually need to be keeping around?
[22:26:23] <logmsgbot>	 !log reedy@tin Synchronized php-1.26wmf24: remove old localisation cache (duration: 01m 38s)
[22:26:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:26:59] <Reedy>	 Hmm, that's not doing it
[22:31:13] <wikibugs>	 6operations, 10Traffic, 10Wikimedia-DNS: Consider DNSSec - https://phabricator.wikimedia.org/T26413#1806020 (10BBlack)
[22:36:24] <legoktm>	 wooah, did that check actually catch something?
[22:36:53] <Reedy>	 legoktm: apparently so
[22:37:05] <logmsgbot>	 !log reedy@tin Synchronized phpunit.xml: (no message) (duration: 00m 30s)
[22:37:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:39:28] <legoktm>	 :D
[22:41:05] <apergos>	 it must hav ebeen some other file
[22:41:18] <Reedy>	 I fixed it
[22:41:19] <apergos>	 when I run the function in question directly on the file it does not whine
[22:41:20] <apergos>	 ah
[22:41:20] <Reedy>	 dsh -g mediawiki-installation -M -F 40 -- "sudo -u mwdeploy -- rm -rf /srv/mediawiki/php-STALE_VERSION"
[22:41:22] <Reedy>	 :)
[22:41:27] <Reedy>	 That won't work now...
[22:41:28] <apergos>	 heh
[22:41:31] <Reedy>	 Ugh
[22:41:40] <Reedy>	 trying to cleanup a load of crap laying around, and I can't
[22:42:02] <Reedy>	 does sync-dir not propogate deletions?
[22:42:08] <apergos>	 uh
[22:43:28] <Reedy>	 !log deleted more old l10ncaches from tin staging 1.27 wmf1-4
[22:43:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:43:44] <Reedy>	 Can't run salt as I'm not root either, right?
[22:44:56] <hoo|away>	 Reedy: dsh!
[22:44:58] * hoo|away hides
[22:45:03] <Reedy>	 hoo|away: I tried
[22:45:10] <Reedy>	 Can't agent forward
[22:45:25] <hoo|away>	 You could do it locally
[22:45:34] <hoo|away>	 that would be rather awful, though
[22:45:34] <Reedy>	 What, put my private key on the server?
[22:45:39] <Reedy>	 And get shot by opsen?
[22:45:48] <apergos>	 it has --delete as one of the rsync args
[22:45:52] <hoo|away>	 No, from your machine with a local dsh
[22:45:53] <apergos>	 no salt for you
[22:46:05] <Reedy>	 hoo|away: that sounds awful
[22:46:08] <hoo|away>	 (I'm not really serious on this one)
[22:46:10] <hoo|away>	 yeah :D
[22:46:12] <Reedy>	 my ssh connections take an age to begin with
[22:46:27] <apergos>	 I have totally done ssh loops from my laptop on the cluster
[22:46:33] <apergos>	 sucked big time but it did work
[22:46:54] <Reedy>	 hahaha
[22:47:24] <Reedy>	 486 /etc/dsh/group/mediawiki-installation
[22:48:19] <apergos>	 yeah
[22:48:28] <apergos>	 start the loop go drink coffee, watcha movie, come back
[22:48:42] <apergos>	 but be prepared that it might want to prompt you for keys you haven't accepted
[22:48:47] <Reedy>	 hahaha
[22:49:16] <Reedy>	 All these extra files laying around presumably just make scap longer than it needs to be
[22:49:17] <logmsgbot>	 !log reedy@tin Synchronized php-1.26wmf24/: purge l10n cache... test if deletes are propogated (duration: 01m 12s)
[22:49:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:49:28] <apergos>	 they really should be
[22:49:45] <Reedy>	 Nope, delete isn't being propogated :(
[22:50:07] <apergos>	 oh groan
[22:50:17] <apergos>	 well why not, the rsync should be doing it given the args
[22:50:58] <logmsgbot>	 !log reedy@mira Synchronized phpunit.xml: (no message) (duration: 00m 29s)
[22:51:00] <apergos>	 is there a verbose option to the script so you can see what it runs?
[22:51:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:51:13] <Reedy>	 22:50:39 sync-master failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-g', 'wikidev', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--exclude=*.swp', '--no-perms', '--verbose', 'mira.codfw.wmnet::common', '/srv/mediawiki-staging']' returned non-zero exit status 23
[22:51:18] <Reedy>	 it's passing --delete
[22:51:25] <apergos>	 right
[22:51:31] <legoktm>	 Reedy: you can use the mwdeploy key to use dsh
[22:51:43] <Reedy>	 legoktm: just gotta load it?
[22:51:46] <legoktm>	 https://wikitech.wikimedia.org/wiki/Dsh
[22:52:29] <Reedy>	 aha
[22:52:30] <Reedy>	 <3
[22:54:15] <apergos>	 what were you hoping to see had been deleted?
[22:54:19] <Reedy>	 !log dsh delete php-1.26wmf16
[22:54:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:54:28] <Reedy>	 apergos: I was trying to delete localisation cache files
[22:54:35] <Reedy>	 for ancient versions
[22:55:11] <Reedy>	 !log dsh delete php-1.26wmf17
[22:55:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:56:33] <Reedy>	 !log dsh delete php-1.26wmf18
[22:56:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:57:14] <apergos>	 well looks like the dsh solution is getting it done
[22:57:14] <Reedy>	 !log dsh delete php-1.26wmf19
[22:57:19] <Reedy>	 yeah
[22:57:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:57:38] <Reedy>	 I'm just clearing the old versions which aren't in staging, but are apparently in /srv/mediawiki on tin
[22:57:45] <Reedy>	 !log dsh delete php-1.26wmf20
[22:57:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:58:12] <Reedy>	 !log dsh delete php-1.26wmf21
[22:58:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:58:24] <apergos>	 so it rsyncs from mira to /srv/mediawiki-staging
[22:58:41] <Reedy>	 yeah, so we can deploy from it
[22:58:43] * apergos is probably too sleepy to think about this straight
[22:58:44] <Reedy>	 or that's the plan
[22:58:47] <Reedy>	 not sure how well it works :)
[22:58:56] <apergos>	 so mira is where you would have to delete things first I guess
[22:59:28] <Reedy>	 it should sync tin -> mira first
[22:59:28] <Reedy>	 22:36:34 Started sync-masters
[22:59:29] <Reedy>	 sync-masters: 100% (ok: 1; fail: 0; left: 0)                                    
[22:59:29] <Reedy>	 22:36:45 Finished sync-masters (duration: 00m 10s)
[22:59:29] <Reedy>	 22:36:45 Started sync-proxies
[22:59:29] <Reedy>	 sync-proxies: 100% (ok: 12; fail: 0; left: 0)                                   
[22:59:31] <Reedy>	 22:36:48 Finished sync-proxies (duration: 00m 02s)
[22:59:33] <Reedy>	 22:36:48 Started sync-apaches
[22:59:36] <Reedy>	 sync-common: 100% (ok: 468; fail: 0; left: 0)                                   
[22:59:38] <Reedy>	 22:37:05 Finished sync-apaches (duration: 00m 17s)
[23:00:07] <Reedy>	 !log dsh delete php-1.26wmf22/cache/l10n/*
[23:00:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:00:23] <apergos>	 well looking at the sync-master, it seeeemed it was syncing from mira to /srv/mw-staging 
[23:00:35] <apergos>	 er loking at the rsync command you pasted earlier I mean
[23:00:51] <apergos>	 yeah definitely too brain dead to carry on a coherent conversation about this
[23:00:58] <apergos>	 1 am.  packing it in for the night
[23:00:59] <Reedy>	 --exclude=**/cache/l10n/*.cdb'
[23:01:07] <Reedy>	 that'll be why it's probably ignoring it :D
[23:01:17] <Reedy>	 !log dsh delete php-1.26wmf23/cache/l10n/*
[23:01:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:01:38] <apergos>	 shouldn't sync them but shouldn't delete them either
[23:03:06] <Reedy>	 !log dsh delete php-1.26wmf24/cache/l10n/*
[23:03:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:04:27] <Reedy>	 !log dsh delete php-1.27.0-wmf.1/cache/l10n/*
[23:04:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:04:48] <Reedy>	 !log dsh delete php-1.27.0-wmf.2/cache/l10n/*
[23:04:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:04:56] <b1ackman>	 !log I just took a massive shit
[23:05:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:13] <Reedy>	 !log dsh delete php-1.27.0-wmf.3/cache/l10n/*
[23:06:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:40] <Reedy>	 !log dsh delete php-1.27.0-wmf.4/cache/l10n/*
[23:06:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:07:12] <b1ackman>	 !log <!-- herp derp
[23:07:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:07:45] <Reedy>	 b1ackman: please stop t hat
[23:08:23] <b1ackman>	 Reedy: suck it
[23:10:29] <grrrit-wm>	 (03PS1) 10Reedy: Remove execute bit from php files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253149 
[23:11:45] <b1ackman>	 !log <!-- lolz
[23:11:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:17:01] <the-wub>	 who owns the https://twitter.com/wikimediatech feed? can they delete b1ackman's posts?
[23:17:39] <Reedy>	 I guess ops would have the password
[23:18:08] <Katie>	 Yes.
[23:18:21] <_joe_>	 I am not sure about that
[23:19:15] <_joe_>	 and it's too late for me to walk to my work computer
[23:19:20] <_joe_>	 for this
[23:19:36] <Reedy>	 heh
[23:26:03] <Krenair>	 Reedy, not sure how many we need, sorry
[23:26:30] <Reedy>	 I think one or 2 of the 1.26 folders can go
[23:28:36] <Reedy>	 or all 3
[23:31:30] <Reedy>	 !log dsh delete php-1.26wmf22 23 and 24
[23:31:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:31:56] <Reedy>	 I think 2 of the 1.27s can go
[23:32:40] <Reedy>	 I'm gonna leave it at that for tongiht
[23:35:14] <Krenair>	 don't we have scap-purge-l10n-cache?
[23:35:36] <Reedy>	 It's broken
[23:35:40] <Reedy>	 https://phabricator.wikimedia.org/T118659
[23:36:03] <Reedy>	 I've already deleted everything less than 1.27 wmf4 via ds
[23:36:04] <Reedy>	 h
[23:37:12] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]
[23:39:37] <Krenair>	 Reedy, https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/ssh.py;HEAD$99
[23:39:37] <Krenair>	 should probably be len(hosts or [])
[23:39:43] <Krenair>	 like l95
[23:40:04] <Reedy>	 mmm
[23:40:09] <Reedy>	 looks about right
[23:41:33] <Reedy>	 Or use self._hosts
[23:42:30] <Reedy>	 considering we've already done the error handling
[23:43:27] <Krenair>	 yeah
[23:43:49] * Reedy makes a patch
[23:43:49] <Reedy>	 thanks
[23:43:59] <Krenair>	 I would contribute but they moved scap to phabricator
[23:44:13] <Reedy>	 oh, have they?
[23:44:42] <Reedy>	 not got arcanist and stuff setup
[23:47:25] <Reedy>	 Permission denied (publickey,keyboard-interactive).
[23:47:25] <Reedy>	 fatal: Could not read from remote repository.
[23:47:26] <Reedy>	 FFS
[23:47:39] <Reedy>	 oh, need to upload a public ke
[23:47:39] <Reedy>	 y
[23:48:12] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [5000000.0]
[23:50:51] <Reedy>	 https://phabricator.wikimedia.org/differential/diff/create/
[23:50:55] <Reedy>	 Do it via the web interface
[23:51:35] <Reedy>	 sweet
[23:51:37] <Reedy>	 that broke it
[23:53:42] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[23:57:13] <Reedy>	 Some linters failed:
[23:57:13] <Reedy>	     - ArcanistMissingLinterException: Unable to locate binary "pyflakes" to run linter ArcanistPyFlakesLinter. You may need to install the binary, or adjust your linter configuration.
[23:57:13] <Reedy>	       TO INSTALL: Install pyflakes with `pip install pyflakes`.
[23:57:13] <Reedy>	     - ArcanistMissingLinterException: Unable to locate binary "pep8" to run linter ArcanistPEP8Linter. You may need to install the binary, or adjust your linter configuration.
[23:57:13] <Reedy>	       TO INSTALL: Install PEP8 using `easy_install pep8`.
[23:57:18] <Reedy>	 Krenair: This is like a rabbit hole
[23:58:49] <Krenair>	 heh
[23:58:58] <Reedy>	 install them via apt
[23:58:59] <Reedy>	 lint ok
[23:59:08] <Reedy>	 nosetests --with-xunit --xunit-file='/tmp/9zyzpui6myo0g0kg/20502-m5dhmQ' --with-coverage --cover-xml --cover-xml-file='/tmp/331plkkzks8wkc4g/20502-oiBHIQ' '/home/reedy/scap/tests/scap/test_ssh.py'
[23:59:16] <Reedy>	 And we complained about git-review
[23:59:34] <SPF|Cloud>	 git-review is life