[01:04:04] <wikibugs>	 06Operations, 10Wikimedia-Site-requests: Default language of votewiki set to Persian (fa) for anonymous users. Change to English (en) - https://phabricator.wikimedia.org/T148352#2720797 (10Arseny1992)
[01:18:14] <wikibugs>	 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 4 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2720813 (10AndyRussG) Please ignore [[ https://phabricator.wikimedia.org/T144952#2717891 | this earlier c...
[01:27:47] <halfak>	 Is there a way I can review the logs of the requests that hit varnish for "ores.wikimedia.org"?
[01:29:18] <Krenair>	 ores... misc-web
[01:29:33] <Krenair>	 needs ops rights to look at varnishlog there
[01:30:08] <Krenair>	 don't think anything misc goes to analytics
[01:30:34] <halfak>	 Darn.  We're getting hammered by a single IP right now.  And I'd like to shut down requests from that IP
[01:31:02] <Zppix|mobile>	 ??
[01:31:18] <Krenair>	 Yes, Zppix|mobile?
[01:31:41] <Zppix|mobile>	 Im responding to halfak message i dont quite understand
[01:31:54] <Krenair>	 why are you responding to it if you don't understand it?
[01:32:06] <Krenair>	 halfak, maybe ask on the ops list
[01:32:48] <Zppix|mobile>	 Thats why i was responding to learn more
[01:33:31] <halfak>	 Thanks Krenair.  Emailing ops.
[01:34:17] <Krenair>	 Zppix|mobile, I don't think any old response will work for that, you need to ask specific questions
[01:34:29] <arseny92>	 Krenair , sonce when a config change is not ops if only ops/wmf can commit to ops/mw/config?
[01:34:38] <arseny92>	 snce*
[01:34:39] <Krenair>	 and seeing as it was just 9 seconds after you joined, looking at the context in the channel logs is also probably a good idea
[01:34:46] <arseny92>	 since*
[01:34:52] <Krenair>	 arseny92, not only ops/wmf can commit to ops/mw/config?
[01:35:23] <Zppix|mobile>	 I can commit to config and im not an op arseny92
[01:36:10] <Krenair>	 I think he meant merging into the main repo on the gerrit server
[01:36:16] <arseny92>	 yes
[01:36:30] <Krenair>	 If you're used to the old SVN system, a 'commit' meant doing pretty much the equivalent of that
[01:36:43] <Krenair>	 We moved away from SVN 4 and a half years ago
[01:37:31] <Zppix|mobile>	 ^true
[01:39:53] <arseny92>	 so that means now anyone can try to fix the config and propose changes for deploy if commits look good and pass review?
[01:40:15] <Krenair>	 anyone can propose changes for deploy without them looking good
[01:40:28] <Krenair>	 one they pass review, then they can be accepted
[01:40:55] <Krenair>	 the wmf-deployment group can merge changes in gerrit and the deployment group defined in the puppet admin module gets to actually push them out to the servers
[01:41:22] <Krenair>	 (theoretically these groups would be identical, but realistically it's a mess)
[01:42:04] <Krenair>	 the person who approved the temporary change of language for that wiki was neither wmf or ops
[01:44:07] <Krenair>	 wmf do not by default inherit deployment rights, ops do
[01:48:09] <Krenair>	 temporary things have a habit of accidentally becoming permanent
[01:50:08] <grrrit-wm>	 (03PS1) 10Huji: Reverting I75cf5954ac8d68e607975e7b1b77170915ffe4d1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) 
[01:50:09] <Krenair>	 or at least, there indefinitely until someone notices
[01:55:34] <Krenair>	 halfak, actually, I've been thinking
[01:55:40] <Krenair>	 How do you know it's a single IP?
[02:02:40] <Krenair>	 arseny92, I'm not sure about "otherwise it's considered as rewriting history as if like things didn't happen" but the rest sounds good
[02:03:55] <arseny92>	 i mean Huji's revert removes the mention of the other task from //comment 
[02:04:08] <Krenair>	 we prefer commit messages that describe the change being made without relying on external references
[02:04:25] <Krenair>	 at least, I do
[02:04:33] <Krenair>	 but I think many would agree
[02:05:15] <arseny92>	 and since this task is now around, the comment would be needing to reference both tasks
[02:07:38] <Krenair>	 yes, references are okay
[02:07:46] <Krenair>	 but I wouldn't rely on them in the first line at the top of the commit message
[02:07:56] <Krenair>	 relevant task lists are good even
[02:09:13] <arseny92>	 welp i have git around, are other users able to amend existing patchsets, or a new patchset is needed
[02:09:56] <Krenair>	 other users can amend existing patches
[02:10:11] <Krenair>	 both using git locally and, more recently, by using the gerrit web UI
[02:12:25] <arseny92>	 k, then im going to try and do my first commit 
[02:12:54] <Krenair>	 Okay
[02:13:32] <Krenair>	 I'm going to bed now, but I'll be back in like 12 hours
[02:13:36] <Krenair>	 good luck
[02:21:37] <logmsgbot>	 !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 07m 34s)
[02:21:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:33] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 17 02:26:33 UTC 2016 (duration 4m 56s)
[02:26:40] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:07:58] <grrrit-wm>	 (03PS2) 10Huji: Revert "Change voteWiki to fa language temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) 
[03:59:20] <grrrit-wm>	 (03PS3) 10Arseny1992: Reverting votewiki back to en Change was meant to be temporary. fawiki elections have been since over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji)
[06:41:08] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Revert "ores: Increase capacity" [puppet] - 10https://gerrit.wikimedia.org/r/316294 
[06:41:18] <_joe_>	 Amir1: ^^
[06:41:29] <_joe_>	 I need to ease the load on scb
[06:41:37] <Amir1>	 _joe_: hey, okay
[06:41:46] <_joe_>	 then we can look at blocking whoever is doing those massive requests
[06:41:59] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "ores: Increase capacity" [puppet] - 10https://gerrit.wikimedia.org/r/316294 (owner: 10Giuseppe Lavagetto)
[06:42:31] <Amir1>	 awesome, thanks
[06:43:42] <_joe_>	 Amir1: next time, page opsens via phone
[06:43:49] <_joe_>	 they're on the contact list on officewiki
[06:44:08] <Amir1>	 I don't have access to the officewiki
[06:44:45] <_joe_>	 ha
[06:45:26] <Amir1>	 I don't have account there, Do you want me to have one? I don't know if that's possible 
[06:45:42] <_joe_>	 heh, no idea
[06:46:18] <_joe_>	 so, aaron was talking about the "enwiki wp10 model"
[06:46:26] <_joe_>	 what's the url structure for that
[06:48:14] <Amir1>	 _joe_: there are several url strcutures 
[06:48:19] <Amir1>	 *structures
[06:48:27] <Amir1>	 Is it possible to block based on IP?
[06:49:04] <Amir1>	 Here's an example of wp10 model request on enwiki
[06:49:05] <Amir1>	 https://ores.wikimedia.org/v2/scores/enwiki/?models=wp10&revids=724030089
[06:49:27] <Amir1>	 Here's another structure: https://ores.wikimedia.org/v2/scores/enwiki/wp10/76655
[06:54:29] <_joe_>	 Amir1: so requests to ores come from the jobrunners, mostly
[06:54:29] <grrrit-wm>	 (03PS1) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) 
[06:54:44] <_joe_>	 who go through the public IP, for some reason
[06:54:56] <grrrit-wm>	 (03PS2) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) 
[06:55:22] <Amir1>	 are you sure? the extension job runner does not seem to do it: https://grafana.wikimedia.org/dashboard/db/ores-extension
[06:55:51] <_joe_>	 what do you mean?
[06:56:19] <_joe_>	 you mean the amount of jobs is limited
[06:56:54] <Amir1>	 yeah and it did not change 
[06:57:08] <Amir1>	 let me check it in details
[06:57:46] <_joe_>	 where does ores log?
[06:57:50] <_joe_>	 only to logstash?
[06:58:54] <Amir1>	 the ores service or the ores extension (the job runner)?
[06:59:07] <_joe_>	 the ores service
[06:59:28] <Amir1>	 I don't think it goes to logstash 
[06:59:55] <Amir1>	 the extension, yes,
[07:00:00] <_joe_>	  /srv/log
[07:00:01] <_joe_>	 damn
[07:00:18] <Amir1>	   /srv/deployment/logs/ores
[07:02:24] <wikibugs>	 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2720949 (10faidon) Yeah — esams by itself gets enough (and diverse enough) traffic that it should suffice.
[07:14:15] <wikibugs>	 06Operations, 06Labs, 10Striker, 07LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#2720964 (10MoritzMuehlenhoff) I think we should use the same wmf-user.schema file across the labs and corp servers, but introduce a separate object class for sto...
[07:15:35] <marostegui>	 !log Dropping memory tables hitcounter, _counters from S7 - T132837
[07:15:37] <stashbot>	 T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837
[07:15:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:17:53] <_joe_>	 Amir1: it's taking me some time as varnishlog is different in varnish 4
[07:18:44] <Amir1>	 _joe_: one thing, AFAIK varnish does not cache ores requests because they are cahced in redis 
[07:19:26] <Amir1>	 akosiaris know better
[07:19:30] <Amir1>	 *knows
[07:23:14] <akosiaris>	 Amir1: that's true however only for the requests that send the proper headers
[07:23:17] <_joe_>	 Amir1: so my hypothesis, from what I see
[07:23:26] <_joe_>	 is that something is doing a massive update of wikidata
[07:23:37] <_joe_>	 and every single edit goes to ores via the jobqueue
[07:23:47] <akosiaris>	 changeprop, not the jobqueue
[07:23:55] <akosiaris>	 ores gets updated via changeprop
[07:24:04] <_joe_>	 akosiaris: well I see requests coming from jobrunners
[07:24:05] <_joe_>	 so...
[07:24:38] <akosiaris>	 hmm, that shouldn't happen AFAIK
[07:24:49] <_joe_>	 akosiaris: well, evidence tells us otherwise
[07:24:57] <_joe_>	 also, it goes through the public endpoint
[07:24:59] <_joe_>	 lol
[07:25:23] <_joe_>	 akosiaris: varnishlog -n frontend  -q 'ReqHeader eq "Host: ores.wikimedia.org"' | fgrep X-Client-IP on cache-misc
[07:25:32] <akosiaris>	 yeah, that's a pattern I hate (using the public endpoint)
[07:26:44] <_joe_>	 wow we had some spikes on the jobqueue in the last few days
[07:26:46] <Amir1>	 akosiaris _joe_: The changeprop is for precaching
[07:26:59] <Amir1>	 the job runner is for the extension
[07:27:07] <Amir1>	 https://grafana.wikimedia.org/dashboard/db/ores-extension
[07:27:12] <_joe_>	 Amir1: what does trigger a job?
[07:27:14] <Amir1>	 It seems it's the jobrunner
[07:27:20] <Amir1>	 a non-bot edit
[07:27:49] <Amir1>	 so it's not related to enwiki wp10 models
[07:28:20] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2720992 (10MoritzMuehlenhoff) The blocking of /sys/fs is currently hard-coded in the fs_proc_sys_dev_boot() function. We could temporarily run...
[07:28:50] <akosiaris>	 there aren't any real requests coming from non-jobrunner IPs from what I see
[07:28:51] <Amir1>	 hmm, so what we can do is to tell that user to slow down
[07:29:11] <akosiaris>	 the wikidata editing bot you mean ?
[07:29:17] <Amir1>	 akosiaris: they are using the public endpoint
[07:29:18] * akosiaris assumes it's a bot
[07:29:22] <_joe_>	 Amir1: I see a lot of /scores/wikidatawiki/?models=damaging&revids=<REV_ID>&precache=true
[07:29:39] <Amir1>	 it's bot but probably doesn't have the bot flag
[07:30:36] <Amir1>	 _joe_: wikidata always is the huge proportion of preaching edits
[07:33:10] <_joe_>	 Amir1: is ores evaluating edits to talk and user pages too?
[07:33:15] <_joe_>	 that seems absurd.
[07:33:21] <Amir1>	 not in wikidata
[07:33:28] <_joe_>	 in enwiki it does
[07:33:37] <Amir1>	 but for other wikis, yes, because vandalism can happen in them too
[07:35:07] <akosiaris>	 so, the jobqueue size had a spike around the time CPU usage when up on SCB
[07:35:15] <akosiaris>	 and that's all wikidata you say ?
[07:35:47] <akosiaris>	 the joqueue size is getting back to "normal" size now
[07:36:27] <Amir1>	 I need to check who was it
[07:37:02] <_joe_>	 akosiaris: I guess it's some massive edit induced by some label change
[07:37:32] <Amir1>	 _joe_: can you get the rev ids?
[07:37:47] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing TIFFs - https://phabricator.wikimedia.org/T148360#2721010 (10Gilles)
[07:37:53] <_joe_>	 Amir1: yes
[07:38:31] <Amir1>	 awesome, it would be nice to have it
[07:38:58] <_joe_>	 Amir1: I can intercept them now, I mean
[07:39:11] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721028 (10Gilles)
[07:40:16] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721043 (10Gilles)
[07:41:22] <_joe_>	 Amir1: do you have a phab task about this?
[07:41:22] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing video - https://phabricator.wikimedia.org/T148363#2721059 (10Gilles)
[07:41:33] <Amir1>	 Okay, I get some not yet
[07:41:37] <Amir1>	 *not yet
[07:42:12] <_joe_>	 Amir1: should we block this user? or ask him to stop nicely first?
[07:42:52] <Amir1>	 _joe_: I blocked him
[07:43:00] <Amir1>	 I go post a message in his talk page
[07:43:04] <_joe_>	 Amir1: oh ok
[07:43:05] <Amir1>	 and tell nicely that
[07:43:36] <_joe_>	 just "mark yourself as a bot" is enough 
[07:43:57] <grrrit-wm>	 (03PS1) 10Marostegui: db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) 
[07:43:59] <akosiaris>	 cool
[07:47:25] <elukey>	 nice varnishlog query _joe_ ;)
[07:49:32] <_joe_>	 Amir1: can't we just exclude wikidata from ores and unblock the user?
[07:49:44] <_joe_>	 I think it's a much better option
[07:49:50] <marostegui>	 !log Stopping MySQL in db2057.codfw.wmnet to use it to clone another server
[07:49:51] <_joe_>	 we simply can't handle the load atm
[07:49:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:50:33] <_joe_>	 Amir1: I'm preparing the patch
[07:51:05] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing video - https://phabricator.wikimedia.org/T148363#2721089 (10Gilles) ``` Oct 17 07:20:53 thumbor1001 thumbor@8830[127650]: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.1c/1/1c/Accidents_will_happen_William-H.-Watson-Un...
[07:52:44] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 
[07:53:39] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: conftool: Remove the old sca physical boxes from zotero backends [puppet] - 10https://gerrit.wikimedia.org/r/315926 
[07:53:53] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] conftool: Remove the old sca physical boxes from zotero backends [puppet] - 10https://gerrit.wikimedia.org/r/315926 (owner: 10Alexandros Kosiaris)
[07:54:07] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 
[07:54:11] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 (owner: 10Alexandros Kosiaris)
[07:57:18] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721100 (10Gilles) ``` Oct 17 07:04:44 thumbor1001 thumbor@8837[74679]: 2016-10-17 07:04:44,645 8837 thumbor:ERROR ERROR: Traceback (most recent call last): Oct 17 07:04:44 thumbor1001 thumbor@88...
[07:59:04] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Temporarily disable Ores on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316299 
[07:59:10] <_joe_>	 Amir1: ^^
[07:59:26] <Amir1>	 _joe_: why?
[07:59:40] <Amir1>	 It's back to normal as far as I can tell
[07:59:44] <_joe_>	 Amir1: we can't really block a real user because a service is under capacity
[07:59:57] <_joe_>	 Amir1: yeah but we need to unblock that editor!
[08:00:18] <Amir1>	 I agree but the user is doing more than 100 edits per minute
[08:00:18] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721115 (10Gilles) This one looks like a different issue:   ``` Oct 17 07:06:47 thumbor1001 thumbor@8825[71257]: 2016-10-17 07:06:47,721 8825 thumbor:ERROR ERROR: Traceback (most recent call last...
[08:00:24] <Amir1>	 it should be marked as a bot
[08:00:31] <Amir1>	 I will unblock him ASAP
[08:00:32] <_joe_>	 Amir1: ok but while we tell him
[08:00:47] <_joe_>	 we can just disable ores on wikidata while we add capacity
[08:00:58] <_joe_>	 a thing akosiaris is basically doing now
[08:01:48] <Amir1>	 I still don't understand, We have enough capacity to handle wikidata
[08:01:50] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721132 (10Gilles) The other ones are the same missing source_file issue.
[08:02:04] <_joe_>	 Amir1: even while that user is not marked as a bot?
[08:02:06] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update to 4.4.25 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316300 
[08:02:07] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: SVG engine: IOError: [Errno 2] No such file or directory: '/srv/thumbor/tmp/thumbor@8811/tmp04oo68/source_file' - https://phabricator.wikimedia.org/T148361#2721133 (10Gilles)
[08:02:22] <grrrit-wm>	 (03CR) 10Elukey: [C: 04-1] "Hey Alex," [puppet] - 10https://gerrit.wikimedia.org/r/316217 (owner: 10Elukey)
[08:02:28] <Amir1>	 That kind of behavior that the user was doing can disrupt everything including the wikidata itself
[08:02:36] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Missing original for video 500s instead of 404ing - https://phabricator.wikimedia.org/T148363#2721134 (10Gilles)
[08:02:39] <Amir1>	 that's why we have throttle for bots 
[08:03:07] <_joe_>	 Amir1: well it's not your call to make
[08:03:16] <_joe_>	 in my not so humble opinion
[08:03:29] <Amir1>	 I'm referring the Wikidata policy https://www.wikidata.org/wiki/Wikidata:Bots
[08:03:36] <Amir1>	 not my opinion 
[08:03:58] <_joe_>	 it's your opinion that what he's doing can disrupt wikidata
[08:04:08] <_joe_>	 because AFAICS, wikidata is very fine
[08:04:11] <_joe_>	 ores isn't 
[08:04:40] <_joe_>	 please note he's not using a bot
[08:04:49] <_joe_>	 it's assisted editing, it's not a program
[08:04:59] <_joe_>	 as you can see on his talk page
[08:05:30] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721151 (10Gilles) Recon_Battalion_hike.png is the same bug as T148361 but with the vips temp file path:  ``` Oct 17 07:03:02 thumbor1001 thumbor@8825[71257]: CommandError: (['/usr/bin/vips', 'sh...
[08:05:33] <jynus>	 es2014 network issues?
[08:05:58] <akosiaris>	 Amir1: both you and _joe_ do have a point
[08:06:06] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721153 (10Gilles) Same for the other ones
[08:06:07] <akosiaris>	 clearly the user is not harming wikidata
[08:06:12] <Amir1>	 I'm saying that since we had similar incidents before now Wikidata has a policy prohibiting this speed 
[08:06:16] <_joe_>	 I'm not saying he shouldn't get a bot account
[08:06:28] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: SVG engine: IOError: [Errno 2] No such file or directory: '/srv/thumbor/tmp/thumbor@8811/tmp04oo68/source_file' - https://phabricator.wikimedia.org/T148361#2721028 (10Gilles)
[08:06:30] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721157 (10Gilles)
[08:07:06] <jynus>	 I think it may have crashed completely
[08:08:16] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: SVG and VIPS engines expect temp folder to stay a while when it doesn't - https://phabricator.wikimedia.org/T148361#2721182 (10Gilles)
[08:09:12] <Amir1>	 Also Wikidata was not very fine https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
[08:09:18] <jynus>	 Only thing I am getting on the console is [8181711.222536]
[08:09:19] <Amir1>	 He also caused dispatch lag
[08:10:18] <wikibugs>	 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2721194 (10faidon) I cleaned up the old, stale backup routes and re-added backup routes for all of the subnets mentioned above across core routers in all datacenter...
[08:10:20] <jynus>	 !log disabling notifications of es2014 before it pages
[08:10:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:10:28] <akosiaris>	 jynus: already paged
[08:10:34] <akosiaris>	 jynus: need any help btw ?
[08:10:41] <jynus>	 it hard-crashed
[08:10:47] <apergos>	 I guess I can ignore that page
[08:10:52] <jynus>	 no
[08:11:04] <jynus>	 it just crashed live
[08:11:09] <apergos>	 ouch
[08:11:13] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: SVG and VIPS engines expect temp folder to stay a while when it doesn't - https://phabricator.wikimedia.org/T148361#2721196 (10Gilles)
[08:11:14] <Amir1>	 I'm saying the speed he is doing is not acceptable even for bots: 142 edits per minute is too much 
[08:11:16] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor: Investigate failing TIFFs - https://phabricator.wikimedia.org/T148360#2721198 (10Gilles)
[08:11:22] <jynus>	 but still responds to ping?
[08:11:45] <_joe_>	 Amir1: heh, ok, to my eye 142 edits per second should be too much
[08:11:47] <_joe_>	 not per minute
[08:11:54] <_joe_>	 we're doing an abysmal job here
[08:11:56] <apergos>	 Amir1: out of curiosity what is the limit per minute for bots edits? 
[08:11:59] <paravoid>	 where's icinga-wm?
[08:12:00] <jynus>	 is it back?
[08:12:01] <paravoid>	 I'll restart
[08:12:02] <Amir1>	 60
[08:12:07] <apergos>	 gtk
[08:12:09] <paravoid>	 or is it you, akosiaris?
[08:12:11] <jynus>	 I think the controller temporarilly failed
[08:12:17] <akosiaris>	 paravoid: no, not me
[08:12:35] <akosiaris>	 actually that's not fully true
[08:12:38] <akosiaris>	 the 60 limit
[08:12:55] <akosiaris>	 even this page https://www.wikidata.org/wiki/Wikidata:Bots
[08:13:05] <akosiaris>	 does not clearly say "60" and that's it
[08:13:05] <jynus>	 and now it is up
[08:13:20] <jynus>	 I am going to check the hardware logs
[08:13:21] <akosiaris>	 it does say . The bot operator should do a test run of between 50 and 250 edits, so that the community can observe that the bot is working correctly. 
[08:13:33] <akosiaris>	 which implies that 120 might very well be acceptable
[08:14:38] <Amir1>	 akosiaris: that's for the trial period and not the speed per minute
[08:14:39] <marostegui>	 Pfff agan?
[08:14:46] <Amir1>	 let me find the discussion 
[08:14:47] <marostegui>	 Last time it was es2015 no?
[08:15:22] <jynus>	 I didn't reboot, it freezed due to filesystem "08:15:02 up 94 days"
[08:15:25] <akosiaris>	 Amir1: we 've never had set a limit in our API from what I know. In fact in https://www.mediawiki.org/wiki/API:Etiquette#Request_limit
[08:15:26] <elukey>	 !log upgrading nodejs on aqs100[56]
[08:15:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:15:33] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.25 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316300 (owner: 10Muehlenhoff)
[08:15:40] <akosiaris>	 it clearly says "There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down."
[08:15:50] <Amir1>	 https://www.wikidata.org/wiki/Wikidata_talk:Bots/Archive/2013#Bot_speed
[08:15:55] <jynus>	 "megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0"
[08:16:11] <_joe_>	 !log restarting hhvm on mw1175, stuck in HPHP::FastCGISession::blockingWriteStdOut after OOM
[08:16:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:16:30] <Amir1>	 akosiaris: yeah, it's in the technical perspective and I agree but in the community they have other considerations too
[08:16:48] <Amir1>	 like flooding the recentchanges
[08:17:12] <Amir1>	 also dispatch stat is the bottleneck
[08:17:18] <Amir1>	 https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
[08:17:28] <Amir1>	 which as you can see it should not be more than zero
[08:17:52] <Amir1>	 do you want me to grab someone from Wikidata team to explain it to you
[08:18:05] <akosiaris>	 no need
[08:18:10] <akosiaris>	 I can understand it just fine
[08:18:25] <Amir1>	 (I'm in that team. My bot has more than 23 M edits there, I'm very well aware of the limits there)
[08:19:20] <apergos>	 that limit should be on the bots page you linked earlier, so everyone knows it
[08:20:02] <_joe_>	 Amir1: from my prespective, we are blocking a big contributor to our projects because a non-critical service is having load issues. So if the block is wikidata-community related, that's ok
[08:20:04] <akosiaris>	 yeah, it's not a clearly communicated limit... then again we have never communicated a clear limit on purpose, AFAIK at least
[08:20:17] <_joe_>	 if that's for ORES, we should just disable ORES on wikidata while we add capacity
[08:20:24] <_joe_>	 that's my line of reasoning
[08:20:24] <jynus>	 mediawiki user level blocks shouldn't be discussed here, please use the wikis for that
[08:20:47] <Amir1>	 We can go to #wikidata
[08:20:51] <_joe_>	 jynus: we're discussing merging my patch for disabling ORES on wikidata
[08:20:52] <jynus>	 we can discuss ores capacity
[08:20:55] <jynus>	 ok
[08:21:03] <jynus>	 just stay on topic
[08:21:10] <_joe_>	 and we're discussing an emergency measure which was blocking a user
[08:21:22] <marostegui>	 jynus: I am going thru my notes of the issues with the megaraid_sas drivers, to see if I have seen that error in my rpevious jobs
[08:21:22] <Amir1>	 but overall, I'm saying I'm not blocking him because of the ORES, I block because of WD:Bots policy 
[08:21:24] <_joe_>	 so, thanks for the police work, but wasn't needed
[08:21:54] <paravoid>	 AlexZ: don't do that again
[08:21:58] <jynus>	 can I get the times for the probematic workdload?
[08:22:31] <jynus>	 database is usually the bottlneck when wikidata goes crazy
[08:23:38] <Amir1>	 jynus: it's clear in https://grafana.wikimedia.org/dashboard/db/ores?from=now-24h&to=now&panelId=15&fullscreen
[08:24:32] <jynus>	 I am probably the highest critict of wikidata, mostly because it has massive consequences
[08:25:02] <jynus>	 but in this case, I have to agree with joe's procedure- I see no problems on the main wikidata infrastructure
[08:25:42] <jynus>	 inserts have not appreciately increased
[08:25:45] <Amir1>	 The dispatch lag was crazy during this edits, so no. It wasn't just ores: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch
[08:26:00] <AlexZ>	 paravoid: I enabled that after consulting with James Alexander and he left a note in this channel about logmsgbot being abused.
[08:26:42] <paravoid>	 AlexZ: it prevents the icinga bot from joining this channel, which is fairly critical for us
[08:29:00] <AlexZ>	 paravoid: alright, but it should have gotten removed a bit earlier for sure. I wasn't aware that icinga-wm got disconnected
[08:29:18] <paravoid>	 it did on a netsplit
[08:29:24] <paravoid>	 and then couldn't join again
[17:14:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.400 second response time
[17:16:57] <_joe_>	 mark: I'm merging the pybal change
[17:17:18] <gehel>	 Note: WDQS deployment is on hold pending issues with finding the binaries... It will resume when things are sorted out (nothing has been deployed / changed yet)
[17:20:07] <_joe_>	 !log restarting lvs on lvs1003/1006 for the api change
[17:20:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:20:48] <grrrit-wm>	 (03CR) 10Yurik: [C: 031] "not that we would ever use non-utf8, but sure, looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/316342 (owner: 10Gehel)
[17:22:38] <grrrit-wm>	 (03PS2) 10Bearloga: Update R and C++-related stats puppet configs [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) 
[17:23:14] <SMalyshev>	 gehel: this is not right, should be from oct 13... not sure why it's not there
[17:23:32] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:24:22] <icinga-wm>	 PROBLEM - Check size of conntrack table on kafka1018 is CRITICAL: CRITICAL: nf_conntrack is 99 % full
[17:27:02] <icinga-wm>	 RECOVERY - Check size of conntrack table on kafka1018 is OK: OK: nf_conntrack is 53 % full
[17:28:13] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0]
[17:28:35] <icinga-wm>	 PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/midom]
[17:28:38] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "with g++ now being 4.8, this looks good" [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) (owner: 10Bearloga)
[17:29:35] <grrrit-wm>	 (03PS1) 10Jgreen: switch payments-listener from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/316391 
[17:30:04] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:31:04] <Dereckson>	 next
[17:31:11] <Dereckson>	 jouncebot: next
[17:31:11] <jouncebot>	 In 0 hour(s) and 28 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T1800)
[17:31:34] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3132103 keys - replication_delay is 653
[17:31:47] <grrrit-wm>	 (03CR) 10Jgreen: [C: 032] switch payments-listener from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/316391 (owner: 10Jgreen)
[17:33:20] <volans>	 paravoid: maybe is worth skipping/postponing SWAT in 27 minutes to avoid adding additional variables
[17:36:43] <Zppix|mobile>	 Whats up with api errors im around to collect reports
[17:37:14] <apergos>	 opsen still working on it
[17:37:25] <Zppix|mobile>	 Is there a task
[17:37:37] <paladox>	 I haven't seen one.
[17:37:54] <Zppix|mobile>	 Give me some details ill open one
[17:38:15] <paladox>	 Zppix|mobile api errors are affecting saving watchlists
[17:38:18] <paladox>	 or removing them
[17:38:30] <Zppix|mobile>	 Ok
[17:38:33] <paladox>	 I have no idea if it also affects saving pages though
[17:38:40] <_joe_>	 paladox: still going on?
[17:39:07] <paladox>	 _joe_ i think so, but not sure.
[17:39:44] <paladox>	 Saving a watchlist then removing it works for me, but the errors above suggest otherwise?
[17:39:57] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:40:08] <paladox>	 Zppix|mobile it also affects anything that uses api
[17:40:10] <paladox>	 like bots
[17:41:32] <wikibugs>	 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723320 (10Zppix)
[17:45:08] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3120724 keys - replication_delay is 0
[17:49:38] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1167 is CRITICAL: Host mw1167 is not in mediawiki-installation dsh group
[17:49:49] <wikibugs>	 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723336 (10Paladox) Should we triage this as high or unbreak due to it affecting production Wikimedia sites?
[17:50:14] <icinga-wm>	 RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[17:50:46] <apergos>	 paladox: people are working on it right now
[17:51:01] <apergos>	 have been for the past.. two hours? 
[17:51:06] <paladox>	 Ok
[17:51:15] <paladox>	 Yep, i mean for history.
[17:51:24] <paladox>	 I know that ops have been working on it.
[17:52:19] <wikibugs>	 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723367 (10Anomie)
[17:52:20] <icinga-wm>	 PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:52:21] <wikibugs>	 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2723365 (10Andrew) a:05Andrew>03None
[17:52:24] <wikibugs>	 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2723370 (10Anomie)
[17:54:56] <icinga-wm>	 RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 81590 bytes in 0.539 second response time
[17:55:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:58:13] <icinga-wm>	 RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 81591 bytes in 5.984 second response time
[18:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T1800). Please do the needful.
[18:00:04] <jouncebot>	 dcausse and kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[18:02:02] <Dereckson>	 dcausse: kart_: we'll wait a little bit volans and paravoid are fine with the current cluster state
[18:02:17] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:02:19] <dcausse>	 Dereckson: sure
[18:04:41] <jynus>	 Dereckson, I do not think waitng a bit will be enough
[18:04:42] <bblack>	 yes, we're still working on the API 503 issue
[18:04:55] <bblack>	 please don't deploy new changes that might cause confusion
[18:04:58] <jynus>	 but I will not cancel it yet
[18:05:01] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[18:06:30] <kart_>	 Dereckson: okay.
[18:07:00] <volans>	 Dereckson: unfortunately still WIP and searching the root cause, it will be better to avoid adding additional variables 
[18:07:19] * Dereckson nods.
[18:08:01] <kart_>	 Dereckson: should we cancel the deployment and schedule later or tomorrow?
[18:08:57] <wikibugs>	 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723424 (10Pchelolo) a:03Pchelolo Looking at the [[ https://github.com/nodejs/node/blob/9eb61793bf5b5ee9a7fb3ebf683083152756feee/lib/cluster.js#L483-L491 | relate...
[18:09:12] <Dereckson>	 kart_: there is no other window scheduled before 2 hours, so we can wait and see, with apparently a rather low probability of the deployment
[18:10:57] <kart_>	 Dereckson: yeah, and I'm in different timezone, so lets wait till I'm here.
[18:11:43] <icinga-wm>	 PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:12:45] <icinga-wm>	 PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:14:36] <Dereckson>	 kart_: perhaps you could add on the Gerrit change a procedure how to test your change, and someone else will be able to sherperd it
[18:17:48] <_joe_>	 !log dumping core on mw1194
[18:17:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:21:33] <icinga-wm>	 PROBLEM - HHVM rendering on mw1194 is CRITICAL: Connection timed out
[18:22:22] <icinga-wm>	 RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:24:04] <icinga-wm>	 RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 81562 bytes in 0.112 second response time
[18:24:31] <wikibugs>	 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723486 (10mobrovac) >>! In T147849#2723424, @Pchelolo wrote: > Looking at the [[ https://github.com/nodejs/node/blob/9eb61793bf5b5ee9a7fb3ebf683083152756feee/lib/c...
[18:28:00] <kart_>	 Dereckson: Let me see if I can add test article/more info.
[18:29:55] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 
[18:31:02] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[18:34:32] <grrrit-wm>	 (03PS1) 10Jgreen: switch payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/316401 
[18:34:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:35:03] <gehel>	 SMalyshev: latest wdqs deployed on wdq-beta
[18:35:14] <icinga-wm>	 PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:35:20] <grrrit-wm>	 (03CR) 10Jgreen: [C: 032] switch payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/316401 (owner: 10Jgreen)
[18:35:48] <Jeff_Green>	 !log switch payments-listener back to eqiad
[18:35:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:36:22] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[18:37:46] <icinga-wm>	 RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.620 second response time
[18:37:47] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 
[18:37:50] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 (owner: 10Alexandros Kosiaris)
[18:37:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:38:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 81566 bytes in 1.235 second response time
[18:38:33] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "unbreaking the icinga config that's been broken for a few hours. Deemed it important enough and unrelated enough to the current API outage" [puppet] - 10https://gerrit.wikimedia.org/r/316400 (owner: 10Alexandros Kosiaris)
[18:38:53] <gehel>	 !log deploying latest gui and binaries for wdqs
[18:38:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:39:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:39:34] <icinga-wm>	 RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:40:24] <icinga-wm>	 RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 81566 bytes in 8.741 second response time
[18:40:29] <grrrit-wm>	 (03Abandoned) 10Dzahn: add mapped IPv6 address for einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/316038 (owner: 10Dzahn)
[18:41:44] <icinga-wm>	 RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.777 second response time
[18:44:05] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: naggen2: Fix typo with multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/316403 
[18:44:25] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Fix typo with multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/316403 (owner: 10Alexandros Kosiaris)
[18:45:19] <gehel>	 SMalyshev: wdqs deployment completed, feel free to test...
[18:48:20] <icinga-wm>	 PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:48:33] <akosiaris>	 that's me ^
[18:48:47] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: naggen2: Third fix for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/316406 
[18:48:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Third fix for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/316406 (owner: 10Alexandros Kosiaris)
[18:51:22] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1167 is OK: OK
[18:51:26] <SMalyshev>	 gehel: thank you!
[18:51:47] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] "Take my +1 with a grain of salt." [puppet] - 10https://gerrit.wikimedia.org/r/316348 (owner: 10Rush)
[18:54:36] <grrrit-wm>	 (03CR) 10Dzahn: "It seems to be the wrong place to put postgresql into the class for PHP packages." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox)
[19:00:20] <icinga-wm>	 RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[19:00:24] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 
[19:01:24] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Add redirect for toolserver sulinfo tool [puppet] - 10https://gerrit.wikimedia.org/r/316076 (owner: 10Dereckson)
[19:01:47] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct
[19:06:13] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto)
[19:07:23] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto)
[19:07:32] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 
[19:07:48] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto)
[19:15:04] <grrrit-wm>	 (03PS2) 10Dzahn: tcpircbot: update IPv6 addresses for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/316030 
[19:15:32] <wikibugs>	 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2723668 (10Eevans) During Thursday's (2016-10-13) //ops-services-syncup// meeting, a final decision to use Intel SSDs was made.  TTBMK, there are no further...
[19:17:48] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "This is for logmsgbot from maintenance servers." [puppet] - 10https://gerrit.wikimedia.org/r/316030 (owner: 10Dzahn)
[19:18:44] <grrrit-wm>	 (03PS4) 10Rush: labsdb: maintain-views don't _p for hitcounters [puppet] - 10https://gerrit.wikimedia.org/r/316348 
[19:18:59] <grrrit-wm>	 (03PS16) 10Rush: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 
[19:19:01] <grrrit-wm>	 (03PS1) 10Rush: labstore: bdsync backup 'test' drbd volume from secondary [puppet] - 10https://gerrit.wikimedia.org/r/316414 
[19:20:45] <grrrit-wm>	 (03CR) 10Rush: [C: 032] labsdb: maintain-views don't _p for hitcounters [puppet] - 10https://gerrit.wikimedia.org/r/316348 (owner: 10Rush)
[19:30:15] <wikibugs>	 06Operations, 10ArchCom-RfC, 06Performance-Team, 06Services, and 4 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#2723710 (10GWicke)
[19:42:09] <icinga-wm>	 PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:52:48] <icinga-wm>	 RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2000).
[20:11:06] <grrrit-wm>	 (03PS6) 10Alexandros Kosiaris: icinga: Remove the hack around facilities, lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/315510 
[20:11:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Remove the hack around facilities, lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris)
[20:11:47] <wikibugs>	 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723773 (10Pchelolo) Here's a PR with a workaround that seems to fix the problem: https://github.com/wikimedia/htcp-purge/pull/5  However, we need to report this us...
[20:15:57] <mutante>	 jouncebot: next
[20:15:57] <jouncebot>	 In 0 hour(s) and 44 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2100)
[20:30:46] <bearND>	 !log starting mobileapps deploy
[20:30:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:32:33] <Krenair>	 !log updated status.wm.o apache config on wikitech-static box to correctly serve static assets again (T148438)
[20:32:34] <stashbot>	 T148438: Styling of status.wikimedia.org is broken - https://phabricator.wikimedia.org/T148438
[20:32:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:33:43] <bearND>	 !log deployed mobileapps 13fa4b4
[20:33:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:33:58] <wikibugs>	 06Operations, 10Wikimedia-General-or-Unknown: Styling of status.wikimedia.org is broken - https://phabricator.wikimedia.org/T148438#2723842 (10Krenair) 05Open>03Resolved I've added the following to `wikitech-static.wikimedia.org:/etc/apache2/sites-enabled/status.wikimedia.org.conf`, below the existing Loca...
[20:46:22] <mutante>	 paladox: https://phabricator.wikimedia.org/T148438#2723842
[20:46:45] <grrrit-wm>	 (03PS2) 10Dzahn: add IPv6 AAAA and PTR for terbium and wasat [dns] - 10https://gerrit.wikimedia.org/r/316028 
[20:46:51] <paladox>	 mutante thanks
[20:46:51] <paladox>	 :)
[20:47:43] <mutante>	 well and now i see Krenair just logged that
[20:48:28] <Krenair>	 yeah
[20:49:03] <Krenair>	 I knew what the problem was earlier but didn't want to bug ops about it while everyone was dealing with the much bigger issues
[20:49:12] <mutante>	 thank you
[20:50:59] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add IPv6 AAAA and PTR for terbium and wasat [dns] - 10https://gerrit.wikimedia.org/r/316028 (owner: 10Dzahn)
[20:51:16] <Krenair>	 and didn't want to accidentally break apache on that site during production issues
[20:51:34] <mutante>	 good thinking
[20:51:52] <paladox>	 :)
[20:53:35] <grrrit-wm>	 (03PS1) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 
[20:53:47] <mutante>	 !log maintenance servers, terbium and wasat, now have IPv6 connectivity
[20:53:49] <paladox>	 mutante there seems to be two icinga-wm bots logged into irc
[20:53:50] <paladox>	 icinga-wm
[20:53:51] <paladox>	 and
[20:53:53] <paladox>	 icinga-wm_
[20:53:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:53:57] <icinga-wm>	 PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100%
[20:54:52] <Jeff_Green>	 grr. icinga. ^^^ betelgeuse is not down.
[20:54:59] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda)
[20:55:22] <icinga-wm>	 RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms
[20:55:32] <grrrit-wm>	 (03PS2) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 
[20:55:41] <grrrit-wm>	 (03CR) 10Dzahn: "from terbium i can now ping6 wasat.codfw.wmnet etc" [dns] - 10https://gerrit.wikimedia.org/r/316028 (owner: 10Dzahn)
[20:57:52] <mutante>	 paladox: fixed
[20:58:01] <paladox>	 mutante thanks :)
[20:58:08] <mutante>	 or not, puppet will start it again
[20:58:14] <paladox>	 oh
[20:58:18] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda)
[20:58:27] <mutante>	 it's because we have "tegmen" now using the icinga role
[20:58:35] <mutante>	 it's running one on each server
[20:58:38] <mutante>	 not 2 instances on neon
[20:58:58] <mutante>	 !log tegmen - stopped duplicate icinga-wm (ircecho)
[20:59:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:00:03] <mutante>	 we'll need something in hiera or so to define which is the currenly "active" icinga server
[21:00:04] <jouncebot>	 dapatrick and bawolff: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2100). Please do the needful.
[21:00:09] <mutante>	 yep
[21:00:18] <paladox>	 oh
[21:00:31] <Revent>	 Wow, jouncebot sucks up well.
[21:00:33] <mutante>	 that was from puppet run on tegmen
[21:00:41] <paladox>	 oh
[21:01:07] <mutante>	 jouncebot: skip
[21:02:09] <grrrit-wm>	 (03PS3) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 
[21:03:04] <grrrit-wm>	 (03CR) 10Yuvipanda: [V: 032] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda)
[21:13:41] <grrrit-wm>	 (03PS2) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) 
[21:13:45] <grrrit-wm>	 (03PS3) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) 
[21:13:51] <grrrit-wm>	 (03CR) 10Paladox: "> It seems to be the wrong place to put postgresql into the class for" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox)
[21:14:40] <greg-g>	 FYI: no SWAT tonight, no more deploys today, We'll wait until EU Opsens make a choice in the morning EU time regarding resuming
[21:15:31] <grrrit-wm>	 (03PS1) 10Chad: Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 
[21:16:44] <Krenair>	 arseny92, so the SWAT window is cancelled
[21:16:47] <Krenair>	 patch will be done a different day
[21:16:59] <Krenair>	 might be tomorrow, might not.
[21:24:01] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad)
[21:26:01] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Forward nova policy changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/316477 
[21:26:03] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Remove labs puppetmaster certcleaner [puppet] - 10https://gerrit.wikimedia.org/r/316478 (https://phabricator.wikimedia.org/T146303) 
[21:27:35] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Forward nova policy changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/316477 (owner: 10Andrew Bogott)
[21:28:09] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Remove labs puppetmaster certcleaner [puppet] - 10https://gerrit.wikimedia.org/r/316478 (https://phabricator.wikimedia.org/T146303) (owner: 10Andrew Bogott)
[21:54:49] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2723969 (10Gilles) Awesome, thank you!
[21:59:47] <wikibugs>	 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2723975 (10Halfak) I'm looking at this task because I've got a set of datasets that seem to belong on these boxes.  It seems that it woul...
[22:00:06] <grrrit-wm>	 (03PS1) 10Madhuvishy: labstore: Mount maps share simultaneously from labstore1003 and 1001 [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) 
[22:00:19] <Zppix>	 we still having API issues?
[22:02:21] <Reedy>	 Why?
[22:07:15] <grrrit-wm>	 (03CR) 10Rush: labstore: Mount maps share simultaneously from labstore1003 and 1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy)
[22:07:31] <Pchelolo>	 !log running restriction import script on restbase1007
[22:07:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:08:26] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:14:20] <grrrit-wm>	 (03PS1) 10Legoktm: Add .gitreview [software/service-checker] - 10https://gerrit.wikimedia.org/r/316484 
[22:14:53] <grrrit-wm>	 (03PS2) 10Legoktm: Add x-default-query functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 
[22:15:30] <grrrit-wm>	 (03CR) 10Legoktm: "This is the last thing needed for MediaWiki support." [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 (owner: 10Legoktm)
[22:18:02] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:19:10] <cwd>	 gerrit down?
[22:19:13] <paladox>	 gerrit.wikimedia.org is not loading for me
[22:19:22] <paladox>	 mutante ostriches ^^
[22:20:25] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[22:20:40] <mutante>	 i _just_ opened my laptop again, ok, ehm
[22:20:51] <paladox>	 mutante it is very very slow
[22:20:52] <paladox>	 at loading
[22:21:13] <paladox>	 I see no spike https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network though
[22:22:43] <p858snake|L2_>	 mutante: ERR_TIMED_OUT for me, seemed to hanged at "Establishing secure connection" for me
[22:22:56] <mutante>	 ok, it's not busy, not like when lead had the problem
[22:23:34] <mutante>	 it did load for me, but slowly
[22:23:34] <p858snake|L2_>	 Code Review - Error / Server Unavailable / 0 
[22:23:41] <p858snake|L2_>	 helpful error is helpful
[22:25:27] * mutante attempts to restart gerrit
[22:25:35] <volans>	 mutante: there are no recent logs in /var/lib/gerrit2/review_site/logs ?
[22:25:40] <mutante>	 i dont know
[22:25:53] <volans>	 last things from the 13th
[22:25:57] <mutante>	 i think it's fast again
[22:26:07] <mutante>	 works for me now
[22:26:13] <mutante>	 !log restarted gerrit on cobalt
[22:26:19] <paladox>	 Yep works now
[22:26:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:26:28] <mutante>	 so yea, just restarted the service
[22:26:37] <mutante>	 with the normal init.d script
[22:27:01] <volans>	 IT Crowd style :)
[22:27:11] <paladox>	 Thanks
[22:27:31] <mutante>	 volans: the change on the 13th is expected kind of because
[22:27:39] <mutante>	 this was merged then https://gerrit.wikimedia.org/r/#/c/315571/
[22:27:51] <paladox>	 I wonder why gerrit became slow, and then started working after a restart?
[22:27:54] <paladox>	 Possibly a bug?
[22:28:36] <volans>	 mutante: ack, between removing spamming logs and no-logs there is some middle way though ;)
[22:29:13] <volans>	 or are they in a different path?
[22:30:00] <arseny92>	 isn't it a burden to use gerrit nowadays as differential is around
[22:30:11] <wikibugs>	 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724062 (10pmiazga)
[22:30:32] <paladox>	 arseny92 differential isent as complete as gerrit
[22:30:39] <Krenair>	 no arseny92 
[22:30:51] <brion>	 mutante: thanks! much improved
[22:31:10] <paladox>	 gerrit includes more develeper tools then differental, but differential is more modern in interface then gerrit
[22:31:33] <paladox>	 Plus the tests wont work with nodepool yet
[22:31:39] <Krenair>	 gerrit also has the added benefit of not being controlled by phabricator upstream
[22:32:00] <paladox>	 Yep
[22:32:06] <paladox>	 There more frendly
[22:32:23] <paladox>	 They accepted one of my changes either though it was a new feature in 2.12.* release
[22:32:31] <paladox>	 They accept almost all contributions
[22:32:55] <icinga-wm>	 PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_aggregator_projectview_data],Exec[git_pull_statistics_mediawiki]
[22:32:57] <arseny92>	 as far as i did read gerrit migration, aren't all the repos cr going to be eventually migrated to differential?
[22:33:25] <paladox>	 arseny92 thats been put on hold
[22:33:26] <paladox>	 for now
[22:33:36] <arseny92>	 seen that too yes
[22:33:39] <paladox>	 until upstream implement most of gerrit features.
[22:33:51] <paladox>	 diffusion is still going ahead
[22:33:52] <paladox>	 though
[22:34:01] <arseny92>	 thats why i said eventually
[22:34:23] <arseny92>	 diffusion is already on for browsing, not cr
[22:34:26] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:34:41] <paladox>	 yep, differential on hold, so no date for that one, diffusion, most repos are there now. All new ones are being created there anyways, so gerrit + diffusion
[22:35:30] <paladox>	 Im starting to thing this is a bug in gerrit now
[22:36:18] <p858snake|L2_>	 and what basis are you using for that, other than random crystal ball glazing?
[22:36:53] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security and fluorine for Petr - https://phabricator.wikimedia.org/T148473#2724080 (10GWicke)
[22:37:01] <paladox>	 p858snake|L2_ gerrit keeps slowing down every so often
[22:37:13] <paladox>	 I doint remeber that happening with the old gerrit server
[22:37:43] <arseny92>	 some service throttles cpu again?
[22:37:48] <icinga-wm>	 PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79520 MB (15% inode=99%)
[22:38:01] <p858snake|L2_>	 gwicke: separate issues, really should be filed as separate tasks
[22:39:18] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to #mediawiki-security - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke)
[22:39:27] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke)
[22:40:08] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security for Petr. - https://phabricator.wikimedia.org/T148476#2724121 (10GWicke)
[22:40:22] <gwicke>	 p858snake|L2_: there you go
[22:40:44] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke)
[22:45:50] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724142 (10pmiazga)
[22:47:55] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724155 (10pmiazga)
[22:51:04] <icinga-wm>	 RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[22:51:45] <wikibugs>	 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724165 (10dr0ptp4kt) Approved for web request & Hive, prod event logging, nonprod (e.g., beta cluster) event logging, modeled after other Reading engineers with such access.
[22:51:53] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724166 (10dr0ptp4kt) Approved.
[23:00:05] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) - CANCELLED (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2300).
[23:00:05] <jouncebot>	 arseny92: A patch you scheduled for Evening SWAT (Max 8 patches) - CANCELLED is about to be deployed. Please be available during the process.
[23:00:14] <arseny92>	 cancelled :/
[23:00:37] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724182 (10Paladox)
[23:00:56] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[23:01:41] <arseny92>	 and jouncebot needs to be tweaked to handle cancelled event properly lol
[23:02:03] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724184 (10Dzahn) On Oct 13 the log4j.properties were merged in https://gerrit.wikimedia.org/r/#/c/315571/  and there is no FileAppender in it because the plan is p...
[23:06:24] <icinga-wm>	 RECOVERY - Disk space on elastic1024 is OK: DISK OK
[23:14:36] <icinga-wm>	 PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:20:50] <wikibugs>	 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew)
[23:21:25] <wikibugs>	 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724231 (10Andrew)
[23:21:54] <wikibugs>	 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew)
[23:27:39] <Pchelolo>	 !log running import deletions script on restbase1007
[23:27:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:53] <wikibugs>	 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724259 (10Andrew) I should add that code blocks that use curly-races don't break anything.  For an extreme example, checkout role::puppetmaster::standalone.  The API says that it has no docs.  But if I rem...
[23:28:07] <icinga-wm>	 PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:28:23] <mutante>	 grroan, the grrrit-wm 
[23:28:53] <mutante>	 grrrit-wm: restart yourself
[23:31:26] <grrrit-wm>	 (03CR) 10Dzahn: "@Alex yep, here's the general love for the rule https://gerrit.wikimedia.org/r/#/c/316497/" [puppet] - 10https://gerrit.wikimedia.org/r/316030 (owner: 10Dzahn)
[23:32:42] <grrrit-wm>	 (03CR) 10Dzahn: "Assuming it's not a problem to @resolve, AAAA when some of them don't have IPv6 yet. (eventlog1001 does not for example)" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn)
[23:40:46] <icinga-wm>	 RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:42:18] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724267 (10Peachey88)
[23:43:37] <icinga-wm>	 PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:44:07] <icinga-wm>	 PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:44:26] <icinga-wm>	 PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:46:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.411 second response time
[23:46:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.193 second response time
[23:48:18] <icinga-wm>	 PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:48:38] <icinga-wm>	 RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.084 second response time
[23:50:39] <icinga-wm>	 RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 1.094 second response time
[23:50:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:51:58] <icinga-wm>	 PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:52:03] <icinga-wm>	 PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:52:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:53:17] <icinga-wm>	 PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:53:29] <icinga-wm>	 RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.356 second response time
[23:53:36] <icinga-wm>	 PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:53:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:53:56] <icinga-wm>	 PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:54:17] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[23:54:17] <icinga-wm>	 RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[23:54:28] <icinga-wm>	 RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.847 second response time
[23:54:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.912 second response time
[23:54:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:54:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:54:58] <icinga-wm>	 PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:55:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:55:26] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:55:46] <icinga-wm>	 RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 81194 bytes in 0.704 second response time
[23:56:03] <icinga-wm>	 RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 81194 bytes in 0.925 second response time
[23:56:03] <icinga-wm>	 PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:56:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 81195 bytes in 0.122 second response time
[23:56:18] <icinga-wm>	 RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.237 second response time
[23:57:16] <icinga-wm>	 PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:57:30] <icinga-wm>	 RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 3.560 second response time
[23:58:01] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[23:58:01] <icinga-wm>	 RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 3.450 second response time
[23:58:06] <icinga-wm>	 PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:58:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:58:47] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/random/{format} (Random title redirect) is CRITICAL: Could not fetch url http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/random/title: Timeout on connection while downloading http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/random/title
[23:58:50] <icinga-wm>	 PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:59:27] <icinga-wm>	 PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:59:36] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:59:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds