[01:04:04] 06Operations, 10Wikimedia-Site-requests: Default language of votewiki set to Persian (fa) for anonymous users. Change to English (en) - https://phabricator.wikimedia.org/T148352#2720797 (10Arseny1992) [01:18:14] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, and 4 others: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2720813 (10AndyRussG) Please ignore [[ https://phabricator.wikimedia.org/T144952#2717891 | this earlier c... [01:27:47] Is there a way I can review the logs of the requests that hit varnish for "ores.wikimedia.org"? [01:29:18] ores... misc-web [01:29:33] needs ops rights to look at varnishlog there [01:30:08] don't think anything misc goes to analytics [01:30:34] Darn. We're getting hammered by a single IP right now. And I'd like to shut down requests from that IP [01:31:02] ?? [01:31:18] Yes, Zppix|mobile? [01:31:41] Im responding to halfak message i dont quite understand [01:31:54] why are you responding to it if you don't understand it? [01:32:06] halfak, maybe ask on the ops list [01:32:48] Thats why i was responding to learn more [01:33:31] Thanks Krenair. Emailing ops. [01:34:17] Zppix|mobile, I don't think any old response will work for that, you need to ask specific questions [01:34:29] Krenair , sonce when a config change is not ops if only ops/wmf can commit to ops/mw/config? [01:34:38] snce* [01:34:39] and seeing as it was just 9 seconds after you joined, looking at the context in the channel logs is also probably a good idea [01:34:46] since* [01:34:52] arseny92, not only ops/wmf can commit to ops/mw/config? [01:35:23] I can commit to config and im not an op arseny92 [01:36:10] I think he meant merging into the main repo on the gerrit server [01:36:16] yes [01:36:30] If you're used to the old SVN system, a 'commit' meant doing pretty much the equivalent of that [01:36:43] We moved away from SVN 4 and a half years ago [01:37:31] ^true [01:39:53] so that means now anyone can try to fix the config and propose changes for deploy if commits look good and pass review? [01:40:15] anyone can propose changes for deploy without them looking good [01:40:28] one they pass review, then they can be accepted [01:40:55] the wmf-deployment group can merge changes in gerrit and the deployment group defined in the puppet admin module gets to actually push them out to the servers [01:41:22] (theoretically these groups would be identical, but realistically it's a mess) [01:42:04] the person who approved the temporary change of language for that wiki was neither wmf or ops [01:44:07] wmf do not by default inherit deployment rights, ops do [01:48:09] temporary things have a habit of accidentally becoming permanent [01:50:08] (03PS1) 10Huji: Reverting I75cf5954ac8d68e607975e7b1b77170915ffe4d1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) [01:50:09] or at least, there indefinitely until someone notices [01:55:34] halfak, actually, I've been thinking [01:55:40] How do you know it's a single IP? [02:02:40] arseny92, I'm not sure about "otherwise it's considered as rewriting history as if like things didn't happen" but the rest sounds good [02:03:55] i mean Huji's revert removes the mention of the other task from //comment [02:04:08] we prefer commit messages that describe the change being made without relying on external references [02:04:25] at least, I do [02:04:33] but I think many would agree [02:05:15] and since this task is now around, the comment would be needing to reference both tasks [02:07:38] yes, references are okay [02:07:46] but I wouldn't rely on them in the first line at the top of the commit message [02:07:56] relevant task lists are good even [02:09:13] welp i have git around, are other users able to amend existing patchsets, or a new patchset is needed [02:09:56] other users can amend existing patches [02:10:11] both using git locally and, more recently, by using the gerrit web UI [02:12:25] k, then im going to try and do my first commit [02:12:54] Okay [02:13:32] I'm going to bed now, but I'll be back in like 12 hours [02:13:36] good luck [02:21:37] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 07m 34s) [02:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 17 02:26:33 UTC 2016 (duration 4m 56s) [02:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:58] (03PS2) 10Huji: Revert "Change voteWiki to fa language temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) [03:59:20] (03PS3) 10Arseny1992: Reverting votewiki back to en Change was meant to be temporary. fawiki elections have been since over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316291 (https://phabricator.wikimedia.org/T148352) (owner: 10Huji) [06:41:08] (03PS1) 10Giuseppe Lavagetto: Revert "ores: Increase capacity" [puppet] - 10https://gerrit.wikimedia.org/r/316294 [06:41:18] <_joe_> Amir1: ^^ [06:41:29] <_joe_> I need to ease the load on scb [06:41:37] _joe_: hey, okay [06:41:46] <_joe_> then we can look at blocking whoever is doing those massive requests [06:41:59] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "ores: Increase capacity" [puppet] - 10https://gerrit.wikimedia.org/r/316294 (owner: 10Giuseppe Lavagetto) [06:42:31] awesome, thanks [06:43:42] <_joe_> Amir1: next time, page opsens via phone [06:43:49] <_joe_> they're on the contact list on officewiki [06:44:08] I don't have access to the officewiki [06:44:45] <_joe_> ha [06:45:26] I don't have account there, Do you want me to have one? I don't know if that's possible [06:45:42] <_joe_> heh, no idea [06:46:18] <_joe_> so, aaron was talking about the "enwiki wp10 model" [06:46:26] <_joe_> what's the url structure for that [06:48:14] _joe_: there are several url strcutures [06:48:19] *structures [06:48:27] Is it possible to block based on IP? [06:49:04] Here's an example of wp10 model request on enwiki [06:49:05] https://ores.wikimedia.org/v2/scores/enwiki/?models=wp10&revids=724030089 [06:49:27] Here's another structure: https://ores.wikimedia.org/v2/scores/enwiki/wp10/76655 [06:54:29] <_joe_> Amir1: so requests to ores come from the jobrunners, mostly [06:54:29] (03PS1) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) [06:54:44] <_joe_> who go through the public IP, for some reason [06:54:56] (03PS2) 10Urbanecm: Show changes from last 14 days in watchlist in cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316295 (https://phabricator.wikimedia.org/T148327) [06:55:22] are you sure? the extension job runner does not seem to do it: https://grafana.wikimedia.org/dashboard/db/ores-extension [06:55:51] <_joe_> what do you mean? [06:56:19] <_joe_> you mean the amount of jobs is limited [06:56:54] yeah and it did not change [06:57:08] let me check it in details [06:57:46] <_joe_> where does ores log? [06:57:50] <_joe_> only to logstash? [06:58:54] the ores service or the ores extension (the job runner)? [06:59:07] <_joe_> the ores service [06:59:28] I don't think it goes to logstash [06:59:55] the extension, yes, [07:00:00] <_joe_> /srv/log [07:00:01] <_joe_> damn [07:00:18] /srv/deployment/logs/ores [07:02:24] 06Operations, 10Traffic, 07Wikimedia-Incident: Deploy redundant unified certs - https://phabricator.wikimedia.org/T148131#2720949 (10faidon) Yeah — esams by itself gets enough (and diverse enough) traffic that it should suffice. [07:14:15] 06Operations, 06Labs, 10Striker, 07LDAP: Store Wikimedia unified account name (SUL) in LDAP directory - https://phabricator.wikimedia.org/T148048#2720964 (10MoritzMuehlenhoff) I think we should use the same wmf-user.schema file across the labs and corp servers, but introduce a separate object class for sto... [07:15:35] !log Dropping memory tables hitcounter, _counters from S7 - T132837 [07:15:37] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [07:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:17:53] <_joe_> Amir1: it's taking me some time as varnishlog is different in varnish 4 [07:18:44] _joe_: one thing, AFAIK varnish does not cache ores requests because they are cahced in redis [07:19:26] akosiaris know better [07:19:30] *knows [07:23:14] Amir1: that's true however only for the requests that send the proper headers [07:23:17] <_joe_> Amir1: so my hypothesis, from what I see [07:23:26] <_joe_> is that something is doing a massive update of wikidata [07:23:37] <_joe_> and every single edit goes to ores via the jobqueue [07:23:47] changeprop, not the jobqueue [07:23:55] ores gets updated via changeprop [07:24:04] <_joe_> akosiaris: well I see requests coming from jobrunners [07:24:05] <_joe_> so... [07:24:38] hmm, that shouldn't happen AFAIK [07:24:49] <_joe_> akosiaris: well, evidence tells us otherwise [07:24:57] <_joe_> also, it goes through the public endpoint [07:24:59] <_joe_> lol [07:25:23] <_joe_> akosiaris: varnishlog -n frontend -q 'ReqHeader eq "Host: ores.wikimedia.org"' | fgrep X-Client-IP on cache-misc [07:25:32] yeah, that's a pattern I hate (using the public endpoint) [07:26:44] <_joe_> wow we had some spikes on the jobqueue in the last few days [07:26:46] akosiaris _joe_: The changeprop is for precaching [07:26:59] the job runner is for the extension [07:27:07] https://grafana.wikimedia.org/dashboard/db/ores-extension [07:27:12] <_joe_> Amir1: what does trigger a job? [07:27:14] It seems it's the jobrunner [07:27:20] a non-bot edit [07:27:49] so it's not related to enwiki wp10 models [07:28:20] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2720992 (10MoritzMuehlenhoff) The blocking of /sys/fs is currently hard-coded in the fs_proc_sys_dev_boot() function. We could temporarily run... [07:28:50] there aren't any real requests coming from non-jobrunner IPs from what I see [07:28:51] hmm, so what we can do is to tell that user to slow down [07:29:11] the wikidata editing bot you mean ? [07:29:17] akosiaris: they are using the public endpoint [07:29:18] * akosiaris assumes it's a bot [07:29:22] <_joe_> Amir1: I see a lot of /scores/wikidatawiki/?models=damaging&revids=&precache=true [07:29:39] it's bot but probably doesn't have the bot flag [07:30:36] _joe_: wikidata always is the huge proportion of preaching edits [07:33:10] <_joe_> Amir1: is ores evaluating edits to talk and user pages too? [07:33:15] <_joe_> that seems absurd. [07:33:21] not in wikidata [07:33:28] <_joe_> in enwiki it does [07:33:37] but for other wikis, yes, because vandalism can happen in them too [07:35:07] so, the jobqueue size had a spike around the time CPU usage when up on SCB [07:35:15] and that's all wikidata you say ? [07:35:47] the joqueue size is getting back to "normal" size now [07:36:27] I need to check who was it [07:37:02] <_joe_> akosiaris: I guess it's some massive edit induced by some label change [07:37:32] _joe_: can you get the rev ids? [07:37:47] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing TIFFs - https://phabricator.wikimedia.org/T148360#2721010 (10Gilles) [07:37:53] <_joe_> Amir1: yes [07:38:31] awesome, it would be nice to have it [07:38:58] <_joe_> Amir1: I can intercept them now, I mean [07:39:11] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721028 (10Gilles) [07:40:16] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721043 (10Gilles) [07:41:22] <_joe_> Amir1: do you have a phab task about this? [07:41:22] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing video - https://phabricator.wikimedia.org/T148363#2721059 (10Gilles) [07:41:33] Okay, I get some not yet [07:41:37] *not yet [07:42:12] <_joe_> Amir1: should we block this user? or ask him to stop nicely first? [07:42:52] _joe_: I blocked him [07:43:00] I go post a message in his talk page [07:43:04] <_joe_> Amir1: oh ok [07:43:05] and tell nicely that [07:43:36] <_joe_> just "mark yourself as a bot" is enough [07:43:57] (03PS1) 10Marostegui: db-equiad: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316296 (https://phabricator.wikimedia.org/T147305) [07:43:59] cool [07:47:25] nice varnishlog query _joe_ ;) [07:49:32] <_joe_> Amir1: can't we just exclude wikidata from ores and unblock the user? [07:49:44] <_joe_> I think it's a much better option [07:49:50] !log Stopping MySQL in db2057.codfw.wmnet to use it to clone another server [07:49:51] <_joe_> we simply can't handle the load atm [07:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:50:33] <_joe_> Amir1: I'm preparing the patch [07:51:05] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing video - https://phabricator.wikimedia.org/T148363#2721089 (10Gilles) ``` Oct 17 07:20:53 thumbor1001 thumbor@8830[127650]: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.1c/1/1c/Accidents_will_happen_William-H.-Watson-Un... [07:52:44] (03PS1) 10Alexandros Kosiaris: Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 [07:53:39] (03PS2) 10Alexandros Kosiaris: conftool: Remove the old sca physical boxes from zotero backends [puppet] - 10https://gerrit.wikimedia.org/r/315926 [07:53:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] conftool: Remove the old sca physical boxes from zotero backends [puppet] - 10https://gerrit.wikimedia.org/r/315926 (owner: 10Alexandros Kosiaris) [07:54:07] (03PS2) 10Alexandros Kosiaris: Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 [07:54:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add interface::add_ip6_mapped to SC clusters [puppet] - 10https://gerrit.wikimedia.org/r/316298 (owner: 10Alexandros Kosiaris) [07:57:18] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721100 (10Gilles) ``` Oct 17 07:04:44 thumbor1001 thumbor@8837[74679]: 2016-10-17 07:04:44,645 8837 thumbor:ERROR ERROR: Traceback (most recent call last): Oct 17 07:04:44 thumbor1001 thumbor@88... [07:59:04] (03PS1) 10Giuseppe Lavagetto: Temporarily disable Ores on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316299 [07:59:10] <_joe_> Amir1: ^^ [07:59:26] _joe_: why? [07:59:40] It's back to normal as far as I can tell [07:59:44] <_joe_> Amir1: we can't really block a real user because a service is under capacity [07:59:57] <_joe_> Amir1: yeah but we need to unblock that editor! [08:00:18] I agree but the user is doing more than 100 edits per minute [08:00:18] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721115 (10Gilles) This one looks like a different issue: ``` Oct 17 07:06:47 thumbor1001 thumbor@8825[71257]: 2016-10-17 07:06:47,721 8825 thumbor:ERROR ERROR: Traceback (most recent call last... [08:00:24] it should be marked as a bot [08:00:31] I will unblock him ASAP [08:00:32] <_joe_> Amir1: ok but while we tell him [08:00:47] <_joe_> we can just disable ores on wikidata while we add capacity [08:00:58] <_joe_> a thing akosiaris is basically doing now [08:01:48] I still don't understand, We have enough capacity to handle wikidata [08:01:50] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing SVGs - https://phabricator.wikimedia.org/T148361#2721132 (10Gilles) The other ones are the same missing source_file issue. [08:02:04] <_joe_> Amir1: even while that user is not marked as a bot? [08:02:06] (03PS1) 10Muehlenhoff: Update to 4.4.25 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316300 [08:02:07] 06Operations, 06Performance-Team, 10Thumbor: SVG engine: IOError: [Errno 2] No such file or directory: '/srv/thumbor/tmp/thumbor@8811/tmp04oo68/source_file' - https://phabricator.wikimedia.org/T148361#2721133 (10Gilles) [08:02:22] (03CR) 10Elukey: [C: 04-1] "Hey Alex," [puppet] - 10https://gerrit.wikimedia.org/r/316217 (owner: 10Elukey) [08:02:28] That kind of behavior that the user was doing can disrupt everything including the wikidata itself [08:02:36] 06Operations, 06Performance-Team, 10Thumbor: Missing original for video 500s instead of 404ing - https://phabricator.wikimedia.org/T148363#2721134 (10Gilles) [08:02:39] that's why we have throttle for bots [08:03:07] <_joe_> Amir1: well it's not your call to make [08:03:16] <_joe_> in my not so humble opinion [08:03:29] I'm referring the Wikidata policy https://www.wikidata.org/wiki/Wikidata:Bots [08:03:36] not my opinion [08:03:58] <_joe_> it's your opinion that what he's doing can disrupt wikidata [08:04:08] <_joe_> because AFAICS, wikidata is very fine [08:04:11] <_joe_> ores isn't [08:04:40] <_joe_> please note he's not using a bot [08:04:49] <_joe_> it's assisted editing, it's not a program [08:04:59] <_joe_> as you can see on his talk page [08:05:30] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721151 (10Gilles) Recon_Battalion_hike.png is the same bug as T148361 but with the vips temp file path: ``` Oct 17 07:03:02 thumbor1001 thumbor@8825[71257]: CommandError: (['/usr/bin/vips', 'sh... [08:05:33] es2014 network issues? [08:05:58] Amir1: both you and _joe_ do have a point [08:06:06] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721153 (10Gilles) Same for the other ones [08:06:07] clearly the user is not harming wikidata [08:06:12] I'm saying that since we had similar incidents before now Wikidata has a policy prohibiting this speed [08:06:16] <_joe_> I'm not saying he shouldn't get a bot account [08:06:28] 06Operations, 06Performance-Team, 10Thumbor: SVG engine: IOError: [Errno 2] No such file or directory: '/srv/thumbor/tmp/thumbor@8811/tmp04oo68/source_file' - https://phabricator.wikimedia.org/T148361#2721028 (10Gilles) [08:06:30] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing PNGs - https://phabricator.wikimedia.org/T148362#2721157 (10Gilles) [08:07:06] I think it may have crashed completely [08:08:16] 06Operations, 06Performance-Team, 10Thumbor: SVG and VIPS engines expect temp folder to stay a while when it doesn't - https://phabricator.wikimedia.org/T148361#2721182 (10Gilles) [08:09:12] Also Wikidata was not very fine https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch [08:09:18] Only thing I am getting on the console is [8181711.222536] [08:09:19] He also caused dispatch lag [08:10:18] 06Operations, 10Traffic, 10netops, 13Patch-For-Review: Fix static IP fallbacks to Pybal LVS routes - https://phabricator.wikimedia.org/T143915#2721194 (10faidon) I cleaned up the old, stale backup routes and re-added backup routes for all of the subnets mentioned above across core routers in all datacenter... [08:10:20] !log disabling notifications of es2014 before it pages [08:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:10:28] jynus: already paged [08:10:34] jynus: need any help btw ? [08:10:41] it hard-crashed [08:10:47] I guess I can ignore that page [08:10:52] no [08:11:04] it just crashed live [08:11:09] ouch [08:11:13] 06Operations, 06Performance-Team, 10Thumbor: SVG and VIPS engines expect temp folder to stay a while when it doesn't - https://phabricator.wikimedia.org/T148361#2721196 (10Gilles) [08:11:14] I'm saying the speed he is doing is not acceptable even for bots: 142 edits per minute is too much [08:11:16] 06Operations, 06Performance-Team, 10Thumbor: Investigate failing TIFFs - https://phabricator.wikimedia.org/T148360#2721198 (10Gilles) [08:11:22] but still responds to ping? [08:11:45] <_joe_> Amir1: heh, ok, to my eye 142 edits per second should be too much [08:11:47] <_joe_> not per minute [08:11:54] <_joe_> we're doing an abysmal job here [08:11:56] Amir1: out of curiosity what is the limit per minute for bots edits? [08:11:59] where's icinga-wm? [08:12:00] is it back? [08:12:01] I'll restart [08:12:02] 60 [08:12:07] gtk [08:12:09] or is it you, akosiaris? [08:12:11] I think the controller temporarilly failed [08:12:17] paravoid: no, not me [08:12:35] actually that's not fully true [08:12:38] the 60 limit [08:12:55] even this page https://www.wikidata.org/wiki/Wikidata:Bots [08:13:05] does not clearly say "60" and that's it [08:13:05] and now it is up [08:13:20] I am going to check the hardware logs [08:13:21] it does say . The bot operator should do a test run of between 50 and 250 edits, so that the community can observe that the bot is working correctly. [08:13:33] which implies that 120 might very well be acceptable [08:14:38] akosiaris: that's for the trial period and not the speed per minute [08:14:39] Pfff agan? [08:14:46] let me find the discussion [08:14:47] Last time it was es2015 no? [08:15:22] I didn't reboot, it freezed due to filesystem "08:15:02 up 94 days" [08:15:25] Amir1: we 've never had set a limit in our API from what I know. In fact in https://www.mediawiki.org/wiki/API:Etiquette#Request_limit [08:15:26] !log upgrading nodejs on aqs100[56] [08:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:33] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.25 [debs/linux44] - 10https://gerrit.wikimedia.org/r/316300 (owner: 10Muehlenhoff) [08:15:40] it clearly says "There is no hard and fast limit on read requests, but we ask that you be considerate and try not to take a site down." [08:15:50] https://www.wikidata.org/wiki/Wikidata_talk:Bots/Archive/2013#Bot_speed [08:15:55] "megaraid_sas 0000:03:00.0: pending commands remain after waiting, will reset adapter scsi0" [08:16:11] <_joe_> !log restarting hhvm on mw1175, stuck in HPHP::FastCGISession::blockingWriteStdOut after OOM [08:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:16:30] akosiaris: yeah, it's in the technical perspective and I agree but in the community they have other considerations too [08:16:48] like flooding the recentchanges [08:17:12] also dispatch stat is the bottleneck [08:17:18] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch [08:17:28] which as you can see it should not be more than zero [08:17:52] do you want me to grab someone from Wikidata team to explain it to you [08:18:05] no need [08:18:10] I can understand it just fine [08:18:25] (I'm in that team. My bot has more than 23 M edits there, I'm very well aware of the limits there) [08:19:20] that limit should be on the bots page you linked earlier, so everyone knows it [08:20:02] <_joe_> Amir1: from my prespective, we are blocking a big contributor to our projects because a non-critical service is having load issues. So if the block is wikidata-community related, that's ok [08:20:04] yeah, it's not a clearly communicated limit... then again we have never communicated a clear limit on purpose, AFAIK at least [08:20:17] <_joe_> if that's for ORES, we should just disable ORES on wikidata while we add capacity [08:20:24] <_joe_> that's my line of reasoning [08:20:24] mediawiki user level blocks shouldn't be discussed here, please use the wikis for that [08:20:47] We can go to #wikidata [08:20:51] <_joe_> jynus: we're discussing merging my patch for disabling ORES on wikidata [08:20:52] we can discuss ores capacity [08:20:55] ok [08:21:03] just stay on topic [08:21:10] <_joe_> and we're discussing an emergency measure which was blocking a user [08:21:22] jynus: I am going thru my notes of the issues with the megaraid_sas drivers, to see if I have seen that error in my rpevious jobs [08:21:22] but overall, I'm saying I'm not blocking him because of the ORES, I block because of WD:Bots policy [08:21:24] <_joe_> so, thanks for the police work, but wasn't needed [08:21:54] AlexZ: don't do that again [08:21:58] can I get the times for the probematic workdload? [08:22:31] database is usually the bottlneck when wikidata goes crazy [08:23:38] jynus: it's clear in https://grafana.wikimedia.org/dashboard/db/ores?from=now-24h&to=now&panelId=15&fullscreen [08:24:32] I am probably the highest critict of wikidata, mostly because it has massive consequences [08:25:02] but in this case, I have to agree with joe's procedure- I see no problems on the main wikidata infrastructure [08:25:42] inserts have not appreciately increased [08:25:45] The dispatch lag was crazy during this edits, so no. It wasn't just ores: https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch [08:26:00] paravoid: I enabled that after consulting with James Alexander and he left a note in this channel about logmsgbot being abused. [08:26:42] AlexZ: it prevents the icinga bot from joining this channel, which is fairly critical for us [08:29:00] paravoid: alright, but it should have gotten removed a bit earlier for sure. I wasn't aware that icinga-wm got disconnected [08:29:18] it did on a netsplit [08:29:24] and then couldn't join again [17:14:41] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.400 second response time [17:16:57] <_joe_> mark: I'm merging the pybal change [17:17:18] Note: WDQS deployment is on hold pending issues with finding the binaries... It will resume when things are sorted out (nothing has been deployed / changed yet) [17:20:07] <_joe_> !log restarting lvs on lvs1003/1006 for the api change [17:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:20:48] (03CR) 10Yurik: [C: 031] "not that we would ever use non-utf8, but sure, looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/316342 (owner: 10Gehel) [17:22:38] (03PS2) 10Bearloga: Update R and C++-related stats puppet configs [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) [17:23:14] gehel: this is not right, should be from oct 13... not sure why it's not there [17:23:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:24:22] PROBLEM - Check size of conntrack table on kafka1018 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [17:27:02] RECOVERY - Check size of conntrack table on kafka1018 is OK: OK: nf_conntrack is 53 % full [17:28:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [17:28:35] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/midom] [17:28:38] (03CR) 10Dzahn: [C: 031] "with g++ now being 4.8, this looks good" [puppet] - 10https://gerrit.wikimedia.org/r/315885 (https://phabricator.wikimedia.org/T147682) (owner: 10Bearloga) [17:29:35] (03PS1) 10Jgreen: switch payments-listener from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/316391 [17:30:04] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:31:04] next [17:31:11] jouncebot: next [17:31:11] In 0 hour(s) and 28 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T1800) [17:31:34] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 653 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3132103 keys - replication_delay is 653 [17:31:47] (03CR) 10Jgreen: [C: 032] switch payments-listener from eqiad to codfw [dns] - 10https://gerrit.wikimedia.org/r/316391 (owner: 10Jgreen) [17:33:20] paravoid: maybe is worth skipping/postponing SWAT in 27 minutes to avoid adding additional variables [17:36:43] Whats up with api errors im around to collect reports [17:37:14] opsen still working on it [17:37:25] Is there a task [17:37:37] I haven't seen one. [17:37:54] Give me some details ill open one [17:38:15] Zppix|mobile api errors are affecting saving watchlists [17:38:18] or removing them [17:38:30] Ok [17:38:33] I have no idea if it also affects saving pages though [17:38:40] <_joe_> paladox: still going on? [17:39:07] _joe_ i think so, but not sure. [17:39:44] Saving a watchlist then removing it works for me, but the errors above suggest otherwise? [17:39:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:40:08] Zppix|mobile it also affects anything that uses api [17:40:10] like bots [17:41:32] 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723320 (10Zppix) [17:45:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3120724 keys - replication_delay is 0 [17:49:38] PROBLEM - mediawiki-installation DSH group on mw1167 is CRITICAL: Host mw1167 is not in mediawiki-installation dsh group [17:49:49] 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723336 (10Paladox) Should we triage this as high or unbreak due to it affecting production Wikimedia sites? [17:50:14] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:46] paladox: people are working on it right now [17:51:01] have been for the past.. two hours? [17:51:06] Ok [17:51:15] Yep, i mean for history. [17:51:24] I know that ops have been working on it. [17:52:19] 06Operations, 10MediaWiki-API: Api cluster issues - https://phabricator.wikimedia.org/T148448#2723367 (10Anomie) [17:52:20] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:21] 06Operations, 06Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2723365 (10Andrew) a:05Andrew>03None [17:52:24] 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2723370 (10Anomie) [17:54:56] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 81590 bytes in 0.539 second response time [17:55:33] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:13] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 81591 bytes in 5.984 second response time [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T1800). Please do the needful. [18:00:04] dcausse and kart_: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [18:02:02] dcausse: kart_: we'll wait a little bit volans and paravoid are fine with the current cluster state [18:02:17] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:02:19] Dereckson: sure [18:04:41] Dereckson, I do not think waitng a bit will be enough [18:04:42] yes, we're still working on the API 503 issue [18:04:55] please don't deploy new changes that might cause confusion [18:04:58] but I will not cancel it yet [18:05:01] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [18:06:30] Dereckson: okay. [18:07:00] Dereckson: unfortunately still WIP and searching the root cause, it will be better to avoid adding additional variables [18:07:19] * Dereckson nods. [18:08:01] Dereckson: should we cancel the deployment and schedule later or tomorrow? [18:08:57] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723424 (10Pchelolo) a:03Pchelolo Looking at the [[ https://github.com/nodejs/node/blob/9eb61793bf5b5ee9a7fb3ebf683083152756feee/lib/cluster.js#L483-L491 | relate... [18:09:12] kart_: there is no other window scheduled before 2 hours, so we can wait and see, with apparently a rather low probability of the deployment [18:10:57] Dereckson: yeah, and I'm in different timezone, so lets wait till I'm here. [18:11:43] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:12:45] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:14:36] kart_: perhaps you could add on the Gerrit change a procedure how to test your change, and someone else will be able to sherperd it [18:17:48] <_joe_> !log dumping core on mw1194 [18:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:21:33] PROBLEM - HHVM rendering on mw1194 is CRITICAL: Connection timed out [18:22:22] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:24:04] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 81562 bytes in 0.112 second response time [18:24:31] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723486 (10mobrovac) >>! In T147849#2723424, @Pchelolo wrote: > Looking at the [[ https://github.com/nodejs/node/blob/9eb61793bf5b5ee9a7fb3ebf683083152756feee/lib/c... [18:28:00] Dereckson: Let me see if I can add test article/more info. [18:29:55] (03PS1) 10Alexandros Kosiaris: naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 [18:31:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:34:32] (03PS1) 10Jgreen: switch payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/316401 [18:34:53] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:03] SMalyshev: latest wdqs deployed on wdq-beta [18:35:14] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:20] (03CR) 10Jgreen: [C: 032] switch payments-listener back to eqiad [dns] - 10https://gerrit.wikimedia.org/r/316401 (owner: 10Jgreen) [18:35:48] !log switch payments-listener back to eqiad [18:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:36:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:37:46] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.620 second response time [18:37:47] (03PS2) 10Alexandros Kosiaris: naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 [18:37:50] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Only use exported resources [puppet] - 10https://gerrit.wikimedia.org/r/316400 (owner: 10Alexandros Kosiaris) [18:37:54] PROBLEM - HHVM rendering on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:09] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 81566 bytes in 1.235 second response time [18:38:33] (03CR) 10Alexandros Kosiaris: "unbreaking the icinga config that's been broken for a few hours. Deemed it important enough and unrelated enough to the current API outage" [puppet] - 10https://gerrit.wikimedia.org/r/316400 (owner: 10Alexandros Kosiaris) [18:38:53] !log deploying latest gui and binaries for wdqs [18:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:06] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:34] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:40:24] RECOVERY - HHVM rendering on mw1194 is OK: HTTP OK: HTTP/1.1 200 OK - 81566 bytes in 8.741 second response time [18:40:29] (03Abandoned) 10Dzahn: add mapped IPv6 address for einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/316038 (owner: 10Dzahn) [18:41:44] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 4.777 second response time [18:44:05] (03PS1) 10Alexandros Kosiaris: naggen2: Fix typo with multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/316403 [18:44:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Fix typo with multiple lines [puppet] - 10https://gerrit.wikimedia.org/r/316403 (owner: 10Alexandros Kosiaris) [18:45:19] SMalyshev: wdqs deployment completed, feel free to test... [18:48:20] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:48:33] that's me ^ [18:48:47] (03PS1) 10Alexandros Kosiaris: naggen2: Third fix for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/316406 [18:48:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] naggen2: Third fix for naggen2 [puppet] - 10https://gerrit.wikimedia.org/r/316406 (owner: 10Alexandros Kosiaris) [18:51:22] RECOVERY - mediawiki-installation DSH group on mw1167 is OK: OK [18:51:26] gehel: thank you! [18:51:47] (03CR) 10Jcrespo: [C: 031] "Take my +1 with a grain of salt." [puppet] - 10https://gerrit.wikimedia.org/r/316348 (owner: 10Rush) [18:54:36] (03CR) 10Dzahn: "It seems to be the wrong place to put postgresql into the class for PHP packages." [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [19:00:20] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:00:24] (03PS1) 10Giuseppe Lavagetto: Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 [19:01:24] (03CR) 10Dzahn: [C: 031] Add redirect for toolserver sulinfo tool [puppet] - 10https://gerrit.wikimedia.org/r/316076 (owner: 10Dereckson) [19:01:47] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [19:06:13] (03CR) 10BBlack: [C: 031] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto) [19:07:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto) [19:07:32] (03PS2) 10Giuseppe Lavagetto: Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 [19:07:48] (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "lvs: raise API depool threshold, disable proxyfetch" [puppet] - 10https://gerrit.wikimedia.org/r/316408 (owner: 10Giuseppe Lavagetto) [19:15:04] (03PS2) 10Dzahn: tcpircbot: update IPv6 addresses for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/316030 [19:15:32] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2723668 (10Eevans) During Thursday's (2016-10-13) //ops-services-syncup// meeting, a final decision to use Intel SSDs was made. TTBMK, there are no further... [19:17:48] (03CR) 10Dzahn: [C: 032] "This is for logmsgbot from maintenance servers." [puppet] - 10https://gerrit.wikimedia.org/r/316030 (owner: 10Dzahn) [19:18:44] (03PS4) 10Rush: labsdb: maintain-views don't _p for hitcounters [puppet] - 10https://gerrit.wikimedia.org/r/316348 [19:18:59] (03PS16) 10Rush: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 [19:19:01] (03PS1) 10Rush: labstore: bdsync backup 'test' drbd volume from secondary [puppet] - 10https://gerrit.wikimedia.org/r/316414 [19:20:45] (03CR) 10Rush: [C: 032] labsdb: maintain-views don't _p for hitcounters [puppet] - 10https://gerrit.wikimedia.org/r/316348 (owner: 10Rush) [19:30:15] 06Operations, 10ArchCom-RfC, 06Performance-Team, 06Services, and 4 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#2723710 (10GWicke) [19:42:09] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:52:48] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2000). [20:11:06] (03PS6) 10Alexandros Kosiaris: icinga: Remove the hack around facilities, lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/315510 [20:11:10] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] icinga: Remove the hack around facilities, lvs::monitor [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [20:11:47] 06Operations, 10ChangeProp, 06Services (doing), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2723773 (10Pchelolo) Here's a PR with a workaround that seems to fix the problem: https://github.com/wikimedia/htcp-purge/pull/5 However, we need to report this us... [20:15:57] jouncebot: next [20:15:57] In 0 hour(s) and 44 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2100) [20:30:46] !log starting mobileapps deploy [20:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:33] !log updated status.wm.o apache config on wikitech-static box to correctly serve static assets again (T148438) [20:32:34] T148438: Styling of status.wikimedia.org is broken - https://phabricator.wikimedia.org/T148438 [20:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:43] !log deployed mobileapps 13fa4b4 [20:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:58] 06Operations, 10Wikimedia-General-or-Unknown: Styling of status.wikimedia.org is broken - https://phabricator.wikimedia.org/T148438#2723842 (10Krenair) 05Open>03Resolved I've added the following to `wikitech-static.wikimedia.org:/etc/apache2/sites-enabled/status.wikimedia.org.conf`, below the existing Loca... [20:46:22] paladox: https://phabricator.wikimedia.org/T148438#2723842 [20:46:45] (03PS2) 10Dzahn: add IPv6 AAAA and PTR for terbium and wasat [dns] - 10https://gerrit.wikimedia.org/r/316028 [20:46:51] mutante thanks [20:46:51] :) [20:47:43] well and now i see Krenair just logged that [20:48:28] yeah [20:49:03] I knew what the problem was earlier but didn't want to bug ops about it while everyone was dealing with the much bigger issues [20:49:12] thank you [20:50:59] (03CR) 10Dzahn: [C: 032] add IPv6 AAAA and PTR for terbium and wasat [dns] - 10https://gerrit.wikimedia.org/r/316028 (owner: 10Dzahn) [20:51:16] and didn't want to accidentally break apache on that site during production issues [20:51:34] good thinking [20:51:52] :) [20:53:35] (03PS1) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 [20:53:47] !log maintenance servers, terbium and wasat, now have IPv6 connectivity [20:53:49] mutante there seems to be two icinga-wm bots logged into irc [20:53:50] icinga-wm [20:53:51] and [20:53:53] icinga-wm_ [20:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:57] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [20:54:52] grr. icinga. ^^^ betelgeuse is not down. [20:54:59] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda) [20:55:22] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [20:55:32] (03PS2) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 [20:55:41] (03CR) 10Dzahn: "from terbium i can now ping6 wasat.codfw.wmnet etc" [dns] - 10https://gerrit.wikimedia.org/r/316028 (owner: 10Dzahn) [20:57:52] paladox: fixed [20:58:01] mutante thanks :) [20:58:08] or not, puppet will start it again [20:58:14] oh [20:58:18] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda) [20:58:27] it's because we have "tegmen" now using the icinga role [20:58:35] it's running one on each server [20:58:38] not 2 instances on neon [20:58:58] !log tegmen - stopped duplicate icinga-wm (ircecho) [20:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:03] we'll need something in hiera or so to define which is the currenly "active" icinga server [21:00:04] dapatrick and bawolff: Respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2100). Please do the needful. [21:00:09] yep [21:00:18] oh [21:00:31] Wow, jouncebot sucks up well. [21:00:33] that was from puppet run on tegmen [21:00:41] oh [21:01:07] jouncebot: skip [21:02:09] (03PS3) 10Yuvipanda: puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 [21:03:04] (03CR) 10Yuvipanda: [V: 032] puppetmaster: Install self signed CA into system store too [puppet] - 10https://gerrit.wikimedia.org/r/316468 (owner: 10Yuvipanda) [21:13:41] (03PS2) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [21:13:45] (03PS3) 10Paladox: Install postgresql on ci in php.pp [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) [21:13:51] (03CR) 10Paladox: "> It seems to be the wrong place to put postgresql into the class for" [puppet] - 10https://gerrit.wikimedia.org/r/316228 (https://phabricator.wikimedia.org/T39602) (owner: 10Paladox) [21:14:40] FYI: no SWAT tonight, no more deploys today, We'll wait until EU Opsens make a choice in the morning EU time regarding resuming [21:15:31] (03PS1) 10Chad: Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 [21:16:44] arseny92, so the SWAT window is cancelled [21:16:47] patch will be done a different day [21:16:59] might be tomorrow, might not. [21:24:01] (03CR) 1020after4: [C: 031] Phab: Use correct location for phd's homedir [puppet] - 10https://gerrit.wikimedia.org/r/316474 (owner: 10Chad) [21:26:01] (03PS1) 10Andrew Bogott: Forward nova policy changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/316477 [21:26:03] (03PS1) 10Andrew Bogott: Remove labs puppetmaster certcleaner [puppet] - 10https://gerrit.wikimedia.org/r/316478 (https://phabricator.wikimedia.org/T146303) [21:27:35] (03CR) 10Andrew Bogott: [C: 032] Forward nova policy changes to mitaka [puppet] - 10https://gerrit.wikimedia.org/r/316477 (owner: 10Andrew Bogott) [21:28:09] (03CR) 10Andrew Bogott: [C: 032] Remove labs puppetmaster certcleaner [puppet] - 10https://gerrit.wikimedia.org/r/316478 (https://phabricator.wikimedia.org/T146303) (owner: 10Andrew Bogott) [21:54:49] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor memory limits for main process and subprocesses - https://phabricator.wikimedia.org/T145623#2723969 (10Gilles) Awesome, thank you! [21:59:47] 06Operations, 06Labs, 06Research-and-Data-Backlog, 10hardware-requests: eqiad: 2 hardware access request for research labsdbs - https://phabricator.wikimedia.org/T146065#2723975 (10Halfak) I'm looking at this task because I've got a set of datasets that seem to belong on these boxes. It seems that it woul... [22:00:06] (03PS1) 10Madhuvishy: labstore: Mount maps share simultaneously from labstore1003 and 1001 [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) [22:00:19] we still having API issues? [22:02:21] Why? [22:07:15] (03CR) 10Rush: labstore: Mount maps share simultaneously from labstore1003 and 1001 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/316482 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [22:07:31] !log running restriction import script on restbase1007 [22:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:26] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:14:20] (03PS1) 10Legoktm: Add .gitreview [software/service-checker] - 10https://gerrit.wikimedia.org/r/316484 [22:14:53] (03PS2) 10Legoktm: Add x-default-query functionality [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 [22:15:30] (03CR) 10Legoktm: "This is the last thing needed for MediaWiki support." [software/service-checker] - 10https://gerrit.wikimedia.org/r/308020 (owner: 10Legoktm) [22:18:02] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:10] gerrit down? [22:19:13] gerrit.wikimedia.org is not loading for me [22:19:22] mutante ostriches ^^ [22:20:25] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:20:40] i _just_ opened my laptop again, ok, ehm [22:20:51] mutante it is very very slow [22:20:52] at loading [22:21:13] I see no spike https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network though [22:22:43] mutante: ERR_TIMED_OUT for me, seemed to hanged at "Establishing secure connection" for me [22:22:56] ok, it's not busy, not like when lead had the problem [22:23:34] it did load for me, but slowly [22:23:34] Code Review - Error / Server Unavailable / 0 [22:23:41] helpful error is helpful [22:25:27] * mutante attempts to restart gerrit [22:25:35] mutante: there are no recent logs in /var/lib/gerrit2/review_site/logs ? [22:25:40] i dont know [22:25:53] last things from the 13th [22:25:57] i think it's fast again [22:26:07] works for me now [22:26:13] !log restarted gerrit on cobalt [22:26:19] Yep works now [22:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:26:28] so yea, just restarted the service [22:26:37] with the normal init.d script [22:27:01] IT Crowd style :) [22:27:11] Thanks [22:27:31] volans: the change on the 13th is expected kind of because [22:27:39] this was merged then https://gerrit.wikimedia.org/r/#/c/315571/ [22:27:51] I wonder why gerrit became slow, and then started working after a restart? [22:27:54] Possibly a bug? [22:28:36] mutante: ack, between removing spamming logs and no-logs there is some middle way though ;) [22:29:13] or are they in a different path? [22:30:00] isn't it a burden to use gerrit nowadays as differential is around [22:30:11] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724062 (10pmiazga) [22:30:32] arseny92 differential isent as complete as gerrit [22:30:39] no arseny92 [22:30:51] mutante: thanks! much improved [22:31:10] gerrit includes more develeper tools then differental, but differential is more modern in interface then gerrit [22:31:33] Plus the tests wont work with nodepool yet [22:31:39] gerrit also has the added benefit of not being controlled by phabricator upstream [22:32:00] Yep [22:32:06] There more frendly [22:32:23] They accepted one of my changes either though it was a new feature in 2.12.* release [22:32:31] They accept almost all contributions [22:32:55] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_aggregator_projectview_data],Exec[git_pull_statistics_mediawiki] [22:32:57] as far as i did read gerrit migration, aren't all the repos cr going to be eventually migrated to differential? [22:33:25] arseny92 thats been put on hold [22:33:26] for now [22:33:36] seen that too yes [22:33:39] until upstream implement most of gerrit features. [22:33:51] diffusion is still going ahead [22:33:52] though [22:34:01] thats why i said eventually [22:34:23] diffusion is already on for browsing, not cr [22:34:26] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:34:41] yep, differential on hold, so no date for that one, diffusion, most repos are there now. All new ones are being created there anyways, so gerrit + diffusion [22:35:30] Im starting to thing this is a bug in gerrit now [22:36:18] and what basis are you using for that, other than random crystal ball glazing? [22:36:53] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security and fluorine for Petr - https://phabricator.wikimedia.org/T148473#2724080 (10GWicke) [22:37:01] p858snake|L2_ gerrit keeps slowing down every so often [22:37:13] I doint remeber that happening with the old gerrit server [22:37:43] some service throttles cpu again? [22:37:48] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79520 MB (15% inode=99%) [22:38:01] gwicke: separate issues, really should be filed as separate tasks [22:39:18] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to #mediawiki-security - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke) [22:39:27] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke) [22:40:08] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access request: #mediawiki_security for Petr. - https://phabricator.wikimedia.org/T148476#2724121 (10GWicke) [22:40:22] p858snake|L2_: there you go [22:40:44] 06Operations, 10Ops-Access-Requests, 06Services (blocked): Access to fluorine for Petr - https://phabricator.wikimedia.org/T148475#2724106 (10GWicke) [22:45:50] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724142 (10pmiazga) [22:47:55] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724155 (10pmiazga) [22:51:04] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:51:45] 06Operations, 10Ops-Access-Requests: Access to stat1002, stat1003, stat1004) for user pmiazga - https://phabricator.wikimedia.org/T148472#2724165 (10dr0ptp4kt) Approved for web request & Hive, prod event logging, nonprod (e.g., beta cluster) event logging, modeled after other Reading engineers with such access. [22:51:53] 06Operations, 10Ops-Access-Requests: Requesting access to "Production shell" for pmiazga - https://phabricator.wikimedia.org/T148477#2724166 (10dr0ptp4kt) Approved. [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) - CANCELLED (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161017T2300). [23:00:05] arseny92: A patch you scheduled for Evening SWAT (Max 8 patches) - CANCELLED is about to be deployed. Please be available during the process. [23:00:14] cancelled :/ [23:00:37] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724182 (10Paladox) [23:00:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:01:41] and jouncebot needs to be tweaked to handle cancelled event properly lol [23:02:03] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724184 (10Dzahn) On Oct 13 the log4j.properties were merged in https://gerrit.wikimedia.org/r/#/c/315571/ and there is no FileAppender in it because the plan is p... [23:06:24] RECOVERY - Disk space on elastic1024 is OK: DISK OK [23:14:36] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:20:50] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew) [23:21:25] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724231 (10Andrew) [23:21:54] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724213 (10Andrew) [23:27:39] !log running import deletions script on restbase1007 [23:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:53] 07Puppet, 06Labs: Puppet parser, puppet API, and inline docs - https://phabricator.wikimedia.org/T148479#2724259 (10Andrew) I should add that code blocks that use curly-races don't break anything. For an extreme example, checkout role::puppetmaster::standalone. The API says that it has no docs. But if I rem... [23:28:07] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:28:23] grroan, the grrrit-wm [23:28:53] grrrit-wm: restart yourself [23:31:26] (03CR) 10Dzahn: "@Alex yep, here's the general love for the rule https://gerrit.wikimedia.org/r/#/c/316497/" [puppet] - 10https://gerrit.wikimedia.org/r/316030 (owner: 10Dzahn) [23:32:42] (03CR) 10Dzahn: "Assuming it's not a problem to @resolve, AAAA when some of them don't have IPv6 yet. (eventlog1001 does not for example)" [puppet] - 10https://gerrit.wikimedia.org/r/316497 (owner: 10Dzahn) [23:40:46] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:18] 06Operations, 10Gerrit, 06Release-Engineering-Team: Investigate why gerrit slowed down on 17/10/2016 - https://phabricator.wikimedia.org/T148478#2724267 (10Peachey88) [23:43:37] PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:07] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:26] PROBLEM - Apache HTTP on mw1286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:37] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.411 second response time [23:46:54] RECOVERY - Apache HTTP on mw1286 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.193 second response time [23:48:18] PROBLEM - HHVM rendering on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:38] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.084 second response time [23:50:39] RECOVERY - HHVM rendering on mw1283 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 1.094 second response time [23:50:56] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:58] PROBLEM - Apache HTTP on mw1279 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:03] PROBLEM - Apache HTTP on mw1283 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:57] PROBLEM - HHVM rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:17] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:29] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 5.356 second response time [23:53:36] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:39] PROBLEM - HHVM rendering on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:56] PROBLEM - Apache HTTP on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:54:17] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:54:28] RECOVERY - Apache HTTP on mw1279 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.847 second response time [23:54:32] RECOVERY - Apache HTTP on mw1283 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 8.912 second response time [23:54:36] PROBLEM - Apache HTTP on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:36] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:54:58] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:55:26] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:55:26] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:55:46] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 81194 bytes in 0.704 second response time [23:56:03] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 81194 bytes in 0.925 second response time [23:56:03] PROBLEM - HHVM rendering on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:06] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 81195 bytes in 0.122 second response time [23:56:18] RECOVERY - Apache HTTP on mw1277 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 1.237 second response time [23:57:16] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:57:30] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 3.560 second response time [23:58:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:58:01] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 81196 bytes in 3.450 second response time [23:58:06] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:58:06] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:58:47] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /page/random/{format} (Random title redirect) is CRITICAL: Could not fetch url http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/random/title: Timeout on connection while downloading http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/random/title [23:58:50] PROBLEM - HHVM rendering on mw1277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:59:27] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:59:36] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:59:57] PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds