[00:00:08] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Sun 10 Aug 2014 21:59:54 UTC [00:39:49] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Mon Aug 11 00:39:47 UTC 2014 [01:18:08] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sun 10 Aug 2014 23:17:34 UTC [01:23:06] gods [01:23:16] could somebody merge https://gerrit.wikimedia.org/r/#/c/151425/ already so the guy stops rebasing it every few hours [01:50:38] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:29] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.031 second response time [02:34:06] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-11 02:33:02+00:00 [02:34:13] Logged the message, Master [02:46:52] (03Abandoned) 10Jackmcbarn: Let sysops edit the GWToolset namespace on Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142617 (https://bugzilla.wikimedia.org/67209) (owner: 10Jackmcbarn) [03:02:18] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [03:05:18] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-11 03:04:15+00:00 [03:05:26] Logged the message, Master [03:19:08] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Sun 10 Aug 2014 23:17:34 UTC [03:20:29] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:21:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:23:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:25:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:27:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:29:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:31:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:33:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:35:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:35:56] i'm getting errors when i use the enwp api: { "servedby": "mw1200", "error": { "code": "internal_api_error_DBQueryError", "info": "Database query error", "*": "" }} [03:37:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:18:17 UTC [03:37:59] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Mon Aug 11 03:37:54 UTC 2014 [03:39:29] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 03:37:54 UTC [03:40:22] jackmcbarn: frequently? [03:40:33] legoktm: intermittently [03:40:34] what action=? [03:40:56] https://en.wikipedia.org/w/api.php?action=purge&format=jsonfm&forcelinkupdate=&generator=categorymembers&gcmtitle=Category%3APages%20with%20script%20errors&gcmlimit=30&gcmsort=timestamp&gcmdir=desc [03:41:22] hah, big surprise (see #mediawiki) [03:41:36] jeremyb: i don't see anything relevant there [03:42:07] lol the page is taking a long time to load [03:42:17] the fact action=purge takes a generator seems like a bad idea [03:42:33] is there a better way to do what i'm doing? [03:43:14] this is the kind of thing to talk to Carmela and Tim-away about [03:43:21] not that I'm aware of [03:43:25] { [03:43:25] "servedby": "mw1142", [03:43:25] "error": { [03:43:26] "code": "internal_api_error_DBQueryError", [03:43:26] "info": "Database query error", [03:43:27] "*": "" [03:43:28] } [03:43:29] } [03:43:36] jackmcbarn: file a bug? [03:43:40] jackmcbarn, are you not hitting that URL to begin with because of orain? or it's really unrelated? [03:43:50] unrelated I think. [03:43:56] jeremyb: that's on enwiki. has nothing to do with what i was saying in #mediawiki [03:44:28] jackmcbarn, i understand they are unrelated reasons and wikis [03:44:38] but it's the same kind of purge [03:44:45] yes. that's a coincidence [03:45:05] well i never even heard of it before today [03:45:18] anyway, nacth [03:45:23] nacht* [03:45:27] that's one of my "favorite" api queries, actually. useful in all sorts of situations [03:57:14] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Mon Aug 11 03:57:01 UTC 2014 [03:57:55] RECOVERY - Puppet freshness on analytics1024 is OK: puppet ran at Mon Aug 11 03:57:52 UTC 2014 [04:00:23] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 11 03:59:17 UTC 2014 (duration 59m 16s) [04:00:29] Logged the message, Master [04:03:00] Purge can be used with a list generator? [04:03:01] Lawl. [04:03:05] Purge as a concept is a hack. [04:03:12] We should aim to kill it, not make it more efficient. [04:04:00] Or rather, it should be programmatically and automagically sufficiently efficient in MediaWiki core. It should be an internal implementation detail, not an exposed interface. [04:04:08] For realz. [04:07:06] Carmela, oh, sorry i pinged you by accident i think. I meant dispenser [04:08:50] I thought you went to bed. [04:24:31] i did [04:24:48] well not bed exactly. have to go home first [04:24:52] :) [04:41:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:43:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:45:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:47:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:49:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:51:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:53:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:55:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:38:13 UTC [04:57:15] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Mon Aug 11 04:57:10 UTC 2014 [04:59:25] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:57:10 UTC [05:17:56] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Mon Aug 11 05:17:48 UTC 2014 [05:39:06] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Epic puppet fail [05:41:52] epic? [05:59:07] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:04:23] <_joe_> eheh [06:14:06] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 04:13:28 UTC [06:15:36] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Mon Aug 11 06:15:26 UTC 2014 [06:28:06] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Epic puppet fail [06:28:17] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Epic puppet fail [06:28:57] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:07] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:07] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:16] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:17] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:26] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:26] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:56] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:36] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:26] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:57] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Epic puppet fail [06:45:16] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:46] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:50:35] (03PS1) 10Jeremyb: account creation limit for CIS (tewiki) event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153383 (https://bugzilla.wikimedia.org/69385) [06:52:48] (03CR) 10Jeremyb: [C: 04-1] "I can't read the event page. -1 pending confirmation from a local. (also reporter is not sysop)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153383 (https://bugzilla.wikimedia.org/69385) (owner: 10Jeremyb) [06:58:57] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [07:42:27] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [07:59:26] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:01:06] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 06:00:08 UTC [08:19:20] <_joe_> mmm no hashar around [08:34:06] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 06:33:16 UTC [08:36:07] (03CR) 10Jerith: "As far as I am aware, this isn't being used for anything -- all the SMS/USSD systems are being hosted externally." [operations/puppet] - 10https://gerrit.wikimedia.org/r/117673 (owner: 10Matanya) [08:37:06] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 06:36:39 UTC [08:45:42] (03PS4) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [08:49:17] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:07] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.027 second response time [08:52:57] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Mon Aug 11 08:52:52 UTC 2014 [08:56:56] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Mon Aug 11 08:56:51 UTC 2014 [09:00:06] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Aug 11 09:00:03 UTC 2014 [09:27:39] <_joe_> akosiaris: good morning btw :) [09:27:59] _joe_: good morning to you too [09:30:15] <_joe_> using salt to fix ldap auth issues on labs machines makes me feel like I'm using ed again [09:30:32] <_joe_> only no I edit files via perl or sed [09:30:36] <_joe_> *now [09:31:20] (03CR) 10Giuseppe Lavagetto: "@hashar: I fixed the access issues on the labs instances as weel." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [09:45:22] (03PS1) 10QChris: Re-align block of attributes [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153387 [09:45:33] (03PS1) 10QChris: Reschedule backups to not interfer with queue runs so easily [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153388 (https://bugzilla.wikimedia.org/68731) [10:56:02] (03PS1) 10Nuria: Lowering celery concurrency and removing MAX_PARALLEL_RUN [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153390 [11:12:42] (03CR) 10Nuria: Reschedule backups to not interfer with queue runs so easily (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153388 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [11:20:11] (03CR) 10QChris: Reschedule backups to not interfer with queue runs so easily (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153388 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [11:59:38] good morning [11:59:44] hola hashar [12:03:03] mark: any thoughts about using a magic cookie to allow staff and trusted users to opt-in to having their requests served by the HHVM backend in varnish? this is patch ; if the approach looks good i'll do it over on top of your varnish patch [12:06:40] what's the difference between mine besides yours being a bit more convoluted and also spanning bits? :) [12:07:52] i guess yours uses a secret cookie name, but why does it need to be secret? [12:08:28] actually maybe it doesn't; i was worried that someone could be a clown and set the cookie in, say, enwiki's common.js [12:08:41] and thereby cause all requests to get routed to an overloaded cluster of a few machines [12:09:03] but i guess you deal with that in a more sophisticated way -- i don't remember the details but there was some logic to fall back to the general apache pool right? [12:09:15] yeah [12:09:17] not that I tested it :) [12:09:25] if someone does that, that's a good reason to take their common.js access away [12:09:50] right [12:10:50] (03Abandoned) 10Ori.livneh: Varnish: route requests with magic cookie to HHVM backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/153289 (owner: 10Ori.livneh) [12:10:58] basically, my patch tries to only send to hhvm on the first try [12:11:02] akosiaris: your packaging skills rocks :] [12:11:07] if the request restarts for some reason, go to zend [12:11:25] i forgot something now I think of it [12:11:34] it should only apply on tier 1, i.e. eqiad right now [12:11:38] (03Abandoned) 10Ori.livneh: Varnish: add 'HHVM' backend [operations/puppet] - 10https://gerrit.wikimedia.org/r/153288 (owner: 10Ori.livneh) [12:12:01] * ori doesn't totally grasp what a 'tier' is in this context [12:12:03] ori: in principle, I see value in supporting a generic mechanism for these kinds of tests [12:12:10] a cookie which enables/disables some functionality [12:12:17] but then perhaps we shouldn't call it "hhvm_cookie" :) [12:12:36] ori: eqiad is tier 1, esams/ulsfo are tier 2 [12:12:39] as they don't talk to backends directly [12:12:45] and only tier 1 backend caches should get this logic [12:13:09] ah right, that's why the testwiki checks are also enclosed in <% if tier == 1 %> checks [12:13:13] yes [12:13:28] makes sense [12:14:58] (03CR) 10Mark Bergsma: [C: 04-1] "I forgot to put conditionals to only add the hhvm backend logic on 'tier 1' caches (i.e. eqiad right now). ulsfo/esams (tier 2) shouldn't " [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [12:15:47] ori: we could even do something like where in varnish we set such a cookie on a very small percentage of incoming traffic (random) [12:15:55] so the nr of users getting that cookie would slowly rampup [12:16:00] and we can just disable that when we have enough [12:16:38] but that would be at a later stage here, when we're more confident [12:16:47] (03CR) 10Revi: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [12:17:08] (03CR) 10Hashar: "The Analytics team is aware of it via Bug 68997 - Package libcidr + libanon + libdclass for Ubuntu Trusty" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153209 (owner: 10Hashar) [12:18:22] (03CR) 10Mark Bergsma: Separate HHVM app servers backend. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [12:21:06] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 10:20:01 UTC [12:23:26] (03CR) 10John Vandenberg: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [12:49:20] (03CR) 10Hashar: [C: 031] "Not sure whether 'latest' is actually needed. Nonetheless +1 for reusing the existing class :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [12:51:15] (03PS1) 10QChris: Force redis dump before backing up [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153395 [12:51:53] (03CR) 10MZMcBride: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [12:51:55] (03PS2) 10QChris: Force redis dump before backing up [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153395 (https://bugzilla.wikimedia.org/68731) [12:54:26] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:55:17] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [13:14:16] (03PS1) 10Giuseppe Lavagetto: puppetmaster: make reimaging servers easier. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153397 [13:20:07] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Aug 11 13:20:03 UTC 2014 [13:31:56] akosiaris: Hi does the deployment-mathoid.eqiad.wmflabs use ubuntu 12 or 14 [13:48:33] (03Abandoned) 10Ori.livneh: HHVM: fix Apache config for status site [operations/puppet] - 10https://gerrit.wikimedia.org/r/152753 (owner: 10Ori.livneh) [14:32:51] (03PS2) 10Hashar: Add tox.ini and set max line length to 120 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150056 (owner: 10Yuvipanda) [14:33:32] physikerwelt: 14.04 [14:34:01] akosiaris: thanks [14:34:02] (03CR) 10Hashar: [C: 032] "Jenkins / Zuul already have triggers and the job pass. Lets merge :-]" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150056 (owner: 10Yuvipanda) [14:34:04] (03PS1) 10Giuseppe Lavagetto: apache: add a 'replaces' parameter to apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/153406 [14:34:07] (03Merged) 10jenkins-bot: Add tox.ini and set max line length to 120 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150056 (owner: 10Yuvipanda) [14:34:32] (03PS2) 10Hashar: Implement last command (per greg-g) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda) [14:34:46] (03CR) 10Hashar: "I have merged https://gerrit.wikimedia.org/r/#/c/150056/ which adds the flake8 env to tox.ini :]" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda) [14:37:33] (03CR) 10Hashar: Implement last command (per greg-g) (032 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda) [14:39:51] (03CR) 10Steinsplitter: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [14:46:14] (03CR) 10Andrew Bogott: [C: 032] Tools: Add some i386 compat packages to exec nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/125241 (owner: 10Yuvipanda) [14:49:46] (03CR) 10Perhelion: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [14:58:45] (03Abandoned) 10Hashar: Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [15:18:27] (03CR) 10Ricordisamoa: "It is not a revert war. It is a proper reversion of a bad change that was introduced without consensus." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [15:20:28] <_joe_> I guess someone does not see the distinction between code reviews and voting [15:26:33] _joe_: heh [15:27:22] <_joe_> this is _all_ I will say on the issue [15:27:53] <_joe_> gerrit is not the place where you vote on a feature, IMHO [15:32:24] new policy: all gerrit commits mustbe preceded by a community consensus process [15:40:39] bblack: not all. but changes on operations/mediawiki-config.git often needs to be backed up by a community decision [15:40:55] I was just being sarcastic, ignore me :) [15:41:42] (03CR) 10Hashar: "@Ricordisamoa I am sure it is not a revert war. Anyway the discussion would be held on wikitech-l , not on Gerrit =)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [15:41:50] hehe [15:41:59] off for some family time, be back later this evening [15:54:55] (03PS5) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [15:55:37] (03CR) 10Odder: "And this patch was abandoned because..?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [16:00:08] (03PS1) 10Alexandros Kosiaris: WIP: module/role class for servermon [operations/puppet] - 10https://gerrit.wikimedia.org/r/153412 [16:02:39] (03PS6) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [16:10:39] (03CR) 10Steinsplitter: "+1'ing this is a way to remind the wmf that we are their devs too." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [16:15:34] (03CR) 10Parent5446: "I agree with Hashar here. Gerrit is not the place to argue this. (Also, if it wasn't abandoned, I would probably -1 this patch for reasons" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [16:20:51] 18:18 entropius: Anybody here responsible for wikimedia infrastructure? I think I found a missconfigured server, enabling me to gain root access on some probably unimportant server in your network. [16:21:05] bblack: ^^ [16:21:14] odder: Thanks! [16:21:57] bblack: May I send you what I have in a query? [16:21:59] odder: contact me privately please [16:22:07] that's entropius, akosiaris [16:22:08] er.. entropius, sorry [16:22:29] entropius: yes, please do [16:23:18] (03CR) 10Florianschmidtwelzow: "@Odder: There is ONE open revert change:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [16:24:37] (03CR) 10Odder: "@Florian: That's for the feature in MediaWiki core." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [16:26:47] (03CR) 10Florianschmidtwelzow: "@Odder: Damn, sorry, false change :(" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [17:04:06] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Epic puppet fail [17:16:56] (03CR) 10Ricordisamoa: "This change should be restored." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [17:20:06] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 15:19:54 UTC [17:20:36] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Aug 11 17:20:29 UTC 2014 [17:25:06] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:31:33] (03PS4) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:31:37] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:44:43] (03PS5) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:44:46] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:53:32] (03PS6) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:57:29] (03CR) 10Rillke: "Sad to see UploadWizard patches pending for more than months without comments and if some community members and developers feel like have " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153302 (owner: 10Tim Starling) [17:57:49] (03CR) 10Nemo bis: data retention audit script for logs, /root and /home dirs (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [18:00:16] (03CR) 10Nemo bis: "Would probably use a license statement somewhere?" (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [18:07:53] Nemo_bis: it can use: cleanup. being broken into several little files. lots more cleanup. and did I mention cleanup? but I'm out of time... [18:10:24] apergos: sure, hence I only mentioned the one possibly-necessary thing which one line should suffice to solve :) [18:10:31] heh [18:11:11] ok that's it for me for the day, so outa here [18:11:24] evening/night :) [18:17:35] (03PS1) 10Ori.livneh: Small lint-fix for hhvm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/153424 [18:50:51] (03PS1) 10Calak: Change user groups rights on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153427 (https://bugzilla.wikimedia.org/69394) [18:56:20] (03PS1) 10Steinsplitter: Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 [18:58:05] (03CR) 10Odder: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [18:59:07] (03PS2) 10Ori.livneh: apache: set a default 5-second GracefulShutdownTimeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/153128 [19:00:04] (03CR) 10Perhelion: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [19:00:47] (03CR) 10Ori.livneh: [C: 032 V: 032] apache: set a default 5-second GracefulShutdownTimeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/153128 (owner: 10Ori.livneh) [19:14:50] (03PS2) 10Steinsplitter: Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 [19:15:55] (03PS3) 10Steinsplitter: Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 [19:18:39] akosiaris: I have updated the node-jsdom package to ubutu 14 and tested on mathoid-puppet, could you deploy the current version of the git repository again? [19:18:49] (03CR) 10Mwjames: [C: 031] "I can only support this notion not on technical grounds but on its social/political implication after having witnessed that User:Eloquence" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [19:18:51] (03CR) 10Hashar: "Please don't restore this change, and instead bring the discussion on wikitech-l. That is nicer, reach a wider audience and avoid clutteri" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [19:22:36] (03CR) 10Ricordisamoa: "Now in Ia629887a1cec625e56f3c8074ced68d09a6f7572" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [19:23:03] (03CR) 10Ricordisamoa: [C: 031] Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [19:24:22] (03CR) 10Ricordisamoa: "7c260ef1b4c4774a7f45b78188b544b11c20e610 was not discussed on wikitech-l, nor did it reach an audience." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153321 (owner: 10Odder) [19:29:59] (03CR) 10Jackmcbarn: "Note that reverting this patch will neither remove superprotection from pages it's already on nor remove the superprotect permission from " [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153302 (owner: 10Tim Starling) [19:32:34] (03Abandoned) 10Hashar: Revert "Add a new protection level called "superprotect"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [19:32:40] ahh [19:32:45] super productive evening [19:33:23] ... [19:33:50] writing a mail to austrian press now. engough is egoug. [19:38:16] Steinsplitter: and that's going to have which effect, exactly? [19:38:36] its in the press? [19:38:51] "wikipedia editors are mad at changes in the website" [19:38:59] haha [19:39:04] "facebook users are mad at the new timeline" [19:39:13] :) [19:39:27] no i don't - enyoing now real live. [19:39:35] it is not worth to spend moor time on this :P [19:41:05] (03CR) 10Ricordisamoa: "Please merge this as "Requested by Ricordisamoa"." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [19:41:08] (03PS1) 10Gage: Logstash filter changes to support messages from Hadoop [operations/puppet] - 10https://gerrit.wikimedia.org/r/153434 [19:41:45] (03CR) 10Gage: "See https://gerrit.wikimedia.org/r/#/c/153434/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [19:41:53] (03Abandoned) 10Gage: filter changes to support messages from Hadoop [operations/puppet] - 10https://gerrit.wikimedia.org/r/140623 (owner: 10Gage) [19:42:47] (03CR) 10Gage: [C: 032] Logstash filter changes to support messages from Hadoop [operations/puppet] - 10https://gerrit.wikimedia.org/r/153434 (owner: 10Gage) [20:00:54] (03CR) 10Odder: "Do not abandon this patch, ever again. Period." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153428 (owner: 10Steinsplitter) [20:12:05] 19:16, 11 August 2014 Koenraad (Talk | contribs) blocked Eloquence (Talk | contribs) with an expiry time of 1 month (account creation disabled) (Setzt sich über das Meinungsbild Wikipedia:Meinungsbilder/Medienbetrachter hinweg) hm. [20:12:54] lol [20:13:12] this is not a fake, see dewiki block log :P [20:13:26] I've been blocked before on a foreign language Wikipedia or wiki and I'm a steward [20:13:31] I was only reverting also [20:14:12] o-O [20:15:23] https://de.wikipedia.org/wiki/Wikipedia:Meinungsbilder/Medienbetrachter [20:15:25] What is this? [20:15:42] RFC [20:26:09] (03PS5) 10Andrew Bogott: Tools: Remove lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 (owner: 10Tim Landscheidt) [20:31:33] (03CR) 10Andrew Bogott: [C: 032] "Sorry this took forever to merge! I was somehow confused by the phrase 'Remove lint' thinking that it meant the opposite of 'lint.'" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124001 (owner: 10Tim Landscheidt) [20:50:26] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:16] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.021 second response time [20:52:04] (03PS2) 10Andrew Bogott: Tools: Install php5-imagick [operations/puppet] - 10https://gerrit.wikimedia.org/r/151551 (https://bugzilla.wikimedia.org/69078) (owner: 10Tim Landscheidt) [20:52:50] (03CR) 10Andrew Bogott: [C: 032] Tools: Install php5-imagick [operations/puppet] - 10https://gerrit.wikimedia.org/r/151551 (https://bugzilla.wikimedia.org/69078) (owner: 10Tim Landscheidt) [20:56:34] (03PS2) 10Andrew Bogott: Tools: Sort package lists alphabetically [operations/puppet] - 10https://gerrit.wikimedia.org/r/151526 (owner: 10Tim Landscheidt) [20:58:35] (03CR) 10Andrew Bogott: [C: 032] Tools: Sort package lists alphabetically [operations/puppet] - 10https://gerrit.wikimedia.org/r/151526 (owner: 10Tim Landscheidt) [21:04:01] (03PS2) 10Andrew Bogott: Tools: Install libgd-gd2-perl [operations/puppet] - 10https://gerrit.wikimedia.org/r/151416 (https://bugzilla.wikimedia.org/67199) (owner: 10Tim Landscheidt) [21:05:45] Jenkins sure is fast when I'm the only one working [21:06:10] (03CR) 10Andrew Bogott: [C: 032] Tools: Install libgd-gd2-perl [operations/puppet] - 10https://gerrit.wikimedia.org/r/151416 (https://bugzilla.wikimedia.org/67199) (owner: 10Tim Landscheidt) [21:06:21] (03PS1) 10Yurik: Zero: 436-06 added to unified and added https [operations/puppet] - 10https://gerrit.wikimedia.org/r/153481 [21:06:38] bblack, when you have a moment ^ (i'm looking at removing dependency from stats so we can migrate more at once) [21:06:48] ok [21:07:34] (03PS2) 10BBlack: Zero: 436-06 added to unified and added https [operations/puppet] - 10https://gerrit.wikimedia.org/r/153481 (owner: 10Yurik) [21:07:40] (03CR) 10BBlack: [C: 032 V: 032] Zero: 436-06 added to unified and added https [operations/puppet] - 10https://gerrit.wikimedia.org/r/153481 (owner: 10Yurik) [21:13:06] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Mon 11 Aug 2014 19:12:55 UTC [21:13:13] (03PS1) 10Yurik: Zero: updated 436-01 436-04 - both unified, both support https [operations/puppet] - 10https://gerrit.wikimedia.org/r/153505 [21:13:24] bblack, dan just told me that 436-01 should also be included [21:13:25] ^ [21:13:31] sorry to go one by one [21:13:37] np :) [21:14:58] (03CR) 10BBlack: [C: 032] Zero: updated 436-01 436-04 - both unified, both support https [operations/puppet] - 10https://gerrit.wikimedia.org/r/153505 (owner: 10Yurik) [21:16:27] (03PS1) 10Jgreen: remove deprecated config related to fundraising/aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/153517 [21:19:01] (03CR) 10Jgreen: [C: 032 V: 031] remove deprecated config related to fundraising/aluminium [operations/puppet] - 10https://gerrit.wikimedia.org/r/153517 (owner: 10Jgreen) [21:27:39] (03PS1) 10Gage: Hadoop role: depend on JARs for GELF, pass param [operations/puppet] - 10https://gerrit.wikimedia.org/r/153526 [21:28:00] (03Abandoned) 10Gage: Hadoop: supply JARs for GELF output, pass parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/140677 (owner: 10Gage) [21:31:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:32:21] (03CR) 10Gage: [C: 032] Hadoop role: depend on JARs for GELF, pass param [operations/puppet] - 10https://gerrit.wikimedia.org/r/153526 (owner: 10Gage) [21:33:36] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Mon Aug 11 21:33:26 UTC 2014 [21:33:48] (03CR) 10Andrew Bogott: [C: 032] "New checks seem to be mostly working now!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [21:34:56] (03PS1) 10Gage: enable gelf logging from hadoop workers [operations/puppet] - 10https://gerrit.wikimedia.org/r/153528 [21:36:00] (03PS2) 10Gage: enable gelf logging from hadoop workers [operations/puppet] - 10https://gerrit.wikimedia.org/r/153528 [21:38:16] (03PS2) 10Andrew Bogott: rm old puppet_disabled check,replaced by new chk [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [21:39:32] (03CR) 10Andrew Bogott: [C: 032] rm old puppet_disabled check,replaced by new chk [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [21:41:50] (03CR) 10Gage: [C: 032] enable gelf logging from hadoop workers [operations/puppet] - 10https://gerrit.wikimedia.org/r/153528 (owner: 10Gage) [21:42:46] PROBLEM - puppet last run on ytterbium is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:46] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:46] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:47] PROBLEM - puppet last run on es1006 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:56] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:57] PROBLEM - puppet last run on db1010 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:57] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [21:42:57] PROBLEM - puppet last run on tarin is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:43:17] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:27] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:27] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:27] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:36] PROBLEM - puppet last run on es1009 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:36] PROBLEM - puppet last run on db1007 is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:57] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [21:44:05] well, that is definitely my fault. [21:46:16] …maybe :/ [21:46:38] RECOVERY - puppet last run on db1007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:47:34] !log removed the old puppet-freshness check which should have no effect but may instead produce a torrent of alert spam https://gerrit.wikimedia.org/r/#/c/142560/ [21:47:38] Logged the message, Master [21:48:48] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [21:54:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [21:55:12] huge reqerror spike there ^ [21:55:13] ? [21:55:27] https://gdash.wikimedia.org/dashboards/reqerror/ [21:56:38] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-1hours&from=-1hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [21:56:49] over in 5 minutes, but still curious if anyone has an explanation [21:56:53] Hm, so it's over already? [21:57:01] Ah, as you said. [22:00:27] spike of mediawiki fatals, apparent here: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [22:00:31] (it's in the channel topic) [22:00:40] RECOVERY - puppet last run on es1009 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [22:00:49] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:00:56] to dive in you can go to logstash or fluorine:/a/mw-log/fatal.log [22:00:59] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [22:00:59] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:01:19] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [22:01:20] RECOVERY - puppet last run on tarin is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:01:29] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:01:29] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:01:49] RECOVERY - puppet last run on es1006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:01:49] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [22:01:50] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:01:59] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [22:02:19] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [22:09:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:03:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:11:15] robin williams dies.. possible MJ effects may follow ? [23:11:30] thedj: yeah, already getting " Pool queue is full" [23:12:15] handy with everyone being on planes... [23:12:15] bblack: ^ [23:13:10] <_joe_> NotASpy: looking [23:13:37] <_joe_> NotASpy: just that page? [23:13:54] yeah, as far as we're aware. [23:14:13] <_joe_> that's what I'm seeing [23:14:17] (we being the en.wp admin cabal) [23:14:26] legoktm: You around? [23:14:36] hi [23:14:51] hard refresh got it to load. [23:15:04] yeah our graphs do confirm a big spike up in traffic that looks like that [23:15:08] legoktm: I'm worrying about what to do as the duty greg-g. [23:15:11] legoktm: Thoughts? [23:15:21] I'm not sure what we can really do besides increase the pool counter limit? [23:16:05] <_joe_> legoktm: how confident are you that won't end up doing more harm than good? [23:16:12] 0% [23:16:22] <_joe_> I'd like to preserve the rest of wp online if possible [23:16:27] * James_F nods. [23:17:06] we can FP the page so people can read and it'll invalidate cache less often [23:17:08] <_joe_> it seems to load better in the continental US than in Europe? [23:17:15] * bblack grumbles something about how a read-mostly page is geneating so much write-like activity into our backends due to stats and banner checks and such :P [23:17:28] <_joe_> that may work [23:18:00] (and yes, it loads fine for me here in the US, I'm mapped to eqiad caches) [23:18:34] <_joe_> it still doesn't for me [23:18:44] <_joe_> the traffic spike can be seen here https://gdash.wikimedia.org/dashboards/reqsum/ [23:19:12] <_joe_> it did now [23:20:08] the spike looks within the bounds of normal traffic variation, though... [23:21:12] <_joe_> all systems are healty AFAICS [23:21:59] <_joe_> also, the poolcounter stats look healthier now [23:22:29] <_joe_> NotASpy: are people still having difficulties loading the page? [23:23:14] <_joe_> error rate hasn;t been pretty https://gdash.wikimedia.org/dashboards/reqerror/ [23:23:32] _joe_: no reports of difficulties on the various IRC channels yet [23:24:35] we're going to protect the page for 12 hours though, to try and make your lives a bit easier. [23:24:42] It's protected now [23:25:25] _joe_: bblack: It seems Mobile caches eqiad are spiking more by relative comparison than overall bits and text caches. Understandable given the way people get the news. [23:25:26] <_joe_> thanks guys [23:25:39] <_joe_> Krinkle: of course, yes [23:25:58] It may be useful to give mobile caches more juice or to distribute things differently. We didn't have this problem around MJ. [23:26:09] Anyway, just thinking ahead. [23:26:27] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Mobile%2520caches%2520eqiad&tab=m&vn=&hide-hf=false [23:26:30] <_joe_> Krinkle: they are holding up fine right now it seems [23:26:36] Yeah [23:27:00] How's the distributions these days (# of cache servers for mobile vs other traffic) [23:27:01] 20/80? [23:27:22] <_joe_> mh lemme check [23:27:39] <_joe_> that may in fact need to be reconsidered [23:28:19] Looks like there's 4 mobile eqiad cps and 12 text caches. [23:28:32] I might be looking at the wrong numbers. I figured there'd be more varnishes. [23:28:44] I guess ganglia isn't the best way to see that information. We have front-ends as well. [23:29:03] and bits is not fragmented anymore, right? [23:29:40] SSL cluster doesn't seem to show any spike. Odd. [23:29:42] <_joe_> yes you are correct [23:29:51] <_joe_> bits is [23:29:57] <_joe_> at the varnish level [23:30:03] OK [23:30:08] <_joe_> not odd, no one usess https [23:30:12] <_joe_> :( [23:30:22] Ah, SSL did spike but only in traffic throughput, not in cpu. [23:30:23] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=SSL+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [23:30:25] Cool. [23:30:27] It doubled in fact. [23:30:45] yeah [23:30:49] <_joe_> cool to see ECDHE is so cool [23:31:06] we actually have 4x cache machines in eqiad currently unused, that we could fairly rapdily turn into mobile caches right now [23:31:18] Are we actually getting errors or problems from them though? [23:31:37] Traffic went way up on mobile caches but according to other Ganglia graphs they don't look unhealthy AFAICT [23:31:42] <_joe_> RoanKattouw: https://gdash.wikimedia.org/dashboards/reqerror/ [23:31:44] yeah they're healthy [23:31:54] <_joe_> we had a few error spikes [23:32:07] Yeah I wonder how many of those are 502/503 (cache) and how many are 500 (appserver) [23:32:15] <_joe_> but I bet it's due to the herd effect after any cache invalidations [23:32:19] he RoanKattouw. [23:32:22] SSL might get into a similar problem area as mobile, generally being a minority traffic wise [23:32:25] I guess separation of SSL and mobile could potentially become a problem (the general minority traffic reaching traffic volume of regular figures). [23:32:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:32:43] [16:24] ori [2014-08-11 14:58:26] spike of mediawiki fatals, apparent here: https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [23:32:51] well the separation of mobile is necessary currently, because mobile's fragmentation is high with Zero [23:33:08] That graphs appears to correlate with the error spikes AFAICT [23:33:09] perhaps after the gif/js defrag work, we could reconsider merging the pools [23:33:29] https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg%5B%5D=vanadium.eqiad.wmnet&mreg%5B%5D=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [23:34:38] <_joe_> since the alarm seems off, I'll go to sleep [23:35:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [23:35:36] <_joe_> mh [23:35:40] Grmbl [23:35:42] <_joe_> the alarm is _not_ off [23:35:46] heh [23:35:54] They're OOMs in DifferenceEngine [23:36:02] Is someone making very large edits to things? [23:36:23] The URL for the errors is the feedrecentchanges API which includes a diff for every recent edit [23:37:25] <_joe_> https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+/+sec&x=0.5&n=&hreg%5B%5D=vanadium.eqiad.wmnet&mreg%5B%5D=fatal%7Cexception>ype=stack&glegend=show&aggregate=1&embed=1 [23:37:56] <_joe_> last hour has seen fatals on mw off the charts [23:38:07] OK so some OOMs that I see appear to be related to https://en.wikipedia.org/?title=User:Lefauneblanc/sandbox2&action=history [23:38:20] Those diffs seem to be hard on the diff code [23:38:35] Not too surprising since the page is enormous [23:38:35] wtf [23:38:43] I'll just delete the page [23:39:29] Hmm [23:39:41] Is DifferenceEngine::localiseLineNumbers() new or old? [23:40:20] Doesn't seem to have been touched recently [23:40:36] That's what's OOMing on us: localizing the line numbers in a massive diff using regex-fu [23:40:48] (on enwiki, mind you, so localizing the line numbers is probably a no-op anyway) [23:41:45] this guy has a bunch of other sandboxes with similar things [23:41:53] * legoktm nukes [23:42:50] <_joe_> well I guess the big exceptions spike we saw earlier [23:43:03] RoanKattouw: I do recall Tim saying that the new diff engine ("new" compared to 2009) is less performant, but it wouldn't be a problem. [23:43:04] <_joe_> had nothing to do with that page [23:43:24] Yeah exactly [23:43:30] Were there any other problems? [23:43:48] Like, ones we might be able to pin on [[Robin Williams]] rather than multi-megabyte user sandbox pages? [23:43:58] <_joe_> RoanKattouw: the errors I were getting were not from varnish [23:44:16] <_joe_> on the RW page [23:44:28] <_joe_> they were well formatted mw errors [23:44:43] What were the errors? PoolCounter messages or something else? [23:44:47] RoanKattouw: I was getting poolcounter full for reading [23:45:17] <_joe_> mine was just saying "too many people are trying to visit this page, please try again later" [23:45:32] OK, so people are/were having a hard time reading RW but PC seems to be doing its job and sending people away rather than letting the cluster fall over then [23:45:56] <_joe_> yes [23:46:08] If we are still getting PC warnings and nothing else is falling over, we could raise the PC limits like James_F suggested to legoktm earlier [23:46:41] I haven't seen any warnings recently. [23:46:44] Problem being that both lego and I just walked off the same 11-hour flight and need slee [23:46:45] p [23:46:46] <_joe_> I'm not sure we're getting warnings now [23:46:59] <_joe_> RoanKattouw: eh, talk about sleep [23:46:59] OK good [23:47:05] <_joe_> 2 AM here :) [23:47:15] _joe_: 1am in my wake-up timezone, and I woke up at 6:45am :) [23:47:35] <_joe_> yes, funny indeed [23:47:47] <_joe_> :/ [23:47:59] But yeah if the page was protected, then edit volume probably fell, which would also improve things [23:47:59] <_joe_> anyway, traffic is cooling off [23:48:07] To get PC warnings you need a lot of traffic *and* a lot of edits [23:48:09] <_joe_> as well [23:48:32] * James_F waves from actual-01:00 too. [23:48:51] forgive my ignorance, but I assume the whole reason the clsuter is involved at all is due to api hits on readonly pages, can we not have that stuff soft-fail and still show the content? [23:49:11] bblack: So the usual thing that happens is this [23:49:22] 1) Someone edits the Robin Williams page [23:49:28] 2) Edit causes cache invalidation [23:49:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:49:50] 3) Someone visits the Robin Williams page; cache is stale, so we need to render the page from scratch [23:50:00] 4) While #3 happens, a thousand other people also do #3 at the same time [23:50:31] 5) Once #3 has finished and people doing #4 start hitting the cache (or even before that), someone else does #1 [23:50:33] Rinse, repeat [23:50:36] well I get that will cause a small spike after each edit, but once that little spike is past, given the page is protected, we should be fine [23:50:52] So #4 would cause a stampede if it weren't for PoolCounter [23:51:24] PC returns error messages (in theory it should be returning stale content, don't know why that's not working) when too many people try to read the page on an empty parser cache [23:51:41] Also, biography articles like Robin Williams often contain lots of references and are slow to parse, which causes the spike to take longer [23:52:13] And during events like these, the page in question is often edited several times per minute, so the cache is cleared about as fast as we can fill it [23:52:33] if that's really the scenario, perhaps we can fix a lot of that in varnish [23:52:36] When Michael Jackson passed away, we had no PoolCounter and a slower parser, and it took the site down [23:53:02] bblack: When I say cache I don't mean Varnish, I mean MediaWiki's parser cache [23:53:02] <_joe_> good night everyone [23:53:06] Night [23:53:08] nite joe [23:53:29] yeah but why would varnish pass off the request to MW at all in this case? [23:53:35] The wikitext->HTML conversion gets cached there, then separately we add a bunch of UI chrome and that then gets cached in Varnish [23:53:48] Because upon edit, both the parser cache and Varnish are invalidated [23:53:52] yes [23:54:26] but all varnishes, globally, ultimately redirect through our 2-3 layers of varnish to hit a specific varnish text backend in eqiad for a given article (based on hash) [23:54:49] all that one varnish server has to do is make all parallel requestors wait on the first fetch after an invalidation, and it should be one edit -> one re-render, in theory [23:55:13] if it's not, that's something we can try to address at that layer [23:55:58] That sounds sensible enough, except [23:56:19] 1) we didn't have Varnish for text yet when we implemented PC, so we didn't have that option at the time [23:56:32] 2) what about logged-in users who bypass some of this Varnish stuff? [23:56:50] yeah logged-in users are a whole different ball of wax [23:57:07] Yeah so I agree with your reasoning that Varnish should already be deduplicating renders caused by anons [23:57:11] and I'm not saying this affects PC in any way really, just saying that the varnish layer itself could be smarter here [23:57:20] But for logged-in users it won't, and so PoolCounter limits could still be exceeded [23:57:39] I wonder whether anyone that saw failures was not-logged-in [23:58:13] I'm seeing tweets from (presumably logged-out) users getting "crashes" instead of the "wiki page for Williams". [23:58:13] Given that we got no reports in #wikipedia, logged-out users might not have [23:58:20] RoanKattouw: No using PC. It's ambiguous. ;-) [23:58:32] They could have, in theory, if a sufficient number of logged-in users beat them to it and filled up the PoolCounter quota [23:58:40] true [23:59:09] Then Varnish would have either cached the PoolCounter error or effectively undone its deduplication of anon requests, both of which sound bad [23:59:31] Also presumably when Varnish does do deduplication, the user's experience is simply a spinner that takes a long time [23:59:38] right