[00:00:06] Now *really*, we shouldn't have pages that parse slowly, and with HHVM this should be a lot better [00:00:17] it wouldn't have mixed them up at the varnish layer afaik, but yeah if PC actually gave an error for the anon fetch due to overload from logged-in, that could have been cached somewhere. [00:00:18] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.004 second response time [00:00:46] the only sane way to check that is to check all the frontends really [00:01:05] Yay now graphite is breaking :( [00:01:27] bblack: In practice I think that'll wash out quickly enough due to the article's edit rate [00:02:40] heh can't we get the editors to agree to work it out elsewhere and slow down their edit rate at times like these? :P [00:02:59] Well that's basically what happened when the page was protected at :24 [00:03:11] https://en.wikipedia.org/w/index.php?title=Robin_Williams&action=history [00:03:18] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [00:03:28] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [00:03:28] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [00:03:29] but we're still invalidating every 3 minutes or so [00:03:31] I looked at the last 50 edits, there are 10 edits in the past ~40 mins since the protection, and 40 edits in the ~15 mins before that [00:03:47] So we went from 1 edit every 20s to 1 edit every ~4 mins [00:04:17] And if the parse time is something like 10-15s (haven't measured this yet), then that's a big difference [00:04:31] yeah [00:04:46] just sayin :) [00:04:57] Clearly we should protect every page! ;-) [00:05:10] superly [00:05:28] * bblack puts a not on Robin_Williams talk page that says "Every time you make a tiny edit, a thousand operations kittens die" [00:05:29] greg-g: Hah. [00:05:34] s/not/note/ :) [00:05:54] think of the ops kitties [00:06:04] Previewing the RW page takes 6.5s for me [00:06:56] OK, so gdash is working again and showing a much lower page view rate [00:07:44] I think we're confident here that lots of people editing RW isn't gonna make anything fall over, if anything it'll just cause PoolCounter warnings to be displayed depending on the mix of edit frequency and page view intensity [00:08:04] So, if that sounds accurate to people here, maybe we should unprotect the page and see what happens? [00:08:12] +1 [00:08:29] * RoanKattouw looks at bblack in particular [00:09:15] +1 [00:09:16] you're welcome to take a try [00:09:20] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=MediaWiki+errors&vl=errors+/+sec&x=0.5&n=&hreg%5B%5D=vanadium.eqiad.wmnet&mreg%5B%5D=fatal%7Cexception>ype=stack&glegend=show&aggregate=1&embed=1 [00:09:30] ^ is looking much better, I presume due to those nukings of that user page [00:09:55] RoanKattouw: Unprotecting speculatively is a bit worrying. [00:10:35] RoanKattouw: But that's for you and greg-g to consider. [00:10:38] * James_F offs. [00:10:56] I investigated the cause of the 5xx errors and they 1) appear to be unrelated to RW and 2) appear to be fixed now [00:11:05] I'm fine as long as bblack is and we're quick to re-protect if needed [00:11:10] OK [00:11:17] I can't access any Wikimedia-site atm. [00:11:19] Yeah obviously we'll re-protect if we have problems [00:12:05] The reqsum graph seems to be flatlining BTW, it appears the spike has finished dissipating [00:12:15] yeah I agree [00:12:27] OK I'll go ahead and unprotect [00:12:34] ganglia's been levelling back off the traffic spike for a while now too [00:12:40] Yourself? [00:12:57] That sounds like a bad idea. This should be communicated to the enwiki admins to do [00:13:12] I don't see why [00:13:51] RoanKattouw: just lower it to semi which it was already at [00:13:56] Yeah exactly [00:14:08] It took me a bit to figure out what the original protection status is [00:14:10] this is probably the best view of the overall network-level situation: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=LVS+loadbalancers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [00:14:22] we're definitely on the downhill slope of the spike, but not quite back to normal levels [00:14:26] Krenair: Where would I find Fluffernutter to ask them to unprotect? [00:14:47] Finnegan on IRC. [00:14:50] yes, legoktm? [00:14:53] RoanKattouw: ^ [00:14:58] Oh hey [00:15:04] hi :) [00:15:08] Finnegan: It's safe to unprotect for technical reasons anyway. [00:15:09] Finnegan: We wanna try unprotecting the page because we believe that should be fine now [00:15:22] okey-dokey [00:15:28] A bunch of people are keeping a close eye now [00:15:32] * Finnegan loads up page [00:17:34] ok, knocked it back to it's regular semiprotection [00:17:44] Thanks [00:17:49] hoo: legoktm: this should definitely *not* have been done by Roan or anyone else in staff without asking Fluffernutter or another enwiki admin given that a) there's already a heck of a lot of tension and b) it's no way an emergency [00:18:12] <-- enwiki admin [00:18:13] Thehelpfulone: Yeah, good call. I was going to do it but Krenair stopped me [00:18:17] ;) [00:18:35] legoktm: also foundation staff/contractor not consulting with the original admin :P [00:18:37] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=MediaWiki+errors&vl=errors+%2F+sec&x=0.5&n=&hreg%5B%5D=vanadium.eqiad.wmnet&mreg%5B%5D=fatal|exception>ype=stack&glegend=show&aggregate=1&embed=1 [00:18:40] spike seems gone [00:18:43] thanks for doing that Roan :) [00:18:51] in *this particular case* it would probably have been ok because i explicitly said i was cool with anyone knocking it back, but Thehelpfulone is right that in the current state of tensions, better to not [00:18:53] were you at Wikimania? I think I must have missed you! [00:18:57] Yes! [00:18:59] Yeah, James told me Roan should not do it himself [00:19:09] And legoktm, greg-g and myself just got home on the same plane [00:19:09] aw :( [00:19:11] Thehelpfulone: yes :( [00:19:15] So we're all equally jetlagged [00:19:29] yeah, not a fun way to return (both the news and the effect on our cluster) [00:19:29] Great timing [00:19:31] I was looking for you legoktm - Ironholds tried and failed to find you to introduce me to you [00:19:37] Krinkle: yeah, RoanKattouw found some huge userpages that were causing OOMs so I nuked them. [00:19:51] Thehelpfulone: Oh, that's sad :S [00:19:53] would you guys like me to stick around in here case you need protection bumped back up, or are you pretty confident we're good to go? [00:20:08] Finnegan: No, we can do that ourselves [00:20:21] things look pretty good so far in any case [00:20:28] bblack: How long are you gonna be awake for? [00:20:32] a long time [00:20:33] I'll stick around for a while [00:20:40] Because I'm starting to feel like I want to plant my face in my bed soon [00:20:40] * Finnegan tips hat. Pleasure doin' business with y'all :) [00:21:02] hoo: again - if you can, get a non-staff enwiki admin to do it what with current tensions [00:21:06] o/ [00:21:32] Thehelpfulone: Well, sure... but if the cluster is at risk... these are purely technical actions [00:22:28] yeah sure if it's an emergency that's fine - just remember https://meta.wikimedia.org/wiki/Requests_for_comment/Superprotect_rights is blowing up [00:22:35] bblack: Awesome. I think I'll sign off soon for some much-needed post-conference/jetlag sleep [00:22:41] go for it [00:23:23] Thehelpfulone: Yeah, you're right that with the current shitstorms going on it's probably wise to avoid doing anything that involves WMF staff and page protection in the same sentence [00:23:43] * greg-g 'll still be online for a bit [00:31:13] bblack: OK I'm gonna go. If people start complaining about PoolCounter limit exceeded errors again while the cluster looks completely healthy, you may be able to get someone to carefully raise the poolcounter limit, if they're not traveling or asleep [00:31:38] If everything explodes in a hellacious ball of fire, you know where to find phone numbers ;) [00:32:14] make a local copy of that wikipage now ;) [00:32:35] the phone numbers list thingy? [00:33:30] yeah, it's on officewiki, which is on the cluster [00:34:09] I know, it's linked for wikitech [00:40:00] :) [00:40:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [00:41:00] ^ that so far looks like some relatively-benign exception spike [00:41:20] Yeah [00:41:32] It's execution time exceeded errors for parsing some very long pages [00:41:58] go to sleep already! :p [00:41:59] Like http://commons.wikimedia.org/wiki/Commons:Requests_for_rights/possible_autopatrolled_candidates/sortableTable500 and http://commons.wikimedia.org/wiki/Commons:Requests_for_rights/possible_autopatrolled_candidates/full/3 [00:42:07] I was going to, then I saw icinga :) [00:42:25] lolwut [00:42:37] I'm seeing 500s due to malformed Host headers, that's a bug in MWMultiVersion [00:43:04] I only see a small uptick in the 500 rate vs before, just recently [00:44:42] Filed https://bugzilla.wikimedia.org/69419 for that [01:10:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:46:28] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [01:59:02] it did get a spike: http://gdash.wikimedia.org/dashboards/jobq/ [02:08:36] (03CR) 10Hoo man: "Just an idea" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153306 (owner: 10Legoktm) [02:19:39] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-12 02:18:36+00:00 [02:19:49] Logged the message, Master [02:33:12] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-12 02:32:09+00:00 [02:33:19] Logged the message, Master [02:54:59] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Epic puppet fail [03:00:18] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 01:00:07 UTC [03:05:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [03:15:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 12 03:14:34 UTC 2014 (duration 14m 33s) [03:15:45] Logged the message, Master [03:15:59] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [03:17:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:23:27] (03PS1) 10BBlack: Move ulsfo public traffic to eqiad temporarily for net maintenance [operations/dns] - 10https://gerrit.wikimedia.org/r/153557 [03:24:33] (03PS2) 10BBlack: Move ulsfo public traffic to eqiad temporarily for net maintenance [operations/dns] - 10https://gerrit.wikimedia.org/r/153557 [03:25:19] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: Puppet has 1 failures [03:25:48] (03CR) 10BBlack: [C: 04-1] "This is prepped and on hold for later (~10:00 UTC, Aug 12)" [operations/dns] - 10https://gerrit.wikimedia.org/r/153557 (owner: 10BBlack) [03:26:58] ^ amssq43 is puppet server load issues again [03:42:26] (03PS1) 10Springle: Depool db1009, filesystem issue. Repool db1018, take over vslow & dump for s2. Move db1036 back onto normal traffic for s2, with warm-up. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153558 [03:43:11] (03CR) 10Springle: [C: 032] Depool db1009, filesystem issue. Repool db1018, take over vslow & dump for s2. Move db1036 back onto normal traffic for s2, with warm-up. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153558 (owner: 10Springle) [03:43:15] (03Merged) 10jenkins-bot: Depool db1009, filesystem issue. Repool db1018, take over vslow & dump for s2. Move db1036 back onto normal traffic for s2, with warm-up. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153558 (owner: 10Springle) [03:43:19] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [03:45:12] !log springle Synchronized wmf-config/db-eqiad.php: s2: depool db1009. repool db1018. adjust db1036 load. (duration: 00m 07s) [03:45:17] Logged the message, Master [03:57:18] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 01:56:58 UTC [04:23:31] (03PS1) 10Springle: Repool db1009 and db1035. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153559 [04:24:05] (03CR) 10Springle: [C: 032] Repool db1009 and db1035. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153559 (owner: 10Springle) [04:24:08] (03Merged) 10jenkins-bot: Repool db1009 and db1035. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153559 (owner: 10Springle) [04:25:00] !log springle Synchronized wmf-config/db-eqiad.php: s2: repool db1009. s3: repool db1035. (duration: 00m 06s) [04:25:06] Logged the message, Master [04:34:32] (03PS1) 10Springle: Consistent error handling for methods (percona, ddl, ddlonline). [operations/software] - 10https://gerrit.wikimedia.org/r/153560 [04:34:59] (03CR) 10Springle: [C: 032] Consistent error handling for methods (percona, ddl, ddlonline). [operations/software] - 10https://gerrit.wikimedia.org/r/153560 (owner: 10Springle) [04:35:56] (03PS1) 10Springle: Add centralauth to information_schema_p on labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/153561 [04:36:23] (03CR) 10Springle: [C: 032] Add centralauth to information_schema_p on labsdb. [operations/software] - 10https://gerrit.wikimedia.org/r/153561 (owner: 10Springle) [05:01:18] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 01:00:07 UTC [05:01:23] !log rsync ~1TB labsdb1001 to labsdb1003, throttled ~25MB/s [05:01:28] Logged the message, Master [05:20:18] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Tue Aug 12 05:20:08 UTC 2014 [05:37:18] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Aug 12 05:37:11 UTC 2014 [06:28:59] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 6 failures [06:29:18] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:29:38] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:38] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:39] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:39] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:39] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:48] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:42:28] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:06] I peeked at mw1189 randomly out of the list above, seems to be puppet server overload again, but saw this in syslog too: [06:44:09] Aug 11 06:35:08 mw1189 kernel: [47396820.607055] CPU10: Package temperature above threshold, cpu clock throttled (total events = 486862330) [06:44:12] wtf? [06:44:15] (lots of those) [06:44:48] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:59] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:00:28] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:14:49] (03PS1) 10Alexandros Kosiaris: osm: Enable ganglia diskstat plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/153566 [08:18:11] (03CR) 10Alexandros Kosiaris: [C: 032] osm: Enable ganglia diskstat plugin [operations/puppet] - 10https://gerrit.wikimedia.org/r/153566 (owner: 10Alexandros Kosiaris) [08:59:38] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [09:12:32] (03PS3) 10QChris: Force redis dump before backing up [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153395 (https://bugzilla.wikimedia.org/68731) [09:14:08] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 6.3GB (= 5.0GB critical): /srv/deployment/ocg/output 2362626044B: /srv/deployment/ocg/postmortem 3009801B: ocg_job_status 6582 msg: ocg_render_job_queue 0 msg [09:16:05] (03PS1) 10QChris: Make hourly backup keep around known-good full backups in case of issues [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153568 (https://bugzilla.wikimedia.org/68731) [09:17:38] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:32:49] (03PS2) 10QChris: Make hourly backup keep around known-good full backups in case of issues [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153568 (https://bugzilla.wikimedia.org/68731) [09:50:18] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.002 second response time [09:52:18] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.335 second response time [09:52:28] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [09:52:28] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [10:06:12] (03Abandoned) 10Giuseppe Lavagetto: apache: adding apache::mod_files [operations/puppet] - 10https://gerrit.wikimedia.org/r/151605 (owner: 10Giuseppe Lavagetto) [10:07:09] (03CR) 10Giuseppe Lavagetto: [C: 032] "rolling out approaching maintenace window." [operations/dns] - 10https://gerrit.wikimedia.org/r/153557 (owner: 10BBlack) [10:07:29] <_joe_> ^^ moving traffic from ulsfo to eqiad [10:17:49] (03PS1) 10Giuseppe Lavagetto: Revert "Move ulsfo public traffic to eqiad temporarily for net maintenance" [operations/dns] - 10https://gerrit.wikimedia.org/r/153574 [10:18:13] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "merge after 15:00 UTC" [operations/dns] - 10https://gerrit.wikimedia.org/r/153574 (owner: 10Giuseppe Lavagetto) [10:23:06] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: use apache::mod_files [operations/puppet] - 10https://gerrit.wikimedia.org/r/151606 (owner: 10Giuseppe Lavagetto) [10:30:51] stats.grok.se looks dead to me :P Do we have a similar site that crunches the page view data? [10:34:04] it was working earlier on, I was using it to think about renaming a page and setting up a disambig on en.wp [10:34:40] hoo: http://tools.wmflabs.org/wikiviewstats/ ? [10:35:16] ah, nice [10:36:22] that might be nicer that stats.grok.se [10:48:43] (03PS1) 10Giuseppe Lavagetto: mediawiki: HAT appserver should turn off mod_php [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 [11:05:23] (03PS1) 10Alexandros Kosiaris: Enable mathoid deployment::target on beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/153582 [11:07:55] (03CR) 10Ori.livneh: [C: 04-1] "No, let's not do it this way please. apache::mod_conf was designed to be internal / private to the Apache module (and I documented it as s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 (owner: 10Giuseppe Lavagetto) [11:09:13] (03PS2) 10Giuseppe Lavagetto: mediawiki: HAT appserver should turn off mod_php [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 [11:13:33] (03CR) 10Giuseppe Lavagetto: "Since there is no interface providing a way to absent modules without using mod_conf (and we already do that here for other modules...) I " [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 (owner: 10Giuseppe Lavagetto) [11:19:21] (03CR) 10Alexandros Kosiaris: [C: 032] Enable mathoid deployment::target on beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/153582 (owner: 10Alexandros Kosiaris) [11:20:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Handle daemon restarts [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/152863 (owner: 10Alexandros Kosiaris) [11:21:39] !log Jenkins: clearing up some obsolete symbolic links under gallium.wikimedia.org:/var/lib/jenkins/jobs/*/builds/ Running in a screen as user jenkins [11:21:47] Logged the message, Master [11:29:08] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:34:17] (03CR) 10Hashar: "recheck" [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 (owner: 10Alexandros Kosiaris) [11:38:02] (03Draft1) 10Filippo Giunchedi: swift-ring: manage swift rings via git [operations/software/swift-ring] - 10https://gerrit.wikimedia.org/r/153584 [11:42:39] (03PS2) 10Ori.livneh: Small lint-fix for hhvm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/153424 [11:43:10] (03CR) 10Filippo Giunchedi: "see also the item about managing the swift rings at https://wikitech.wikimedia.org/wiki/Swift/TODO" [operations/software/swift-ring] - 10https://gerrit.wikimedia.org/r/153584 (owner: 10Filippo Giunchedi) [11:48:28] (03PS2) 10Hoo man: Set "siteGroup" for testwikidata and wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153121 [11:49:53] (03CR) 10Hoo man: [C: 032] "Only affects testwikidata atm." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153121 (owner: 10Hoo man) [11:50:15] (03Merged) 10jenkins-bot: Set "siteGroup" for testwikidata and wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153121 (owner: 10Hoo man) [11:51:00] (03PS1) 10Springle: Make permanent some labsdb TokuDB settings applied with SET GLOBAL. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153585 [11:51:10] !log hoo Synchronized wmf-config/InitialiseSettings.php: Set siteGroup for testwikidata (duration: 00m 11s) [11:51:14] Logged the message, Master [11:51:30] (03PS1) 10Ori.livneh: wmflib: add validate_ensure() [operations/puppet] - 10https://gerrit.wikimedia.org/r/153586 [11:51:49] (03CR) 10Springle: [C: 032] Make permanent some labsdb TokuDB settings applied with SET GLOBAL. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153585 (owner: 10Springle) [11:51:56] (03PS2) 10Ori.livneh: wmflib: add validate_ensure() [operations/puppet] - 10https://gerrit.wikimedia.org/r/153586 [11:53:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:24:01] (03CR) 10Filippo Giunchedi: [C: 031] apache: add a 'replaces' parameter to apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/153406 (owner: 10Giuseppe Lavagetto) [12:33:33] (03CR) 10Filippo Giunchedi: "LGTM in general, I think adding some use cases would clear things up (e.g. at the beginning of custom functions/resources?)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153586 (owner: 10Ori.livneh) [12:36:25] (03PS1) 10Ori.livneh: salt-minion service should fail to start if master rejects key [operations/puppet] - 10https://gerrit.wikimedia.org/r/153589 [12:37:07] godog: thanks; could you look at https://gerrit.wikimedia.org/r/#/c/153589/ if you ahve a chance? [12:40:36] (03CR) 10Ori.livneh: [C: 032] Small lint-fix for hhvm.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/153424 (owner: 10Ori.livneh) [12:42:28] ori: no worries :) heading out to lunch now tho, might have some time later on tho [12:42:34] too many thos tho [12:52:06] (and thanks for picking up the first bit!) [12:52:20] ack, that was uparrow in the wrong window :) [12:58:04] (03PS1) 10BBlack: Revert "Move ulsfo public traffic to eqiad temporarily for net maintenance" [operations/dns] - 10https://gerrit.wikimedia.org/r/153592 [12:58:41] (03PS2) 10BBlack: Revert "Move ulsfo public traffic to eqiad temporarily for net maintenance" [operations/dns] - 10https://gerrit.wikimedia.org/r/153592 [12:59:46] (03CR) 10BBlack: [C: 04-1] "On hold till 15:00 UTC for end of window, but real outage is likely already over (logs show downtime from 11:10 -> 11:27)." [operations/dns] - 10https://gerrit.wikimedia.org/r/153592 (owner: 10BBlack) [13:26:47] (03PS1) 10Hashar: multiversion: test we emit Invalid host name [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153593 (https://bugzilla.wikimedia.org/69419) [13:32:03] (03PS2) 10Giuseppe Lavagetto: apache: add a 'replaces' parameter to apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/153406 [13:32:48] (03CR) 10Ottomata: [C: 032 V: 032] Re-align block of attributes [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153387 (owner: 10QChris) [13:46:56] (03PS7) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [13:50:34] (03CR) 10Giuseppe Lavagetto: "cherry-picked on beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [13:52:00] (03PS1) 10Springle: MariaDB 10 in the Sanitarium (pre-labsdb) [operations/puppet] - 10https://gerrit.wikimedia.org/r/153597 [14:00:23] (03CR) 10Springle: [C: 032] MariaDB 10 in the Sanitarium (pre-labsdb) [operations/puppet] - 10https://gerrit.wikimedia.org/r/153597 (owner: 10Springle) [14:00:43] <_joe_> springle: \o/ [14:02:07] :) [14:06:12] (03PS1) 10Yuvipanda: androidsdk: Make sure that JDK is present [operations/puppet] - 10https://gerrit.wikimedia.org/r/153600 [14:14:08] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 2554295215B: /srv/deployment/ocg/postmortem 3106965B: ocg_job_status 6685 msg: ocg_render_job_queue 0 msg [14:14:50] (03PS8) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:15:34] (03CR) 10Filippo Giunchedi: "for reference, the rationale is summaringly here https://etherpad.wikimedia.org/p/mod-conf" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153406 (owner: 10Giuseppe Lavagetto) [14:21:01] (03CR) 10Ottomata: "Hm, why do we set this as a global here, rather than in the role::analytics::hadoop::config class?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153528 (owner: 10Gage) [14:29:23] (03CR) 10Filippo Giunchedi: [C: 031] salt-minion service should fail to start if master rejects key [operations/puppet] - 10https://gerrit.wikimedia.org/r/153589 (owner: 10Ori.livneh) [14:30:17] (03CR) 10Mark Bergsma: [C: 04-1] Separate HHVM app servers backend. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:33:09] greg-g: We wanted to test mathoid on beta... I would guess that https://gerrit.wikimedia.org/r/#/c/135522/ is a required before we can go ahead with testing (at least if there is a "new" wiki that has not been updated with the mathoid table from the updater script. Is that right? [14:36:13] (03PS9) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:37:29] (03CR) 10Giuseppe Lavagetto: Separate HHVM app servers backend. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:37:44] (03PS10) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:38:26] so that does mean that there's no separation of cache objects on the other tiers of course [14:39:08] <_joe_> mark: yes, right [14:39:32] <_joe_> mmh, not sure it's a good idea in fact [14:39:48] physikerwelt: beta runs update.php from time to time [14:39:52] <_joe_> mark: should we do the same on the other tiers? just hash things differently? [14:40:07] <_joe_> I think it's better if we do in fact [14:40:34] then you need another way to find out if it's coming from HHVM [14:40:40] does it always set a header? [14:40:49] <_joe_> yes it does [14:40:55] <_joe_> X-Powered-By [14:41:33] <_joe_> so we could use that as a condition for the vcl_hash [14:42:59] hoo: I'm wondering what the resposibility of the addWiki script is? [14:43:00] <_joe_> instead of the backends condition [14:44:15] Hoo: at some point I had to fix the script because of a renamed table template https://git.wikimedia.org/commit/mediawiki%2Fextensions%2FWikimediaMaintenance/afb3c59cf56d8b809725ec122843b0f71b378b96 [14:45:02] <_joe_> mark: if (beresp.http.X-Powered-By =~ "^HHVM") looks like a good option? [14:45:05] physikerwelt: it is being used if you create a *new* wiki DB [14:45:15] so the change is indeed go [14:45:17] od [14:45:18] <_joe_> mark: or directly on the request cookie [14:45:27] but not needed for beta, probably [14:45:40] <_joe_> if req.http.Cookie ~ "hhvm=true" like we do when selecting backends [14:45:41] did Sean yet create the tables in production? [14:45:49] <_joe_> that's even better, no? [14:46:02] ok, he did [14:46:33] _joe_: no, because a request may have come from zend even with that cookie [14:46:44] e.g. for test wikipedia, the api, or from a restart in an error [14:46:52] <_joe_> ok [14:46:55] <_joe_> correct [14:47:03] <_joe_> so I'll use X-Powered-By [14:47:30] <_joe_> that is reasonably sure not to be mixed with Zend. [14:48:18] hoo: ok. I see. [14:49:03] physikerwelt: approved [14:49:23] hoo thanks [14:50:08] <_joe_> I think this additional hash is temporary anyway [14:50:27] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [14:50:28] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [14:50:44] physikerwelt: If you now create a new Wiki in beta, it should be fine for usage (from a DB perspective, no idea about the state of the code) [14:50:45] (03PS11) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:50:54] also existing ones should be fine [14:52:14] hoo: Thanks. I think I will not create a new wiki... but it's better to have it merged now...before the problem might occurs. [14:52:27] Sure [14:56:46] (03PS12) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [14:57:29] physikerwelt: 10:45 < hoo> did Sean yet create the tables in production? [14:57:40] oh, nvm [14:57:43] 10:46 < hoo> ok, he did [14:57:48] :) [14:58:32] (03CR) 10Nuria: "Sorry but my comment was "missinterpreted". I thought you were asking me "are you still using vanadium.eqiad.wmnet? (right now)" as i used" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136130 (owner: 10Rush) [14:58:43] (03PS1) 10Springle: Sanitarium sysv script + basic monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153609 [14:59:56] (03CR) 10Springle: [C: 032] Sanitarium sysv script + basic monitoring. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153609 (owner: 10Springle) [15:01:15] akosiaris: hey! any update on the postgres stuff? :) [15:03:04] yuvipanda: yup. it is done [15:03:15] which means you are good to go ? [15:03:18] akosiaris: w000t! details / docs? [15:04:01] akosiaris: yeah, but I'm swamped with a couple of other things. will take a look next week (or later this week), once Coren is back. I'd also need access. [15:05:28] (03PS1) 10Physikerwelt: Re-enable all Math modes on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) [15:05:47] yuvipanda: what kind of access ? [15:06:03] a superuser account to create/delete accounts I assume ? [15:06:07] akosiaris: yeah. [15:06:43] some docs are here https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Configuring_PGAdmin_for_OSM_access [15:07:15] I think I should also create a CNAME and update docs with it [15:07:48] (03CR) 10BBlack: [C: 032] Revert "Move ulsfo public traffic to eqiad temporarily for net maintenance" [operations/dns] - 10https://gerrit.wikimedia.org/r/153592 (owner: 10BBlack) [15:08:06] !log flipping ulsfo traffic back to ulsfo [15:08:12] Logged the message, Master [15:08:22] akosiaris: \o/ yeah, CNAME would be great. [15:08:33] akosiaris: and superuser account. I'll rescue my puppet patch and amend it for this [15:10:27] (03PS13) 10Giuseppe Lavagetto: Separate HHVM app servers backend. [operations/puppet] - 10https://gerrit.wikimedia.org/r/152903 (owner: 10Mark Bergsma) [15:12:17] (03PS2) 10Physikerwelt: Re-enable all Math modes on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) [15:13:30] (03PS1) 10Alexandros Kosiaris: Create a CNAME for labs postgresql DBs [operations/dns] - 10https://gerrit.wikimedia.org/r/153614 [15:15:03] Reedy: I would like to test the additional math rendering modes on betalabs before enabling them in production https://gerrit.wikimedia.org/r/#/c/153610/ [15:16:56] (03CR) 10Milimetric: Reschedule backups to not interfer with queue runs so easily (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153388 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [15:17:10] legoktm: I finally made the deploy calendar for this week, left the backlog of swats un-assigned, let me know when you want to do yours https://wikitech.wikimedia.org/wiki/Deployments [15:28:04] (03CR) 10Milimetric: [C: 031] "looks good to me, but I did not test" (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153568 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [15:28:27] (03CR) 10Milimetric: [C: 031] "looks good to me, but I did not test" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153395 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [15:33:06] (03PS3) 10QChris: Make hourly backup keep around known-good full backups in case of issues [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153568 (https://bugzilla.wikimedia.org/68731) [15:33:54] (03CR) 10QChris: Make hourly backup keep around known-good full backups in case of issues (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/153568 (https://bugzilla.wikimedia.org/68731) (owner: 10QChris) [15:36:54] bblack: how do I reset RT password? [15:37:10] if I enter my mail at https://rt.wikimedia.org/ I get mail with this text [15:37:28] "Your new password is:" [15:37:33] and there is no password :( [15:40:00] have you tried logging in with no password? :) [15:40:09] I'm not an RT expert really, I'll have to look around a bit [15:40:14] zeljkof: ^ [15:40:25] bblack: let me try that :) [15:41:11] bblack: no, empty password does not work :( [15:41:13] <_joe_> bblack: I was about to make that joke [15:48:06] zeljkof: if it's and html email, maybe check the plain-text version? [15:48:18] valhallasw`cloud: thanks, tried all versions [15:48:25] no password anywhere [15:48:27] zeljkof: when I try the email password reset on my own account, I get: "Only external users can reset their passwords this way." [15:48:47] bblack: what does that mean? :) [15:48:51] am I external user? [15:49:00] I don't suspect that you should be [15:49:20] bblack: can you set any password for me? [15:49:33] yes, eventually when I figure out what I'm doing :) [15:54:25] zeljkof: sent you an email [15:56:27] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Epic puppet fail [15:56:41] (03PS2) 10Ori.livneh: salt-minion service should fail to start if master rejects key [operations/puppet] - 10https://gerrit.wikimedia.org/r/153589 [15:56:48] (03CR) 10Ori.livneh: [C: 032 V: 032] salt-minion service should fail to start if master rejects key [operations/puppet] - 10https://gerrit.wikimedia.org/r/153589 (owner: 10Ori.livneh) [15:56:59] bblack: thanks :) [15:59:47] bblack: thanks, i was able to log in [15:59:57] but now, I do not see how to change the password :( [16:00:42] the only option I have is logout [16:00:55] zeljkof: from the top menu bar: "Logged in as foo" -> "Settings" -> "About me"? [16:01:54] bblack: no :( there is only logout link there [16:03:11] could someone have a look at labsdb1002 [16:03:20] looks OOM killed [16:03:22] -.- [16:03:24] zeljkof: is this a brand-new RT account? [16:03:28] (from ganglia) [16:03:30] bblack: yes [16:03:38] memory peaked, then went down... [16:03:39] oh, everything makes more sense now :) [16:03:46] I have sent you my screenshot via mail [16:03:55] bblack: ^ [16:04:52] hoo: are you having current isses with it? [16:05:15] mysqld_safe restarted it on oomkill [16:05:22] bblack: to my own surprise not [16:05:25] oh, it restarted [16:05:27] that's fine then [16:05:43] ah I see, memory usage is growin again [16:05:45] * hoo hides [16:05:56] I love diagnosing stuff via ganglia grapsh [16:06:18] actually there was no oomkill from the kernel pov [16:06:26] but the mysql process was restarted recently [16:06:37] by whom/ what? [16:06:50] springle: ping ^ [16:07:07] by sean by the looks of the machine logins :) [16:08:13] it seems to struggle with the non-prewarmed restart [16:09:27] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures [16:10:48] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:16:27] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:20:07] PROBLEM - puppet last run on analytics1012 is CRITICAL: CRITICAL: Puppet has 1 failures [16:25:32] hoo: bblack: was a sig 6 assertion failure. investigating [16:27:23] uh [16:39:46] (03PS1) 10Jgreen: remove aluminium from sites.pp and dhcp config [operations/puppet] - 10https://gerrit.wikimedia.org/r/153622 [16:41:05] (03CR) 10Jgreen: [C: 032 V: 031] remove aluminium from sites.pp and dhcp config [operations/puppet] - 10https://gerrit.wikimedia.org/r/153622 (owner: 10Jgreen) [16:47:44] (03CR) 10Ori.livneh: "> I could find only two examples" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153586 (owner: 10Ori.livneh) [16:49:52] (03PS2) 10Rush: New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [16:50:24] !log restart mysqld on labsdb1001, upgrade to mariadb 10.0.13 for bugfix [16:50:30] (03CR) 10jenkins-bot: [V: 04-1] New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [16:50:30] Logged the message, Master [16:50:45] (03CR) 10Rush: "I changed the 'eventlogging' group in data.yaml to eventlogging-admins. There IS an eventlogging group on hafnium already I don't want to" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [16:57:48] !log removed aluminium.wikimedia.org from production [16:57:54] Logged the message, Master [16:57:55] (03PS3) 10Rush: New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [17:01:33] (03CR) 10Ottomata: [C: 032] New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [17:02:17] (03CR) 10Andrew Bogott: [C: 031] New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [17:02:40] (03PS4) 10Andrew Bogott: New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 [17:03:34] (03CR) 10Rush: [C: 032] New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 (owner: 10Andrew Bogott) [17:05:34] apergos: Can you give any updates on https://gerrit.wikimedia.org/r/152724 [17:13:20] !log restart mysqld on labsdb1002, upgrade to mariadb 10.0.13 for bugfix [17:13:25] Logged the message, Master [17:20:43] (03PS3) 10Ottomata: Set up passive icinga for webrequest data imports in HDFS and Hive [operations/puppet] - 10https://gerrit.wikimedia.org/r/151963 [17:23:19] (03CR) 10Ottomata: Set up passive icinga for webrequest data imports in HDFS and Hive (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151963 (owner: 10Ottomata) [17:27:06] godog: I keep thinking this but not asking it… what are the risks of just storing the ring file in the ops repository? [17:27:29] Are there nasty races that crop up as the rings are gradually updated across different hosts? [17:35:52] (03CR) 10Andrew Bogott: "I mostly like this. I have a couple of naive questions though:" [operations/software/swift-ring] - 10https://gerrit.wikimedia.org/r/153584 (owner: 10Filippo Giunchedi) [17:55:38] !log populateBacklinkNamespace.php finished on all wikis [17:55:43] Logged the message, Master [18:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140812T1800). [18:15:09] andrewbogott: were you planning on doing work on the salt puppetization? [18:15:51] ori: I don't know what that is, so probably not :) [18:16:10] There's a minor bug in salt that causes occasional puppet failures, was hoping to upgrade to work around that. [18:16:12] andrewbogott: I simply mean the salt puppet module and related role manifests [18:16:13] ah [18:22:00] Reedy: http://test.wikipedia.org/wiki/Special:RecentChangesLinked/Main_page wah wah [18:24:37] (03PS1) 10Aaron Schulz: Increased the number of parsoid job runners to lower queue size [operations/puppet] - 10https://gerrit.wikimedia.org/r/153639 [18:25:44] gah 08fee4c [18:31:05] I HATE YOU IRC [18:32:52] (03PS1) 10Reedy: Non wikipedias to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153641 [18:33:03] I guess we better fix FR/08fee4c first though [18:34:13] AaronSchulz: I backported https://gerrit.wikimedia.org/r/#/c/151096/ to wmf16 [19:04:26] (03PS1) 10Ottomata: Use 'hdfs dfs' instead of 'hadoop fs' for cdh::hadoop::directory exec [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153648 [19:06:39] !log reedy Synchronized php-1.24wmf16/includes/specials/SpecialRecentchangeslinked.php: Fix FR bug (duration: 00m 14s) [19:06:45] Logged the message, Master [19:09:14] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153641 (owner: 10Reedy) [19:09:18] (03Merged) 10jenkins-bot: Non wikipedias to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153641 (owner: 10Reedy) [19:12:30] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf16 [19:12:36] Logged the message, Master [19:13:31] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 17:12:58 UTC [19:19:57] (03CR) 10Ottomata: [C: 032 V: 032] Use 'hdfs dfs' instead of 'hadoop fs' for cdh::hadoop::directory exec [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153648 (owner: 10Ottomata) [19:51:04] (03PS2) 10BBlack: varnish/zero.inc.vcl.erb - retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/153219 (owner: 10Dzahn) [19:52:09] (03CR) 10BBlack: [C: 032 V: 032] varnish/zero.inc.vcl.erb - retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/153219 (owner: 10Dzahn) [20:01:21] (03PS3) 10Hashar: Re-enable all Math modes on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) (owner: 10Physikerwelt) [20:01:28] (03PS4) 10Hashar: Re-enable all Math modes on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) (owner: 10Physikerwelt) [20:02:05] (03CR) 10Hashar: [C: 032] "Lets do Math!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) (owner: 10Physikerwelt) [20:02:09] (03Merged) 10jenkins-bot: Re-enable all Math modes on beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) (owner: 10Physikerwelt) [20:02:25] (03CR) 10Hashar: "Note, I have just clarified the commit message." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153610 (https://bugzilla.wikimedia.org/66587) (owner: 10Physikerwelt) [20:02:51] (03PS2) 10Hashar: Re-enable all Math modes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [20:02:54] (03CR) 10jenkins-bot: [V: 04-1] Re-enable all Math modes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [20:08:14] (03CR) 10BBlack: [C: 04-2] "208.80.154.0/23 is already included in 208.80.152.0/22 above it (and is in both lists)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/153062 (owner: 10Dzahn) [20:09:11] (03PS2) 10BBlack: Fix kafka & udp2log filtering of ZERO [operations/puppet] - 10https://gerrit.wikimedia.org/r/152836 (owner: 10Yurik) [20:11:42] (03PS3) 10BBlack: Fix kafka & udp2log filtering of ZERO [operations/puppet] - 10https://gerrit.wikimedia.org/r/152836 (owner: 10Yurik) [20:12:31] (03CR) 10BBlack: [C: 032] Fix kafka & udp2log filtering of ZERO [operations/puppet] - 10https://gerrit.wikimedia.org/r/152836 (owner: 10Yurik) [20:43:47] (03PS1) 10Ottomata: Add script and class to manage HDFS user directories [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 [20:43:50] (03CR) 10jenkins-bot: [V: 04-1] Add script and class to manage HDFS user directories [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 (owner: 10Ottomata) [20:44:20] (03PS2) 10Ottomata: Add script and class to manage HDFS user directories [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 [20:46:48] (03PS3) 10Ottomata: Add script and class to manage HDFS user directories [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 [20:48:24] (03PS4) 10Ottomata: Add script and class to manage HDFS user directories [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 [20:48:56] (03CR) 10Ottomata: "Chase, gimme all your comments. Naming suggestions welcome." [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/153706 (owner: 10Ottomata) [21:00:05] spagewmf: Dear anthropoid, the time has come. Please deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140812T2100). [21:11:32] PROBLEM - puppet last run on db1004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:14:31] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 17:12:58 UTC [21:30:32] RECOVERY - puppet last run on db1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:40:31] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 19:39:59 UTC [21:40:41] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Tue Aug 12 21:40:32 UTC 2014 [22:12:51] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Tue Aug 12 22:12:43 UTC 2014 [22:13:00] Reedy: Seems you forgot to submodule update WikimediaMessages [22:13:07] https://gerrit.wikimedia.org/r/153329 [22:13:22] thanks hoo [22:13:26] btw :) [22:13:44] greg-g: What about WikimediaMessages? [22:17:31] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 12 Aug 2014 20:16:46 UTC [22:17:36] (that will need to be scaped) [22:20:00] * greg-g looks [22:20:09] sorry, I have been mostly ignoring things today [22:20:40] hoo: do you feel comfortable doing it all? [22:21:35] greg-g: I encourage hoo to do it if he's comfortable with it, but if not then I can do it [22:21:43] * greg-g nods [22:21:54] (I seem to be the only person returning from Wikimania who is in the office today) [22:21:55] !log hoo Synchronized php-1.24wmf16/extensions/ProofreadPage/: Fix JS error while editing (duration: 00m 10s) [22:22:01] Logged the message, Master [22:22:05] need to look a little more at the circumstance, but I guess I can do it [22:22:20] Speaking of, this place is deserted, there were like 10 people on the entire 3rd floor and not even all at the same time [22:22:46] * AaronSchulz was at Wikimania [22:23:21] [citation needed] [22:26:09] RoanKattouw: That is fishy [22:26:15] he updated to the wmf15 HEAD [22:26:19] not the 16 one [22:26:23] that's why the diff is so huge [22:27:28] uh [22:27:37] in the version currently deploy superprotect doesn't even exist [22:27:44] (the message) [22:28:01] hoo: Where is this happening? [22:28:15] RoanKattouw: Look at the state on tin [22:28:37] I guess updating it to the real wmf16 HEAD and the scapping would be the sanest way to go here [22:28:47] everything else will turn intro trouble [22:29:00] State of what? MW core? [22:29:11] RoanKattouw: php-1.24wmf16 on tin [22:29:17] go there, git diff [22:29:29] Oh, WMMsgs [22:29:33] then you see that it's not in line with what is in gerrit [22:29:43] and what is in gerrit also is wrong (it's the wmf 15 head) [22:30:00] Oh dear I see what you mean [22:30:19] so... bump to wmf16 head and scap? [22:30:23] Diff to that is minimal [22:30:40] hoo@tin:/a/common/php-1.24wmf16/extensions/WikimediaMessages$ git diff HEAD..origin/wmf/1.24wmf16 [22:30:45] (only superprotect changes) [22:30:48] The cherry-picked commits (the ones above "Creating new wmf/1.24wmf15 branch" in the log, are those all in wmf16? [22:31:03] yep, both 15 and 16 [22:31:20] I guess the superprotect stuff isn't in 16 [22:31:25] it is [22:31:25] So maybe that should be cherry-picked to 16? [22:31:29] it is [22:31:36] Oh, hold on [22:31:37] I see [22:31:41] The diff is *adding* messages [22:31:47] just core has the wrong submodule version [22:31:53] I thought we would be losing stuff, but no, we'd be gaining stuff [22:31:58] (both what is checked out on tin and what is in gerrit is awry) [22:32:02] yep [22:32:16] shall I update it to HEAD of wmf16 and then scap? [22:32:38] Yeah please [22:32:41] * greg-g nods [22:32:43] thanks hoo [22:32:48] Just make sure that the submodule pointer in MW core's wmf16 branch is also corrected [22:33:13] RoanKattouw: Of course ;) [22:33:15] hoo: I think you finished working on ProofreadPages (it works for now)? [22:33:20] FlorianSW: Yep [22:33:25] did you verify it? [22:33:31] hoo: thanks :) Yes, i have :) [22:33:34] I totally forgot that over the messages troubles [22:33:40] Ah, great :) [22:33:42] np :) [22:35:39] RoanKattouw: that was a fast +2 :) [22:37:21] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Aug 12 22:37:19 UTC 2014 [22:39:41] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 18 MB (3% inode=99%): [22:40:06] !log hoo Started scap: Update WikimediaMessages (superprotect messages for wmf16) [22:40:11] Logged the message, Master [22:42:13] RoanKattouw: Hm.. do you know of any outage or intended bypassing of ulsfo during Robin Williams death? [22:42:26] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+ulsfo&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [22:42:26] the other bits caches show a spike (naturally) [22:42:41] (eqiad and esams) [22:42:41] Krinkle: Maint. window just at that time [22:42:49] traffic went to eqiad [22:42:52] for 5 hours? [22:43:01] until 21 UTC or so [22:43:04] not sure [22:43:16] k [22:43:35] Krinkle: That does not overlap with RW's death, it was much later [22:43:51] Ah, right, the time stamps are what.. UTC? [22:43:58] Not BST or GMT+1 [22:44:05] The RW death spike was between 23:00 and 00:00 UTC on Monday [22:44:24] The ulsfo dip was 11:00-15:00 UTC on Tuesday judging from that graph [22:44:39] It says "Tue" in front of it though [22:44:52] and given that it's 23:44 where I am, and it says Tue 12:00 on that graph.. [22:45:00] Yes [22:45:07] That's when the ulsfo dip was [22:45:13] Right, makes sense now [22:45:16] Tue 12:00 UTC = Tue 13:00 BST [22:45:37] Yeah.. [22:45:40] = 05:00 PDT which is not a bad time for a maintenance window here [22:45:55] I got that url yesterday and hadn't thought it would include more recent hours now that I refreshed it [22:46:40] Right [22:47:17] (Although ulsfo traffic doesn't follow the circadian cycle I was expecting, it seems to be more Asia-driven) [22:51:11] (03CR) 10Aaron Schulz: [C: 032] multiversion: test we emit Invalid host name [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153593 (https://bugzilla.wikimedia.org/69419) (owner: 10Hashar) [22:51:17] (03Merged) 10jenkins-bot: multiversion: test we emit Invalid host name [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153593 (https://bugzilla.wikimedia.org/69419) (owner: 10Hashar) [22:51:39] AaronSchulz: You realize I'm scaping atm? [22:52:38] it's a change to a test file [22:52:57] just to make sure... I have no idea what would happen if you started a normal sync now [22:53:07] * would start [22:53:56] hoo: I don't plan on syncing that test file anytime soon [22:54:18] though even if I did try it would just whine about /var/run/scap.lock being locked [22:54:18] I'll just sync it out after my scap [22:54:25] oh, nice [22:54:41] they all check the same lock (ever since the python rewrite) [22:55:11] PROBLEM - puppet last run on mw1053 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:59:41] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 20 MB (3% inode=99%): [23:00:04] RoanKattouw, mwalker, ori, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140812T2300). [23:00:23] I'll take it [23:00:31] there's nothing to deploy [23:00:35] Oh haha [23:00:40] Well that was easy then [23:00:42] free karma :) [23:00:48] :D [23:01:21] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /a/common/). [23:02:06] oh [23:02:07] hey [23:02:11] i have something to deploy! [23:02:15] RoanKattouw [23:02:16] :D [23:02:20] * hoo hides [23:02:48] https://gerrit.wikimedia.org/r/#/c/151425/ [23:04:15] MatmaRex_mobile: OK will deploy [23:04:25] thanks [23:05:26] Also going to deploy https://gerrit.wikimedia.org/r/#/c/153593/ which was just merged by AaronSchulz and caused icinga-wm's complaint above [23:06:29] 228 servers to deploy to, 228 servers... wait a bit... 227 server to deploy to, 227 servers ... [23:06:31] *sings* [23:06:53] (I actually only wait for one more atm) [23:08:22] hoo: You mean you're deploying right now? [23:08:36] RoanKattouw: Waiting for scap [23:08:40] OK [23:08:45] What are you scapping? [23:08:47] can push the other stuff after [23:08:51] WMMessages still? [23:08:51] the WikimediaMessages stuff [23:08:55] yep [23:08:55] OK yeah that would be cool, please do [23:09:58] ;) [23:16:53] still waiting for one host [23:16:56] and it's not fenari [23:22:06] gnah it's mw1053 [23:22:15] the one with the disk failure AFAIR [23:22:41] RECOVERY - Disk space on lanthanum is OK: DISK OK [23:22:58] (03PS3) 10Ori.livneh: wmflib: add ordered_yaml() [operations/puppet] - 10https://gerrit.wikimedia.org/r/149775 [23:23:00] (03PS1) 10Ori.livneh: Clean up salt::minion [operations/puppet] - 10https://gerrit.wikimedia.org/r/153727 [23:24:08] (03CR) 10Ori.livneh: [C: 031] "Same questions as Andrew. This looks nice, though." [operations/software/swift-ring] - 10https://gerrit.wikimedia.org/r/153584 (owner: 10Filippo Giunchedi) [23:26:19] (03CR) 10Ori.livneh: [C: 031] puppetmaster: make reimaging servers easier. [operations/puppet] - 10https://gerrit.wikimedia.org/r/153397 (owner: 10Giuseppe Lavagetto) [23:26:22] !log hoo Finished scap: Update WikimediaMessages (superprotect messages for wmf16) (duration: 46m 16s) [23:26:28] Logged the message, Master [23:26:53] !log Had to abort scap on mw1053 (which is depooled) manually [23:26:59] Logged the message, Master [23:27:10] grrr [23:27:18] greg-g: It's depooled [23:27:24] but still in dsh? [23:27:28] yeah [23:27:30] seems so [23:27:55] bblack: can you, oh RT duty person, please remove mw1053 from dsh [23:28:26] bblack: not complaining to you, but the ether: this happens a lot: a box is depooled but the person depooling doesn't remove it from dsh, causing all kinds of pain for deployers [23:28:36] yeah [23:28:44] well, I am complaining "to" you, but not "about" you ;) [23:28:45] wait [23:29:15] hoo: ? [23:29:31] change on the way [23:29:41] (03PS1) 10Hoo man: Remove mw1053 from mediawiki-installation dsh [operations/puppet] - 10https://gerrit.wikimedia.org/r/153728 [23:29:48] bblack: ^ [23:29:54] my passphrase just is to long :P [23:29:54] btw I'm not sure mw1053 has a disk failure, I think it's out for other reasons? not sure though. [23:30:04] * greg-g shrugs [23:30:05] bblack: Disk looks okish [23:30:15] not sure what's wrong there (didn't look close, just killed stuff there) [23:30:22] sudo -u mwdeploy killall -u mwdeploy python [23:30:26] (03CR) 10BBlack: [C: 032] Remove mw1053 from mediawiki-installation dsh [operations/puppet] - 10https://gerrit.wikimedia.org/r/153728 (owner: 10Hoo man) [23:30:30] thanks both [23:31:09] my flaky memory says maybe mw1053 is out because it was being used as an hhvm test host [23:32:23] bblack: Oh, possible [23:32:24] !log hoo Synchronized php-1.24wmf16/skins/Vector/skinStyles/mediawiki.special.preferences.less: Fix missing tab images on Special:Preferences (duration: 00m 10s) [23:32:29] Logged the message, Master [23:32:34] it was a job runner I think [23:32:35] :P [23:32:47] MatmaRex_mobile: Please verify [23:33:50] hoo: mw1017 is testwiki [23:34:00] yes? [23:34:02] aka, the hhvm-enabled one [23:34:15] greg-g: Memory tells me it was a hhvm job runner [23:34:18] so, not 1023 [23:34:23] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:34:24] oh... [23:34:24] not sure why I recalled that exactly during the scap [23:34:46] !log hoo Synchronized tests/multiversion/MWMultiVersionTest.php: (no message) (duration: 00m 11s) [23:34:49] it is an hhvm test box [23:34:51] Logged the message, Master [23:35:03] ok, all synced / deployed ;) [23:35:07] it was running jobs though the hhvm-server was turned off before wikimania [23:35:31] AaronSchulz: any reason scap would fail on it? [23:35:45] the pythons were just stuck on it [23:35:51] the runner was left on though so it kept spawning curl requests that could work...I think there were lots of zombie procs due to them getting killed [23:36:00] no high CPU load (they didn't have any cpu load) and no io load [23:36:18] the may have caused problems after enough time [23:36:29] I could have looked further, but I doubt there's much value in that [23:36:55] AaronSchulz: It might have hit some time out after some time [23:37:09] but I saw no point waiting further for a server that serves nothin [23:37:17] hoo: to confirm, 1023 *isn't* serving traffic, right? (just out of dsh doesn't mean that) [23:37:25] heh, k [23:37:27] bad timing ;) [23:37:52] :) [23:43:16] MatmaRex_mobile: Could you please verify your deploy? [23:43:30] I can tell that nothing (seems) broken, but not that whatever was fixed is fixed [23:43:41] yes [23:44:36] fixed indeed [23:44:45] thanks [23:44:54] Nice :) You're welcome [23:59:32] (03PS1) 10Spage: Grant 'block' to qa_automation group on test2wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/153732 (https://bugzilla.wikimedia.org/61799)