[00:03:26] (03CR) 10Ori.livneh: apache: clean up init.pp & params.pp (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 (owner: 10Ori.livneh) [00:03:54] (03PS4) 10Ori.livneh: ::apache: delete everything we're not already using [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 [00:04:03] (03CR) 10Ori.livneh: [C: 032 V: 032] ::apache: delete everything we're not already using [operations/puppet] - 10https://gerrit.wikimedia.org/r/137884 (owner: 10Ori.livneh) [00:07:34] andrewbogott: it applied correctly (i.e., no-op) [00:07:41] win! [00:09:39] (03PS2) 10Ori.livneh: apache::vhost: removed unused 'configure_firewall' option [operations/puppet] - 10https://gerrit.wikimedia.org/r/138669 [00:10:02] (03CR) 10Ori.livneh: [C: 032 V: 032] apache::vhost: removed unused 'configure_firewall' option [operations/puppet] - 10https://gerrit.wikimedia.org/r/138669 (owner: 10Ori.livneh) [00:12:07] so did that :) [00:19:21] andrewbogott: thanks very much, once again [00:24:26] (03CR) 10Andrew Bogott: [C: 031] apache: clean up init.pp & params.pp (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 (owner: 10Ori.livneh) [00:33:12] (03PS2) 10Ori.livneh: apache: clean up init.pp & params.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 [00:33:20] (03CR) 10Ori.livneh: [C: 032 V: 032] apache: clean up init.pp & params.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138668 (owner: 10Ori.livneh) [00:35:47] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 1424 seconds [00:36:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are unmerged changes in puppet (dir /var/lib/git/operations/puppet) waiting since 1424 seconds [00:36:44] that was hardly 1424 seconds [00:36:47] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:37:07] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:37:26] that was less than three minutes [00:37:31] that alert is broken [00:52:50] is there a way to see all changes "merged by" a user? [00:52:54] in gerrit [00:53:10] Not really [00:53:21] You can do reviewer:username but that overselects [00:53:31] Also note that according to Gerrit, the merges were all performed by jenkins-bot [00:53:37] So you'd want "+2ed by this user" [00:54:06] yes, true, that's what i want [00:54:14] <^d> RoanKattouw: Not necessarily, depends on submit policy and current history. [00:54:39] but aren't you even a reviewer just for being _added_ to a change [00:54:44] even if you did not actually review [00:54:48] Yes [00:54:53] grr.. hrmm [00:54:56] Which is why it's so overinclusive [00:55:17] <^d> label:Code-Review=2,user=jsmith [00:55:22] <^d> From the search docs. [00:55:28] <^d> Matches changes with a +2 code review where the reviewer is jsmith. [00:55:32] What ^d said is also true, but for most repos (including almost all deployed ones) and most recent-ish history, you want +2ers [00:55:35] ^d: Ooooh that's cool [00:55:36] well, ok, reviewer: gets me closer at least, thanks [00:55:38] Is that new? [00:55:44] ^d: ah, thanks! [00:55:45] * ^d shrugs [00:55:52] <^d> Might be, no clue. [00:56:02] <^d> https://gerrit.wikimedia.org/r/Documentation/user-search.html#labels [00:57:49] works:) [00:58:31] now make ohloh give score based on that :) [00:58:32] cya [01:06:06] (03CR) 10Chad: "Is this obsolete?" [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/130969 (owner: 10Manybubbles) [01:48:31] (03PS1) 10Ori.livneh: apache: replace apache::mod::* hierarchy with simpler equivalent from MWV [operations/puppet] - 10https://gerrit.wikimedia.org/r/138769 [01:51:57] (03PS2) 10Ori.livneh: apache: replace apache::mod::* hierarchy with simpler equivalent from MWV [operations/puppet] - 10https://gerrit.wikimedia.org/r/138769 [02:12:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [02:15:46] !log LocalisationUpdate completed (1.24wmf7) at 2014-06-11 02:14:43+00:00 [02:15:55] Logged the message, Master [02:29:21] !log LocalisationUpdate completed (1.24wmf8) at 2014-06-11 02:28:18+00:00 [02:29:26] Logged the message, Master [02:42:49] (03CR) 10Faidon Liambotis: [C: 04-2] "When I read the backlog, I was about to suggest... exactly the solution that AndrewB committed :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138623 (owner: 10Filippo Giunchedi) [02:43:01] (03Abandoned) 10Faidon Liambotis: Revert "Replace exim::simple-mail-sender with a role class" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138623 (owner: 10Filippo Giunchedi) [02:45:05] (03Abandoned) 10Faidon Liambotis: Move mail manifests to a module called 'exim' [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [02:48:10] (03CR) 10Faidon Liambotis: "I see in the version that we're running that "log_statsd_default_sample_rate" does exist (and defaults to 1). That bug that you mentioned " [operations/puppet] - 10https://gerrit.wikimedia.org/r/138574 (owner: 10Filippo Giunchedi) [02:50:07] (03CR) 10Faidon Liambotis: [C: 032] "+2 for the intention and whole patch series, although I haven't reviewed all the individual patches." [operations/puppet] - 10https://gerrit.wikimedia.org/r/138011 (owner: 10Rush) [02:51:10] (03CR) 10Faidon Liambotis: [C: 04-1] "I haven't reviewed the code -- and I'm not sure I should, I'm still divided on whether Ruby is a good idea for this :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138292 (owner: 10Ori.livneh) [03:11:41] (03PS3) 10Tim Starling: Puppetize /home/tstarling/.bashrc [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 [03:14:42] (03CR) 10Tim Starling: [C: 032] "Updated for wikiversions.json and fixed whitespace" [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [03:15:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [03:18:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [03:29:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 11 03:28:14 UTC 2014 (duration 28m 13s) [03:29:26] Logged the message, Master [03:33:07] (03PS3) 10Ori.livneh: apache: replace apache::mod::* hierarchy with simpler equivalent from MWV [operations/puppet] - 10https://gerrit.wikimedia.org/r/138769 [03:33:39] paravoid: what do you think about that ^ ? [03:34:34] I don't like it much [03:34:39] why not having it as a define? [03:34:56] that's the puppet pattern for multiple includes with arbitrary parameters [03:35:12] I think encoding every single apache module out there as a separate puppet class is kinda ugly [03:35:46] that's the pattern for reusing a block of code to configure structurally similar components, but it doesn't dodge the duplicate definition issue [03:35:57] ? [03:36:02] apache::mod { 'rewrite': } x2 == dupe [03:36:48] why would you enable it twice? [03:37:11] take cases like zirconium, where you have several vhosts that share the same infrastructure but are conceptually unrelated roles [03:37:47] you can either overreach in the role and have the role "know" that zirconium already has the relevant modules enabled (or that they already are enabled by some other vhost role on that machine) [03:37:58] which is what people often do [03:38:20] hrm, yeah I guess [03:38:22] or you can declare everything your vhost requires in your vhost role [03:38:29] but then everyone else is screwed [03:38:43] or they use if (!defined()) which a problem disguised as a solution [03:39:00] or virtual resources maybe [03:39:17] we tried that, it doesn't really resolve the duplicate definition issue [03:39:53] i balked at the stub class approach for a while because it really rubbed me the wrong way, but it solved the problem really well for mwv, and it has stayed solved [03:47:01] with virtual resources you'd presumably have each vhost realize the relevant modules [03:47:08] but it means you need to define them centrally somewhere [03:47:51] so you end up with a file that contains defines for all the modules [03:48:08] in other words you end up with the same pattern but funkier syntax (@ and realize()) [03:49:49] (03CR) 10Tim Starling: "Ori, I think it's great that you're looking at this -- yes I have thought about this before, but mostly many years ago and the numbers are" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [03:52:10] TimStarling: salt --out=txt 'mw*' cmd.run 'test -r /var/run/apache2.pid && ps h -o rss --ppid $(cat /var/run/apache2.pid) 2>/dev/null | grep -v " 0"' | awk '{sum+=$2} END { print "Average = ",sum/NR}' [03:52:26] (03CR) 10Tim Starling: "Ideally I would like to have protection against swapping that is smarter than just setting MaxClients -- for example, actually measuring s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [03:52:27] granted that's at a particular time of day / day of week [03:52:56] TimStarling: appservers have no swap nowadays [03:53:24] that is good [03:53:40] that makes it less of a disaster if memory is exhausted, and so lets you set a higher MaxClients [03:53:48] oom killer is better than swapdeath, yes :) [03:53:55] but oom-killer is not the smartest thing imaginable [03:54:28] it's not, but heavy swapping traditionally killed the boxes altogether [03:54:29] ori: I think you'll find that is just libraries [03:54:52] so it's not really per-worker [03:54:54] hm [03:54:55] right [03:55:11] that oneliner is wrong not just because of libraries [03:55:21] fork() is copy on write [03:55:52] the kernel won't copy all of apache's memory for every child [03:55:56] what's the right way to measure it? [03:56:32] well, you can use free :) [03:56:34] stop apache, free [03:56:42] I don't know if there is any decent tool [03:56:49] start apache, free + ps, divide memory/apache processes [03:56:50] salt --out=txt 'mw*' cmd.run 'apache2ctl stop' [03:56:51] in theory it is possible to parse /proc//maps [03:56:52] heh [03:57:07] I have heard of that being done in a random app [03:57:41] it seems like the sort of thing that should exist [03:59:41] you could also put apache in a cgroup and look at memory.memsw.usage_in_bytes [03:59:48] but I think that will include cache/buffers [04:00:18] http://www.selenic.com/smem/ ? [04:00:21] says apt-cache search :) [04:00:54] https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py [04:00:59] yeah, looks about right [04:01:03] I'd just use free tbh [04:01:50] (03CR) 10Faidon Liambotis: "appservers have no swap nowadays :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [04:01:51] # V1.7 20 Sep 2007 Use PSS from /proc/$pid/smaps if available, which [04:01:51] # fixes some over-estimation and allows totalling. [04:02:01] "divide memory/apache processes" [04:02:08] there is the question of how you count apache processes [04:02:19] whether or not you count ones that are idle, listening [04:02:33] netstat -np |grep :80 | grep -c ESTABLISHED [04:02:36] :P [04:02:43] but yeah, you're right [04:03:15] speaking of apache [04:03:26] anyway, like I was saying on the change, I don't think average memory usage is even the right metric [04:03:27] I wonder if there's a reason to run the master process as root [04:03:44] I think you want memory in some hypothetical overload condition [04:03:49] port 80 is one obviously, but that we can change [04:04:19] I don't know, I think using a privileged port is a feature [04:04:25] you know MySQL uses an unprivileged port [04:04:57] and I have complained that if you can make it crash (not hard, right?) and if you have unprivileged shell access, you could bring up a fake mysqld before the real one manages to restart [04:05:04] (03CR) 10Faidon Liambotis: [C: 031] "I'm not thrilled about rotated_log, as it strikes me as something that should be a separate module (logrotate). We can iterate though and " [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 (owner: 10Ori.livneh) [04:05:17] heh, true [04:05:37] well, the point for not running apache as root would be to allow non-roots to sudo as apache and ptrace() it [04:05:46] but with hhvm this might be moot now anyway [04:06:20] paravoid: nothing is using rotated_log yet, it is only introduced in that patch [04:06:23] i don't love it either [04:06:29] ori: yes it is [04:06:32] you use it for the apache log [04:06:37] oh [04:06:38] hm [04:07:04] nothing uses daemon.conf.erb though, I think [04:07:07] so that can go [04:08:04] let's kill rotated_log too, i'll replace it with having the specific logrotate config in mediawiki/manifests/syslog.pp [04:08:18] i think we're right to want an abstraction there but this isn't the right one [04:08:46] I don't mind either way, this changeset is definitely progress compared to what we have and I've been holding you off long enough [04:09:10] meh, it won't get easier to remove in the future [04:09:13] we should have peer reviews more often [04:09:27] I feel like I'm very productive with tech work [04:09:33] heheh [04:09:35] just as to avoid writing those reviews [04:09:43] i haven't done any yet [04:09:53] deadline is in 3 hours, right? [04:09:55] i am a champion procrastinator [04:10:18] ori: also the commit message lies I think, it says /var/log/local/ [04:10:20] i will probably end up sending them at some implausible hour after the deadline with a suitably apologetic note [04:10:22] paravoid: I even made a couple wp article edits today to procrastinate ;) [04:29:53] (03PS12) 10Ori.livneh: Add rsyslog module and port existing usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 [04:34:56] gotta run for a bit, i'll merge it later when i can babysit [04:35:01] thanks for the review [05:04:08] !log beginning schema changes bug 49193 page_content_model [05:04:12] Logged the message, Master [05:11:11] (03CR) 10Ori.livneh: [C: 032] Add rsyslog module and port existing usage [operations/puppet] - 10https://gerrit.wikimedia.org/r/135447 (owner: 10Ori.livneh) [05:13:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [05:16:37] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [05:16:37] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:16:57] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:17:07] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:07] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:17] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [05:17:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [05:17:27] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [05:17:27] PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:47] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [05:18:07] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [05:18:57] hi [05:19:11] i don't know anything about the rendering service.. yet. *consults wikitech* [05:19:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [05:19:19] i would be very upset if this were related to my change [05:19:27] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [05:19:47] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69309 bytes in 0.382 second response time [05:19:52] hmm [05:19:57] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [05:19:57] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [05:20:07] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [05:20:07] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [05:20:17] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [05:20:19] i like problems that fix themselves. sort of. [05:20:27] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [05:21:29] hey ori thanks for the call the other night, i've added you to my phone's vip settings so that future calls will always ring through. [05:21:31] my change doesn't apply to those hosts, so rule that out [05:21:46] good to know [05:22:31] fwiw, my schema changes havn't hit anything except labsdbs yet [05:22:37] so not i either :) [05:23:07] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:12] nngh [05:23:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [05:23:17] PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:27] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [05:23:37] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [05:23:37] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [05:23:57] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:23:58] hey [05:24:00] I just saw this [05:24:01] this is rsyslog [05:24:07] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [05:24:17] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [05:24:58] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.008 second response time [05:25:07] RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.596 second response time [05:25:17] RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.005 second response time [05:25:18] how? [05:25:35] swift bug [05:26:00] https://wikitech.wikimedia.org/w/index.php?title=Swift/TODO&diff=115824&oldid=115809 [05:26:24] "* Figure out a way to fix the [[Incident_documentation/20131205-Swift|"restarting syslog kills Swift"]] bug" [05:26:37] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [05:26:40] !log restarting all swift daemons across the cluster to fix runaway threads due to rsyslog restart [05:26:44] Logged the message, Master [05:27:07] very interesting [05:27:10] gracias paravoid [05:27:17] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:27:58] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: HTTP CRITICAL - No data received from host [05:28:58] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 343 bytes in 0.009 second response time [05:29:03] I disabled puppet across the cluster [05:29:07] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.661 second response time [05:29:12] to make sure it won't happen uncontrollably [05:29:17] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.073 second response time [05:29:27] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.088 second response time [05:29:27] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.417 second response time [05:29:47] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69309 bytes in 0.324 second response time [05:29:57] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [05:30:17] the upstream bug is a mess [05:31:27] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [05:32:07] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.045 second response time [05:36:13] ori: isn't it awesome that that diff above is from yesterday [05:36:23] and yet today I gave a +1 on your change without noticing it [05:37:53] we don't have this patch: http://hg.python.org/cpython/rev/99f0c0207faa [05:39:40] yes this is linked from one of the swift bugs [05:41:52] we should just apply it [05:42:14] run a patched python you mean? [05:42:16] no thanks :) [05:42:21] upgrading to python 2.7.4 is too invasive [05:44:17] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [05:44:21] upgrade to trusty and be done with it :) [05:44:45] the other workaround for this bug is to point swift to 127.0.0.1:514 instead of /dev/log [05:44:55] and configure rsyslog to listen on localhost [05:49:12] anyway [05:49:27] all swift eqiad+esams upgraded, rsyslog restarted, swift restarted, puppet enabled [05:53:16] thanks for that [06:17:36] (03PS1) 10Ori.livneh: Fix-up for I28bcbab76: strip whitespace from clear_magick_tmp cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/138780 [06:19:16] <_joe|away> ori: good morning [06:21:06] <_joe_> hey I did not get paged for the swift problem [06:21:20] it was too early [06:23:42] how nice, I got locked out from wikitech [06:25:14] <_joe_> oh so somebody called you, ok [06:26:35] no, I was just around [06:29:04] (03PS1) 10Faidon Liambotis: Remove amanda from carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/138782 [06:29:32] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove amanda from carbon [operations/puppet] - 10https://gerrit.wikimedia.org/r/138782 (owner: 10Faidon Liambotis) [06:30:02] <_joe_> btw, reading cajoel's email I discovered the "official" puppet docker module is implemented by on of the most abysmal hipster-brogrammers I ever had the displeasure to collaborate with. [06:30:18] <_joe_> nice to know [06:30:37] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 03:30:05 UTC [06:31:31] <_joe_> ^ is bogus [06:32:10] root@carbon:~# grep -c ' installed openbsd-inetd' /var/log/dpkg.log.1 [06:32:13] 3012 [06:32:29] <_joe_> ? [06:34:25] see commit above [06:36:18] <_joe_> paravoid: do we collect stats from apc somewhere? [06:36:25] I doubt it [06:37:00] <_joe_> ok, one more thing in the TODO list [06:37:19] no point really [06:37:32] we're switching to hhvm, so it will be pointless [06:38:01] <_joe_> well, yes [06:41:44] (03CR) 10Giuseppe Lavagetto: "Tim: I do agree completely with your point about not being too lenient on CPU limits; that is why I proposed quite liberal limit based on " [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [06:56:17] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:00:27] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jun 11 07:00:26 UTC 2014 [07:14:28] (03CR) 10Giuseppe Lavagetto: "I'll keep the netstat-based check" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [07:33:19] (03PS5) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [07:37:48] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [07:43:43] (03PS1) 10QChris: Make dbstore1002 handle s4 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/138786 (https://bugzilla.wikimedia.org/66068) [07:44:54] (03CR) 10Springle: [C: 032] Make dbstore1002 handle s4 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/138786 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [07:45:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [07:46:38] (03PS6) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [07:48:37] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 04:47:31 UTC [08:04:37] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 05:03:27 UTC [08:14:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [08:17:07] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jun 11 08:17:04 UTC 2014 [08:18:10] (03PS2) 10Alexandros Kosiaris: torrus: csw2-esams in accessswitches, not corerouters [operations/puppet] - 10https://gerrit.wikimedia.org/r/131473 [08:21:11] (03CR) 10Alexandros Kosiaris: [C: 032] torrus: csw2-esams in accessswitches, not corerouters [operations/puppet] - 10https://gerrit.wikimedia.org/r/131473 (owner: 10Alexandros Kosiaris) [08:26:59] (03PS7) 10Alexandros Kosiaris: manutius: remove torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130587 (owner: 10Matanya) [08:29:01] (03CR) 10Alexandros Kosiaris: [C: 032] manutius: remove torrus [operations/puppet] - 10https://gerrit.wikimedia.org/r/130587 (owner: 10Matanya) [08:30:45] hello [08:30:58] akosiaris: are you phasing out Torrus in favor of diamond? [08:31:10] one more tampa on the way out [08:31:29] hashar: yes in general, no right no. This is a tampa migration thing [08:31:35] no right now* [08:31:59] :-) [08:33:57] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jun 11 08:33:49 UTC 2014 [08:40:02] (03PS7) 10Giuseppe Lavagetto: redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 [08:40:31] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [08:40:58] (03CR) 10Giuseppe Lavagetto: [V: 032] redis: restart service upon first install [operations/puppet] - 10https://gerrit.wikimedia.org/r/138317 (owner: 10Giuseppe Lavagetto) [08:46:42] (03CR) 10Alexandros Kosiaris: [C: 032] Increase bacula pool size [operations/puppet] - 10https://gerrit.wikimedia.org/r/137564 (owner: 10Alexandros Kosiaris) [08:51:00] (03CR) 10Alexandros Kosiaris: [C: 032] Purge vimrc.local [operations/puppet] - 10https://gerrit.wikimedia.org/r/137340 (owner: 10Alexandros Kosiaris) [09:15:20] (03PS1) 10Filippo Giunchedi: keep 5 days worth of diamond.log [operations/puppet] - 10https://gerrit.wikimedia.org/r/138789 [09:34:33] (03CR) 10Aklapper: [C: 031] "Changing to +1, now that https://bugzilla.wikimedia.org/show_bug.cgi?id=61178 got fixed (consistency)." [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/106761 (https://bugzilla.wikimedia.org/59893) (owner: 1001tonythomas) [09:49:07] PROBLEM - puppetmaster backend https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [09:49:17] PROBLEM - puppetmaster https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [09:51:31] <_joe_> uh [09:53:17] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.045 second response time [09:53:57] <_joe_> Unexpected error in mod_passenger: Could not connect to the ApplicationPool server: Broken pipe (32) [09:54:01] <_joe_> meh [09:54:07] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.019 second response time [09:54:23] <_joe_> !log restarted apache on palladium - passenger crashed [09:54:29] Logged the message, Master [09:58:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:14:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:20:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:20:51] (03PS1) 10Alexandros Kosiaris: Initial import for version 1.3 [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/138792 [10:21:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Initial import for version 1.3 [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/138792 (owner: 10Alexandros Kosiaris) [10:25:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:28:09] (03CR) 10Alexandros Kosiaris: [C: 031] "That seems like a lot of changes at first but all in all it is only <10 new functions. The rest is docs and rspec/tests. I laughed with th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138635 (owner: 10Ori.livneh) [10:33:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:33:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:39:06] looks like it is normal? I was looking at http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2#metric_Global_JobQueue_length [10:39:14] I think I might have killed graphite [10:39:37] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.003 second response time [10:39:48] it does indeed seem down hashar :)) [10:40:05] I was browsing some huge metric, that might have killed it with some oom [10:41:24] I have no clue how graphite is setup on tungsten though [10:41:41] for the jobq, you might have better clue looking at http://gdash.wikimedia.org/dashboards/jobq/ (whenever graphite is back) [10:41:53] hopefully it will have the confident bands graphed [10:42:35] the usual problem there when asking for many metrics is that uwsgi workers running graphite-web get locked up while waiting for the disk and won't talk to apache anymore [10:44:57] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:45:13] !log restarted uwsgi on tungsten, hung on fetching many metrics [10:45:17] Logged the message, Master [10:46:14] (03PS5) 10Odder: Move queries for bugs with ASSIGNED status [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 [10:47:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:47:37] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [10:47:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:48:05] hashar: yup the dashboard looks fine too to me [10:48:12] <_joe_> I'll silence this until graphite is more stable [10:48:19] I guess it is back [10:48:42] which alarm _joe_ ? [10:48:54] <_joe_> Number of mediawiki jobs running [10:49:00] the gdash dashboard for queue only has rates :( [10:49:23] would be rather nice to have the job queues with confidence band and have the nagios check point to that url :D [10:49:32] I don't think it is related with graphite being unstable, it was alarming even before [10:50:09] <_joe_> then there's something strange going on there - maybe some strange data [10:50:12] <_joe_> yeah, right [10:50:17] <_joe_> we do forecasting over 1 w [10:50:41] <_joe_> we had a very unusual behavior (due to a bug) for some days in the last seven [10:51:04] <_joe_> so maybe the prediction over a week is a too long interval, I'll dig into that [10:51:29] <_joe_> we may have more than enough datapoints with 1 day [10:52:19] would have it happened the day after the bug was fixed anyway? [10:52:41] <_joe_> I need to look at the data :) [10:54:34] I have a question for graphite wizards, suggested by the editswiki graph which does something like target=cactiStyle(substr(highestMax(summarize http://ur1.ca/hgf85 [10:55:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:55:20] sorry I killed it again [10:55:21] Is there something like that to show a line for each of the top N wikis with most "observation points"? [10:55:22] :( [10:55:29] I will stop attempting to browse my large hierarchy :D [10:55:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:55:49] * Nemo_bis skimming https://graphite.readthedocs.org/en/1.0/functions.html [10:55:50] hashar: you get to fix it this time around :) "service uwsgi restart" on tungsten + waiting some time is the trick [10:56:39] (03PS1) 10Odder: Add Tony Thomas to the English Planet Wikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/138796 [10:57:07] (03CR) 10Odder: [C: 04-1] "Commit message is incorrectly formatted." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [10:57:58] godog: sorry I haven't got access to that host :( [10:58:06] but I will surely stop playing with it fornow [10:58:17] oh, ok didn't realize that hashar [10:58:17] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:58:30] can't find anything in the docs, unless https://graphite.readthedocs.org/en/1.0/functions.html#graphite.render.functions.events can be combined with highestMax and friends [10:58:46] hashar: btw graphite seems fine to me [10:58:47] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 54 data above and 9 below the confidence bounds [10:59:02] godog: yes went back. Might just be my connection so [11:00:10] Nemo_bis: not sure, I usually try on a small sample of data in the graphite ui [11:00:38] well I don't have access to that, can only play with image URLs [11:02:40] quick break then I will look at LXC \O/ [11:09:17] (03PS1) 10Giuseppe Lavagetto: check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 [11:10:33] (03CR) 10jenkins-bot: [V: 04-1] check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 (owner: 10Giuseppe Lavagetto) [11:11:57] (03PS2) 10Giuseppe Lavagetto: check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 [11:15:37] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [11:19:37] <_joe_> out for lunch. bbl. [11:21:37] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [11:24:34] (03CR) 10Nemo bis: "I wonder if we should store events rather than counts. http://obfuscurity.com/2014/01/Graphite-Tip-A-Better-Way-to-Store-Events + http://o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117021 (https://bugzilla.wikimedia.org/41754) (owner: 10Nemo bis) [11:31:37] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [11:32:05] sigh, a lot of accesses to reqstats.edits [11:32:05] !log restarted uwsgi on tungsten, a lot of accesses to reqstats.edits.*.submits [11:32:10] Logged the message, Master [11:40:02] !log manually cleaning librenms tables. db1001 is going to have increased load for some time. The approach is automatable, see http://jira.observium.org/browse/OBSERVIUM-757 [11:40:07] Logged the message, Master [12:00:37] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.005 second response time [12:02:37] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [12:11:47] (03PS23) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [12:20:35] (03CR) 10Alexandros Kosiaris: "Yes, the previous PS did not work for a number of reasons (see the diff for changes). I have uploaded previously amended PSes that were cl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [12:21:45] <_joe_> !log set up a secondary remote named 'readonly' in /a/common on tin, to use with the icinga check for unmerged commits [12:21:50] hey mutante, any ideas why Tech News does not appear on the English Planet Wikimedia? [12:21:50] Logged the message, Master [12:21:56] Any errors? [12:27:43] <_joe_> twkozlowski: is there a related change I can look into? [12:29:29] (03PS3) 10Giuseppe Lavagetto: check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 [12:31:27] https://gerrit.wikimedia.org/r/#/c/127222/ _joe_ [12:31:55] <_joe_> ok it will take me a while, I know nothing about planet :) [12:32:18] I see Nemo_bis commented on the problem already back in April, hmm. [12:32:29] I never read Tech News, so :-P [12:32:57] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:33:15] <_joe_> twkozlowski: yeah that seems likely to be the cause of the problem :) [12:33:34] (03CR) 10Giuseppe Lavagetto: [C: 031] leave only one statsd/carbon-relay CNAME [operations/dns] - 10https://gerrit.wikimedia.org/r/138568 (owner: 10Filippo Giunchedi) [12:33:47] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:33:52] oh dear :-( [12:37:42] <_joe_> twkozlowski: I did not investigate this, but please speak with Nemo_bis - if you still have doubts I can help dig deeper [12:40:22] (03CR) 10Hashar: contint: localhost.mediawiki vhost on ci labs slave (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 (owner: 10Hashar) [12:40:28] (03PS5) 10Hashar: contint: localhost.mediawiki vhost on ci labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 [12:40:36] (03PS6) 10Hashar: contint: localhost.mediawiki vhost on ci labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 [12:41:34] (03PS3) 10Giuseppe Lavagetto: Allow searching NS_MODULE with mwgrep [operations/puppet] - 10https://gerrit.wikimedia.org/r/138216 (owner: 10Hoo man) [12:41:43] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow searching NS_MODULE with mwgrep [operations/puppet] - 10https://gerrit.wikimedia.org/r/138216 (owner: 10Hoo man) [12:42:44] (03CR) 10Hashar: "@Dzahn" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 (owner: 10Hashar) [12:43:28] godog_: I have addressed a puppet file mode concern you had on https://gerrit.wikimedia.org/r/#/c/135529/ :) should be good to go now. [12:50:26] (03PS1) 10Hashar: contint: reduce duplication with mediawiki::packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 [12:50:28] (03CR) 10Hashar: "Will rework it on https://gerrit.wikimedia.org/r/138804" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137921 (owner: 10Hashar) [12:50:36] (03CR) 10Hashar: "Thanks! Will rework it on https://gerrit.wikimedia.org/r/138804" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138097 (owner: 10Ori.livneh) [12:58:49] (03CR) 10Nemo bis: "Odder notes what above is still current; I had forgotten because I'm using the feed directly. Doesn't Venus output some logs somewhere? If" [operations/puppet] - 10https://gerrit.wikimedia.org/r/127222 (owner: 10Dzahn) [12:59:46] (03PS2) 10Hashar: contint: reduce duplication with mediawiki::packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 [13:00:25] <_joe_> Nemo_bis: problem could be better reported on RT [13:00:35] <_joe_> I'll do that for you :) [13:00:40] (03CR) 10Hashar: "Ordering the calls with class dependencies:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [13:02:45] (03CR) 10Hashar: [C: 031] "Cherry picked on contint puppetmaster" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 (owner: 10Hashar) [13:05:49] _joe_: as you prefer :) we have a "Planet" component in bugzilla fyi [13:06:00] (03CR) 10Hashar: [C: 031] "Cherry picked on contint puppetmaster." [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [13:06:12] <_joe_> yeah but this looks like something that needs ops investigation first [13:09:29] ah trusty comes with kernel 3.13.x \O/ [13:14:52] hashar: lovely! TYVM [13:15:17] (03CR) 10Filippo Giunchedi: [C: 031] contint: localhost.mediawiki vhost on ci labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 (owner: 10Hashar) [13:15:19] hmm my catalog differ dies with puppet 3... [13:15:38] I am going to install docker on a labs instance and play with it a bit [13:17:58] (03PS1) 10Mark Bergsma: Migrate mobile-lb.esams to new IP addresses (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138806 [13:19:35] (03PS1) 10Mark Bergsma: Migrate mobile-lb.esams to new IP addresses (Zero scheme) [operations/dns] - 10https://gerrit.wikimedia.org/r/138807 [13:20:18] akosiaris: any caching servers in tampa ? [13:20:42] nope [13:25:49] (03CR) 10Mark Bergsma: [C: 032] Migrate mobile-lb.esams to new IP addresses (Zero scheme) [operations/dns] - 10https://gerrit.wikimedia.org/r/138807 (owner: 10Mark Bergsma) [13:28:16] (03PS1) 10Matanya: cache: remove pmtpa hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/138811 [13:28:46] (03CR) 10Mark Bergsma: [C: 032] Migrate mobile-lb.esams to new IP addresses (Zero scheme) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138806 (owner: 10Mark Bergsma) [13:32:43] <_joe_> Nemo_bis: RT created and added you as requestor [13:33:40] (03CR) 10Filippo Giunchedi: [C: 031] check git merge: use outstanding commits, not time (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 (owner: 10Giuseppe Lavagetto) [13:34:11] hashar: precise also has 3.13 as an option, fwiw [13:34:13] (03PS4) 10Giuseppe Lavagetto: check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 [13:34:22] paravoid: yeah noticed that [13:34:30] (03CR) 10Giuseppe Lavagetto: [C: 032] check git merge: use outstanding commits, not time [operations/puppet] - 10https://gerrit.wikimedia.org/r/138798 (owner: 10Giuseppe Lavagetto) [13:34:40] argh.... fchown(4, 0, 500) = -1 EPERM (Operation not permitted) [13:34:45] why puppet ? why ? [13:35:12] last famous words [13:35:12] so that no boredom ensues! [13:37:42] <_joe_> "why puppet ? why ?" is the official opsen motto, I see [13:37:45] (03PS1) 10Ottomata: analytics-kafka partman didn't work, it was too complicated [operations/puppet] - 10https://gerrit.wikimedia.org/r/138813 [13:38:11] (03CR) 10Ottomata: [C: 032 V: 032] analytics-kafka partman didn't work, it was too complicated [operations/puppet] - 10https://gerrit.wikimedia.org/r/138813 (owner: 10Ottomata) [13:38:37] <_joe_> ottomata: man I just merged a patch [13:38:43] _joe_: ok to merge your 'check git merge: use outstanding commits, not time' commit? [13:38:46] ah, yup :) [13:39:07] _joe_: its ok for me to puppet-merge that? [13:39:15] <_joe_> ottomata: yeah [13:39:21] thanks joe [13:39:34] <_joe_> we were quasi-synchronous [13:41:42] <_joe_> Nemo_bis: thank me once I had time to look into it :P [13:44:35] (03PS1) 10Alexandros Kosiaris: quality $hostname in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138814 [13:44:47] QUALITY [13:44:50] quality! [13:44:51] haha [13:44:59] it's a very nice hostname [13:45:39] indeed it is [13:45:45] <_joe_> lol [13:46:11] let's go for nouns this time. Infinitely more than chemical elements :-) [13:46:30] (03PS2) 10Alexandros Kosiaris: qualify $hostname in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138814 [13:47:17] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [13:47:41] akosiaris: but there's no fixed order for nouns ... :/ (which one is nexed ...) [13:47:51] <_joe_> uhm I forgot a >/dev/null [13:47:58] <_joe_> but still, it works [13:49:09] Trminator: of course there is. The one with the smallest Levenshtein distance. [13:49:40] akosiaris: from which dictionary, in which language ... what about swear words? ;) [13:50:31] but I'll let you guys work that out ;) [13:51:04] Trminator: the Klingon Pocket Dictionary by Klingonska Akademien, obviously the Klingon language and swear words would be a grave mistake [13:51:24] hehe [13:52:42] <_joe_> akosiaris: klingon nouns? +1 [13:53:17] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection refused [13:53:27] (03CR) 10Alexandros Kosiaris: [C: 032] qualify $hostname in network.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/138814 (owner: 10Alexandros Kosiaris) [13:53:57] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection refused [13:54:57] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.191 second response time [13:55:02] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [13:55:14] 5xx spike just started [13:55:17] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.110 second response time [13:55:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [13:55:37] <_joe_> bblack: faster than icinga [13:56:20] !log upgrading mw1153-mw1160, tmh1001-tmh1002 for USN-2244-1 [13:56:25] Logged the message, Master [13:57:01] <_joe_> bblack: are you looking into it? [13:57:24] it seems to have been short-lived [13:57:35] could have been the upgrade above [13:57:35] <_joe_> ok just a spike, may be related to imagescalers upgrade? [13:57:37] I'm guessing related to the upgrades [13:57:51] <_joe_> ok, there seem to be some consensus :P [13:58:01] guys, I think it may be because of the upgrade! [13:58:05] :D [13:59:31] the only thing better than consensus is a management decision :) [13:59:47] i hereby decree [14:08:59] (03PS1) 10Odder: Set $wgCategoryCollation to 'uca-fr' on frwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138820 (https://bugzilla.wikimedia.org/66165) [14:10:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [14:10:44] Reedy: around? [14:11:34] (03CR) 10Odder: "Special:Statistics reports 45k pages on the wiki, but I guess Springle needs to have a say on when this can be merged & deployed." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138820 (https://bugzilla.wikimedia.org/66165) (owner: 10Odder) [14:16:35] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [14:22:15] heya mutante! [14:22:24] nuria is asking if there is a way she can ack icinga eventlogging alerts [14:22:31] How do I delete a mailing list? [14:22:31] i know that I can now, but only as of recently [14:22:37] so I think there is some special privs she needs [14:22:39] but I don't know what they are [14:23:57] # docker.io run -i -t wikimedia-ci:trusty /bin/bash [14:23:57] root@695674b99022:/# echo "*flexes*" [14:26:42] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 ottomata Need more brokers! [14:27:22] ACKNOWLEDGEMENT - Kafka Broker Replica Lag on analytics1021 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 1870448616.0 ottomata Need more brokers! [14:27:22] ACKNOWLEDGEMENT - Kafka Broker Under Replicated Partitions on analytics1022 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 40.0 ottomata Need more brokers! [14:28:17] ottomata: if I understand the eventlogging.pp puppet file correctly, she'd need to be in either of these nagios groups: contact_group => 'admins,analytics' in order for her to be able to do that :) [14:28:58] she is in analytics [14:29:06] and i believe that just sends her the alerts [14:29:20] she'd like to log into icinga and click a button that acknowledges that she is aware of and working on the problem [14:29:32] (like i just did for the above kafka notices) [14:33:03] ottomata: I get that ;) (am managing quite a few nagios/icinga instances myself, but am not really sure on how your setup is regarding nagios privs :/ just tried to help :) [14:33:25] twkozlowski: Hm? Also I replied to the cswikiversity bug about babel making a comment you might like to read. [14:34:19] (03PS1) 10Ottomata: Fix for $critical param on monitor_ganglia define [operations/puppet] - 10https://gerrit.wikimedia.org/r/138822 [14:34:21] ja, thanks Trminator :) [14:36:07] JohnLewis: Oh, I know what the state of the bug is. Danny is blocking it because he hates Juan (who reported the bug). [14:36:38] twkozlowski: Now that is a bit of cs knowledge I didn't know :p [14:36:51] I know about Danny in general and it sounds 100% like it. [14:37:30] I assigned it to Danny because I didn't want to get involved in this mess. Can submit a patch but it's likely it's going to get -1'd, so whatever. [14:38:05] meh [14:38:31] twkozlowski: could sneak it into a SWAT ;) [14:40:04] <_joe_> twkozlowski, Nemo_bis you should see the technews on planet shortly [14:40:18] !log Jenkins updating plugins [14:40:22] Logged the message, Master [14:40:45] Great :) [14:41:10] <_joe_> someone ran planet as root [14:41:54] <_joe_> !log manually ran 'planet' on en.planet to restore technews [14:41:59] Logged the message, Master [14:44:24] <_joe_> twkozlowski: can you confirm en.planet has Technews entries now? [14:46:08] <_joe_> oh, more than one file with bad permissions :/ [14:46:15] <_joe_> wait 5 more mins [14:48:07] !log rebooting lvs3004.esams (inactive uploads LVS) for 3.13 kernel [14:48:12] Logged the message, Master [14:49:37] (03PS3) 10Krinkle: webperf/deprecate: Log jqmigrate to statsd under mw.js.deprecate [operations/puppet] - 10https://gerrit.wikimedia.org/r/137484 [14:50:13] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 1520944 MB (3% inode=99%): [14:50:21] apergos: ^^ [14:50:33] (it's been in WARNING for weeks now) [14:51:00] (03CR) 10jenkins-bot: [V: 04-1] webperf/deprecate: Log jqmigrate to statsd under mw.js.deprecate [operations/puppet] - 10https://gerrit.wikimedia.org/r/137484 (owner: 10Krinkle) [14:51:48] Is someone restarting Jenkins? [14:51:50] hashar: ? [14:51:59] <_joe_> Krinkle: he's updating things [14:52:04] <_joe_> as per the SAL [14:52:05] Ah, I see. [14:52:22] !log Jenkins restarting (plugin upgrades) [14:52:23] Well, updating plugins doesn't always require a restart and restart can be scheduled later. [14:52:24] I didn't see that though [14:52:26] Logged the message, Master [14:52:29] Krinkle: yes sorry [14:52:30] :) [14:52:33] Happened to hit it the xact second [14:52:49] Krinkle: also I have upgraded Jenkins yesterday by mistake :-( [14:52:59] luckily everything went fine and I amanged to get to bed "early" [14:53:13] hashar: Where do we install Jenkins itself from? [14:53:13] <_joe_> early in the morning [14:53:15] <_joe_> I guess [14:53:15] not debian? [14:53:25] it's alwyas morning somewhere [14:53:37] Krinkle: we use the upstream debian package which is pulled manually on apt.wikimedia.org [14:53:47] Krinkle: then it is just about running apt-get update && apt-get install [14:54:06] hashar: ah, right. It was already updated in our repository, just not yet applied yet, and you applied the update [14:54:16] I found it hard to believe one accidentally updates a package :P [14:54:21] that's quite a lot of work [14:54:35] I ran apt-get upgrade and did not notice jenkins was proposed :-/ [14:54:40] Yeah [14:55:01] Krinkle: meanwhile, never ever upgrade the Jenkins git plugins. They are broken beyond repairs [14:55:29] Maybe our puppet stuffies should have a system to detect the difference between manifest clean and current state (primarily ensure=>present vs latest, but there's others as well). So that one can see which machines are dirty. [14:55:51] Those should essentially by on operations' todo list to do manually so that the cluster isn't in a weird state that it'll never get back in if a machine is replaced. [14:55:59] <_joe_> Krinkle: not sure I got what you're searching to accomplish [14:56:06] I mean, what's the point of puppetising it if the machines are constantly in a state that ins't actually in pupet [14:56:28] I'm not, just nagging about things I don't understand. [14:56:36] <_joe_> Krinkle:what is not in puppet? [14:56:54] <_joe_> state of apt packages? see facter -p :) [14:57:47] _joe_: ensure present/latest is just one example, but I mean in general the idea that puppet manifests ensure one thing, but that often does not always match the result on a new node. E.g. ensure=>present doesn't do anything when the package has an update already prepped in our apt repo. So replacing that node and provisioning it exactly the same can completely break something. [14:57:55] I'm not saying ensure=>latest is great either. [14:58:13] anomie: no swat this morning. party time [14:58:27] but while I don't know how this weighs in perspective, I feel it would make sense the difference between the two is detected somewhere and put on a todo list for manual upgrades. [14:58:35] <_joe_> Krinkle: oh are you saying that the 'package' resourse in puppet is shit? [14:58:45] manybubbles|away: Works for me [14:58:59] <_joe_> I may agree, but it's not particularly shittier than most puppet, and definitely less than a lot of other things [14:59:01] grumble |away grumble [14:59:02] any kind of resource that accepts a state where it does not ensrue the same state as what it would provision a new node with [14:59:23] Krinkle: ops have a web interface to report outdated packages (nicknamed Servmon ) it is not publicly available though [14:59:40] Right, like I said, I don't know how much this weighs in comparison to other types of shit, just sense something wrong and glad I got that right :) [14:59:43] <_joe_> Krinkle: puppet can be described as "good ideas horribly implemented" [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T1500) [15:00:32] hashar: Do we have a way to exclude files from pep8? [15:00:53] Krinkle: more or less, what is the use case / impacted repo ? [15:01:23] Right now I"m a bit pissed at ops for having self-merged a patch that breaks their own lint check, and now any patch that triggers the pep8 pipeline will result in a broken build because of an unrelated file in the repo not passing hte style guide. Basically master (or 'production') is now failing the build consistently. [15:01:35] Specificall this change: https://gerrit.wikimedia.org/r/#/c/137929/ [15:01:42] Causing my patch to break: https://gerrit.wikimedia.org/r/#/c/137484/ [15:01:53] * hashar blame ops [15:02:00] <_joe_> Krinkle: that's me btw [15:02:07] Oh, I did not know that. [15:02:11] <_joe_> Krinkle: ;) [15:02:12] (really) [15:02:32] so yeah we want to ignore modules/puppetmaster/files/puppetsigner.py [15:02:32] Anyhow, I'm not ashamed. If anything glad you're here. [15:02:36] Yep [15:02:56] <_joe_> Krinkle: the problem there specifically was I had the puppetmuster in labs not signing certs anymore [15:03:07] since the python files in ops/puppet might come from upstream, andrew boggot wrote a tiny wrapper around pep8 that execute it in the sub directories [15:03:09] <_joe_> and I moved the file, not really created it, from a wrong location [15:03:13] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 12:02:54 UTC [15:03:37] <_joe_> I'm pretty sure someone told me to 'just verify it' [15:04:01] <_joe_> sorry for the inconvenience [15:04:15] tis ok [15:04:20] you should have copied ./modules/ldap/files/scripts/.pep8 as well [15:04:25] that ignores the error about long line [15:04:31] <_joe_> did not notice that [15:04:33] <_joe_> :/ [15:04:52] <_joe_> sorry, newbie errors :) [15:04:57] it is so well hidden that there is little chance you could have figured out [15:05:00] Sure, I understand the urgency. I guess in addition you're not well aquainted with pep8? (Neither am I). I mean, if I were doing an emergency check-in in mediawik-core adding a javasript file from upstream, I'd just add 1 line in jshintignore, dont even think about it, I work on taht every day and know what to do. it wouldn't make sense to omit that, you jsut add the ignore rule. [15:05:22] <_joe_> Krinkle: I am used to pep8 [15:05:35] <_joe_> just did not know how it specifically worked here [15:05:42] but was unable to find some kind of pep8-ignore list, or I would've added a patch for it wihtin a few days [15:05:57] <_joe_> Krinkle: next time ping me even via email [15:06:03] k :) [15:06:12] Hm.. that's odd. [15:06:23] <_joe_> now that you know it's me [15:06:23] The file was moved from a wrong location *within* the repo [15:06:27] so it was ignored before? [15:06:33] <_joe_> yes! [15:06:38] (03PS1) 10Hashar: Fix Jenkins pep8 job [operations/puppet] - 10https://gerrit.wikimedia.org/r/138824 [15:06:39] ^^^ [15:06:40] <_joe_> via the .pep8 file [15:06:58] Oh, right. Closest-directory discovery [15:07:02] That makes sense [15:07:05] <_joe_> hashar: merging that [15:07:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix Jenkins pep8 job [operations/puppet] - 10https://gerrit.wikimedia.org/r/138824 (owner: 10Hashar) [15:07:38] (03CR) 10Krinkle: "Fixed in Ia74ffa5e3fff :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137929 (owner: 10Giuseppe Lavagetto) [15:07:47] (03CR) 10Hashar: "Jenkins runs pep8 via a wrapper in integration/jenkins.git , it basically cd to the directory holding the python file so it can recognize " [operations/puppet] - 10https://gerrit.wikimedia.org/r/137929 (owner: 10Giuseppe Lavagetto) [15:08:06] \O/ [15:08:06] <_joe_> Krinkle: what confused me is, there are already files that are deprecated in the repository [15:08:26] jenkins is happy now [15:08:34] (03PS4) 10Krinkle: webperf/deprecate: Log jqmigrate to statsd under mw.js.deprecate [operations/puppet] - 10https://gerrit.wikimedia.org/r/137484 [15:09:01] the lame script is at https://github.com/wikimedia/integration-jenkins/blob/master/tools/puppet_pep8.py [15:09:26] one day all python files will pass pep8 and we will be able to get rid of it [15:09:42] (03CR) 10Krinkle: [C: 031] "Relevant patches in mediawiki and WikimediaEvents repos have been merged. Waiting for this so that the data stream stars being logged in s" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137484 (owner: 10Krinkle) [15:13:29] To the operator's, keep up the amazing work it's value to some is forever... Thank you... [15:14:18] _joe_: Thanks! I can see the latest issue on the Planet now [15:14:35] <_joe_> ok :) [15:15:19] so while I'm bugging you :-) [15:15:44] There are several requests on Bugzilla to have some mailing lists closed [15:15:58] How do I bring it to your attention (opsen) without having to file an RT ticket? [15:16:24] (Because if it's already on Bugzilla, having to copy it onto RT is just silly.) [15:17:53] closed = removed/deleted [15:22:28] <_joe_> twkozlowski: hmmm let me get back to you in 5 [15:23:01] qchris: heya! Small gerrit problem, we've not been able to create new branches in gerrit for the android app. any idea what might be the problem? [15:23:28] YuviPanda: Mhmm. Nope. Let me have a look. [15:23:52] <_joe_> twkozlowski: I'd say "send me an email with the links" but I'm not sure how strict we are about process on these things (again, newbie here) [15:24:59] YuviPanda: ldap/wmf should be able to create references ... do you have a sha1 and branch name that you want to have created? [15:25:07] YuviPanda: (so I could test myself) [15:25:26] qchris: yeah, 2ad087212c0b376cb65eea2bf588d8e2679bb9c3 as branch beta [15:25:56] YuviPanda: in apps/android/wikipedia. right? [15:26:11] qchris: yes [15:26:24] YuviPanda: Done. [15:26:33] qchris: I was trying to do git push gerrit HEAD:refs/for/beta [15:26:44] qchris: that should've worked, right? [15:26:57] If beta had existed, yes. [15:27:15] But to create a branch, you need to push a commit to it before. [15:27:25] qchris: oh? how do I do that? [15:27:28] Like the commit of where you want to branch off from. [15:27:47] qchris: oh, so from a commit that's already in master? [15:28:01] git push gerrit BRANCHING_HASH:refs/heads/NEW_BRANCH_NAME [15:28:15] YuviPanda: Yes, from a commit that's already existing. [15:28:23] qchris: aaah, ok. that makes sense. [15:28:27] qchris: let me try to create some more [15:29:04] If you want to start an unconnected branch, you can do that too. [15:29:05] _joe_: How about I assign the reports to you? You'll get an e-mail this way :-) [15:29:37] But you cannot review the root of the on connected branch already in place. [15:29:47] s/on connected/unconnected/ [15:30:03] <_joe_> mmmy wise move :P [15:30:20] qchris: right. I was missing the fact I need to create it separately, which is different from 'regular' git [15:30:28] <_joe_> ok twkozlowski agreed, I may need to close them as invalid if we're strict about process [15:31:40] qchris: thank you very much! [15:31:53] YuviPanda: yw [15:32:23] YuviPanda: Ja, gerrit is a bit more restrictive here in oredr to get it's magic around reviews done :-/ [15:33:23] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Wed Jun 11 15:33:22 UTC 2014 [15:34:24] <_joe_> bbl [15:50:39] I am disappearing see you tomorrow [15:54:03] qchris: heya! another question :) how do we merge a branch back into master? [15:54:10] qchris: just usual git merge and then push for review? [15:54:19] * qchris is not listening :-D [15:54:36] It depends on how you want to structure your repo. [15:55:21] But gerrit does not like submitted merge commits too much. [15:56:10] I only heard complaints about people doing them. Sooner or later, they shot themselves in the foot. [15:56:19] Policywise ... I do not know. [15:56:28] Maybe ^demon|away has opinions? [15:57:05] (03CR) 10Rush: [C: 031] "I didn't actually wait the days required to make sure this is properly handling 5 days worth of data, but yeah we need this. I'm sorry th" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138789 (owner: 10Filippo Giunchedi) [16:00:34] qchris: gah, just back in India and a powercut just happened. [16:00:50] :-/ [16:01:01] no idea when it'll be back [16:01:04] * YuviPanda welcomes self to India [16:13:08] (03PS1) 10Odder: Fix a few English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 [16:13:49] *click* [16:14:35] (03PS2) 10Odder: Fix a few English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 [16:14:35] Hi, is someone here capable of reviewing changes for the WikimediaMaintenance extension https://gerrit.wikimedia.org/r/#/c/135522/2 ? [16:17:14] (03CR) 10Nemo bis: Fix a few English Wikimedia Planet entries (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 (owner: 10Odder) [16:20:34] (03PS3) 10Odder: Fix two English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 [16:20:52] (03CR) 10Odder: Fix two English Wikimedia Planet entries (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 (owner: 10Odder) [16:21:55] (03PS4) 10Odder: Fix two English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 [16:22:58] (03CR) 10Nemo bis: [C: 031] Fix two English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 (owner: 10Odder) [16:23:04] Nemo_bis: The Atom feed for Santhosh doesn't work on the Planet [16:23:48] twkozlowski: weird, I'd report it as bug [16:24:00] (against Venus) [16:29:53] quick question - does anybody know when MediaWiki:history copyright was made redundant ? [16:30:03] !log rebooting lvs3004 (inactive uploads LVS) for 3.13 again [16:30:08] Logged the message, Master [16:30:51] NotASpy: not the right channel [16:39:36] (03CR) 10Ori.livneh: [C: 032] Update stdlib to latest supported release from PuppetLabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/138635 (owner: 10Ori.livneh) [16:43:02] !log shutting off lvs3002.esams pybal to test XPS balancing of live traffic on lvs3004.esams + 3.13 [16:43:07] Logged the message, Master [16:45:41] _joe_: thanks for the redis patch. i think the way you did it is fine; i would have removed the -1 if i had a chance [16:54:16] (03PS1) 10Jgreen: more tweaking OTRS exim_messages_in alert thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138842 [16:56:33] (03CR) 10Ori.livneh: [C: 031] contint: reduce duplication with mediawiki::packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [16:58:49] (03CR) 10Jgreen: [C: 032 V: 031] more tweaking OTRS exim_messages_in alert thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138842 (owner: 10Jgreen) [17:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T1700) [17:02:16] (03CR) 10Ori.livneh: "I talked about this change briefly with paravoid. The log might be useful for reviewers: http://p.defau.lt/?ko7mQr9kmoa3TFnm2nv0Ow" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138769 (owner: 10Ori.livneh) [17:05:23] (03CR) 10Aklapper: [C: 031] "+1 Looks good to me, can go live." [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [17:12:48] (03CR) 10Dzahn: [C: 032] Add Tony Thomas to the English Planet Wikimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/138796 (owner: 10Odder) [17:12:54] greg-g, i should be ready to depl shortly, massive update to zero infrastructuring coming our way... beware... :) [17:13:12] and yes, it has been running on betalabs for a bit :) [17:13:59] (03PS4) 10Ori.livneh: apache: replace apache::mod::* hierarchy with simpler equivalent from MWV [operations/puppet] - 10https://gerrit.wikimedia.org/r/138769 [17:14:01] (03PS1) 10Ori.livneh: apache: include mod_version and mod_access_compat by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/138846 [17:14:14] yurikR: whew ;) [17:14:16] have fun [17:15:45] (03PS5) 10Dzahn: Fix two English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 (owner: 10Odder) [17:17:35] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [17:18:37] (03CR) 10Dzahn: [C: 032] Fix two English Wikimedia Planet entries [operations/puppet] - 10https://gerrit.wikimedia.org/r/138836 (owner: 10Odder) [17:23:31] (03PS1) 10Ori.livneh: Fix-up for Ibceb48bd8: correct path reference for logrotate config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138849 [17:24:06] https://gerrit.wikimedia.org/r/#/c/138849/ and https://gerrit.wikimedia.org/r/#/c/138780/ are one-liner fixup commits, cld someone +1? [17:24:15] (03CR) 10Dzahn: [C: 031] "maybe also remove the 'pmtpa' => [ "mw1017.eqiad.wmnet" ], test server below it right away" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138811 (owner: 10Matanya) [17:25:47] (03PS2) 10Matanya: cache: remove pmtpa hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/138811 [17:26:09] (03CR) 10Dzahn: [C: 032] contint: localhost.mediawiki vhost on ci labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/135529 (owner: 10Hashar) [17:26:17] (03Abandoned) 10Manybubbles: WIP: Add some plugins to labs [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/130969 (owner: 10Manybubbles) [17:26:41] (03Abandoned) 10Manybubbles: Update highlighter to 0.0.9 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/133771 (owner: 10Manybubbles) [17:30:04] Coren: could I bug you for +1s on https://gerrit.wikimedia.org/r/#/c/138849/ and https://gerrit.wikimedia.org/r/#/c/138780/ ? They're both uncontroversial one-liner fixups [17:30:43] <_joe_> mutante: oh you already fixed the broken planet feeds, thanks :) [17:30:52] ori: In a meeting; I'll take a look after? [17:30:56] (03CR) 10Dzahn: [C: 031] Fix-up for Ibceb48bd8: correct path reference for logrotate config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138849 (owner: 10Ori.livneh) [17:31:09] Coren: sure, sorry for the ping. looks like daniel beat you to it [17:31:11] thanks mutante [17:31:26] (03CR) 10Ori.livneh: [C: 032] Fix-up for Ibceb48bd8: correct path reference for logrotate config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138849 (owner: 10Ori.livneh) [17:31:57] (03CR) 10Dzahn: [C: 031] Fix-up for I28bcbab76: strip whitespace from clear_magick_tmp cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/138780 (owner: 10Ori.livneh) [17:32:37] many thanks [17:32:46] (03CR) 10Ori.livneh: [C: 032] Fix-up for I28bcbab76: strip whitespace from clear_magick_tmp cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/138780 (owner: 10Ori.livneh) [17:32:55] ori: np, just can't do the https://gerrit.wikimedia.org/r/#/c/138804/2 right now, maybe you can get Coren for that later:) [17:33:11] mutante: did your suggestion [17:33:35] nod nod [17:35:25] matanya: sorry, looking at it again i should probably not have made that comment... [17:35:33] it says pmtpa but then it's an eqiad server [17:35:37] i'm not sure why [17:35:50] <_joe_> mutante: I'm looking at this change right now [17:35:56] right, that is why i didn't touch it in first place [17:36:01] mutante: identity crisis [17:36:05] _joe_: what do you think of it? yea, the one touching cache.pp [17:38:00] (03CR) 10Giuseppe Lavagetto: "LGTM in general, not sure if the solution you found (the class declaration chain) will work." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [17:38:13] <_joe_> ok, off again [17:38:39] so mutante what do you suggest ? [17:40:18] greg-g, how busy is the schedule today? running out of time [17:40:51] (03Abandoned) 10Ori.livneh: deploy apache-config via trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/136250 (owner: 10Ori.livneh) [17:41:09] yurikR: multimedia has a window in 50 minutes [17:41:25] will try to make it, if not will go to plan B [17:44:20] matanya: not sure, in doubt back to PS1, sorry for that [17:44:39] fixing, no worries [17:44:45] (03CR) 10Dzahn: [C: 032] Move queries for bugs with ASSIGNED status [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [17:44:55] (03Abandoned) 10Aaron Schulz: Avoid letting warnings into the runJobs.php DB name param [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 (owner: 10Aaron Schulz) [17:45:28] (03CR) 10Dzahn: [V: 032] "no jenkins here" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [17:46:58] (03PS3) 10Matanya: cache: remove pmtpa hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/138811 [17:48:05] (03CR) 10Dzahn: "shows up as expected. https://bugzilla.wikimedia.org/buglist.cgi?bug_status=ASSIGNED&f1=assigned_to&list_id=320993&o1=equals&v1=%25user%25" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/129671 (owner: 10Odder) [17:48:36] (03PS1) 10Ori.livneh: Replace references to /etc/apache2/wmf symlink with its link target [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 [17:49:21] (03CR) 10jenkins-bot: [V: 04-1] Replace references to /etc/apache2/wmf symlink with its link target [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 (owner: 10Ori.livneh) [17:49:59] (03CR) 10Ori.livneh: "hashar, ??? ^" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 (owner: 10Ori.livneh) [17:50:29] yay mutante [17:51:39] ori: that path doesnt exist on gallium it seems [17:51:43] Could not open configuration file /usr/local/apache/conf/nonexistent.conf: No such file or directory [17:51:47] twkozlowski: :) [17:52:03] mutante: in your spare time [17:52:11] mutante: yeah, but that's bogus. that's where it is on all the app servers. so the job config is assuming too much :/ [17:53:57] i'll create the reverse symlink on gallium and email hashar about a proper fix [17:55:03] actually no, that's not proper [17:55:10] * ori looks at the job config [17:56:48] anomie: maybe https://gerrit.wikimedia.org/r/#/c/138117/ could be in a swat deploy [17:57:31] AaronSchulz: What's broken that it's fixing? [17:59:42] (03PS1) 10Jgreen: another tweak of exim/ganglia OTRS monitoring thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138862 [17:59:52] see exception.log [18:00:25] I think it would follow trusted proxies in the chain even if they point to "unknown" and then die on that invalid IP [18:00:40] (03CR) 10Dzahn: [C: 032] cache: remove pmtpa hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/138811 (owner: 10Matanya) [18:01:03] so.. i really hope nothing explodes when we remove the unexisting pmtpa servers.. but how should it :) [18:01:13] matanya: going for it [18:01:32] thanks [18:01:33] (03PS1) 10Yurik: Deploying new Zero extensions, but not switching to it just yet [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138864 [18:02:17] (03CR) 10Ori.livneh: "hashar/krinkle: fix for job config: https://gerrit.wikimedia.org/r/#/c/138863/" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 (owner: 10Ori.livneh) [18:02:50] mutante: i submitted an update to the job config, i guess i'll shelve the apache-config patch until hashar or krinkle have a chance to deploy the updated config [18:03:19] !log Reloading Zuul to deploy I5d154a4002d08 [18:03:19] ori: sounds good to me [18:03:23] Logged the message, Master [18:03:34] oh hey, a Krinkle appears [18:04:04] operations-apache-config.yaml ? uh.. interesting [18:04:10] I wish such miracles would happen to me [18:04:11] Krinkle|detached: https://gerrit.wikimedia.org/r/#/c/138863/1 [18:07:33] (03CR) 10Jgreen: [C: 032 V: 031] another tweak of exim/ganglia OTRS monitoring thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138862 (owner: 10Jgreen) [18:09:05] greg-g, better not risk scaping now, i would like to do it between parsoid and swat (14:00 - 16:00 PDT) [18:11:02] twkozlowski (attn: JohnLewis): no, i don't block the bug and not even for any kind of relationship with juan. if there is no consensus, there is no consensus. re-setting is not trivial, so rather set later for ever, than immediately for a month or two and then re-set [18:11:38] Danny_B: Looking at the bug; I see consensus. [18:11:45] AaronSchulz: Yeah, it was choking on the 'unknown' in there. I'd say it could be SWATted, sure. I expect you'll use the afternoon window rather than the morning one, though? [18:11:48] JohnLewis: where? [18:12:05] _joe_: when you commented about planet earlier, did you mean RT #7669 ? because i just read that now [18:12:07] At the link provided in the bug. [18:12:23] <_joe_> mutante: yeah that is addressed btw [18:12:32] JohnLewis: do you speak czech? [18:12:38] _joe_: permission error seems odd, this worked all the time [18:12:43] (03PS1) 10Yuvipanda: toollabs: Add tcl8.5-dev package [operations/puppet] - 10https://gerrit.wikimedia.org/r/138870 [18:12:45] _joe_: do you know how? [18:12:52] Coren: ^ for gifti [18:12:59] <_joe_> mutante: someone ran it as root [18:13:02] Danny_B: No, but like most people in here - google translate works a basic treat [18:13:14] <_joe_> mutante: it as in "planet" [18:13:34] YuviPanda: May have to do this for 8.6 rather. [18:13:41] _joe_: ah yea, that explains it, it needs to be run with "sudo -u planet" [18:13:43] Coren: ah, right. or downgrade tcl [18:13:53] <_joe_> so probably whoever merged the original patch ran puppet on the host, then ran planet without sudo [18:14:13] <_joe_> everyone saw the articles, so everything was fine [18:14:16] there should be no need to manually run planet, it's a cron [18:14:25] alright, good [18:14:43] YuviPanda: We have both. [18:14:51] <_joe_> yeah, I do understand that one wants to see a feedback before than 1-2 hours later [18:14:54] Coren: ah, hmm. ok [18:15:07] JohnLewis: the only consensus there is that it should be czech names, however, the format hasn't reach any consensus yet [18:15:23] YuviPanda: Also, that needs to go into dev_environ not exec_environ. :-) [18:15:27] I'll +2 if you fix. [18:15:51] _joe_: yep, one can copy the command from "crontab -u planet -l" and prepend sudo -u [18:16:37] AFAIK the only reason we sent it back to the community was to make a decision 'Czech or English' [18:16:52] <_joe_> yeah, of course ;) [18:17:00] <_joe_> it was clearly a mistake [18:18:23] Coren: ah, hmm. true that [18:18:53] (03PS2) 10Ottomata: Fix for $critical param on monitor_ganglia define [operations/puppet] - 10https://gerrit.wikimedia.org/r/138822 [18:18:58] Coren: updated [18:18:59] (03CR) 10Ottomata: [C: 032 V: 032] Fix for $critical param on monitor_ganglia define [operations/puppet] - 10https://gerrit.wikimedia.org/r/138822 (owner: 10Ottomata) [18:19:02] (03PS2) 10Yuvipanda: toollabs: Add tcl8.5-dev package [operations/puppet] - 10https://gerrit.wikimedia.org/r/138870 [18:19:23] JohnLewis: there is no ordered format of the names, and since we use certain naming conventions on cs wikis, it should be used here as well, however, some people deliberately block it, hence why no consensus on format has been reached yet [18:19:53] move to query instead of the channel [18:22:04] (03PS1) 10Ottomata: Not renaming monitor_ganglia warning and critical parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/138875 [18:22:30] (03CR) 10Ori.livneh: "recheck" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 (owner: 10Ori.livneh) [18:22:35] (03CR) 10Ottomata: [C: 032 V: 032] Not renaming monitor_ganglia warning and critical parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/138875 (owner: 10Ottomata) [18:22:59] \o/ it worked [18:23:03] thanks again Krinkle [18:23:17] (03PS2) 10Ori.livneh: Replace references to /etc/apache2/wmf symlink with its link target [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138858 [18:23:21] (03CR) 10coren: [C: 032] toollabs: Add tcl8.5-dev package [operations/puppet] - 10https://gerrit.wikimedia.org/r/138870 (owner: 10Yuvipanda) [18:24:12] ori: hehe, you have no idea how perfect that just was [18:24:43] I was staging it temporarily. Your recheck triggered that one (not the live one), then mine (intended for the stage) went to the live one after I saved it for real. [18:24:52] heh [18:25:15] (03CR) 10QChris: "Yay, for getting more data out to the public +1 \o/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:27:20] legoktm: Do you see a chance to do a code review for https://gerrit.wikimedia.org/r/#/c/135522/. [18:27:47] (03PS1) 10Jgreen: hopefully one last tweak otrs exim icinga thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138876 [18:28:33] (03PS1) 10Ori.livneh: mediawiki/apache: load all.conf from canonical path rather than symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/138877 [18:30:05] marktraceur: The time is nigh to deploy Multimedia (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T1830) [18:30:22] LO [18:30:26] And I forgot [18:30:29] Thanks jouncebot [18:31:06] isn't he great marktraceur ? [18:31:39] PROBLEM - jmxtrans on analytics1012 is CRITICAL: Connection refused by host [18:31:39] PROBLEM - DPKG on analytics1012 is CRITICAL: Connection refused by host [18:31:49] PROBLEM - Disk space on analytics1012 is CRITICAL: Connection refused by host [18:31:49] PROBLEM - puppet disabled on analytics1012 is CRITICAL: Connection refused by host [18:32:00] PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: Connection refused by host [18:32:19] PROBLEM - RAID on analytics1012 is CRITICAL: Connection refused by host [18:32:20] more analytics stuff falling over :) [18:32:29] PROBLEM - check configured eth on analytics1012 is CRITICAL: Connection refused by host [18:32:36] cool, new notices! [18:32:39] PROBLEM - check if dhclient is running on analytics1012 is CRITICAL: Connection refused by host [18:32:45] :-D Fun [18:32:51] no this is ok, i'm bringing up a new host, looks like the notices are firing before stuff is running [18:32:59] ah, whew [18:33:28] <_joe_> ottomata: has this host existed before? [18:33:38] <_joe_> if so, we may have had silenced alarms [18:36:10] not as kafka, it was a hadoop datanode [18:36:13] i decommed it firrst [18:37:05] (03CR) 10Nemo bis: "Thanks for pointing that out. Someone knowledgeable could clarify what sort of requests would mostly/actually get caught by that and summa" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:37:19] RECOVERY - RAID on analytics1012 is OK: OK: no disks configured for RAID [18:37:29] RECOVERY - check configured eth on analytics1012 is OK: NRPE: Unable to read output [18:37:39] RECOVERY - check if dhclient is running on analytics1012 is OK: PROCS OK: 0 processes with command name dhclient [18:37:39] RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, args -jar jmxtrans-all.jar [18:37:39] RECOVERY - DPKG on analytics1012 is OK: All packages OK [18:37:44] (03CR) 10Jgreen: [C: 032 V: 031] hopefully one last tweak otrs exim icinga thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/138876 (owner: 10Jgreen) [18:37:49] RECOVERY - Disk space on analytics1012 is OK: DISK OK [18:37:49] RECOVERY - puppet disabled on analytics1012 is OK: OK [18:37:49] PROBLEM - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 [18:37:56] ottomata: RECOVERY - check configured eth on analytics1012 is OK: NRPE: Unable to read output ??? [18:38:00] RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [18:39:20] matanya: dunno? [18:39:31] <_joe_> ottomata: decomming - if you did not remove its facts well before reinstalling, would mean icinga still saw the host [18:40:15] i did [18:40:26] puppetstoredconfig clean or whatever [18:40:37] i saw that the host was removed from the icinga configs on neon too [18:41:11] <_joe_> oh ok, then it's quite the coincidence [18:41:40] <_joe_> puppet must have stored the configs on the db and puppet on neon ran before the first puppet run on that host completed [18:42:00] <_joe_> as configs are stored at compile time [18:42:17] <_joe_> and not at apply time [18:42:23] <_joe_> another beautiful implementation decision by our beloved puppet devs [18:42:51] heheh [18:43:49] yeah, weird [18:44:05] (03PS2) 10Nemo bis: Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:44:09] PROBLEM - NTP on analytics1012 is CRITICAL: NTP CRITICAL: Offset unknown [18:44:11] !log disabling puppet on analytics1012 to allow for more replica threads to catch up with current broker replicas...maybe :) [18:44:16] Logged the message, Master [18:44:42] (03CR) 10Nemo bis: "I added a summary to the commit message of the conversation to the commit message, please improve." [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:44:59] (03CR) 10jenkins-bot: [V: 04-1] Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:45:29] RoanKattouw: Having "-dirty" at the end of a commit in a submodule is a problem, I fear. Confirm? [18:45:43] commit hash* [18:45:44] marktraceur: It won't actually get committed that way [18:45:49] PROBLEM - Disk space on analytics1020 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 99801 MB (5% inode=99%): /var/lib/hadoop/data/g 73836 MB (3% inode=99%): [18:45:57] It's git's way of telling you that you have local uncommitted changes in the submodule dir [18:46:02] Ah, kay [18:46:02] sheesh [18:46:03] Those may also just be untracked files [18:46:15] (03PS1) 10Milimetric: Add logging to limn mobile data job [operations/puppet] - 10https://gerrit.wikimedia.org/r/138884 [18:46:52] RoanKattouw: They are, they're untracked .swp files [18:48:50] (03CR) 10Ottomata: [C: 031] "Ok with me. How large will this file grow? Does it need a logrotate config?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138884 (owner: 10Milimetric) [18:50:06] (03PS1) 10Ottomata: Refinery keeping 31 days of data, deleting every 4 hours [operations/puppet] - 10https://gerrit.wikimedia.org/r/138886 [18:53:09] RECOVERY - NTP on analytics1012 is OK: NTP OK: Offset -0.01923823357 secs [18:53:10] (03CR) 10Ottomata: [C: 032 V: 032] Refinery keeping 31 days of data, deleting every 4 hours [operations/puppet] - 10https://gerrit.wikimedia.org/r/138886 (owner: 10Ottomata) [18:53:56] (03PS1) 10BBlack: Add XPS support to interface-rps.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/138887 [18:56:57] greg-g, any objections to https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T2100 ? [18:57:55] yurikR: no [18:57:57] yurikR: do it [18:58:01] (03CR) 10BBlack: [C: 032 V: 032] Add XPS support to interface-rps.py [operations/puppet] - 10https://gerrit.wikimedia.org/r/138887 (owner: 10BBlack) [18:58:01] OK looks good [18:58:08] I'm going to scap - have at least one message change [18:58:32] cool, thx. gwicke, let me know if you finish depl early. thx! [18:58:33] (03PS1) 10Jgreen: grr, restore warning= parameter to otrs/exim/nagios [operations/puppet] - 10https://gerrit.wikimedia.org/r/138888 [18:58:47] (03PS1) 10Ori.livneh: 2.4 compat: load mod_filter for AddOutputFilterByType [operations/apache-config] - 10https://gerrit.wikimedia.org/r/138889 [18:59:10] !log marktraceur Started scap: MultimediaViewer fixes for cards 630, 429, and 697 [18:59:15] Logged the message, Master [18:59:59] PROBLEM - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 1601585220.0 [19:00:04] ottomata: ^ [19:00:17] (03CR) 10Jgreen: [C: 032 V: 031] grr, restore warning= parameter to otrs/exim/nagios [operations/puppet] - 10https://gerrit.wikimedia.org/r/138888 (owner: 10Jgreen) [19:00:26] yup yup yup [19:00:39] i just told it that it was a replica of all the topics [19:00:48] now its all like "AAHH I'm SO FAR BEHIND" [19:01:01] i do that all the time [19:03:06] (03CR) 10Milimetric: "gr, yea, we probably will need to rotate this as it's running every 30 minutes. We don't need to keep more than like a day of logs though" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138884 (owner: 10Milimetric) [19:03:14] !log rebooting lvs3002 for 3.13 kernel + XPS [19:03:18] Logged the message, Master [19:05:31] (03PS2) 10Ori.livneh: mediawiki/apache: load all.conf from canonical path rather than symlink [operations/puppet] - 10https://gerrit.wikimedia.org/r/138877 [19:05:33] (03PS1) 10Ori.livneh: mw/apache 2.4 compat: remove DefaultType directive [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 [19:08:20] (03PS1) 10Ottomata: Use $brokers_array in kafka monitoring view rather than hardcoding kafka brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138893 [19:09:49] (03CR) 10QChris: "> edits are already public (one could also match the length" [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [19:11:43] (03CR) 10Ottomata: [C: 032 V: 032] Use $brokers_array in kafka monitoring view rather than hardcoding kafka brokers [operations/puppet] - 10https://gerrit.wikimedia.org/r/138893 (owner: 10Ottomata) [19:12:44] * bblack grumbles about unack'd / not-in-downtime problems on the icinga tactical overview :P [19:12:51] 1 apaches had sync errors [19:12:59] Gah bd808|BUFFER what are you doing to me [19:13:04] which apache? [19:13:04] Pluralisation man [19:13:24] mw1151 returned [255]: Permission denied (publickey). [19:13:46] greg-g, I need to get a centralnotice thing out today but it's a bit big to fit in the swat [19:13:56] my ideas are either; I do the swat + my thing [19:14:01] or I do it after the swat [19:14:35] and sadly I cannot cherry pick the urgent change because I already had something lined up in my deploy branch to go out on the train [19:16:30] mwalker: sure, option A sounds fine (swat then you), or if yurikR is done early enough [19:16:52] marktraceur: that's a broken host [19:17:02] greg-g: Expectedly broken? [19:17:07] :D [19:17:18] !log mw1151 *still* giving permission denied errors (publickey), what's the status, yo? [19:17:22] Logged the message, Master [19:17:26] marktraceur: yeah, unfortunately [19:17:48] 'kay [19:17:49] can I complain about server decom/recom issues again? [19:17:56] !log marktraceur Finished scap: MultimediaViewer fixes for cards 630, 429, and 697 (duration: 18m 45s) [19:18:01] Logged the message, Master [19:18:09] !log rebooting lvs3003 for 3.13 kernel [19:18:15] Logged the message, Master [19:18:27] mwalker, will post when done. Lets hope it doesn't take too long :) [19:18:33] kk; thanks :) [19:19:30] OK, I'm done [19:19:34] greg-g, you can complain, but I don't know if anyone will listen :p [19:19:34] Thanks greg-g, etc. [19:19:55] Wait. [19:20:18] greg-g: i18n cache may not be updating, if it's still borked at SWAT time I may bother the SWATters to run scap. [19:21:33] Or...hm [19:21:46] why not? [19:21:50] why isn't it updating? [19:21:59] mwalker: seems to be the M.O. :) [19:25:05] greg-g: Not sure; I see the message locally on my dev machine but it's not showing up in the UI on commons [19:25:30] multimediaviewer-viewfile-link - shows up in the API too [19:25:37] odd [19:26:20] Exception from line 182 of /usr/local/apache/common-local/php-1.24wmf8/extensions/Flow/includes/Model/Workflow.php: Interwiki to enwiki not implemented [19:26:31] * AaronSchulz wonders if anyone is on that [19:26:43] spagewmf: ^ [19:36:37] (03PS1) 10Ottomata: Use all disks for kafka data on analytics1012 [operations/puppet] - 10https://gerrit.wikimedia.org/r/138899 [19:44:16] mutante: another thing about Planet; any ideas why http://muddybtz.blog.com/feed/ shows up as a 502? [19:44:34] the feed opens all right for me [19:46:13] Does anyone use "markerType" => 'nowiki' in his extension (http://www.mediawiki.org/wiki/Manual:Tag_extensions#How_can_I_avoid_modification_of_my_extension.27s_HTML_output.3F) for the Math extension this does not seem to work see http://deployment.wikimedia.beta.wmflabs.org/wiki/User:Anomie/Sandbox [19:46:52] sorry wong window [19:53:02] (03PS1) 10Odder: Raise account creation limit for eswiki GLAM event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138902 (https://bugzilla.wikimedia.org/66491) [19:55:00] (03CR) 10Ottomata: [C: 032 V: 032] Use all disks for kafka data on analytics1012 [operations/puppet] - 10https://gerrit.wikimedia.org/r/138899 (owner: 10Ottomata) [19:56:35] (03PS2) 10Odder: Raise account creation limit for eswiki outreach event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138902 (https://bugzilla.wikimedia.org/66491) [19:57:34] (03PS1) 10Matanya: cache: remove pointer to pmtpa for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/138903 [20:00:02] (03CR) 10Anomie: [C: 031] Raise account creation limit for eswiki outreach event [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138902 (https://bugzilla.wikimedia.org/66491) (owner: 10Odder) [20:00:04] gwicke, subbu: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T2000) [20:00:09] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1012 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 ottomata This is a new broker and it is currently copying replicas from the existing broker. [20:00:09] ACKNOWLEDGEMENT - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 1568931452.0 ottomata This is a new broker and it is currently copying replicas from the existing broker. [20:04:03] RECOVERY - Kafka Broker Replica Lag on analytics1012 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0 [20:04:53] RECOVERY - Disk space on analytics1020 is OK: DISK OK [20:07:35] !log deployed Parsoid 3de0dba15 [20:07:41] Logged the message, Master [20:13:52] yurikR: I'm done [20:15:14] gwicke, greg-g, begining in 10 [20:16:00] kk [20:18:03] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [20:21:32] (03CR) 10Hashar: contint: reduce duplication with mediawiki::packages (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [20:24:13] PROBLEM - Kafka Broker Under Replicated Partitions on analytics1012 is CRITICAL: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value CRITICAL: 20.0 [20:24:21] good! [20:25:07] (03CR) 10Hashar: cache: remove pointer to pmtpa for labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138903 (owner: 10Matanya) [20:26:13] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1012 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:27:25] hashar: i can bring back the parsoid one, other than that, is it +1 for you ? [20:27:37] matanya: yes :)) [20:27:58] matanya: I heading to bed in a minute so I haven't really looked at the beta cluster varnish cache for parsoid [20:28:08] matanya: but it is most probably bugged so I got to fix that up [20:28:26] (03CR) 10Yurik: [C: 032] Deploying new Zero extensions, but not switching to it just yet [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138864 (owner: 10Yurik) [20:29:00] hashar: thanks, will supply a fix, you can +1 tomorrow :) [20:29:03] PROBLEM - Kafka Broker Replica Lag on analytics1012 is CRITICAL: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value CRITICAL: 1593753193.0 [20:30:23] PROBLEM - Kafka Broker Server on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [20:30:33] matanya: filled the issue as bug https://bugzilla.wikimedia.org/show_bug.cgi?id=66497 [20:30:46] (03Merged) 10jenkins-bot: Deploying new Zero extensions, but not switching to it just yet [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138864 (owner: 10Yurik) [20:30:48] (03CR) 10Hashar: cache: remove pointer to pmtpa for labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/138903 (owner: 10Matanya) [20:31:01] matanya: thank you for the clean up. Without it we would probably never have noticed the issue [20:31:03] PROBLEM - Kafka Broker Server on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [20:31:13] :) [20:33:22] (03PS2) 10Matanya: cache: remove pointer to pmtpa for labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/138903 [20:33:23] PROBLEM - Check status of defined EventLogging jobs on vanadium is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/kafka [20:33:29] (03PS1) 10Ottomata: All brokers use all disks [operations/puppet] - 10https://gerrit.wikimedia.org/r/138910 [20:35:23] hehe, [20:35:27] sorry for all the kafka spam :/ [20:38:55] (03CR) 10Ottomata: [C: 032 V: 032] All brokers use all disks [operations/puppet] - 10https://gerrit.wikimedia.org/r/138910 (owner: 10Ottomata) [20:40:23] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [20:44:03] RECOVERY - Kafka Broker Server on analytics1012 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [20:44:03] RECOVERY - Kafka Broker Replica Lag on analytics1012 is OK: kafka.server.ReplicaFetcherManager.Replica-MaxLag.Value OKAY: 0.0 [20:44:23] RECOVERY - Kafka Broker Server on analytics1022 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [20:45:03] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [20:45:53] RECOVERY - Kafka Broker Messages In on analytics1012 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3228.8101429 [20:59:12] (03PS1) 10coren: Tool Labs: add libfcgi0ldbl for tcl fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/138992 (https://bugzilla.wikimedia.org/56995) [20:59:26] (03CR) 10jenkins-bot: [V: 04-1] Tool Labs: add libfcgi0ldbl for tcl fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/138992 (https://bugzilla.wikimedia.org/56995) (owner: 10coren) [21:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T2100) [21:00:34] !log yurik Started scap: Deploying 3 new ext (JsonConfig, ZeroBanner, ZeroPortal), but they are not enabled anywhere yet [21:00:39] Logged the message, Master [21:05:37] !log yurik Finished scap: Deploying 3 new ext (JsonConfig, ZeroBanner, ZeroPortal), but they are not enabled anywhere yet (duration: 05m 03s) [21:05:43] Logged the message, Master [21:13:32] Dafu? Can anyone figure out what Jenkins is smoking in re https://gerrit.wikimedia.org/r/#/c/138992/1 ? This changeset is against the head, there is nowhere to rebase to! [21:14:25] Coren: Sometimes jenkins is a bit blazed [21:14:29] Doesn't know where he is [21:14:37] Try a trivial change to the commit message or something [21:15:22] (03PS2) 10coren: Tool Labs: add libfcgi0ldbl for tcl fcgi [operations/puppet] - 10https://gerrit.wikimedia.org/r/138992 (https://bugzilla.wikimedia.org/56995) [21:17:03] PROBLEM - Puppet freshness on palladium is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 18:16:14 UTC [21:17:03] marktraceur: That did it. Huh. [21:17:17] (03CR) 10coren: [C: 032] "Trivial package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/138992 (https://bugzilla.wikimedia.org/56995) (owner: 10coren) [21:17:28] Gotta get jenkins into a treatment program, man [21:18:29] Coren: there's a bug for that [21:21:56] Coren: https://bugzilla.wikimedia.org/show_bug.cgi?id=hash-mismatch [21:22:06] have fun getting someone to fix it [21:22:07] i failed [21:22:32] Eeew. Heisenbug. [21:23:07] you could find hundreds of immortalized cases of it in gerrit… [21:24:43] !log yurik Started scap: (no message) [21:24:49] Logged the message, Master [21:25:23] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [21:31:03] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 11 Jun 2014 18:30:02 UTC [21:46:38] !log Disabling Puppet on mw1149. It's a former bits app server that isn't in PyBal so it isn't getting traffic. Going to stage some proposed changes for apache-config and operations/puppet there. [21:46:42] Logged the message, Master [21:50:35] !log yurik Finished scap: (no message) (duration: 25m 51s) [21:50:48] Logged the message, Master [21:56:22] !log yurik Synchronized php-1.24wmf7/extensions/JsonConfig/: (no message) (duration: 01m 09s) [21:56:23] Logged the message, Master [21:58:05] !log yurik Synchronized php-1.24wmf8/extensions/JsonConfig/: (no message) (duration: 01m 11s) [21:58:10] Logged the message, Master [22:06:47] !log yurik Synchronized wmf-config/InitialiseSettings.php: Attempting to enable new zero ext on zerowiki & ruwiki (duration: 01m 12s) [22:06:52] Logged the message, Master [22:17:33] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [22:19:54] <^d> !log restarted elasticsearch on logstash1003, complaining about heap. [22:20:00] Logged the message, Master [22:22:09] !log yurik Synchronized wmf-config/InitialiseSettings.php: Attempting to enable new zero ext on zerowiki & ruwiki - take2 (duration: 01m 06s) [22:22:14] (03PS1) 10Yurik: Safety check for zeroPortal [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139010 [22:22:14] Logged the message, Master [22:39:23] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [22:41:13] PROBLEM - Apache HTTP on mw1149 is CRITICAL: Connection refused [22:41:17] Hi. a Hive newbie, was trying to run my first query, but I keep getting this execution error. http://pastebin.com/2giWCtGL [22:44:19] (03PS1) 10Yurik: Switching to ZeroBanner/Portal extension, disabling ZRMA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139018 [22:44:41] (03CR) 10Yurik: [C: 032] Safety check for zeroPortal [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139010 (owner: 10Yurik) [22:44:47] (03Merged) 10jenkins-bot: Safety check for zeroPortal [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139010 (owner: 10Yurik) [22:45:35] mwalker: I'm going to go pick up our CSA share real quick, but godspeed on the swat+cn deploy [22:45:42] (03CR) 10Yurik: [C: 032] Switching to ZeroBanner/Portal extension, disabling ZRMA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139018 (owner: 10Yurik) [22:45:48] (03Merged) 10jenkins-bot: Switching to ZeroBanner/Portal extension, disabling ZRMA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139018 (owner: 10Yurik) [22:45:54] greg-g, tasty :) [22:51:28] (03CR) 10John F. Lewis: [C: 031] "SOA" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (https://bugzilla.wikimedia.org/52528) (owner: 10TTO) [22:54:23] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [22:55:28] (03PS1) 10Yurik: Revert "Switching to ZeroBanner/Portal extension, disabling ZRMA" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139023 [22:55:48] (03CR) 10Yurik: [C: 032] Revert "Switching to ZeroBanner/Portal extension, disabling ZRMA" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139023 (owner: 10Yurik) [22:55:54] (03Merged) 10jenkins-bot: Revert "Switching to ZeroBanner/Portal extension, disabling ZRMA" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139023 (owner: 10Yurik) [22:57:24] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [22:58:13] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.130 second response time [22:58:29] !log yurik Synchronized wmf-config/: Restoring to ZRMA for now (duration: 01m 04s) [22:58:34] Logged the message, Master [22:58:57] (03PS1) 10Yurik: Switching to ZeroBanner/Portal extension, disabling ZRMA [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139025 [22:59:34] (03CR) 10Yurik: [C: 04-2] "Pending ZeroBanner ..." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139025 (owner: 10Yurik) [23:00:04] RoanKattouw, mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140611T2300) [23:00:57] mwalker: were you going to swat? i can, otherwise. [23:01:21] there's a note on the calendar that says "This is big; Matt should do the swat", so i guess so [23:02:00] yep; sorry, my irc client dropped [23:02:02] I'm going to do it [23:02:57] (03CR) 10Mwalker: [C: 032] Give testwiki some custom namespaces [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (https://bugzilla.wikimedia.org/52528) (owner: 10TTO) [23:04:09] (03Merged) 10jenkins-bot: Give testwiki some custom namespaces [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78016 (https://bugzilla.wikimedia.org/52528) (owner: 10TTO) [23:04:33] greg-g, even though I didn't accomplish everything I hoped, it was much less than feared. done for now [23:09:41] mwalker: might have another patch coming your way for SWAT btw. [23:10:32] !log Running deleteEqualMessages.php on trwiki (bug 43917) [23:10:36] Logged the message, Master [23:14:39] JohnLewis, got anything? [23:14:42] otherwise I'm going to scap [23:15:25] mwalker: not yet. Could you deploy what I'm looking at later in the window? If not, I'll shove it into tomorrows. [23:16:21] put it in tomorrows please :) [23:16:42] !log mwalker Started scap: SWAT deploy for MultimediaViewer, CentralNotice, and testwiki config [23:16:43] mwalker: Will do then, I'll patch it somepoint tomorrow then :p [23:16:47] Logged the message, Master [23:19:03] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [23:24:18] James_F: Any update with https://gerrit.wikimedia.org/r/#/c/112590/? [23:24:22] yurikR: "good" [23:31:05] who has jenkins access? [23:33:00] greg-g, is ssh public key denied on mw1151 a known thing? [23:33:09] jackmcbarn, what do you need jenkins to do? [23:33:28] e.g. just submit a job? log in to the web console? log in to the physical box? [23:33:29] i need someone to implement this gerrit change, apparently it can't be done automatically [23:33:30] https://gerrit.wikimedia.org/r/#/c/138612/ [23:33:39] (per Krinkle's comment) [23:33:57] marktraceur, ^ can you help here? [23:34:13] <^d> I can, just gotta remember how. [23:34:59] I linked to the documentation link [23:35:05] 7~Uh [23:35:06] page [23:35:31] mwalker: yeah, known [23:35:35] mwalker: not an issue [23:35:43] tgr, qchris -- i18n updates aren't out yet but code changes are [23:35:48] so test away :) [23:36:12] mwalker: Thanks ... testing. [23:36:27] mwalker: longer story: mw1151 is a bits server, bits no longer exists so it isn't serving traffic, when _joe_ migrates it to the general mw app cluster he'll reinstall/rerun puppet [23:36:57] <^d> jackmcbarn: I'm just waiting on jenkins to merge. [23:36:59] <^d> Then I'll do it. [23:37:00] ^d: basically 3 steps, 1) update jjjb (cd jjb; sudo python setup; if this is the first time, also set up symlink for config/), 2) clear outout/* and run the compiler to generate the jobs, 3) push the ones you want to push [23:37:07] Please push before merging [23:37:44] jjb-config is essentially a log / representation of what jenkins-accessible people have pushed to it. [23:37:55] if you push first you can test an rollback if needed without merging in the repo [23:38:11] and avoids conflicts, i'm deploying ori's change at the same time now [23:38:11] mwalker: thanks, works for me [23:38:25] mwalker: EventLogging is not more broken than before, but we still see broken country values. [23:38:40] probably for people who's sessions are still open [23:38:54] I was hoping you had a recreation that you could test [23:38:55] yeah, they'll take a while to disappear completely, presumably [23:39:02] <^d> All done. [23:39:03] mwalker: Yup. [23:39:32] <^d> jackmcbarn: You should be all set. [23:39:39] ^d: thanks [23:39:55] <^d> yw [23:40:59] !log mwalker Finished scap: SWAT deploy for MultimediaViewer, CentralNotice, and testwiki config (duration: 24m 16s) [23:41:03] Logged the message, Master [23:41:32] Weird [23:41:38] tgr: That message still isn't showing up [23:42:12] <^d> jackmcbarn: 2 skipped tests, otherwise looks good! :) [23:42:20] works for me on beta, maybe it got added in a different patch? [23:42:37] mwalker: Thanks for the deploy. I'll keep an eye on the country values. [23:42:44] *thumbs up* [23:42:59] marktraceur, I only applied those patches to wmf8 [23:43:11] I'm on Commons testing [23:43:26] mwalker: And actually the patch was *on* wmf8; it's seen two scaps now too [23:43:55] The message shows up in the API but not on the client side of MMV [23:44:14] marktraceur: did we add any other IF messages? [23:44:41] * marktraceur assumes IF === i18n [23:44:43] RoanKattouw, I feel like there is a bug where the resourceloader cache does not always properly get new messages? [23:44:47] No, that was the only one [23:44:49] (interface) [23:45:03] some sort of ResourceLoader cache issue then? [23:45:08] Mayyyybe [23:45:21] RoanKattouw, and I feel like the last time this happened I had to run some special script [23:45:37] Yes [23:45:50] It's clearMessageBlobs.php [23:46:38] Yay SWAT to the rescue [23:47:21] ori: ^d: Used this one as example to update documentation a bit https://www.mediawiki.org/wiki/Continuous_integration/Jenkins_job_builder#Example [23:48:41] <^d> yay more docs :) [23:48:54] RoanKattouw, where does that script live? because it's not in maintenance apparently [23:49:46] mwalker: extensions/WikimediaMaintenance [23:49:49] Sorry, I forgot about that [23:50:54] !log clearing resourceloader blobs on commonswiki to try and force a multimediaviewer message "mwscript extensions/WikimediaMaintenance/clearMessageBlobs.php --wiki=commonswiki" [23:50:58] Logged the message, Master [23:54:16] Hm.. that script should not be needed. Haven't seen that used in over a year. I think the problem is elsewhere. [23:55:45] marktraceur, when you say "shows up in the api" what do you mean? [23:55:52] Given it's not working, I agree [23:55:58] mwalker: https://commons.wikimedia.org/w/api.php?action=query&meta=allmessages&ammessages=multimediaviewer-viewfile-link [23:56:03] well... the script is still running [23:56:09] Oh. [23:56:35] of the many things I forgot about this script is that it runs on all wikis; it doesn't accept a --wiki argument [23:56:47] and it doesn't tell you what wiki its on :) [23:56:55] does it run for all wikis, or does it share a central cache? [23:57:12] it's calling out to wgConf->getLocalDatabases() [23:57:26] aye [23:57:46] mwalker: Got a link to a bug? [23:57:51] which message is it and where is not not appearing? [23:58:17] marktraceur, ^ [23:58:40] Krinkle: last time we needed this was about a month ago, IIRC [23:58:45] but it's not showing up in client side JS for multimediaviewer; and the message is "multimediaviewer-viewfile-link" [23:58:52] if the problem does not resolve itself within 5 minutes it most likely will not change anything anywhere even if you fix it unless an actual "change" is made to the module (a touch is enough) because of the various caching layers. [23:59:24] Was it a new message during the recent deployment? [23:59:28] and it was successfully fixed by running a script at the time [23:59:49] I can't remember if it was the same script though [23:59:56] I'm almost convinced that was a coincidence if it was this script. [23:59:59] yes, the message is new