[00:01:15] !log ori Synchronized php-1.25wmf1/extensions/WikimediaEvents: Update WikimediaEvents for If9cdde0f0 (duration: 00m 04s) [00:01:19] Logged the message, Master [00:01:23] !log ori Synchronized php-1.25wmf2/extensions/WikimediaEvents: Update WikimediaEvents for If9cdde0f0 (duration: 00m 03s) [00:01:27] Logged the message, Master [00:01:41] bd808: getting closer to stacking separately [00:01:47] http://graphite.wmflabs.org/render/?width=749&height=369&_salt=1412726479.642&from=-2weeks&target=color(integration.integration-slave1006.memory.MemTotal.value%2C%22red%22)&target=stacked(integration.integration-slave1006.memory.%7BActive%2CBuffers%2CDirty%2CInactive%2CShmem%7D.value)&target=alpha(stacked(integration.integration-slave1006.memory.MemFree.value)%2C0.4) [00:02:03] still missing some data points to account for the 1-2GB gap [00:02:46] http://i.imgur.com/FecVGiH.png [00:03:06] (03PS1) 10MaxSem: Remove unused log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165404 [00:03:12] any idea which they might be? I think Swap and Vmalloc aren't ram, so not sure which it is [00:03:46] when I include memory.Cached it goes over [00:03:57] * bd808 looks at data points [00:04:06] Krinkle: I don't know if active/inactive/dirty should be counted [00:04:26] hmm, I shall defer to bd808 atm :) [00:06:01] Krinkle: What are you shooting for? A complete picture of ram? [00:06:31] note I'm not trying to make the sum that will be used for alerts (I won't include MemFree obviously and probably others excluded as well). [00:06:37] But trying to understand the data first [00:06:58] which data points add up, which are negative (MemFree) which positive (Active) and which are maybe a subset of another.. [00:07:27] same for cpu. it's all very confusing. I tried digging into ganglia and diamond internals but it seems the separation exists already at the C level. [00:08:03] I should be able to get a complete picture indeed [00:08:08] * bd808 looks at https://github.com/BrightcoveOS/Diamond/blob/master/src/collectors/memory/memory.py [00:08:11] ceeyenndiv: test [00:10:44] Krinkle: Diamond reads /proc/meminfo and this seems to be a good description of what those numbers represent -- http://superuser.com/a/521552 [00:11:46] (03PS3) 10Springle: mysql_wmf: db1001 is in eqiad not in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [00:11:53] (03CR) 10Springle: [C: 032] mysql_wmf: db1001 is in eqiad not in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [00:13:03] So I think total = free + buffers + cache + mapped? [00:14:04] no it would be more than that... + slab at least [00:14:17] looks like those are not exposed [00:14:24] mapped and slab [00:14:51] yeah. free is free and otherwise why would you care .:) [00:15:34] well, at least now I know where to look [00:15:34] https://gist.githubusercontent.com/Krinkle/5d4c06f6bbf591f65b59/raw [00:15:34] VMallocChunk can be important depending on what your server is doing [00:15:39] proc/meminfo from an int slave [00:17:07] (03PS1) 10Ori.livneh: Word-wrap fix for comment block [puppet] - 10https://gerrit.wikimedia.org/r/165407 [00:17:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Word-wrap fix for comment block [puppet] - 10https://gerrit.wikimedia.org/r/165407 (owner: 10Ori.livneh) [00:18:07] bd808: Hm.. OK. So which would you say are significant when making a sum() of metrics that together will prompt an alert when it reaches % of total? because I imagine inactive is not entirely free but will be if needed etc. [00:18:22] e.g. the kind of thing incinga or nagios tends to report [00:18:26] ganglia* [00:18:37] (03PS4) 10Springle: mysql_wmf: db1001 is in eqiad not in pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/165090 (owner: 10Matanya) [00:18:47] ganglia graphs add up: Use + Share + Cache + Buffer + Swap [00:19:57] I would just alert on %free if I was worried about ram utilization [00:20:56] either there is space for more stuff or there isn't. Doesn't really matter what's eating the space does it? [00:22:21] Use probably equates to active+inactive [00:22:34] (03PS4) 10Dzahn: remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 [00:22:46] (03CR) 10jenkins-bot: [V: 04-1] remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [00:23:44] bd808: comparing mw1039.eqiad.wmnet /proc/meminfo to http://ganglia.wikimedia.org/latest/graph.php?h=mw1039.eqiad.wmnet&g=mem_report&c=Application+servers+eqiad [00:25:08] hmm. so maybe Use = Active [00:25:22] huh, jenkins is now in portuguese? [00:25:54] heh [00:26:00] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [00:26:01] (03PS5) 10Dzahn: remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 [00:26:07] https://integration.wikimedia.org/ci/ anyone else ? [00:26:15] * aude doesn't mind [00:26:21] aude: Yeah me too [00:26:22] oh, yes [00:26:25] confirmed [00:26:27] Histórico de compilações [00:26:35] i see [00:26:36] help for search Entrar [00:26:50] ori: ok to pupet-merge? [00:27:01] 5 meses 24 dias [00:27:13] springle: yes, sorry [00:27:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:27:21] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [00:27:28] ori: np done [00:27:38] Use=? Share=?, Cache=Cached, Buffer=Buffers, Swap=SwapCached [00:27:39] there is also a "help translating (to English)" link there, heh [00:27:44] bd808: looks like Use=Inactive, not Active [00:27:49] Active is like 6G on that node [00:27:55] use= 3.2G in ganglia [00:28:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:28:20] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:28:51] aude, mutante: supposedly Jenkins does language detection based on the accept headers? -- https://wiki.jenkins-ci.org/display/JENKINS/How+to+view+Jenkins+in+your+language [00:29:07] springle: still looks good? https://gerrit.wikimedia.org/r/#/c/164257/5 [00:29:24] bd808: configuration says still lang=en; force=1 [00:29:29] https://integration.wikimedia.org/ci/configure [00:29:42] explicitly to ignore accept header and such [00:30:30] (03CR) 10Springle: [C: 031] remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [00:30:52] weird [00:31:01] mutante: fine [00:31:26] bd808: interesting [00:31:55] re: jenkins, i tried logging in and finding a language setting in profile but i dont think there is one [00:32:00] springle: :) tx [00:32:26] but interestingly there is an option to set my IRC nick [00:34:07] !log long schema changes running from terbium. ok to kill osc_host.sh in emergency [00:34:07] Logged the message, Master [00:34:09] Local plugin is installed, latest version and like Krinkle pointed out set to force the UI to english [00:34:23] So jenkins really is drunk today [00:34:30] RECOVERY - Disk space on analytics1035 is OK: DISK OK [00:34:32] (03CR) 10Dzahn: [C: 032] remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [00:35:57] * bd808 leaves for dinner [00:38:44] !log cp3016 - why you report failed puppet unlike everyone else but then it works [00:38:52] Logged the message, Master [00:39:20] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [00:39:53] RECOVERY - check if salt-minion is running on osmium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:40:14] !log osmium - salt-minion was running twice, stopped both, killed one, restarted properly [00:40:19] Logged the message, Master [00:41:40] RECOVERY - check if salt-minion is running on searchidx1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:41:45] !log searchidx1001 - same, fixed duplicate salt-minion [00:41:50] Logged the message, Master [00:42:22] what happened to rhenium? [00:42:33] it's the flow box , cajoel [00:44:31] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [00:45:54] (03PS1) 10Dzahn: remove more es server remnants [dns] - 10https://gerrit.wikimedia.org/r/165411 [00:47:21] (03CR) 10Dzahn: [C: 032] remove more es server remnants [dns] - 10https://gerrit.wikimedia.org/r/165411 (owner: 10Dzahn) [00:50:29] (03PS1) 10Dzahn: remove Tampa nas servers [dns] - 10https://gerrit.wikimedia.org/r/165412 [00:51:19] PROBLEM - check if salt-minion is running on osmium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:51:40] (03PS1) 10Chad: Config for graphite plugin for Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/165413 [00:52:05] TimStarling: ori greg-g anyone care if we deploy https://gerrit.wikimedia.org/r/#/c/165410/ now? [00:52:12] try to fix Q183 [00:52:39] * ori takes a look [00:53:10] ok [00:53:34] and https://gerrit.wikimedia.org/r/#/c/165409/ for the other core branch [00:54:11] aude: what's the worst that could happen? [00:55:06] is it easy to revert if it goes badly? [00:56:33] (03PS1) 10Dzahn: remove entire $ORIGIN pmtpa. from wmnet [dns] - 10https://gerrit.wikimedia.org/r/165414 [00:56:57] would be easy [00:57:39] trying wiht eval.php, it seems to help although there may be more issues than this [01:01:27] (03PS1) 10Dzahn: remove virt0 - decom [dns] - 10https://gerrit.wikimedia.org/r/165415 [01:03:09] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [01:04:34] (03PS1) 10Dzahn: remove virt0 from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/165416 [01:04:49] sorry, was in a meeting [01:06:48] np [01:07:09] I will deploy them [01:07:49] (03PS1) 10Dzahn: remove one last pmtpa remnant in domain_search [puppet] - 10https://gerrit.wikimedia.org/r/165417 [01:09:06] thanks [01:10:18] (03PS1) 10Dzahn: remove Tampa network gear from netmon [puppet] - 10https://gerrit.wikimedia.org/r/165418 [01:11:06] yeah, sorry, i missed your reply [01:11:09] thanks tim [01:16:08] !log tstarling Started scap: update for Wikidata crash bug [01:16:16] Logged the message, Master [01:16:34] !log tstarling scap failed: CalledProcessError Command '('/usr/bin/git', 'rev-list', '-1', '@{upstream}')' returned non-zero exit status 128 (duration: 00m 25s) [01:16:39] Logged the message, Master [01:17:20] huh [01:18:49] What's the status of pmtpa? Is stuff still running there? [01:20:11] Krenair: nope, it's really down [01:20:18] since like.. the last couple days [01:20:27] well, still cleaning up remnants [01:20:41] but no more servers running [01:20:57] that is literally "@{upstream}" [01:22:11] where is the scap source? [01:22:17] /srv/mediawiki-staging [01:22:19] mutante, sdtpa? [01:22:32] not /srv/mediawiki [01:22:40] no, the source of scap [01:22:52] not the source of a scap [01:23:01] mediawiki/tools/scap [01:23:14] Krenair: good point, there might be more. fyi here though https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/dns+branch:master+topic:decom-Tampa,n,z [01:23:21] https://git.wikimedia.org/summary/mediawiki%2Ftools%2Fscap.git [01:24:14] I am right that @{upstream} is not a normal thing to pass to git, right? [01:24:39] Krenair: actually,no "sdtpa" in puppet or DNS it looks to me it was all only pmtpa [01:25:32] TimStarling: no [01:26:28] http://git-scm.com/book/en/Git-Tools-Revision-Selection#RefLog-Shortnames [01:26:29] here is the scap output: http://paste.tstarling.com/p/NRCIMH.html [01:27:42] where did you start it? [01:27:52] i.e., what cwd? [01:28:05] it shouldn't be sensitive to that, but it may be [01:28:10] /srv/mediawiki-staging/php-1.25wmf2 [01:28:28] do you have a ~/.gitconfig ? [01:28:37] (it should explicitly ignore that, which it isn't) [01:28:43] [tin:/srv/mediawiki-staging] $ git rev-list -1 @{upstream} [01:28:43] 71d270b2260e5c2ed83f95533dfd4b1fe8220762 [01:28:46] mutante, looks like Faidon removed sdtpa stuff from the repos back in late April/early May [01:29:00] yes, just with user.name and user.email [01:29:13] yes, that works for me also [01:29:26] Krenair: ah, thanks for checking:) [01:29:50] but it does it for each branch [01:29:53] mutante, e.g. https://gerrit.wikimedia.org/r/#/c/130098/ and https://gerrit.wikimedia.org/r/#/c/131348/ [01:31:44] Krenair: the one i still wonder about is just this https://gerrit.wikimedia.org/r/#/c/164233/ [01:31:50] it works for me for all branches and extensions and skins [01:31:51] Krenair: needs a.pergos [01:31:53] !log tstarling Started scap: (no message) [01:31:58] I'll just try running it again [01:31:59] Logged the message, Master [01:32:07] !log tstarling scap failed: CalledProcessError Command '('/usr/bin/git', 'rev-list', '-1', '@{upstream}')' returned non-zero exit status 128 (duration: 00m 14s) [01:34:22] I'm going to edit the source now [01:35:35] !log tstarling Started scap: (no message) [01:35:49] !log tstarling scap failed: CalledProcessError Command '('/usr/bin/git', 'rev-list', '-1', '@{upstream}')' returned non-zero exit status 128 (duration: 00m 14s) [01:36:19] ok, well that was informative [01:36:28] the cwd it failed on was /srv/mediawiki-staging/php-1.25wmf1/.git/modules/extensions/MobileFrontend [01:36:36] and that does indeed fail from the command line [01:38:04] let me try [01:38:07] TimStarling: ^ [01:38:13] do you mind? [01:38:22] try what? [01:38:31] nevermind, i'm lagged [01:38:35] didn't see the last three messages [01:42:18] TimStarling: well, it doesn't handle the case where the branch doesn't have a remote upstream branch gracefully [01:47:54] what happens if I skip it? [01:48:15] I just want to hack it up to get Wikidata out [01:48:30] mutante, just updating the meta page about this stuff. Is codfw ready as a fallback in case of issues in eqiad, the same way pmtpa was? [01:48:54] TimStarling: can you give me a minute to fix it? [01:48:59] well, 5, really [01:49:16] ok [01:49:54] why can't we do sync-dir ? [01:50:14] would be better to fix the issue, of course [01:51:41] if ori can fix it in 5 minutes, then we can just wait [01:51:42] Krenair: it's being built but not quite ready yet, like infra is there but not all the servers [01:51:52] sure [01:52:02] then pushing out wikidata will be the test of his fix [01:52:10] yep [01:53:17] TimStarling: can you try again? [01:53:45] !log tstarling Started scap: (no message) [01:55:43] so far so good [01:56:13] did you just have it ignore the error? I see error output but it kept going [01:58:18] i had it call 'git merge-base HEAD origin' if rev-list -1 @{upstream} failed [01:58:33] right [01:59:27] i don't like this stuff at all, a deployment process should not be loosey-goosey like that [02:00:42] i'll submit a patch in a moment [02:02:46] !log tstarling Finished scap: (no message) (duration: 09m 01s) [02:02:54] Logged the message, Master [02:04:25] scape output: http://paste.tstarling.com/p/QCttkJ.html [02:05:02] now we get a new error on https://www.wikidata.org/w/index.php?title=Q183&oldid=143201634 [02:05:05] other stuff is fine [02:07:02] doing http://pastie.org/9630016 works now [02:08:10] https://github.com/wmde/WikibaseDataModel/pull/216 might be the fix but shall investigate tomorrow [02:08:12] I got a crash on that URL [02:08:24] http://www.wikidata.org/w/index.php?title=Q183&oldid=143201634 [02:08:45] oh, different backtrace, nice [02:09:25] progress at least [02:09:43] it sounds familiar, like what we had on Q72 [02:09:54] just in a different place now [02:10:41] it is crashing in the garbage collector in Wikibase\DataModel\ByPropertyIdArray::buildIndex() [02:11:06] ah [02:14:06] alright, time to sleep [02:14:21] thanks for your help [02:19:30] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:29] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:40:48] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-08 02:40:48+00:00 [02:40:57] Logged the message, Master [02:49:50] PROBLEM - Disk space on analytics1035 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 146719 MB (3% inode=99%): [02:51:50] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [02:52:10] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 1 failures [03:02:52] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [03:08:33] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:09:22] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:10:02] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:22] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [03:11:51] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [03:18:44] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-08 03:18:44+00:00 [03:18:51] Logged the message, Master [03:21:11] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:29:34] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:31:03] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 8 04:31:03 UTC 2014 (duration 31m 2s) [04:31:10] Logged the message, Master [04:52:29] (03CR) 10Tim Landscheidt: [C: 04-1] "Need to test how this is affected by bug #71692." [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [05:18:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [05:43:29] (03CR) 1020after4: [C: 031] Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [06:13:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [06:28:08] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: puppet fail [06:28:31] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: puppet fail [06:28:33] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [06:28:33] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [06:28:48] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:29:08] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: puppet fail [06:29:08] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [06:30:17] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:07] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:07] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:44:58] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:07] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:45:47] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:58] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:12] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:28] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:01:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [07:02:18] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [07:41:08] RECOVERY - check if salt-minion is running on osmium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:42:19] _joe_: mw1017 PROCS WARNING: 2 processes with command name 'hhvm' [07:42:36] _joe_: also mw1053 WARNING: Puppet is currently disabled, last run 525013 seconds ago with 0 failures [07:42:55] <_joe_> paravoid: the second one just needs acknowledging [07:43:15] <_joe_> the first one, I'll take a look [07:43:32] puppet disabled for a week? how come? [07:44:38] <_joe_> paravoid: the hhvm jobrunner is in a dire state [07:45:12] <_joe_> I will poke aaron and or.i this evening [07:45:50] <_joe_> so puppet is disabled because it should've been a temporary measure while the latest bugs got fixed [07:45:52] and why does it need puppet disabled for that? [07:46:02] can't we disable it in puppet? [07:46:03] <_joe_> because puppet starts the jobrunner instance [07:46:06] <_joe_> yes [07:46:18] <_joe_> given it's a week and nobody cared [07:46:21] <_joe_> I assume so [07:46:21] (also, no mention of all that in SAL) [07:46:25] :( [07:46:44] <_joe_> whenever I disabled it, I logged it [07:47:02] <_joe_> ok, doing that [07:48:58] RECOVERY - BGP status on cr2-ulsfo is OK: OK: host 198.35.26.193, sessions up: 41, down: 0, shutdown: 0 [07:49:45] <_joe_> but after I merge https://gerrit.wikimedia.org/r/#/c/159490/ and https://gerrit.wikimedia.org/r/#/c/164358/ ... been working on those for a while [07:55:33] !log restart db2011 [07:55:38] Logged the message, Master [07:56:06] include "sites-enabled/..."? isn't this going to load that config twice? [07:56:29] one because you include it, and another because the default config includes sites-enabled/*? [07:56:38] <_joe_> no [07:56:51] <_joe_> it includes sites-enabled/*.conf [07:57:07] ah! [07:57:32] <_joe_> which is what debian does as well [07:58:14] nice [07:58:26] cleanups, I love that :) [07:58:36] did you do any tests with nginx btw? [07:58:43] or is this after the hhvm migration is over? [07:59:28] <_joe_> after that [07:59:33] cool [07:59:45] <_joe_> but I'm preparing a testing framework [07:59:49] sorry, just trying to catch up :) [07:59:52] <_joe_> so that we don't get crazy [07:59:58] <_joe_> no no don't be sorry [08:00:14] <_joe_> I harassed you for months [08:02:31] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 91.198.174.245, interfaces up: 98, down: 0, dormant: 0, excluded: 0, unused: 0 [08:02:39] (just fixing random alerts) [08:11:17] !log powercycling rhenium, unresponsive [08:11:23] Logged the message, Master [08:11:33] springle: hi :) [08:12:09] paravoid: hi :) [08:13:20] long time no see [08:14:02] RECOVERY - DPKG on rhenium is OK: All packages OK [08:14:20] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [08:14:32] RECOVERY - SSH on rhenium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:14:32] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [08:14:32] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [08:14:32] RECOVERY - Disk space on rhenium is OK: DISK OK [08:15:41] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [08:16:30] paravoid: thanks for entering the LocalisationUpdate discussion [08:17:13] well, my reply was a bit offtopic to your original point [08:17:24] <_joe_> springle: we'll have localizoid anyways in a not-too-distant future anyways :D [08:17:31] hahaha [08:17:32] the fix will be relevant though [08:17:35] heh [08:17:38] * _joe_ promised himself not to be too sarcastic [08:18:31] <_joe_> does any of you ever used a powerline lan adapter? do they work? [08:20:18] man I'm not sure what's more depressing, icinga's unhandled service alerts or the handled ones [08:20:37] gitblit.wikimedia.org 404 43d11h19m [08:22:25] dataset2? isn't that box dead? [08:25:05] nickel (ganglia) has a broken disk since sep 2nd [08:26:21] <_joe_> paravoid: I'm pretty sure someone did some work on nickel to this that [08:26:32] <_joe_> s/this/fix/ [08:27:10] <_joe_> but I've been pretty much focused on varnish/mediawiki/hhvm and hiera lately [08:27:23] <_joe_> so I lost some grip on the rest tbh [08:27:31] there's an RT [08:27:59] somewhere in limbo between cmjohnson and Coren, I think deadlocked with eachother [08:28:50] <_joe_> ... [08:29:11] <_joe_> ok, back to webtest, ttyl [08:29:24] <_joe_> ping me if you need something :) [08:29:38] does a shoulder to cry on count? [08:29:49] <_joe_> ahahah [08:32:52] RECOVERY - NTP on rhenium is OK: NTP OK: Offset -0.0004564523697 secs [08:35:40] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [08:35:49] RECOVERY - MySQL InnoDB on db1042 is OK: OK longest blocking idle transaction sleeps for 0 seconds [08:35:49] RECOVERY - MySQL Processlist on db1042 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [08:38:09] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CRIT replication delay 811607 seconds [08:38:20] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 811558 seconds [08:38:26] <_joe_> 10 days to recover [08:38:34] (03PS1) 10Springle: upgrade db1042 mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/165449 [08:38:36] nah [08:38:38] <_joe_> good luck with that [08:38:41] <_joe_> yeah I know [08:38:42] it's about to be blatted [08:38:46] <_joe_> it's just funny [08:39:49] figured i should get around to it before paravoid cracked the icinga whip at me ;) [08:40:18] (03PS1) 10Faidon Liambotis: gitblit: fix monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/165450 [08:40:53] (03CR) 10Faidon Liambotis: [C: 032] gitblit: fix monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/165450 (owner: 10Faidon Liambotis) [08:41:59] (03CR) 10Springle: [C: 032] upgrade db1042 mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/165449 (owner: 10Springle) [08:42:40] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [08:43:08] huh? [08:43:28] _joe_: ^ look, 10 days in 5 min [08:43:30] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay 0 seconds [08:43:33] lol [08:45:50] springle: there are also alerts for db2010, db2011, db2029 & dbstore1001 [08:46:09] db2029 has a comment "replication broken" and is acknowledged [08:46:16] dbstore? [08:46:19] hmm [08:46:24] dbstore1001 has a comment "intentional 24h replag" [08:46:37] oh a warning [08:46:45] yes, I'm looking at these too :) [08:46:52] yes. have yet to tweak the check script for lagged replicas [08:46:57] ok [08:49:43] godog: can I reenable puppet on lithium? it has a comment "test syslog", I think that was you? [08:50:30] unless it was related to https://wikitech.wikimedia.org/w/index.php?title=Syslog&diff=130154&oldid=47241 [08:50:39] paravoid: almost yes :) https://gerrit.wikimedia.org/r/#/c/164524/ and just added you to https://gerrit.wikimedia.org/r/#/c/164523/ [08:51:35] Nemo_bis: yep it was that [08:51:49] I've never used syslog-ng [08:51:54] I use rsyslog myself [08:51:58] so I can't help you much on the reviews [08:52:01] just self-merge :) [08:53:08] haha okay, curious though why the syslog-ng/rsyslog split? [08:53:26] rsyslog is the default, and syslog-ng... because someone preferred it at the time? [08:53:31] dunno, precedes me :) [08:53:35] (03CR) 10Giuseppe Lavagetto: [C: 031] syslog-ng: filter out swift noise [puppet] - 10https://gerrit.wikimedia.org/r/164524 (owner: 10Filippo Giunchedi) [08:54:15] _joe_: according to #8243, the memory issue is fixed [08:54:35] _joe_: thanks! [08:54:44] PROBLEM - check if salt-minion is running on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:22] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:22] PROBLEM - puppet last run on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:31] <_joe_> mmmh [08:55:33] PROBLEM - Apache HTTP on mw1115 is CRITICAL: Connection timed out [08:55:33] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:39] <_joe_> api outage again? [08:55:42] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:43] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:43] PROBLEM - Apache HTTP on mw1118 is CRITICAL: Connection timed out [08:55:43] seems so [08:55:52] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:52] PROBLEM - Apache HTTP on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:52] PROBLEM - puppet last run on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:52] PROBLEM - Apache HTTP on mw1195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:53] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:53] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:53] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:54] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:57] (03PS2) 10Filippo Giunchedi: syslog-ng: update to trusty [puppet] - 10https://gerrit.wikimedia.org/r/164523 [08:56:01] <_joe_> I'm sure the reduction in parsoid's timeout has nothing to do with this [08:56:03] PROBLEM - Apache HTTP on mw1148 is CRITICAL: Connection timed out [08:56:03] PROBLEM - Apache HTTP on mw1127 is CRITICAL: Connection timed out [08:56:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog-ng: update to trusty [puppet] - 10https://gerrit.wikimedia.org/r/164523 (owner: 10Filippo Giunchedi) [08:56:13] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:18] PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:18] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:18] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:18] PROBLEM - Apache HTTP on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:18] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:19] PROBLEM - Apache HTTP on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:19] PROBLEM - Apache HTTP on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:20] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:20] PROBLEM - Apache HTTP on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:22] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:23] PROBLEM - SSH on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:23] PROBLEM - Apache HTTP on mw1194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:23] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:23] PROBLEM - RAID on mw1124 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:56:23] PROBLEM - LVS HTTP IPv4 on api.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:25] (03PS2) 10Filippo Giunchedi: syslog-ng: filter out swift noise [puppet] - 10https://gerrit.wikimedia.org/r/164524 [08:56:29] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog-ng: filter out swift noise [puppet] - 10https://gerrit.wikimedia.org/r/164524 (owner: 10Filippo Giunchedi) [08:56:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [08:56:43] RECOVERY - check if salt-minion is running on mw1128 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:56:43] PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:43] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:43] PROBLEM - Apache HTTP on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:43] PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:51] s2 [08:56:52] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:52] PROBLEM - Apache HTTP on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:56:53] PROBLEM - Apache HTTP on mw1124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:12] RECOVERY - DPKG on mw1127 is OK: All packages OK [08:57:13] <_joe_> springle: db problem? [08:57:23] PROBLEM - check configured eth on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:23] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:23] RECOVERY - SSH on mw1124 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [08:57:29] seems isolated to API, are you sure it's s2? [08:57:31] <_joe_> it loooks to me like we're simply getting too many http requests [08:57:32] RECOVERY - RAID on mw1124 is OK: OK: no RAID installed [08:57:34] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:42] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:52] PROBLEM - puppet last run on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:52] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:58] nah, it's not that simple [08:58:12] PROBLEM - DPKG on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:16] PROBLEM - check configured eth on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:16] PROBLEM - RAID on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:22] PROBLEM - check configured eth on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:22] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [08:58:22] RECOVERY - check configured eth on mw1122 is OK: NRPE: Unable to read output [08:58:25] it spiked, then crashed [08:58:33] PROBLEM - RAID on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:34] https://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=hour&st=1412758689&host_regex= [08:58:37] <_joe_> paravoid: I'm on a server and it seems they're working hard [08:58:37] PROBLEM - check configured eth on mw1124 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:58:49] https://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=hour&st=1412758689&host_regex= [08:58:52] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 297 seconds ago with 0 failures [08:58:56] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.228 second response time [08:58:57] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=API+application+servers+eqiad&m=ap_rps&s=by+name&mc=2&g=load_report [08:59:02] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 815 seconds ago with 0 failures [08:59:03] RECOVERY - DPKG on mw1117 is OK: All packages OK [08:59:03] RECOVERY - RAID on mw1132 is OK: OK: no RAID installed [08:59:03] RECOVERY - check configured eth on mw1132 is OK: NRPE: Unable to read output [08:59:13] PROBLEM - Disk space on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:13] PROBLEM - RAID on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:13] PROBLEM - check if dhclient is running on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:27] PROBLEM - puppet last run on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:27] PROBLEM - check configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:27] RECOVERY - check configured eth on mw1117 is OK: NRPE: Unable to read output [08:59:30] <_joe_> paravoid: it's just not reporting to ganglia, I'm seeing a 200 load on some servers [08:59:32] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:33] RECOVERY - RAID on mw1117 is OK: OK: no RAID installed [08:59:45] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 1266 seconds ago with 0 failures [08:59:52] PROBLEM - SSH on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:52] PROBLEM - RAID on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:52] PROBLEM - DPKG on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:52] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 213 seconds ago with 0 failures [08:59:52] PROBLEM - SSH on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:53] PROBLEM - check if dhclient is running on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:53] PROBLEM - RAID on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:14] PROBLEM - RAID on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:14] PROBLEM - check configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:14] PROBLEM - check if salt-minion is running on mw1118 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:15] PROBLEM - check if salt-minion is running on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:15] PROBLEM - nutcracker process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:23] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:42] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:00:42] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [09:00:42] RECOVERY - DPKG on mw1115 is OK: All packages OK [09:00:42] RECOVERY - check configured eth on mw1124 is OK: NRPE: Unable to read output [09:00:55] RECOVERY - SSH on mw1126 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:00:56] PROBLEM - puppet last run on mw1190 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:56] PROBLEM - nutcracker process on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:05] RECOVERY - LVS HTTP IPv4 on api.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 4126 bytes in 0.099 second response time [09:01:15] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.161 second response time [09:01:36] PROBLEM - check configured eth on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:41] RECOVERY - Apache HTTP on mw1124 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.000 second response time [09:01:41] PROBLEM - nutcracker process on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:41] RECOVERY - RAID on mw1131 is OK: OK: no RAID installed [09:01:41] RECOVERY - Disk space on mw1131 is OK: DISK OK [09:01:45] RECOVERY - nutcracker process on mw1148 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:01:45] RECOVERY - check if salt-minion is running on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:01:46] RECOVERY - check if dhclient is running on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [09:01:46] RECOVERY - check configured eth on mw1131 is OK: NRPE: Unable to read output [09:01:46] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1140 seconds ago with 0 failures [09:01:46] PROBLEM - RAID on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:01:56] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.233 second response time [09:02:05] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.556 second response time [09:02:05] PROBLEM - RAID on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:06] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 812 seconds ago with 0 failures [09:02:06] RECOVERY - nutcracker process on mw1138 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:02:17] RECOVERY - check if dhclient is running on mw1127 is OK: PROCS OK: 0 processes with command name dhclient [09:02:25] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:26] RECOVERY - RAID on mw1127 is OK: OK: no RAID installed [09:02:26] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.766 second response time [09:02:26] RECOVERY - RAID on mw1128 is OK: OK: no RAID installed [09:02:26] RECOVERY - nutcracker process on mw1146 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:02:35] RECOVERY - check configured eth on mw1146 is OK: NRPE: Unable to read output [09:02:35] RECOVERY - check configured eth on mw1148 is OK: NRPE: Unable to read output [09:02:35] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.222 second response time [09:02:36] RECOVERY - check if salt-minion is running on mw1118 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:02:36] RECOVERY - RAID on mw1137 is OK: OK: no RAID installed [09:02:36] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.097 second response time [09:02:46] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [09:02:47] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.327 second response time [09:02:47] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.131 second response time [09:02:48] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.124 second response time [09:02:56] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.910 second response time [09:03:05] !log killed masses of sleeping connections on s2 slaves [09:03:06] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [09:03:06] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.105 second response time [09:03:07] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [09:03:08] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.251 second response time [09:03:08] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.022 second response time [09:03:08] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [09:03:08] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.495 second response time [09:03:10] Logged the message, Master [09:03:15] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.692 second response time [09:03:15] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [09:03:15] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.137 second response time [09:03:16] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:03:23] <_joe_> springle: that is exected as well [09:03:25] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.041 second response time [09:03:30] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.111 second response time [09:03:30] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.619 second response time [09:03:31] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.143 second response time [09:03:31] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.274 second response time [09:03:35] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.082 second response time [09:03:35] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.259 second response time [09:03:35] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.504 second response time [09:03:35] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.226 second response time [09:03:36] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.635 second response time [09:03:45] RECOVERY - Apache HTTP on mw1118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.104 second response time [09:03:45] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.094 second response time [09:04:05] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.148 second response time [09:04:06] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [09:04:06] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.124 second response time [09:04:06] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.207 second response time [09:04:57] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.097 second response time [09:06:17] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [09:07:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [09:08:16] PROBLEM - RAID on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:16] PROBLEM - SSH on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:19] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:19] PROBLEM - puppet last run on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:19] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:45] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:17] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [09:09:17] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 461 seconds ago with 0 failures [09:09:17] RECOVERY - SSH on mw1132 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:09:17] PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:28] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:37] PROBLEM - RAID on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:43] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:44] PROBLEM - puppet last run on mw1129 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:45] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [09:10:06] PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:06] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:15] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:16] RECOVERY - RAID on mw1148 is OK: OK: no RAID installed [09:10:16] RECOVERY - DPKG on mw1127 is OK: All packages OK [09:10:16] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.124 second response time [09:10:27] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [09:10:29] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [09:10:33] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.199 second response time [09:10:33] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 1382 seconds ago with 0 failures [09:11:12] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 274 seconds ago with 0 failures [09:11:12] RECOVERY - RAID on mw1144 is OK: OK: no RAID installed [09:11:12] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.129 second response time [09:12:26] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [09:17:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:24:26] PROBLEM - puppet last run on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:05] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [09:25:06] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [09:25:27] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 389 seconds ago with 0 failures [09:25:36] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:36] PROBLEM - Apache HTTP on mw1142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:25:56] PROBLEM - puppet last run on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:56] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:25] RECOVERY - Apache HTTP on mw1142 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.123 second response time [09:26:26] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.894 second response time [09:26:46] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [09:26:48] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.184 second response time [09:26:48] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 99 seconds ago with 0 failures [09:27:06] !log springle Synchronized wmf-config/db-eqiad.php: isolate api traffic on s2 to db1054 and db1060 (duration: 01m 20s) [09:27:11] Logged the message, Master [09:27:13] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM, merging and running a compile on netmon1001" [puppet] - 10https://gerrit.wikimedia.org/r/164492 (owner: 10Dzahn) [09:28:16] PROBLEM - SSH on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:16] PROBLEM - check if dhclient is running on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:16] PROBLEM - nutcracker process on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:16] PROBLEM - check if salt-minion is running on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:28:17] PROBLEM - puppet last run on lithium is CRITICAL: CRITICAL: Puppet last ran 517272 seconds ago, expected 14400 [09:28:17] PROBLEM - Apache HTTP on mw1143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:29:05] PROBLEM - Disk space on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:06] PROBLEM - puppet last run on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:06] PROBLEM - RAID on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:06] PROBLEM - nutcracker port on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:07] PROBLEM - check configured eth on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:07] PROBLEM - DPKG on mw1143 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:29:25] RECOVERY - puppet last run on lithium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:30:07] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:35] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:36] PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:36] PROBLEM - check configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:45] PROBLEM - Apache HTTP on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:48] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:48] PROBLEM - puppet last run on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:55] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:55] PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:55] PROBLEM - Apache HTTP on mw1117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:55] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:55] PROBLEM - check if salt-minion is running on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:55] PROBLEM - puppet last run on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:56] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:56] PROBLEM - check if dhclient is running on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:30:57] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 122 seconds ago with 0 failures [09:31:15] PROBLEM - check if salt-minion is running on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:15] PROBLEM - Apache HTTP on mw1137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:15] PROBLEM - RAID on mw1137 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:16] PROBLEM - Apache HTTP on mw1138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:16] PROBLEM - RAID on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:17] PROBLEM - RAID on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:17] PROBLEM - Apache HTTP on mw1121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:18] PROBLEM - puppet last run on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:18] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:25] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:31:25] PROBLEM - RAID on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:25] PROBLEM - Apache HTTP on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:25] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:25] RECOVERY - check configured eth on mw1140 is OK: NRPE: Unable to read output [09:31:26] PROBLEM - puppet last run on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:26] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:27] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:27] PROBLEM - Apache HTTP on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:35] PROBLEM - Apache HTTP on mw1129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:35] PROBLEM - Apache HTTP on mw1200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:35] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:45] RECOVERY - Disk space on mw1126 is OK: DISK OK [09:31:46] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:47] RECOVERY - check if salt-minion is running on mw1138 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:31:47] PROBLEM - DPKG on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:47] RECOVERY - check if dhclient is running on mw1138 is OK: PROCS OK: 0 processes with command name dhclient [09:31:47] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 321 seconds ago with 0 failures [09:31:47] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:55] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:56] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 274 seconds ago with 0 failures [09:31:56] RECOVERY - RAID on mw1144 is OK: OK: no RAID installed [09:31:56] PROBLEM - puppet last run on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:05] PROBLEM - Apache HTTP on mw1146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:05] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:06] RECOVERY - check if salt-minion is running on mw1133 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:32:15] PROBLEM - check if dhclient is running on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:15] PROBLEM - SSH on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:15] PROBLEM - nutcracker process on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:15] PROBLEM - check configured eth on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:15] PROBLEM - puppet last run on mw1128 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:16] PROBLEM - Apache HTTP on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:16] PROBLEM - DPKG on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:17] PROBLEM - puppet last run on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:17] PROBLEM - check configured eth on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:18] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [09:32:25] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.613 second response time [09:32:28] PROBLEM - SSH on mw1132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:28] PROBLEM - RAID on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:37] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:37] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:37] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.674 second response time [09:32:37] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:38] PROBLEM - Apache HTTP on mw1133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:38] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:46] RECOVERY - DPKG on mw1131 is OK: All packages OK [09:32:46] PROBLEM - puppet last run on mw1133 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:32:55] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:05] RECOVERY - SSH on mw1128 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:33:10] RECOVERY - check if dhclient is running on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [09:33:10] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 946 seconds ago with 0 failures [09:33:10] RECOVERY - check configured eth on mw1128 is OK: NRPE: Unable to read output [09:33:10] RECOVERY - nutcracker process on mw1117 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:33:10] PROBLEM - check configured eth on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:10] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.427 second response time [09:33:10] RECOVERY - DPKG on mw1126 is OK: All packages OK [09:33:11] RECOVERY - Disk space on mw1143 is OK: DISK OK [09:33:15] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 982 seconds ago with 0 failures [09:33:15] RECOVERY - check configured eth on mw1117 is OK: NRPE: Unable to read output [09:33:16] RECOVERY - RAID on mw1126 is OK: OK: no RAID installed [09:33:16] RECOVERY - nutcracker port on mw1143 is OK: TCP OK - 0.000 second response time on port 11212 [09:33:16] RECOVERY - RAID on mw1133 is OK: OK: no RAID installed [09:33:16] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [09:33:16] RECOVERY - check configured eth on mw1143 is OK: NRPE: Unable to read output [09:33:17] RECOVERY - DPKG on mw1143 is OK: All packages OK [09:33:17] PROBLEM - Apache HTTP on mw1119 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:18] PROBLEM - puppet last run on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:18] PROBLEM - SSH on mw1141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:19] PROBLEM - check if salt-minion is running on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:19] PROBLEM - RAID on mw1121 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:25] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [09:33:26] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 344 seconds ago with 0 failures [09:33:26] RECOVERY - Apache HTTP on mw1131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.246 second response time [09:33:26] PROBLEM - DPKG on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:26] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.510 second response time [09:33:26] PROBLEM - RAID on mw1138 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:27] PROBLEM - nutcracker process on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:35] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [09:33:35] RECOVERY - Apache HTTP on mw1133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [09:33:36] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:33:36] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.638 second response time [09:33:37] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [09:33:37] PROBLEM - check if dhclient is running on mw1117 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [09:33:45] RECOVERY - SSH on mw1143 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:33:52] PROBLEM - RAID on mw1141 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:33:52] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:33:52] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.149 second response time [09:33:52] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 892 seconds ago with 0 failures [09:33:52] RECOVERY - Apache HTTP on mw1123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.695 second response time [09:33:53] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [09:33:55] RECOVERY - DPKG on mw1122 is OK: All packages OK [09:33:55] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [09:34:05] RECOVERY - Apache HTTP on mw1117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [09:34:09] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.106 second response time [09:34:09] RECOVERY - Apache HTTP on mw1146 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [09:34:09] RECOVERY - check configured eth on mw1125 is OK: NRPE: Unable to read output [09:34:09] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.163 second response time [09:34:09] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 1224 seconds ago with 0 failures [09:34:18] RECOVERY - RAID on mw1137 is OK: OK: no RAID installed [09:34:18] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [09:34:18] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.555 second response time [09:34:19] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 1411 seconds ago with 0 failures [09:34:19] RECOVERY - Apache HTTP on mw1137 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.657 second response time [09:34:19] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [09:34:19] RECOVERY - SSH on mw1141 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:34:20] RECOVERY - check if salt-minion is running on mw1141 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:34:20] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 548 seconds ago with 0 failures [09:34:21] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 396 seconds ago with 0 failures [09:34:26] RECOVERY - DPKG on mw1141 is OK: All packages OK [09:34:26] RECOVERY - RAID on mw1120 is OK: OK: no RAID installed [09:34:26] RECOVERY - check if salt-minion is running on mw1143 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:34:26] RECOVERY - nutcracker process on mw1143 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:34:26] RECOVERY - check if dhclient is running on mw1143 is OK: PROCS OK: 0 processes with command name dhclient [09:34:37] RECOVERY - RAID on mw1138 is OK: OK: no RAID installed [09:34:38] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.094 second response time [09:34:38] RECOVERY - RAID on mw1141 is OK: OK: no RAID installed [09:34:38] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.156 second response time [09:34:38] RECOVERY - Apache HTTP on mw1129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.107 second response time [09:35:07] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.140 second response time [09:35:16] RECOVERY - Apache HTTP on mw1138 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.113 second response time [09:35:19] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [09:35:25] RECOVERY - RAID on mw1121 is OK: OK: no RAID installed [09:35:26] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [09:35:26] RECOVERY - SSH on mw1132 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [09:35:26] RECOVERY - nutcracker process on mw1132 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker [09:35:26] RECOVERY - RAID on mw1117 is OK: OK: no RAID installed [09:35:35] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 838 seconds ago with 0 failures [09:35:36] RECOVERY - check if dhclient is running on mw1117 is OK: PROCS OK: 0 processes with command name dhclient [09:35:36] RECOVERY - Apache HTTP on mw1132 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [09:40:52] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:01] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:02] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:02] PROBLEM - puppet last run on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:11] PROBLEM - check configured eth on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:12] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:12] PROBLEM - DPKG on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:12] PROBLEM - puppet last run on mw1146 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:21] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:42:01] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.117 second response time [09:42:12] RECOVERY - Apache HTTP on mw1134 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.196 second response time [09:42:12] PROBLEM - check if salt-minion is running on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:21] RECOVERY - DPKG on mw1146 is OK: All packages OK [09:42:21] PROBLEM - DPKG on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:42:21] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.063 second response time [09:42:22] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 512 seconds ago with 0 failures [09:42:31] PROBLEM - RAID on mw1119 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:02] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.102 second response time [09:43:16] (03PS1) 10Springle: Isolate api traffic on s2 to db1054 and db1060. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165458 [09:43:21] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.114 second response time [09:43:22] PROBLEM - RAID on mw1132 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:43:22] RECOVERY - check if salt-minion is running on mw1119 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:43:22] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 660 seconds ago with 0 failures [09:43:22] RECOVERY - DPKG on mw1119 is OK: All packages OK [09:43:23] RECOVERY - check configured eth on mw1130 is OK: NRPE: Unable to read output [09:43:32] RECOVERY - RAID on mw1119 is OK: OK: no RAID installed [09:43:41] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [09:43:48] (03CR) 10Springle: [C: 032] Isolate api traffic on s2 to db1054 and db1060. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165458 (owner: 10Springle) [09:43:54] (03Merged) 10jenkins-bot: Isolate api traffic on s2 to db1054 and db1060. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165458 (owner: 10Springle) [09:44:11] RECOVERY - RAID on mw1132 is OK: OK: no RAID installed [09:44:22] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:46:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:58:29] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:59:22] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [10:00:12] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:41:26] (03PS2) 10Faidon Liambotis: Reduce number of parsoid runners even further [puppet] - 10https://gerrit.wikimedia.org/r/165317 (owner: 10GWicke) [10:41:34] (03CR) 10Faidon Liambotis: [C: 032] Reduce number of parsoid runners even further [puppet] - 10https://gerrit.wikimedia.org/r/165317 (owner: 10GWicke) [10:56:32] (03CR) 10QChris: [C: 04-1] "Needs further investigation on cluster bootstrapping," (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/164761 (owner: 10QChris) [10:57:14] (03PS2) 10QChris: Declare datanode's mount directories only once [puppet] - 10https://gerrit.wikimedia.org/r/164763 [10:59:06] _joe_: I want to delete those docroot folders from mediawiki-config today :) [10:59:23] <_joe_> Reedy: fine by me [10:59:28] sweet [11:00:33] (03PS4) 10Reedy: Remove all superfluous docroot folders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/90704 [11:00:42] (03CR) 10Reedy: [C: 032] "Good riddance!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/90704 (owner: 10Reedy) [11:00:49] (03Merged) 10jenkins-bot: Remove all superfluous docroot folders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/90704 (owner: 10Reedy) [11:02:13] !log reedy Synchronized docroot and w: good riddance to bad docroots (duration: 00m 16s) [11:02:21] Logged the message, Master [11:02:53] I think I'll see about removing some more of them/symlinks soon too [11:03:08] * Reedy wonders why he left usability [11:03:28] <_joe_> bb in 5. rebooting the damn router [11:13:51] (03PS3) 10Giuseppe Lavagetto: mediawiki: add HHVM proxy rules in main.conf [puppet] - 10https://gerrit.wikimedia.org/r/159490 [11:14:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add HHVM proxy rules in main.conf [puppet] - 10https://gerrit.wikimedia.org/r/159490 (owner: 10Giuseppe Lavagetto) [11:14:37] <_joe_> c'mon jenkins [11:15:46] I think zuul isn't happy [11:17:32] No hashar [11:22:27] Anyone in ops fancy tackling https://gerrit.wikimedia.org/r/#/c/160960/ and it's dependencies? I'm sure springle would appreciate it too :)) [11:25:28] <_joe_> Reedy: not now sorry [11:26:30] <_joe_> webtest is _awesome_ [11:28:45] Reedy: I'll take a look in 10 [11:28:55] Thanks! [11:28:59] (03PS1) 10Giuseppe Lavagetto: mediawiki: fixup for proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/165463 [11:29:26] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: fixup for proxy changes [puppet] - 10https://gerrit.wikimedia.org/r/165463 (owner: 10Giuseppe Lavagetto) [11:31:45] (03PS3) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 [11:31:50] (03CR) 10jenkins-bot: [V: 04-1] swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [11:37:56] (03PS4) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 [11:38:21] * godog nudges jenkins [11:43:16] (03CR) 10Filippo Giunchedi: "LGTM, couple of comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [11:43:52] heh, cheers godog [11:45:17] Reedy: I know it isn't your code and just shuffled around, still worth adjusting if we can :) [11:45:36] the comment definitely needs updating [11:50:39] (03CR) 10Reedy: Extract wmf-beta-scap to sudo-withagent wrapper script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [11:51:13] mark: thoughts on https://gerrit.wikimedia.org/r/#/c/160632/3 ? I know it's been "controversial" in the past [11:54:27] it's on my todo list from before i got back [11:54:49] before i left even [11:55:12] the hosts have no certificates that i know of [11:55:46] and we'll need two, as wiki-mail is separate [11:57:20] godog: If you can reply to my couple of questions on that patchset, I'll get it fixed up :) [11:57:58] paravoid: good point! yeah I'm targeting the low-hanging fruits first [12:00:55] (03CR) 10Filippo Giunchedi: Extract wmf-beta-scap to sudo-withagent wrapper script (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [12:00:59] Reedy: yep, done [12:07:37] (03PS9) 10Reedy: Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [12:07:39] (03CR) 10Reedy: Extract wmf-beta-scap to sudo-withagent wrapper script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [12:07:50] (03PS10) 10Reedy: Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 [12:09:11] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [12:09:46] that's everything bar the set -u then [12:10:21] Reedy: yup looks good, I'll merge [12:10:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: repool mw1163 for scap, convert to hhvm. [puppet] - 10https://gerrit.wikimedia.org/r/165466 [12:10:33] thanks [12:10:41] Need to fix up its dependency though [12:10:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Extract wmf-beta-scap to sudo-withagent wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/160960 (owner: 10Reedy) [12:11:35] Reedy: which? [12:11:41] https://gerrit.wikimedia.org/r/#/c/158623/ [12:11:43] Bryan has -1'd it [12:11:57] it's one below, not one above sudo-withagent [12:12:27] ah okay, so nothing broken yet [12:12:34] nope :) [12:12:47] that gets the prep work in place at least [12:16:57] (03CR) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [12:24:06] (03CR) 10Mark Bergsma: "No problems with it in principle." [puppet] - 10https://gerrit.wikimedia.org/r/160632 (owner: 10Filippo Giunchedi) [12:31:53] (03CR) 10Manybubbles: [C: 032 V: 032] Install new plugin and upgrade another [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 (owner: 10Manybubbles) [12:32:54] !log deploying new elasticsearch plugins in preparation for minor Elasticsearch version upgrade today [12:33:01] Logged the message, Master [12:44:34] !log upgraded elastic1001 to Elasticsearch 1.3.2 -> 1.3.4, experimental highlighter 0.0.11 -> 0.0.12, and installed trigram accelerated regex search 0.0.1 [12:44:40] Logged the message, Master [12:48:38] Coren: WMFLabs down ? [13:07:28] NotASpy: parts of it [13:07:57] yeah, hearing about it through the grapevine now [13:14:37] (03PS2) 10Alexandros Kosiaris: remove Tampa network gear from netmon [puppet] - 10https://gerrit.wikimedia.org/r/165418 (owner: 10Dzahn) [13:14:57] (03CR) 10Hashar: "Imho there is little point in using ensure => absent, we can just clear the package on the instances. That saves us the change to remove " [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [13:16:51] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [13:23:57] (03CR) 10Alexandros Kosiaris: [C: 032] remove Tampa network gear from netmon [puppet] - 10https://gerrit.wikimedia.org/r/165418 (owner: 10Dzahn) [13:24:01] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/165471 [13:28:59] (03PS1) 10Alex Monk: Znuny4OTRS-WikimediaDTL: Unbreak read-only customer ID [software/otrs] - 10https://gerrit.wikimedia.org/r/165472 (https://bugzilla.wikimedia.org/59950) [13:30:41] (03PS3) 10Alexandros Kosiaris: delete sanger SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164692 (owner: 10Dzahn) [13:31:00] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/165473 [13:34:37] (03PS1) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-pt-ca] - 10https://gerrit.wikimedia.org/r/165475 [13:41:59] NotASpy: Failure of the underlying hardware of one of the servers. Most things have recuperated well but because the webproxy was one of the affected instances it confused a lot of webservices. I'm restarting them all gradually now. [13:55:49] On #wikimedia-tech starting in 5 min: Collection/Book tool/PDF/OCG Bug Day, follow also on https://etherpad.wikimedia.org/BugTriage-Collection [13:57:07] (03PS2) 10Giuseppe Lavagetto: mediawiki: repool mw1163 for scap, convert to hhvm. [puppet] - 10https://gerrit.wikimedia.org/r/165466 [13:57:32] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: repool mw1163 for scap, convert to hhvm. [puppet] - 10https://gerrit.wikimedia.org/r/165466 (owner: 10Giuseppe Lavagetto) [13:57:39] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: repool mw1163 for scap, convert to hhvm. [puppet] - 10https://gerrit.wikimedia.org/r/165466 (owner: 10Giuseppe Lavagetto) [14:02:59] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.108 [14:03:25] !log upgrading elastic1001 uncovered a bug in our highlighter that I have yet to diagnose. I removed that server from the rotation so we'll continue to use the old version. [14:03:34] Logged the message, Master [14:05:25] manybubbles: gah, btw apt has 1.3.4 already [14:05:49] godog: its cool. its not elasticsearch 1.3.4's fault. Its my highlighter. I'll have to figure out what is up and fix it. [14:06:05] that puts that upgrade on hold for a day or so [14:06:07] ^demon|brb: ^^^ [14:06:29] ^demon|brb: I started the first one this morning because I thought it'd be nice to get a jump on it. I'm unclear why we don't see this problem in dev or beta. [14:06:37] <^demon|brb> Hmm [14:07:02] manybubbles: oh ok, good to know! [14:07:06] and puppet just restarted it.... I thought I disabled puppet [14:07:51] <^demon|brb> manybubbles vs. puppet [14:07:53] <^demon|brb> round 1....fight! [14:08:08] its `sudo puppet agent --disable` right? [14:09:55] <^demon|brb> Yeah should be? [14:10:03] PROBLEM - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.108:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health [14:10:27] ^demon|brb: https://gist.github.com/nik9000/a5f89239a7d215ae61e6 [14:10:30] shut up [14:10:34] icinga [14:10:43] yep that should be it, you can provide a string to --disable to explain why [14:11:06] manybubbles: that might have been a running instance that was sleeping on splay? it that from the logs? [14:11:15] indeed, different pid [14:11:17] ACKNOWLEDGEMENT - ElasticSearch health check for shards on elastic1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.108:9200/_cluster/health error while fetching: Max retries exceeded for url: /_cluster/health manybubbles upgrading elastic1001 uncovered a bug in our highlighter that I have yet to diagnose. I removed that server from the rotation so well continue to use the old version. [14:11:17] <^demon|brb> That was my guess. [14:11:32] or just finishing, anyways [14:11:36] godog: yeah - its just that --disable did catch it in time. cool on the string. next time [14:11:38] <^demon|brb> We should add "shut up icinga" to mutante's bot idea. [14:11:47] <^demon|brb> In addition to "yeah yeah, we know icinga" [14:12:24] not a bad idea! If we could ack the thing from here. Though I don't know about security stuffs [14:12:56] oh really: Caused by: java.lang.ClassNotFoundException: org.wikimedia.highlighter.experimental.lucene.hit.weight.BasicQueryWeigher$TermInfos [14:13:40] git deploy!!!!!! [14:13:41] -rw-r--r-- 1 root root 74 Oct 8 12:33 /srv/deployment/elasticsearch/plugins/experimental-highlighter-elasticsearch-plugin/experimental-highlighter-lucene-0.0.12.jar [14:13:57] <^demon|brb> Yep, looks like a text file again [14:14:24] I totally checked this morning! [14:14:37] !log hard restarting zuul [14:14:46] Logged the message, Master [14:15:43] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [14:16:02] that was the end of tampa [14:16:09] ;_; [14:16:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [14:16:14] ಠ_ಠ [14:16:24] ^demon|brb: all the servers got it _but_ elastic1001 [14:16:33] no more worrying about hurricane which might flood a dc [14:16:52] nah, it would only flood the generator room [14:16:58] this fat git deploy is jangly [14:17:02] our servers were 10+ floors up [14:17:05] =] [14:17:15] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [14:17:18] * robh doesnt mind disclosing that in public since they arent there anymore [14:17:22] heh [14:17:43] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [14:18:02] (03PS1) 10Hashar: zuul: debug gear.Server [puppet] - 10https://gerrit.wikimedia.org/r/165481 [14:18:30] apergos: dataset2 is dead, right? [14:18:36] as in, tampa host that doesn't exist anymore? [14:18:47] may I have the change above merged in please? It tweaks a zuul config file and I already deployed it manually [14:19:00] !log fixed missing elasticsearch extension jar file and brought elastic1001 back up. git fat betrayed us. [14:19:05] RECOVERY - ElasticSearch health check for shards on elastic1001 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 18, unassigned_shards: 0, timed_out: False, active_primary_shards: 2034, cluster_name: production-search-eqiad, relocating_shards: 0, active_shards: 6083, initializing_shards: 14, number_of_data_nodes: 18 [14:19:07] Logged the message, Master [14:20:47] !log disabled puppet on gallium to make sure a zuul config change stick in. {{gerrit|165481}} [14:20:52] Logged the message, Master [14:22:00] that was exciting! I'll brb [14:23:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [14:23:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [14:25:32] (03CR) 10Hashar: [V: 032] "I have deployed the new configuration file on gallium.wikimedia.org (which hosts the Zuul server)." [puppet] - 10https://gerrit.wikimedia.org/r/165481 (owner: 10Hashar) [14:25:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [14:26:04] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [14:30:00] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 81, down: 0, dormant: 0, excluded: 2, unused: 0 [14:30:29] bblack: thanks a lot of all this network monitoring work [14:30:34] it's going to be extremely helpful [14:32:24] <^demon|brb> manybubbles|brb: Hmmm, elastic1001 didn't get stuck like they did in beta yesterday. We'll see what happens with 1002. [14:32:39] so far [14:36:03] <^demon|brb> yeah, 01 was fine in beta too. [14:36:08] <^demon|brb> it was 02-04 that screwed up [14:40:37] <^demon|brb> godog: You know, I think I am going to go the "one script with subcommands" route. Then it's just one file and I don't have to worry about packaging the shared parts. [14:42:27] ^demon|brb: indeed, self containment is nice! shouldn't be too over-engineered to have a map subcommand -> function too :) [14:42:44] <^demon|brb> Yeah [14:42:54] paravoid: dead as a doornail. shut down and gone. [14:43:08] <^demon|brb> Has it been doused in gasoline and set ablaze yet? [14:43:23] lol [14:43:35] icinga still has alerts about it [14:43:40] so maybe not properly decom'ed yet? [14:43:42] ^demon|brb: Should I try on logstash? ;) [14:43:53] ALL tampa hosts are dead and gone and disconnected [14:44:05] I wanted to say RIP [14:44:12] but the in peace doesn't make sense in some cases [14:44:31] <^demon|brb> Reedy: 1.3.4? [14:44:34] Ja [14:45:06] <^demon|brb> Don't see why not [14:45:23] apergos: ^ [14:45:27] (sorry :) [14:47:50] (03PS1) 10KartikMistry: WIP: apertium service for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [14:48:33] (03CR) 10jenkins-bot: [V: 04-1] WIP: apertium service for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [14:49:06] I'll be taking care of that later today [14:49:15] host for those services almost ready [14:50:32] if it would make icinga cleanup nicer for me to ack those alerts or something, I can do that [14:50:48] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [14:51:12] marktraceur, manybubbles, ^demon|brb: I'll SWAT today, unless one of you had your heart set on it [14:51:22] anomie: have fun [14:51:26] <^demon|brb> go for it, I'm in the middle of something anyway [14:51:31] anomie: I have a meeting [14:52:00] (03PS3) 10Filippo Giunchedi: swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 [14:52:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 (owner: 10Filippo Giunchedi) [14:52:20] (03Draft3) 10Filippo Giunchedi: swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 [14:53:00] <_joe_> !log repooling mw1163 [14:53:06] Logged the message, Master [14:56:20] RoanKattouw: Ping for SWAT in about 4 minutes [14:56:25] anomie: Pong [15:00:02] (03PS2) 10KartikMistry: WIP: apertium service for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [15:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141008T1500). [15:00:05] RoanKattouw: Doing your change now. [15:01:19] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 [15:04:01] !log upgrading elastic1002 now [15:04:07] Logged the message, Master [15:07:17] (03PS1) 10Reedy: Enable Collection by default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) [15:07:20] (03CR) 10BryanDavis: Use sync-dir to copy out l10n json files, build cdbs on hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [15:08:12] !log anomie Synchronized php-1.25wmf2/extensions/VisualEditor/lib/ve/src/ce/ve.ce.Surface.js: SWAT: Revert "ve.ce.Surface: Magic workaround for broken Firefox cursoring" [[gerrit:164593]] (duration: 00m 09s) [15:08:12] RoanKattouw: ^ Test please [15:08:16] Logged the message, Master [15:13:27] (03PS1) 10coren: Collect ganglia metrics from NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/165492 [15:13:52] paravoid: when you have a minute mind skimming over https://gerrit.wikimedia.org/r/#/c/160430/ ? I tested in labs and seemed fine change-wise [15:14:05] andrewbogott: Can you give a quick look over ^^? [15:16:02] (03CR) 10Andrew Bogott: [C: 032] Collect ganglia metrics from NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/165492 (owner: 10coren) [15:16:58] Thankee. [15:19:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [15:20:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [15:22:19] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 0 MB (0% inode=99%): [15:22:59] (03CR) 10Reedy: [C: 031] "Ignore my comment above, I fail." [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [15:25:14] may I have the change https://gerrit.wikimedia.org/r/#/c/165481/ merged in please? It tweaks a zuul config file. I have deployed it manually and had puppet disabled to prevent it from being overriden [15:25:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [15:25:52] (03PS3) 10Reedy: ship {texvc,texvccheck} via mediawiki-math-texvc [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [15:25:59] (03CR) 10Reedy: [C: 031] ship {texvc,texvccheck} via mediawiki-math-texvc [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [15:26:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [15:27:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM; perf is very useful and should be installed everywhere." [puppet] - 10https://gerrit.wikimedia.org/r/164883 (owner: 10Ori.livneh) [15:28:49] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [15:29:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [15:30:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [15:30:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [15:32:32] (03PS3) 10Giuseppe Lavagetto: varnish:qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/161184 (owner: 10Matanya) [15:33:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] ship {texvc,texvccheck} via mediawiki-math-texvc [puppet] - 10https://gerrit.wikimedia.org/r/135544 (owner: 10Filippo Giunchedi) [15:33:43] Reedy: ^ merged [15:33:52] Greayt [15:33:57] I'll prepare a mw-config patch [15:34:06] But it can wait 24 hours or so to let puppet catch up [15:35:33] // Use the Debian-packaged texvc on Trusty. --Ori, 24-Sept-2014 [15:35:33] $wgTexvc = defined( 'HHVM_VERSION' ) && $wmfRealm === 'production' ? [15:35:33] '/usr/bin/texvc' : '/usr/local/apache/uncommon/bin/texvc'; [15:35:36] ye, hopefully it converges in no longer than two runs so ~40m [15:35:56] Is texvccheck in your new verison too? [15:36:03] $wgMathTexvcCheckExecutable = "/usr/local/apache/uncommon/bin/texvccheck"; [15:36:22] it is [15:36:59] :) [15:37:29] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: puppet fail [15:38:05] (03PS1) 10Reedy: Always use debian packaged texvc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 [15:38:59] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: puppet fail [15:39:10] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [15:39:10] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: puppet fail [15:39:30] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [15:39:38] yes yes, fix underway [15:39:46] (03PS1) 10Filippo Giunchedi: mediawiki: mediawiki-math-texvc is in precise-wikimedia too [puppet] - 10https://gerrit.wikimedia.org/r/165496 [15:39:48] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM http://puppet-compiler.wmflabs.org/413/change/161184/html/" [puppet] - 10https://gerrit.wikimedia.org/r/161184 (owner: 10Matanya) [15:40:34] Reedy: ^ https://gerrit.wikimedia.org/r/#/c/165496 [15:40:55] (03CR) 10Reedy: [C: 031] mediawiki: mediawiki-math-texvc is in precise-wikimedia too [puppet] - 10https://gerrit.wikimedia.org/r/165496 (owner: 10Filippo Giunchedi) [15:40:56] Haha [15:41:34] (03PS2) 10Filippo Giunchedi: mediawiki: mediawiki-math-texvc is in precise-wikimedia too [puppet] - 10https://gerrit.wikimedia.org/r/165496 [15:41:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: mediawiki-math-texvc is in precise-wikimedia too [puppet] - 10https://gerrit.wikimedia.org/r/165496 (owner: 10Filippo Giunchedi) [15:42:14] bd808 or ori, is it possible that wikitech's cache has a maximum life of one day that overrides our request for longer lives? [15:42:30] Nova tokens are expiring all over the place. In theory they are timed to last just as long as mediawiki sessions. [15:44:08] * bd808 doesn't know but bets that Reedy can figure it out ^ [15:45:05] andrewbogott: Where are they stored? [15:45:07] memcached? [15:45:17] right [15:45:45] $key = wfMemcKey( 'openstackmanager', 'fulltoken', $username ); [15:45:45] // Expiration time is unneccessary. Token expiration is expected [15:45:45] // to be longer than MediaWiki's token, so a re-auth will occur [15:45:45] // before the generic token expires. [15:45:46] $wgMemc->set( $key, $this->token ); [15:46:09] That's the one. [15:46:21] So my question is -- is memcache just expiring them anyway, due to a default setting? [15:46:42] Else is it full, and expiring "old" data or similar? [15:46:49] The issues (nova auth expiring all over the place) seems to have started roughly when y'all last fiddled with the memcache setup. [15:47:17] How would I tell if it's full? [15:47:24] $wgObjectCacheSessionExpiry = 3600; [15:47:28] 60 minutes? [15:47:40] bah, where is that set? [15:47:49] That's DefaultSettings [15:47:50] (Here's where you say, 'you did that') [15:48:00] just looking if we override it in config [15:48:01] Oh! So default for all wikis everywhere is 60 minutes? [15:48:06] That would totally cause what I'm seeing. [15:48:39] We don't seem to override it [15:48:49] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:48:55] So there's 2 ways of fixing it [15:49:10] Update OSM to set it the same as session [15:49:15] Or we just override it in MW config [15:49:34] I'm just wondering if the latter would have unintended consequences (other stuff MW expects to have a "short" TTL) [15:50:19] How long does a session last? [15:51:24] andrewbogott: Did we set it to something else in the old wikitech config? [15:51:40] Reedy: Probably we didn't set it at all. I'll look. [15:52:00] RECOVERY - Disk space on ocg1002 is OK: DISK OK [15:52:26] ObjectCacheSessionExpiry is unset in the old config [15:53:11] * The expiry time to use for session storage when $wgSessionsInObjectCache is [15:53:11] * enabled, in seconds. [15:54:13] Anyway, I'm totally fine with just setting a long expiry time in OSM. Give me a moment... [15:54:13] That doesn't actually set for general memcached [15:55:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [15:55:59] Reedy: wait, does that mean that the 60m expiry is moot since expiration isn't enabled at all? [15:56:10] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [15:56:53] andrewbogott: I'm wondering if the expiry of 0 (default) is handled differently by the 2 different clients [15:57:39] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:57:40] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [15:57:59] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:58:31] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:58:36] http://php.net/manual/en/memcached.expiration.php [15:58:41] "If the expiration value is 0 (the default), the item never expires (although it may be deleted from the server to make place for other items)." [15:58:47] (03CR) 10Filippo Giunchedi: [C: 04-1] Always use debian packaged texvc (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165493 (owner: 10Reedy) [16:01:22] Reedy: so… https://gerrit.wikimedia.org/r/#/c/165501/ ? [16:01:42] andrewbogott: Nope. That's the same as current behaviour [16:02:30] Ah, I misunderstood your last comment. [16:02:41] So you might just be best setting it to 30 days [16:02:50] So there's no value for 'never expire,' I just have to pick a big number [16:02:50] 60*60*24*30 [16:03:04] "if the expiration value is larger than that, the server will consider it to be real Unix time value rather than an offset from current time." [16:03:09] (03PS4) 10ArielGlenn: decom dataset2, replace with dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [16:03:11] (03PS2) 10Filippo Giunchedi: zuul: debug gear.Server [puppet] - 10https://gerrit.wikimedia.org/r/165481 (owner: 10Hashar) [16:03:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: debug gear.Server [puppet] - 10https://gerrit.wikimedia.org/r/165481 (owner: 10Hashar) [16:04:08] (03CR) 10ArielGlenn: [C: 032] decom dataset2, replace with dataset1001 [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [16:05:00] Reedy: https://gerrit.wikimedia.org/r/#/c/165501/2/nova/OpenStackNovaController.php [16:06:00] yup [16:06:11] should stop memcached deciding to evict it whenever it wants too [16:07:04] Hm, I just missed morning SWAT, huh? [16:07:22] heh, yup [16:07:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 1, unused: 0 [16:07:42] !log elasticsearch upgraded on logstash[ [16:07:45] fail [16:07:48] Logged the message, Master [16:07:57] !log elasticsearch upgraded on logstash100[23] to 1.3.4 [16:08:02] Logged the message, Master [16:08:19] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 224, down: 0, dormant: 0, excluded: 0, unused: 0 [16:09:15] !log elasticsearch upgraded on logstash1001 to 1.3.4 [16:09:19] Logged the message, Master [16:10:36] Reedy: can I get a +1 so that that patch is visibly blessed come afternoon swat? [16:11:09] +1s are sticky now :) [16:11:20] Oh, you already did, thanks [16:19:30] (03PS1) 10ArielGlenn: ms1001 takes over job of gone dataset2 as secondary for dumps [puppet] - 10https://gerrit.wikimedia.org/r/165502 [16:22:17] (03PS1) 10Reedy: Install logstash-contrib too [puppet] - 10https://gerrit.wikimedia.org/r/165503 [16:23:20] (03CR) 10Reedy: [C: 04-1] "To come after RT #8600" [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [16:23:23] (03PS2) 10ArielGlenn: ms1001 takes over job of gone dataset2 as secondary for dumps [puppet] - 10https://gerrit.wikimedia.org/r/165502 [16:23:31] (03CR) 10Hashar: "Thank you. I have reenabled puppet on gallium" [puppet] - 10https://gerrit.wikimedia.org/r/165481 (owner: 10Hashar) [16:28:04] (03CR) 10ArielGlenn: [C: 032] ms1001 takes over job of gone dataset2 as secondary for dumps [puppet] - 10https://gerrit.wikimedia.org/r/165502 (owner: 10ArielGlenn) [16:37:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 222, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr2-pmtpa:xe-0/0/0 (Level3/FPL, CV71026) {#2008} [10Gbps wave]BR [16:38:11] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-5/2/1: down - Core: cr1-pmtpa:xe-1/1/0 (GBLX/FPL, CV71028) {#2012} [10Gbps wave]BR [16:42:17] (03PS1) 10ArielGlenn: datasets: remove obsoleted script rsync-dumps.sh [puppet] - 10https://gerrit.wikimedia.org/r/165509 [16:43:18] (03CR) 10ArielGlenn: [C: 032] datasets: remove obsoleted script rsync-dumps.sh [puppet] - 10https://gerrit.wikimedia.org/r/165509 (owner: 10ArielGlenn) [16:44:10] <_joe_> !log doing some load testing on the hhvm servers [16:44:14] Logged the message, Master [16:45:15] (03PS7) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) [16:45:21] (03CR) 10jenkins-bot: [V: 04-1] Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) (owner: 10Reedy) [16:45:43] (03PS8) 10Reedy: Use sync-dir to copy out l10n json files, build cdbs on hosts [puppet] - 10https://gerrit.wikimedia.org/r/158623 (https://bugzilla.wikimedia.org/70443) [16:47:12] (03PS4) 10Reedy: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 [16:48:31] (03PS1) 10ArielGlenn: datasets: re-enable rsync between peers [puppet] - 10https://gerrit.wikimedia.org/r/165510 [16:50:09] (03CR) 10ArielGlenn: [C: 032] datasets: re-enable rsync between peers [puppet] - 10https://gerrit.wikimedia.org/r/165510 (owner: 10ArielGlenn) [17:04:41] <_joe_> !log load testing done [17:04:49] Logged the message, Master [17:09:01] (03CR) 10Ottomata: [C: 031] Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [17:09:01] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [17:09:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 203, down: 0, dormant: 0, excluded: 1, unused: 0 [17:16:07] (03PS1) 10BBlack: parsoidecache: do not coalesce only-if-cached reqs [puppet] - 10https://gerrit.wikimedia.org/r/165511 [17:20:11] bblack: Ye Olde Parsoide Cache? :P [17:22:10] (03PS1) 10Ottomata: Hadoop fairscheduler queue change - remove 'adhoc' queue, rename 'standard' to 'essential'. [puppet] - 10https://gerrit.wikimedia.org/r/165512 [17:27:05] RoanKattouw: :p Don't mock me, English is my second language! [17:27:09] (the first being C) [17:27:53] RoanKattouw, ESL speaker correcting EFL Wikimedians' English since 2007 [17:28:10] heh [17:28:12] (03CR) 10Gage: [C: 031] Hadoop fairscheduler queue change - remove 'adhoc' queue, rename 'standard' to 'essential'. [puppet] - 10https://gerrit.wikimedia.org/r/165512 (owner: 10Ottomata) [17:34:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:34:47] <_joe_> mmmh [17:39:57] AaronSchulz: We're waiting on you to run a script that should fix some errors following the swift outage from a few weeks back - are you aware of that/what's the status? [17:40:12] (03PS2) 10ArielGlenn: remove dataset2 [dns] - 10https://gerrit.wikimedia.org/r/164255 (owner: 10Dzahn) [17:40:50] (03CR) 10ArielGlenn: "just rebased it." [dns] - 10https://gerrit.wikimedia.org/r/164255 (owner: 10Dzahn) [17:41:26] marktraceur: looks like it finished [17:41:51] AaronSchulz: OK, I'll mark that card done then [17:42:06] missing_files-commons-all.list in my home dir [17:42:10] ~17600 files [17:42:17] some of them are just revdeleted files though [17:42:26] e.g. http://en.wikipedia.org/wiki/File:%E2%80%9CColonel_Gustafsson%E2%80%9D,_former_Gustav_IV_Adolf,_King_of_Sweden_1792-1809_-_Google_Art_Project.jpg [17:44:34] AaronSchulz: Cool. I marked the SoS card as done; the MM card is now "in development" but will probably get marked done this week. [17:46:26] _joe_: https://gerrit.wikimedia.org/r/#/c/164883/ -- ohhhhh, come on [17:50:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:51:03] ori: he has a point, I run into the perf kernel version thing all the time (we're rarely on the latest kernel) [17:51:19] ori: grab kernel version from facter, and ensure => installed for perf package for that version? [17:51:49] sure, okay [17:51:51] it may not exist [17:51:56] in which case puppet will fail [17:52:01] ? [17:52:12] what may not exist? [17:52:14] not every kernel image in apt has a corresponding perf package [17:52:21] corresponding linux tools package i mean [17:52:47] it should, that seems borky [17:53:08] We have servers not rebooted in 700+ days... those run "ancient" kernels [17:53:25] well sure those are on older distros though [17:53:35] we could do the perf thing in general only on trusty [17:53:43] see PS1 of that patch [17:53:45] (since everything should be going trusty soon anyways) [17:53:52] (but with explicit versions) [17:53:56] heh [17:53:59] ok, will amend [17:53:59] sec [17:54:05] thanks for looking:) [17:54:34] (03CR) 10Dzahn: "Ariel: thanks! i think with that one we can now close the "what's left in Tampa" tracking ticket. woohoo" [puppet] - 10https://gerrit.wikimedia.org/r/164233 (owner: 10Dzahn) [17:55:48] ori: (also, perf gets more useful with various -dbg packages, but not sure how/if we want to handle that part... maybe at least also grab libc-dbg? I donno.) [17:56:12] yeah, i think we should make a point of installing debug symbols for major software packages [17:56:19] in fact i think faidon and tim agreed on it a while ago [17:56:23] but i may be misremembering [17:56:44] the hhvm role installs debug symbols [17:56:44] seems reasonable. it's not like anyone's going to argue "we can't afford the rootfs disk space for debug symbols" or something. [17:57:14] so yeah ignore it for the perf patch. but we should probably look at doing whatever it takes to get most packages +dbg [17:57:24] (is there a way to configure apt to always grab debug for everything?) [17:58:38] i don't think so. could be done in puppet [17:59:08] (03CR) 10RobH: "anytime we delete a certificate, please make an independent core-ops ticket to invalidate the certificate on the provider level, and assig" [puppet] - 10https://gerrit.wikimedia.org/r/164001 (owner: 10Dzahn) [17:59:23] there isn't and not all packages provide -dbg packages [17:59:30] it's a manual process to create those, unfortunately [17:59:35] i.e. the maintainer has to do it [17:59:44] there's tooling around it, but it doesn't happen by default [17:59:57] Ubuntu tried to fix this with a concept called ddebs back in 2007 or 2008 or so [18:00:04] yurik: Dear anthropoid, the time has come. Please deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141008T1800). [18:00:11] it sucks that there isn't a flag that means "for each package X, also install X-dbg, if such a package exists" [18:00:30] I don't think it went anywhere [18:00:36] yurikR1: happy? ;) [18:00:44] greg-g, very :) [18:00:50] https://wiki.debian.org/AutomaticDebugPackages [18:03:33] (03CR) 10RobH: [C: 04-2] "We use these with a third party, hence we should keep them in the repo forever. I have the suspicion this is the second time this has com" [puppet] - 10https://gerrit.wikimedia.org/r/164698 (owner: 10Dzahn) [18:06:08] greg-g, btw, funny enough, i think i will skip this weeks, because we were mostly onboarding jeff, and his stuff is not ready yet. But next week will be (hopefully) a big depl [18:06:32] (03PS6) 10Ori.livneh: base::standard-packages: install `perf` [puppet] - 10https://gerrit.wikimedia.org/r/164883 [18:09:06] (03PS1) 10GWicke: Reduce Parsoid job processing parallelism further [puppet] - 10https://gerrit.wikimedia.org/r/165523 [18:12:13] (03CR) 10BBlack: [C: 032] Reduce Parsoid job processing parallelism further [puppet] - 10https://gerrit.wikimedia.org/r/165523 (owner: 10GWicke) [18:13:11] (03PS2) 10BBlack: parsoidecache: fix up various only-if-cached infelicities [puppet] - 10https://gerrit.wikimedia.org/r/165511 [18:16:04] (03PS1) 10KartikMistry: WIP: Added initial Debian package [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/165528 [18:16:16] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet last ran 134958 seconds ago, expected 14400 [18:17:16] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:19:59] (03CR) 10GWicke: [C: 031] "Brandon, a huge thanks for tracking down these issues & plugging the holes! I learned a lot about Varnish request processing by following " [puppet] - 10https://gerrit.wikimedia.org/r/165511 (owner: 10BBlack) [18:24:39] (03PS3) 10BBlack: parsoidecache: fix up various only-if-cached infelicities [puppet] - 10https://gerrit.wikimedia.org/r/165511 [18:27:40] <_joe_> bblack: the vcl_pass thing I did get [18:28:39] (03PS1) 10Rush: phab add irc alias field for profiles [puppet] - 10https://gerrit.wikimedia.org/r/165532 [18:29:27] (03CR) 10Dzahn: [C: 032] "confirmed down" [dns] - 10https://gerrit.wikimedia.org/r/164255 (owner: 10Dzahn) [18:32:06] (03CR) 10Dzahn: [C: 031] "nice" [puppet] - 10https://gerrit.wikimedia.org/r/165532 (owner: 10Rush) [18:36:13] (03CR) 10Aklapper: "Nice, yeah! Looks like all other items use capitalization ("User Since", "MediaWiki Userpage") so I'd turn 'IRC alias' into 'IRC Alias'. (" [puppet] - 10https://gerrit.wikimedia.org/r/165532 (owner: 10Rush) [18:42:35] (03PS2) 10Rush: phab add irc alias field for profiles [puppet] - 10https://gerrit.wikimedia.org/r/165532 [18:43:01] (03CR) 10Rush: [C: 032 V: 032] phab add irc alias field for profiles [puppet] - 10https://gerrit.wikimedia.org/r/165532 (owner: 10Rush) [18:43:03] (03PS4) 10BBlack: parsoidecache: fix up various only-if-cached infelicities [puppet] - 10https://gerrit.wikimedia.org/r/165511 [18:49:35] Reedy: Are you going to need these logstash packages on trusty as well? Will the same packages work there? [18:50:34] andrewbogott: We will need them on trusty at some point... "Upgrading"/reinstalling the boxes to trusty is on my near term TODO list [18:50:44] They should work [18:50:51] ok -- I'll copy these packages over, it'll be up to you to verify that they're useful. [18:51:27] yeah, trusty logstash will be tested in labs [18:51:41] gwicke: do you think the parallelism reduction pushed earlier is probably enough? or is it just a guess? [18:51:43] we've already tried the packages on beta logstash [18:53:42] bblack: it's a rough back-of-envelope calculation based on the ratio of template updates to other jobs & the timout delay so far [18:53:58] so not exact science, but should be in the vicinity [18:55:09] ok I'm gonna give it another shot manually (with the updated patch) and see how it goes before merging the changes for real [18:58:52] yurikR1: oh good (re big deploy) ;) [18:59:38] gwicke: running new patch as of now [19:01:18] peaks up to 40% right now: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Parsoid+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report [19:01:24] yeah turning it back off [19:01:37] steady-state should be between 30 & 40% [19:01:56] well I was just looking at the spiking loadavg [19:02:06] also, the API is at 133% again [19:02:32] yeah that's what I meant, spiking load avg on API cluster. So backing the change out again. [19:03:38] let me reduce the parallelism a bit further [19:03:45] FWIW, though, the newest version of the patch did kill ETIMEDOUT and looks efficient [19:04:27] that's great! [19:05:19] (03PS1) 10GWicke: Reduce Parsoid job parallelism slightly further [puppet] - 10https://gerrit.wikimedia.org/r/165536 [19:05:46] that's great! [19:06:28] (03CR) 10Yuvipanda: [C: 031] "Good idea!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 (owner: 10Reedy) [19:06:38] https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=load_report [19:06:54] gwicke: the spikes currently at the start and near the end of that graph are the two test runs, basically [19:07:19] yup [19:07:20] it doesn't seem like even the first reduction of parallelism had a huge impact on the spikiness (the second test was much shorter, it probably would've gone higher if not stopped) [19:08:30] the jobs sped up by this patch are primarily template updates [19:08:55] which are hitting the API more than other requests (where template expansions are reused) [19:09:30] (03PS2) 10Reedy: Enable Flow on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 [19:10:14] bblack: I went from 7 to 5 in https://gerrit.wikimedia.org/r/165536 [19:10:21] ok [19:10:29] that's almost 30% less [19:10:46] (03CR) 10BBlack: [C: 032] Reduce Parsoid job parallelism slightly further [puppet] - 10https://gerrit.wikimedia.org/r/165536 (owner: 10GWicke) [19:11:05] will give that time to take effect, then try again [19:15:01] k [19:23:42] 5 instead of 21 [19:23:44] wow :) [19:23:48] so, what changed then? [19:23:52] besides today's fix [19:24:37] 1) reduced timeout for only-if-cached varnish requests from 60s to 10s on the parsoid end, 2) solved timeout issues on the varnish end [19:25:31] so we unthrottled parsoid by fixing timeouts, but now need to throttle more aggressively in the job queue to compensate [19:25:34] (03PS1) 10Dereckson: Planet update [puppet] - 10https://gerrit.wikimedia.org/r/165558 [19:27:01] fwiw, it looks like there is a good amount of wikidata activity through the API recently [19:27:18] wbsetlabel especially [19:27:24] https://gdash.wikimedia.org/dashboards/apimethods/ [19:27:47] also wbsetsitelink [19:30:06] (03CR) 10Pleclown: [C: 031] Planet update [puppet] - 10https://gerrit.wikimedia.org/r/165558 (owner: 10Dereckson) [19:31:53] paravoid: (2) is https://gerrit.wikimedia.org/r/165511 [19:32:06] I saw [19:32:31] aka "varnish please stop trying to be varnish, we need you to be something else" :) [19:33:01] yeah [19:33:11] it's all an elaborate plan by gwicke [19:33:17] so that we can stop worrying about restbase [19:33:24] and welcome it with open arms :) [19:33:31] :-) [19:33:51] How I learned to stop worrying and love the restbase? [19:33:53] for that I should actually create some drama & show how badly we need to replace Varnish [19:34:08] [19:34:22] (03CR) 10Dzahn: Planet update (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165558 (owner: 10Dereckson) [19:34:42] gwicke probably has a nodejs http cache service in development [19:34:47] i don't even want to guess its name [19:35:03] 'parsoid' [19:35:04] cachoid [19:35:07] cachely.io [19:35:09] nah [19:35:16] mine was funniest [19:35:26] ori: {{coi}} [19:35:32] once Go becomes faster than C, then maybe ;) [19:35:42] spdoyd [19:35:46] ideally even faster than c [19:36:25] it's probably already faster than C for many common implementation tasks just by the better concurrency patterns it reinforces in the source authors [19:36:49] bblack: you are really pithy sometimes, you should write a blog [19:36:52] (not sarcastic) [19:37:16] :p [19:37:28] bblack: yup; definitely if you compare similar similar dev effort on both ends [19:38:10] but there is pretty fine software that we just get to use [19:38:37] functionally-fine. just don't try to read the source code and understand it. [19:38:50] (03CR) 10Alexandros Kosiaris: [C: 032] delete sanger SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164692 (owner: 10Dzahn) [19:39:16] (03CR) 10ArielGlenn: "I see that as of Oct 5 /var/log/apache/other_vhosts_access.log has rather a lot of stuff in it. Example: on mw1191 that file is 11 GB in " [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [19:39:16] bblack: *nod* [19:39:33] if(v_o->x.y[z->obj].a == &v->alpha) { FROB; } [19:39:39] ^ typical varnish source line [19:40:01] it wouldn't include the { } [19:40:09] lol [19:41:03] (03CR) 10Pleclown: Planet update (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165558 (owner: 10Dereckson) [19:43:22] gwicke: is there any delay beyond puppet's own delays in the concurrency thing for the APIs taking effect I should worry about? [19:44:06] bblack: not that I know of [19:44:15] my last adjustments were all picked up within an hour [19:44:54] keep in mind though that the job queue is not the only thing hitting parsoid [19:45:59] PDF rendering for example is also hitting the Parsoid cluster [19:46:48] !log Created flow tables on officewiki [19:46:54] Logged the message, Master [19:47:42] gwicke: trying again... [19:49:17] !log logstash upgraded to 1.4.2-1 on logstash100[1-3] [19:49:22] Logged the message, Master [19:50:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor stuff mostly. The idea and implementation seem sound to me." (033 comments) [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [19:50:52] looks much better so far, it spiked almost immediately last time [19:50:59] (the API load) [19:54:57] it's still a net increase on the API cluster, but not very dramatic. I'm not sure at what level we care. [19:55:17] (or if we're trying to aim for offsetting the ETIMEDOUT fix for a net zero change in parsoid's load on the API server?) [20:00:04] gwicke, cscott, subbu: Dear anthropoid, the time has come. Please deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141008T2000). [20:07:22] (03CR) 10Manybubbles: "I didn't realize this was an option!" [puppet] - 10https://gerrit.wikimedia.org/r/165413 (owner: 10Chad) [20:08:08] greg-g, we aren't deploying today. see some things in rt-testing we are still looking at. [20:08:49] bblack: at the current rate we might want to reduce the load even further [20:11:24] (03PS1) 10GWicke: And another decrease in Parsoid job parallelism [puppet] - 10https://gerrit.wikimedia.org/r/165596 [20:13:14] bblack: ^^ [20:13:39] subbu: kk [20:15:14] (03CR) 10BBlack: [C: 032] And another decrease in Parsoid job parallelism [puppet] - 10https://gerrit.wikimedia.org/r/165596 (owner: 10GWicke) [20:16:02] (03PS5) 10BBlack: parsoidecache: fix up various only-if-cached infelicities [puppet] - 10https://gerrit.wikimedia.org/r/165511 [20:16:08] (03CR) 10BBlack: [C: 032 V: 032] parsoidecache: fix up various only-if-cached infelicities [puppet] - 10https://gerrit.wikimedia.org/r/165511 (owner: 10BBlack) [20:17:23] [warning/api][ruwiki/Участник:Arystanbek/Бүгінгі_өсім?oldid=56573014] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":0} [20:17:39] ^ that's the only one I've seen make it to zero on wtp1009 since the change (it did count down from 8) [20:19:01] bblack those are not varnish requests. requests to the mw api and they have 8 retries. [20:19:41] so, we should distinguish our logging stmts between them as well. [20:20:49] ah [20:25:46] <^demon|brb> [2014-10-08 20:22:26,573][ERROR][service.graphite ] [deployment-elastic01] Graphite reporting disabled, no graphite host configured [20:25:51] <^demon|brb> YuviPanda: ^ :D [20:26:02] <^demon|brb> Hey, at least we're getting the plugin loaded. Just need my puppet change merged now [20:26:06] ^demon|brb: :D [20:26:07] yay [20:26:11] ^demon|brb: bribe someone to merge it? [20:26:16] * YuviPanda can't merge things for another 3 weeks [20:26:37] <^demon|brb> I put ottomata on the change along with you and Nik [20:27:29] ^demon|brb: andrewbogott is also usually nice to put on the list for changes that go to labs [20:28:29] (03PS1) 10Chad: Gerrit: explicitly whitelist image formats we want to display [puppet] - 10https://gerrit.wikimedia.org/r/165602 [20:29:28] Hi all, which version of udplog are you using with trusty? [20:30:07] sorry udp2log [20:30:20] <^demon|brb> YuviPanda: Well, there's two pings. I'd rather not keep pinging others since it's not urgent :p [20:30:28] ^demon|brb: hehe :) [20:32:39] renoirb: presuambly whatever is in our apt repo [20:32:42] (03CR) 10Cscott: "gwicke says: Extension:Parsoid is configured "basically everywhere either flow or VE is enabled; pretty much everywhere these days". Is t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) (owner: 10Reedy) [20:32:44] :) [20:32:59] it's public, so you can use it too and check [20:32:59] ... I wish I could find that apt repo :/ [20:33:09] ubuntu.wikimedia.org [20:33:11] * renoirb is searching all over the place. [20:33:12] oh! [20:33:14] right :) [20:35:06] (03CR) 10Ottomata: [C: 031] "Cool that works, eh?" [puppet] - 10https://gerrit.wikimedia.org/r/165413 (owner: 10Chad) [20:35:30] (03CR) 10Reedy: "'wmgUseParsoid' => array(" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) (owner: 10Reedy) [20:37:58] ottomata: wanna +2 as well? https://gerrit.wikimedia.org/r/#/c/165413/ ;) [20:38:13] hmm, ok! [20:38:19] i can't watch it htough [20:38:24] ottomata: it's only on betalabs [20:38:25] ^d you around to watch it? [20:38:27] OH! [20:38:28] ok [20:38:34] (03PS2) 10Ottomata: Config for graphite plugin for Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/165413 (owner: 10Chad) [20:38:39] (03CR) 10Ottomata: [C: 032] Config for graphite plugin for Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/165413 (owner: 10Chad) [20:39:47] mertged. [20:40:40] ottomata: ty [20:41:22] (03CR) 10Cscott: "And I just pushed https://gerrit.wikimedia.org/r/165603 to allow Parsoid to support oldwikisource. There might be other missing pieces he" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) (owner: 10Reedy) [20:42:19] <^demon|brb> ottomata: I am :) [20:44:29] (03CR) 10Reedy: "I'm sure we will get people complaining if it doesn't work :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/165490 (https://bugzilla.wikimedia.org/71416) (owner: 10Reedy) [20:45:06] (03CR) 10QChris: [C: 031] Hadoop fairscheduler queue change - remove 'adhoc' queue, rename 'standard' to 'essential'. [puppet] - 10https://gerrit.wikimedia.org/r/165512 (owner: 10Ottomata) [20:48:09] !log currently running /home/legoktm/fixBug71749.php on terbium [20:48:15] Logged the message, Master [20:48:46] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:48:53] (03PS27) 10Alexandros Kosiaris: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [20:49:16] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:51:37] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [20:52:17] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:52:30] Reedy, there’s no udp2log, nor udplog in any trusty-backports, trusty-proposed trusty-security universe multiverse etc. [20:52:31] :( [20:52:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:52:49] It's probable none of the servers have been upgraded [20:52:52] ottomata: merged your changed as well [20:53:29] And hence, no one has rebuilt the packages [20:53:31] <^d> Thanks akosiaris [20:53:42] renoirb: I'd be slightly surprised if the package didn't work anyway [20:53:42] <^d> (I was wondering why it hadn't showed up yet :) [20:54:07] (03CR) 10Jforrester: [C: 04-1] "This would necessarily leak individual page names from a private wiki into public config…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145174 (owner: 10Reedy) [20:54:27] ^d: :-) [20:54:30] Reedy, I just added all wikimedia.org apt variants for trusty and nothing shows up in relation to udplog or udp2log [20:54:57] try precise? [20:55:02] Ryan_Lane had this installed under precise and lucid.... my new cloud provider doesn’t have any other ubuntu version than trusty. Got to have something :( [20:55:25] srsly [20:55:31] just take the packge and install it [20:55:44] akosiaris: ? [20:55:57] oh hm, i hit yes [20:56:02] Reedy, trying it now. [20:56:03] must have had another character in there [20:56:06] thanks [20:56:52] scp-ed udplog_1.8-5~precise_amd64.deb libboost-program-options1.46.1_1.46.1-7ubuntu3_amd64.deb from a precise server from apt cache. worked. [20:56:59] fair enough, Reedy, thx :) [21:01:44] (03PS2) 10Alexandros Kosiaris: Add citoid module to sca1001 and sca1002 [puppet] - 10https://gerrit.wikimedia.org/r/164758 (owner: 10Catrope) [21:03:10] (03CR) 10Alexandros Kosiaris: [C: 032] Add citoid module to sca1001 and sca1002 [puppet] - 10https://gerrit.wikimedia.org/r/164758 (owner: 10Catrope) [21:07:21] PROBLEM - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 3 failures [21:07:23] ^d: do you know if the patch was applied? [21:07:59] <^d> Should've been [21:10:21] hm, mark, yt? i just realized that ldap servers changed, [21:10:26] need a hole poked in the analytics ACl [21:15:10] AaronSchulz: I am seeing some jobs in the queue that were inserted in April (!) [21:16:41] (03PS1) 10Alexandros Kosiaris: Specify citoid/deploy as package name [puppet] - 10https://gerrit.wikimedia.org/r/165614 [21:16:48] ottomata: I can do it. What do you need ? [21:16:54] AaronSchulz: is there a way to clear jobs of a specific type? [21:17:01] akosiaris: https://rt.wikimedia.org/Ticket/Display.html?id=4433 [21:17:06] just updated that ticket at the bottom with what I need [21:17:25] gwicke: old ones or all of them? [21:17:35] old ones if possible [21:18:04] (03CR) 10Alexandros Kosiaris: [C: 032] Specify citoid/deploy as package name [puppet] - 10https://gerrit.wikimedia.org/r/165614 (owner: 10Alexandros Kosiaris) [21:18:53] oh akosiaris i think we use ldaps [21:18:58] but both should be open? [21:18:58] hm [21:19:27] I hope we use ldap + starttls [21:19:33] otherwise I will be sad [21:19:54] i do not know! all i know is this is the url i'm pointing at [21:19:54] ldap_url="ldaps://ldap-eqiad.wikimedia.org ldaps://ldap-codfw.wikimedia.org" [21:20:16] ldaps, not ldap+STARTTLS ... [21:20:21] so port 636, not 389 ... [21:20:27] * akosiaris sad [21:20:35] AaronSchulz: either those jobs aren't processed in fifo order, or they are retried somehow [21:20:48] anyway, looking into it ottomata [21:20:53] akosiaris: can I just change that on my end? or is that something the ldap servers need to support? [21:20:54] ok, tahnks [21:21:07] AaronSchulz: do you see any other possibility? [21:22:32] also some jobs from January 2014 [21:23:05] RoanKattouw gwicke: with 1 extra small patch https://gerrit.wikimedia.org/r/165614 and a minor addition to https://gerrit.wikimedia.org/r/164758 citoid is ready to be deployed. The LVS part, I will merge tomorrow morning (my morning) [21:23:22] if the queue is already backlogged a lot, retries can cause there to be jobs with super old rootTimestamps [21:23:44] <^d> YuviPanda: elasticsearch.yml is still old version, force ran puppet a few times :\ [21:23:46] since it ends up at the back of the queue [21:24:00] ^d: deployment-prep has its own puppetmaster [21:24:03] needs to be merged there [21:24:07] AaronSchulz: the queue is relatively short though [21:24:07] <^d> Dur. [21:24:08] it could take maybe even longer than it took to be run than the first time [21:24:15] akosiaris: Awesome, tahnks [21:24:24] akosiaris: awesome++ [21:24:28] Ugh the lvs::realserver thing is totally obvious and it was right there staring me in the face :| [21:24:29] <^d> Ah yep, see it in prod. [21:24:38] <^d> YuviPanda: I dunno how! [21:24:45] ^d: mind if I do the merging? [21:24:49] AaronSchulz: is it possible that those jobs were retried many times? [21:24:56] <^d> YuviPanda: Go ahead :) [21:25:00] ^d: ok, doing [21:25:54] PROBLEM - citoid on sca1001 is CRITICAL: Connection refused [21:25:57] ^d: it's on deployment-salt, /var/lib/git/operations/puppet. [21:26:00] ^d: bah, merge conflicts. [21:26:08] bd808: ^ merge conflicts on deployment-salt :( [21:26:44] YuviPanda: {{sofixit}} [21:26:45] akosiaris: That icinga alert is expected, git-deploy needs to be kick-started. If you don't mind I'll deal with that tomorrow. I should be around from about 11am your time [21:26:45] not more than 3 tries [21:26:55] PROBLEM - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 2 failures [21:26:55] bd808: it's with... random VCL code? [21:27:04] akosiaris: also, Moritz & me verified that Mathoid is working as intended; next step will be to double-check things for prod & then enabling it as an optional math render mode [21:27:09] ^d: anyway, I cherry picked... [21:27:23] marxarelli's patch for the servurity testing guy [21:27:27] *security [21:27:33] RoanKattouw: yeah I know. I 'll acknowledge them in icinga. Thanks! [21:27:37] gwicke: awesome! [21:28:14] AaronSchulz: only three retries are not much since January [21:28:39] YuviPanda: You can fix by rebase interactive against origin/production and drop the local cherry-pick for marxarelli's patch [21:28:40] with a queue length of 10k or so [21:28:45] and fifo behavior [21:29:43] YuviPanda: And the poke him to fix and reapply the cherry-pick [21:30:01] bd808: hmm, I'm not sure if I should kill them before poking [21:30:37] YuviPanda: meh. beta and it's only there to allow redirecting a security researcher to his own hhvm server [21:30:51] SHould in #-qa first if you'd like [21:30:55] AaronSchulz: where are the retries configured? [21:30:55] *shout [21:31:32] (03CR) 10Ori.livneh: "@apergos: Yeah, I was kinda tilting that way too. Will amend." [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [21:31:46] ^d: can you force another run anyway? [21:31:52] <^d> I am. [21:31:56] ^d: ok. [21:32:06] bd808: I'll attempt a rebase after ^d's run finishes... [21:33:42] (03PS4) 10Ori.livneh: apache: keep two weeks' worth of logs, rather than 1yr [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [21:34:02] gwicke: CommonSettings [21:34:11] <^d> YuviPanda: Ok, ran puppet on all 4 nodes. [21:34:42] ^d: cool, let me perform some rebase surgery... [21:35:55] AaronSchulz: thx [21:36:45] bd808: done [21:37:06] bd808: do you know his email address? [21:37:40] <_joe_> AaronSchulz: the hhvm jobrunner is off since 1 week; should I repurpose it, or will someone work on the reasons that got it to be disabled? [21:37:46] YuviPanda: dduval@ [21:38:00] ACKNOWLEDGEMENT - citoid on sca1001 is CRITICAL: Connection refused alexandros kosiaris Waiting for initial deployment of citoid code [21:38:00] ACKNOWLEDGEMENT - puppet last run on sca1001 is CRITICAL: CRITICAL: Puppet has 2 failures alexandros kosiaris Waiting for initial deployment of citoid code [21:39:36] bd808: thanks [21:39:54] ^d: do ES logs indicate that the plugin is loaded? [21:39:57] * YuviPanda isn't seeing any metrics yet [21:40:03] <^d> Well, before we configured it to turn on. [21:40:03] _joe_: not sure who turned it off...might want to ping ori [21:40:08] <^d> Now it's not starting. [21:40:27] ^d: couldn't parse the previous sentence... [21:40:40] <^d> Elasticsearch won't start now that the config is live. [21:40:43] <_joe_> AaronSchulz: ori turned it off, but I was under the impression you were leading the work on the jobrunner [21:40:45] oh... [21:40:50] because of buggy plugin? [21:40:51] <^d> I think it's failing because it can't connect to labmon1001.eqiad.wmnet [21:41:03] ^d: it should be able to. [21:41:04] <_joe_> I'll ask him then [21:41:07] <^d> demon@deployment-elastic01:/var/log/elasticsearch$ ping labmon1001.eqiad.wmnet [21:41:07] <^d> PING labmon1001.eqiad.wmnet (10.64.37.13) 56(84) bytes of data. [21:41:07] <^d> From ae2-1118.cr2-eqiad.wikimedia.org (10.64.20.3) icmp_seq=1 Packet filtered [21:41:12] ^d: no icmp allowed [21:41:25] ^d: do a curl. [21:41:54] <^d> Ok yeah, it that worked. [21:41:55] <^d> Hmm [21:41:56] (03PS1) 10BBlack: ns1 is no longer in transition [dns] - 10https://gerrit.wikimedia.org/r/165633 [21:42:23] (03CR) 10BBlack: [C: 032] ns1 is no longer in transition [dns] - 10https://gerrit.wikimedia.org/r/165633 (owner: 10BBlack) [21:43:08] damn, we are butchering icmp ? [21:43:17] we shouldn't... [21:43:21] huh? [21:43:32] oh, labs [21:43:35] probably [21:43:39] (not that we should) [21:43:43] sigh [21:43:54] ^d: making a note to fix it [21:44:05] akosiaris: yeah, this is labs -> prod [21:44:11] <^d> Not even sure if it's what's causing my bug :) [21:44:27] we shouldn't be filtering icmp anyway [21:44:28] <^d> I could be that this plugin is horribly shitty and just won't work [21:44:29] <^d> :) [21:44:37] ^d: that's also possible, yeah :) [21:44:42] for labs, everything's filtered by default and must be explicitly allowed. So we probably just need to go add a rule to allow icmp. [21:44:51] PROBLEM - citoid on sca1002 is CRITICAL: Connection refused [21:45:00] yeah, you 're right obviously bblack [21:45:00] (but I can't imagine why we'd not have done that long ago, either) [21:45:13] _joe_: I haven't been on hhvm in a few weeks [21:45:28] AaronSchulz: I didn't find any setting for the number of allowed retries in commonsettings or the job runner [21:45:28] <_joe_> ok sorry :) [21:45:51] AaronSchulz: it seems that https://gerrit.wikimedia.org/r/#/c/165635/1/ParsoidCacheUpdateJob.php could help though [21:45:56] ACKNOWLEDGEMENT - citoid on sca1002 is CRITICAL: Connection refused alexandros kosiaris Awaiting first deployment of citoid [21:45:56] ACKNOWLEDGEMENT - puppet last run on sca1002 is CRITICAL: CRITICAL: Puppet has 2 failures alexandros kosiaris Awaiting first deployment of citoid [21:47:23] gwicke: the 3 is from MW core, but it does nothing without setting claimTTL, which is in CommonSettings [21:47:24] $wgJobTypeConf['default']['claimTTL'] = 3600; [21:47:28] <^d> YuviPanda: It's like the init script is failing. [21:47:52] ^d: hmm, so is that 'shitty plugin'? [21:48:02] since network issues shouldn't cause that to happen, assuming it isn't terribly written... [21:48:05] <^d> easy enough to test. [21:48:11] AaronSchulz: do we need to set that explicitly for the parsoid job types, or will they inherit the default? [21:49:11] <^d> Hmm, deleted plugin and it still won't come up. [21:49:14] <^d> Hmm [21:49:54] ^d: hmm, maybe something unrelated. puppet on betalabs was out of date by a few days. [21:50:33] gwicke: they all use the default if not set otherwise [21:51:17] why do retries matter so much for that job type anyway? [21:51:39] I'm trying to limit or get rid of them [21:51:54] we shouldn't keep retrying jobs from January [21:52:56] do just not want to retry "old" job or none at all? [21:53:21] run() could always no-op if the rootTimestamp is too far in the past [21:54:13] I'd like to limit the number of retries too [21:54:58] mostly for the case where something really huge is enqueued & then takes up all cpu time across the cluster [21:55:16] by virtue of quick retries [21:55:35] (03PS8) 10Dduvall: beta: varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) [21:56:23] (03PS1) 10Dzahn: dsh - remove last Tampa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165643 [21:56:30] AaronSchulz: so I have traced maxTries to a commandline parameter to the job runner; where is that coming from? [21:57:20] ottomata: your LDAP firewall change is done [21:57:30] (03CR) 10Dzahn: [C: 032] dsh - remove last Tampa remnants [puppet] - 10https://gerrit.wikimedia.org/r/165643 (owner: 10Dzahn) [21:57:43] ah, right, some of the conf was moved to the json conf for the service [21:57:49] modules/mediawiki/templates/jobrunner/jobrunner.conf.erb [21:57:52] it's still 3 though [21:58:00] ottomata: I tested it. Care to update the ticket ? I am feeling like going to sleep [21:59:31] AaronSchulz: k, thx [22:00:10] (03CR) 10Dereckson: Planet update (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165558 (owner: 10Dereckson) [22:01:58] AaronSchulz: I do suspect that something is not quite working with the retry counting [22:02:11] <^d> manybubbles: elastic won't come back up on deployment-elastic01 :( [22:02:24] <^d> Tried a couple of times. Removed graphite plugin in case that was it, still nothing. [22:02:45] <^d> Log is silent so it's like it's failing pre-that. [22:02:50] !log tin - there are dozens of dsh groups that have been removed from repo long time ago but never got purged, but it isn't easy to tell what might still be used, so deleting all and letting puppet recreate might be risky? [22:02:57] Logged the message, Master [22:07:45] !log tin - deleted empty pmtpa dsh group files [22:07:51] Logged the message, Master [22:07:54] !log updated OCG to version def24eca [22:07:59] Logged the message, Master [22:09:41] (03CR) 10Dzahn: [C: 032] Planet update [puppet] - 10https://gerrit.wikimedia.org/r/165558 (owner: 10Dereckson) [22:11:16] AaronSchulz: there are 86 ParsoidCache* jobs from January in the queue, and those were executed 4006 times today alone; I think that makes it pretty clear that the retry counts aren't working, at least for ParsoidCache* jobs [22:11:42] (03PS1) 10Reedy: Fix beta docroots to use docroot/wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/165649 [22:12:00] (03CR) 10Dzahn: [C: 032] "and now it's good to go since all pmtpa hosts that were in that group are out of icinga config" [puppet] - 10https://gerrit.wikimedia.org/r/165091 (owner: 10Matanya) [22:17:21] (03CR) 10Dzahn: [C: 032] "to fix current 404 on beta, sure! though it appears to me we should try to not even have separate config files for beta anymore in the fut" [puppet] - 10https://gerrit.wikimedia.org/r/165649 (owner: 10Reedy) [22:19:00] (03CR) 10Dduvall: [C: 031] "Compiles fine against prod, but there seems to be an unrelated problem with test compilation against beta." [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181) (owner: 10Dduvall) [22:21:42] andrewbogott: how about https://gerrit.wikimedia.org/r/#/c/165416/ [22:22:18] (03CR) 10Andrew Bogott: [C: 031] remove virt0 from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/165416 (owner: 10Dzahn) [22:22:42] andrewbogott: thx [22:23:06] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [22:23:16] apergos: Can you restart the last json dump? [22:23:30] or, well, try [22:23:34] not sure why it failed [22:23:35] (03PS2) 10Dzahn: remove virt0 from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/165416 [22:23:46] I fear it was Q!83 as well [22:23:50] * Q183 [22:24:25] RECOVERY - Disk space on analytics1035 is OK: DISK OK [22:25:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [22:26:04] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [22:27:14] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [22:28:05] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [22:29:54] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [22:30:01] <^d> YuviPanda: Well, I'm stuck. 01 still won't come back up. I'm afraid to touch 02-04 since we're in red. [22:30:09] :( [22:30:24] I could offer to take a look, but I know jackshit about ES [22:30:28] <^d> I'll just reboot the instance. [22:30:30] ok [22:30:31] <^d> Can't hurt. [22:31:14] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [22:31:16] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [22:31:16] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [22:31:25] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [22:32:24] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [22:34:15] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [22:36:24] (03CR) 10Dzahn: [C: 032] remove virt0 from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/165416 (owner: 10Dzahn) [22:37:10] (03PS1) 10Chad: Actually format Elasticsearch config properly or it won't start [puppet] - 10https://gerrit.wikimedia.org/r/165652 [22:37:44] <^d> Stupid me, wrong channel. [22:37:47] ^d: ow [22:37:55] <^d> YuviPanda: See scrollback in #-dev [22:37:57] ^d: yeah, I missed that too [22:37:59] yeah, saw that [22:39:19] !log Removed openjdk-6-* from logstash100[1-3] [22:39:24] Logged the message, Master [22:39:28] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [22:40:07] andrewbogott: can you merge https://gerrit.wikimedia.org/r/#/c/165652/? Trivial typofix... [22:40:08] (03CR) 10Reedy: "Can be merged now" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [22:40:18] !log virt0 - deleted salt key, revoked puppet cert, removed from site.pp [22:40:23] Logged the message, Master [22:40:35] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [22:40:44] (03CR) 10Andrew Bogott: [C: 032] Actually format Elasticsearch config properly or it won't start [puppet] - 10https://gerrit.wikimedia.org/r/165652 (owner: 10Chad) [22:41:17] andrewbogott: ty [22:41:21] ^d: now updating the puppetmaster [22:41:35] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [22:42:07] (03PS2) 10Dzahn: remove one last pmtpa remnant in domain_search [puppet] - 10https://gerrit.wikimedia.org/r/165417 [22:42:21] (03PS1) 10Ottomata: Download the Maxmind Geoip Connection-Type databases [puppet] - 10https://gerrit.wikimedia.org/r/165653 [22:42:44] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [22:43:08] ^d: puppet should have the correct file now. can you force a run, and we can see if it is sendind stats... [22:43:15] (03PS2) 10Ottomata: Download the Maxmind Geoip Connection-Type databases [puppet] - 10https://gerrit.wikimedia.org/r/165653 [22:43:37] (03CR) 10Dzahn: [C: 032] remove one last pmtpa remnant in domain_search [puppet] - 10https://gerrit.wikimedia.org/r/165417 (owner: 10Dzahn) [22:44:15] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:44:25] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [22:45:25] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [22:46:26] so i'm tailing fluorine:/a/mw-log/fatal.log for the first time and the contents are scary but maybe they're normal? [22:46:36] they're likely normal [22:46:40] ok :\ [22:46:41] Fatal error: Allowed memory size [22:46:42] those? [22:46:43] What's scaring you? :) [22:46:44] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:46:45] yeah [22:46:54] any errors scare me :) [22:47:04] FIX THEM ALL [22:47:21] hee [22:47:25] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:47:57] thaaa, thanks akosiaris, it works now. [22:48:07] The biggest problem is that it's somewhat unmanageable [22:49:30] (03PS3) 10Ottomata: Download the Maxmind Geoip Connection-Type databases [puppet] - 10https://gerrit.wikimedia.org/r/165653 [22:49:35] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:49:35] (03CR) 10Ottomata: [C: 032 V: 032] Download the Maxmind Geoip Connection-Type databases [puppet] - 10https://gerrit.wikimedia.org/r/165653 (owner: 10Ottomata) [22:50:39] <^d> [2014-10-08 22:49:21,148][INFO ][service.graphite ] [deployment-elastic04] Graphite reporting triggered every [1m] to host [labmon1001.eqiad.wmnet:2003] with metric prefix [elasticsearch.beta-search] [22:50:43] <^d> YuviPanda: ^ [22:50:52] w00t [22:50:57] let's go and check [22:51:12] hmm, I don't see it there yet [22:54:15] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [22:54:59] <^d> Ok, restarted on all 4 nodes. [22:56:04] (03PS2) 10Dzahn: remove 10.4.16.0/24 and host scs-c1-pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/159439 [22:57:15] (03PS3) 10Dzahn: remove scs-c1-pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/159439 [22:57:29] (03CR) 10Dzahn: [C: 031] "rebased, only the console server is left" [puppet] - 10https://gerrit.wikimedia.org/r/159439 (owner: 10Dzahn) [22:58:17] (03PS2) 10Dzahn: dhcp - delete remaining Tampa db's and es's [puppet] - 10https://gerrit.wikimedia.org/r/159440 [22:58:55] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [22:58:57] (03CR) 10Dzahn: "rebased into nothing :) those are the best ones" [puppet] - 10https://gerrit.wikimedia.org/r/159440 (owner: 10Dzahn) [22:59:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:59:17] (03Abandoned) 10Dzahn: dhcp - delete remaining Tampa db's and es's [puppet] - 10https://gerrit.wikimedia.org/r/159440 (owner: 10Dzahn) [22:59:55] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141008T2300). [23:00:55] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [23:01:17] <^d> I've got it. [23:01:29] (03CR) 10Dzahn: [C: 032] remove virt0 - decom [dns] - 10https://gerrit.wikimedia.org/r/165415 (owner: 10Dzahn) [23:01:38] <^d> Ping andrewbogott, tgr, spagewmf for swat [23:02:02] ^d: present [23:02:33] <^d> Alrighty, we'll do yours first. [23:03:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:03:55] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [23:05:16] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [23:06:39] (03Abandoned) 10Dzahn: remove blog SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164698 (owner: 10Dzahn) [23:06:55] ^d: I'm here... [23:07:04] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [23:07:06] <^d> Sweet, I merged your patch to master and doing the backport(s) now [23:07:21] cool [23:08:06] (03PS2) 10Dzahn: remove boron node from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/165117 [23:09:36] ^d hi Flow backport is ready to go, brb [23:09:52] <^d> Ok sweet, yeah I'll do yours last [23:09:55] <^d> Already working on the first 2. [23:10:00] (03CR) 10Dzahn: [C: 032] "3038 # as of 2014-08-12 these fundraising servers use frack puppet" [puppet] - 10https://gerrit.wikimedia.org/r/165117 (owner: 10Dzahn) [23:10:14] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 [23:12:05] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: No response from remote host 198.35.26.192 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [23:13:35] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [23:14:14] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [23:14:16] !log demon Synchronized php-1.25wmf2/extensions/CommonsMetadata: (no message) (duration: 00m 06s) [23:14:18] <^d> tgr: ^ [23:14:24] Logged the message, Master [23:15:05] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [23:15:05] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 9, down: 0, shutdown: 0 [23:16:14] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [23:16:23] ^d: works, thanks [23:16:32] <^d> yw [23:16:45] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [23:17:45] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [23:18:17] <^d> Ok, just waiting on jenkins for the last two [23:18:54] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [23:19:01] <^d> YuviPanda: Still nothing in labmon? [23:19:05] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:19:26] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [23:20:20] ^d: tell me again how to force a regeneration of l10n? [23:21:07] I presume that's why wikitech is busted [23:21:54] sync-common should update the l10n cache [23:21:59] update/build [23:22:08] (03PS2) 10Chad: Gerrit: explicitly whitelist image formats we want to display [puppet] - 10https://gerrit.wikimedia.org/r/165602 (https://bugzilla.wikimedia.org/70892) [23:22:24] Reedy: ah, you're right, now wikitech is back [23:22:29] It just stuttered while sync-common was running [23:23:28] @terbium:~# # /usr/bin/php importDump.php --dry-run --conf ../LocalSettings.php /home/dzahn/1411943812-wmcwiki.xml cawikimedia [23:23:29] !log demon Synchronized php-1.25wmf2/extensions/Flow: (no message) (duration: 00m 05s) [23:23:31] <^d> ebernhardson: ^ [23:23:31] ^ does this make sense? [23:23:34] Logged the message, Master [23:23:39] anyone who was imported to cluster wiki before? [23:23:46] from xmldump file on shell [23:23:56] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:24:22] i don't wanna break replication or something :p [23:24:28] <^d> mutante: Run it with --no-updates [23:24:35] <^d> And then run refreshLinks afterwords. [23:24:38] mutante: no [23:24:41] <^d> Much faster. [23:24:49] mwscript and --wiki [23:25:24] mwscript importDump.php --dry-run /home/dzahn/1411943812-wmcwiki.xml --wiki=cawikimedia --no-updates [23:25:41] ^d: thanks [23:25:48] <^d> Reedy: --wiki has to be first parameter [23:25:49] <^d> ebernhardson: yw [23:25:58] damn programs [23:26:05] ^d: Reedy : tyvm! [23:26:15] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [23:26:34] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [23:26:37] ^d: Reedy : oh, one more thing. do i have to edit in the XML file? [23:26:47] You shouldn't... [23:26:48] http://tsbpap01.wikimedia.ca/wiki/Main_Page [23:26:53] Oh [23:26:54] but that hostname .. [23:27:05] the ca chapter sent that file to us [23:27:14] I think it's just header info [23:27:15] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:27:25] ok [23:27:27] I've exported various things from enwiki and just imported them straight into elsewhere [23:27:29] !log demon Synchronized php-1.25wmf2/extensions/OpenStackManager: (no message) (duration: 00m 06s) [23:27:34] Logged the message, Master [23:27:35] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:27:40] <^d> andrewbogott: and you're live ^ [23:27:44] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:27:45] (03PS1) 10Andrew Bogott: Revert "Change ldap 'master' settings in firstboot.sh" [puppet] - 10https://gerrit.wikimedia.org/r/165657 [23:27:49] <^d> feel free to sync to virt* [23:27:49] when he runs sync-common again [23:27:50] :D [23:27:54] ^d: thanks! [23:28:11] !log running sync-common on virt1000 [23:28:16] Logged the message, Master [23:28:45] (03CR) 10Andrew Bogott: [C: 032] Revert "Change ldap 'master' settings in firstboot.sh" [puppet] - 10https://gerrit.wikimedia.org/r/165657 (owner: 10Andrew Bogott) [23:29:19] (03CR) 10BryanDavis: [C: 031] Install logstash-contrib too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [23:29:22] the dry run looks like it works. afaict [23:29:39] well, it says "Done!" at the end [23:29:57] haha [23:30:00] <^d> Ok, swat complete. [23:30:04] <^d> Thanks for playing. [23:30:07] <^d> More prizes tomorrow. [23:30:08] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [23:30:08] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:30:44] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:31:17] !log importing xml dump to cawikimedia [23:31:23] Logged the message, Master [23:31:37] lol, and it's thaaat much slower [23:32:41] how big is the xml file? [23:33:52] Reedy: 26M unzipped. i learned afterwards i could have also left it gzipped [23:34:00] done now [23:34:08] 800 (7.17 pages/sec 52.40 revs/sec) [23:34:08] heh [23:34:19] I guess for a small file like that it doesn't matter so much [23:34:49] so now i run rebuildrecentchanges.php ? [23:34:55] looks for syntax [23:35:11] mwscript refreshLinks --wiki=cawiki [23:35:14] mwscript refreshLinks --wiki=cawikimedia [23:35:19] mwscript refreshLinks.php --wiki=cawikimedia [23:35:22] third time lucky [23:35:24] fwiw, it says "You might want to run rebuildrecentchanges.php to regenerate RecentChanges [23:35:27] thanks again! [23:35:35] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:35:43] yeah, do that too [23:36:54] it's refreshing things.. [23:37:13] (03CR) 10CSteipp: [C: 031] Gerrit: explicitly whitelist image formats we want to display [puppet] - 10https://gerrit.wikimedia.org/r/165602 (https://bugzilla.wikimedia.org/70892) (owner: 10Chad) [23:38:27] "Retrieving illegal entries .." did it mean "undocumented"? [23:38:39] well, all 0, sounds good [23:39:16] Reedy: should i see more in the actual RC now though ?;p [23:39:33] Did you do recentchanges too? [23:39:42] I guess it depends how recent the recent changes were ;) [23:40:05] https://ca.wikimedia.org/w/index.php?title=Special:RecentChanges&days=30&from=&limit=500 [23:40:09] hmmm [23:40:44] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:41:02] https://ca.wikimedia.org/wiki/Special:AllPages [23:41:09] Probably wouldn't worry about it too much :) [23:42:05] ah :)) [23:42:36] and yes, now i also ran rebuildrecentchanges, but it was so fast, it's hard to believe it does stuff [23:43:04] real 0m0.266s [23:43:41] > echo $wgRCMaxAge / 24 /3600; [23:43:42] 30 [23:43:46] well, anyways, welcome Canada chapter on WMF servers [23:43:55] so if nothing changed within the last 30 days... [23:44:08] ah, ok [23:44:34] Coren: ^ fyi for your husband [23:44:53] the data has been imported [23:45:14] all your wikimedia ca data r belong to us [23:45:29] I'll tell him to told 'em. He's not actually on the board. :-) [23:45:40] oh, how about images? [23:45:51] mutante: got a tarball or something? [23:45:59] all i got so far was the xml [23:46:24] https://ca.wikimedia.org/w/index.php?title=Special%3AAllPages&from=&to=&namespace=6 [23:46:41] There's probably quite a few that can just be updated to use stuff from commons [23:46:42] oh nice [23:54:19] nice, tried removing myself from the access list on a google doc, result: [23:54:22] Sorry, an internal error has occurred and your request was not completed.