[00:00:02] ttfn [00:41:31] !log ongoing schema changes: ar_content_model, ar_content_format. on terbium, osc_host.sh processes ok to kill in emergency [00:41:36] Logged the message, Master [00:51:29] akosiaris: Hi, are there any ideas to assign the mathoid role to a node on betalabs. [00:52:25] akosiaris: I think a single instance would be sufficient for now. I converted wikipedia en on a single node within a few hours [00:57:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [02:11:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:17:55] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-05 02:16:52+00:00 [02:18:05] Logged the message, Master [02:25:34] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:29:01] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-05 02:27:58+00:00 [02:29:07] Logged the message, Master [02:58:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [03:09:36] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 5 03:08:30 UTC 2014 (duration 8m 29s) [03:09:42] Logged the message, Master [03:34:03] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 01:33:32 UTC [04:38:03] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 02:37:16 UTC [04:59:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [05:35:03] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 01:33:32 UTC [06:21:03] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 04:20:09 UTC [06:28:54] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:03] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:33] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Tue Aug 5 06:33:27 UTC 2014 [06:39:03] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 02:37:16 UTC [06:46:03] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [07:00:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [07:00:23] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Tue Aug 5 07:00:19 UTC 2014 [07:31:48] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: remove puppet 2, new version [operations/puppet] - 10https://gerrit.wikimedia.org/r/151820 [07:32:38] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet_compiler: remove puppet 2, new version [operations/puppet] - 10https://gerrit.wikimedia.org/r/151820 (owner: 10Giuseppe Lavagetto) [07:32:54] <_joe_> what the hell is jenkins doing? [07:33:29] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet_compiler: remove puppet 2, new version [operations/puppet] - 10https://gerrit.wikimedia.org/r/151820 (owner: 10Giuseppe Lavagetto) [07:36:18] (03CR) 10Hashar: "recheck" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150056 (owner: 10Yuvipanda) [07:36:43] <_joe_> hashar: I may need your help [07:37:07] (03PS1) 10Hashar: Implement last command (per greg-g) [wikimedia/bots/jouncebot] (refs/changes/56/150056/1) - 10https://gerrit.wikimedia.org/r/151821 [07:37:09] (03CR) 10jenkins-bot: [V: 04-1] Implement last command (per greg-g) [wikimedia/bots/jouncebot] (refs/changes/56/150056/1) - 10https://gerrit.wikimedia.org/r/151821 (owner: 10Hashar) [07:37:58] (03Abandoned) 10Hashar: Implement last command (per greg-g) [wikimedia/bots/jouncebot] (refs/changes/56/150056/1) - 10https://gerrit.wikimedia.org/r/151821 (owner: 10Hashar) [07:38:09] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/149387 (owner: 10Hashar) [07:38:52] (03CR) 10Hashar: "Ignore the cherry-pick above. You will want to rebase/cherry-pick your change on top of https://gerrit.wikimedia.org/r/150056 . Also fix" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda) [07:57:23] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Aug 5 07:57:17 UTC 2014 [07:58:08] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: use ensure_packages to avoid conflicts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151823 [07:58:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet_compiler: use ensure_packages to avoid conflicts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151823 (owner: 10Giuseppe Lavagetto) [08:09:04] (03PS7) 10Ricordisamoa: minor changes to InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 [08:09:57] (03CR) 10Ricordisamoa: "PS7 is rebase only" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [08:19:44] Eloquence bblack saw your email, taking a look at swift now [08:21:53] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: remove python-pip declaration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151824 [08:22:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppet_compiler: remove python-pip declaration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151824 (owner: 10Giuseppe Lavagetto) [08:36:13] <_joe_> bd808|MOBILE: how come you're up so early? :D [08:36:30] <_joe_> the pupppet compiler is back online and functioning [08:37:51] _joe_: I just landed in London. Tired already. [08:38:32] <_joe_> I figured that [08:38:48] <_joe_> now try to understand cockney, when jetlagged [08:45:51] bd808|MOBILE: have fresh air. Take a 1 hour at most nap at 1pm and done :] [08:46:31] hashar: Sounds like a good plan. [08:46:47] bd808|MOBILE: come say hello tomorrow morning :) [08:46:50] bd808|MOBILE: make sure someone knows you are taking a nap and come wake you up [08:47:01] or you will end up waking up at 9pm ... that hmm sucks :] [08:47:10] First I will need escape the airport and then I will need a room. [09:01:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [09:20:03] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 07:19:41 UTC [10:00:08] (03CR) 10QChris: [C: 031] "While I cannot comment much on the code itself, an additional" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [10:20:43] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Tue Aug 5 10:20:34 UTC 2014 [10:22:49] hashar: got a sec ? [10:31:36] akosiaris: a sec yes :] [10:39:48] (03CR) 10Alexandros Kosiaris: [C: 031] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [10:40:34] hashar: so, I have two similar requests for beta. One is to get mathoid into beta, the other is to get a varnish in front of cxserver [10:41:04] hashar: my problems is... not sure which machines to pick for this in deployment-prep. Or should I just create new ones ? [10:41:27] hashar: how would you prefer we move on with this in general ? [10:42:47] akosiaris: how will we handle varnish caching in production? [10:42:59] I think there was some discussions about sharing the varnish caches for the *oid [10:43:07] i.e. a shared one for cxserver / mathoid / parsoid [10:43:07] hashar: we are thinking about using the same as for parsoid [10:43:11] exactly [10:43:22] so I would try that on beta cluster :] [10:43:38] if that works fine, it can then be applied to prod [10:44:24] akosiaris: beta cluster has deployment-parsoidcache01 [10:44:34] maybe create a new one like deployment-oidcache01 [10:44:42] ahaha [10:44:45] set it up to be able to handle cxserver / parsoid / mathoid etc [10:44:52] or deployment-soa-cache01 [10:44:52] oid always reminds me of SNMP [10:44:54] or whatever :] [10:44:59] hehe [10:45:10] from there we can keep deployment-parsoidcache01 around [10:45:13] <_joe_> or deployment-callbacks01 [10:45:24] probably soa is the best [10:45:26] and use some magic conf switch to move parsoid to use the new spa-cache [10:45:46] deployment-gwicke-made-us-do-it-cache-js-0001 [10:45:49] so yeah [10:45:55] build a new shared cache [10:46:04] then we can switch the beta cluster parsoid to it and see what happens [10:46:07] <_joe_> akosiaris: we're not doing SOA, in all honesty [10:47:23] but we will always strive to get there [10:47:46] hashar: ok, moving forward with this, thanks for the advice :-) [10:48:54] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:50:29] akosiaris: remember anything learned while doing it on beta is already a step forward for prod :] [10:51:05] indeed [10:51:14] foooood time [10:59:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a good package. Minor comments, otherwise LGTM" (033 comments) [operations/debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/151615 (owner: 10Gage) [11:02:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [11:05:23] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [11:05:43] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [11:05:44] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.006 second response time [11:24:07] pip install hacking [11:24:19] sigh.... the perfect name for a python package [11:26:02] lol [11:35:03] akosiaris: ahh [11:35:20] akosiaris: I know that package. It is a superset of flake8 which has a bunch of OpenStack conventions [11:35:32] yeah [11:35:48] it might even be packaged for Debian [11:39:57] akosiaris: check your private messages ;) [11:40:28] ah yes [12:00:19] (03PS2) 10Yuvipanda: quarry: Use separate worker module for celery [operations/puppet] - 10https://gerrit.wikimedia.org/r/151409 [12:04:05] akosiaris: seeing you are online, can I interest you in +2ing ^? trivial fix for one of my projects in labs... [12:05:14] YuviPanda: ah celery... I always like that project [12:05:26] akosiaris: :D [12:05:30] it moves a bit too fast sometimes but it works well (most of the times) [12:05:38] akosiaris: this is for quarry.wmflabs.org [12:05:42] akosiaris: indeed. [12:05:55] akosiaris: and has upto date packages in trusty! [12:07:20] (03CR) 10Alexandros Kosiaris: [C: 032] quarry: Use separate worker module for celery [operations/puppet] - 10https://gerrit.wikimedia.org/r/151409 (owner: 10Yuvipanda) [12:07:30] akosiaris: :D ty! [12:07:45] akosiaris: also, no movement on the postgres boxen? [12:08:15] YuviPanda: yes there is movement. I am the blocker now, I have finally installed them, playing around a bit [12:08:22] aaaah [12:08:26] cooool :) [12:08:33] will be ready before wikimania for sure [12:11:28] akosiaris: wheee [12:18:38] (03PS1) 10Alexandros Kosiaris: Update raid5-gpt-lvm config [operations/puppet] - 10https://gerrit.wikimedia.org/r/151845 [12:21:05] (03CR) 10Alexandros Kosiaris: [C: 032] Update raid5-gpt-lvm config [operations/puppet] - 10https://gerrit.wikimedia.org/r/151845 (owner: 10Alexandros Kosiaris) [12:26:23] !log uploaded python-gear_0.5.5-1 on apt.wikimedia.org [12:26:28] Logged the message, Master [12:26:28] hasharEat: ^ that is for you [12:27:05] akosiaris: awesome [12:30:25] !log Upgrading python-gear on gallium and restarting zuul and zuul-merger [12:30:30] Logged the message, Master [12:32:56] akosiaris: on packaging , I have seen your message for php-parsekit not compiling on Trusty [12:33:00] https://gerrit.wikimedia.org/r/#/c/151042/3/debian/changelog .. [12:33:05] seems it is some abandon ware and we need a better strategy [12:33:17] what is that auto-merge there? I 've never seen that before [12:33:23] hasharEat: my point exactly... [12:33:24] akosiaris: yeah that one is a merge commit. I have explained it to andrew on the RT ticket [12:33:30] aahh looking [12:33:42] the master branch has a few patches for us (such as .gitreview ) [12:33:50] so I just merge upstream tag to our master [12:33:54] which craft a merge commit [12:34:03] if you git-review -d 151042 you will have a nice overview [12:35:11] yeah, just did [12:35:16] so merging that one [12:35:25] and building the package [12:35:38] definitely confusing though :( [12:35:40] (03CR) 10Alexandros Kosiaris: [C: 032] Merge tag 'v0.10.0' into gerrit-master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151042 (https://bugzilla.wikimedia.org/68995) (owner: 10Hashar) [12:36:38] upstream-branch=f618f4d35a88efd1d3529217c49df5892899aecd [12:36:40] ahahaha [12:36:45] doh [12:36:59] if it wasn't for the comment above I would be like... huh ? [12:37:02] puff maybe we should just dish out that git repo and use the upstream packages instead [12:37:34] (03CR) 10QChris: "Looks good to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [12:37:55] akosiaris: that might be needed any more [12:38:31] that commit f618f4d35a88efd1d3529217c49df5892899aecd got merged in v0.8.0 [12:39:58] which is not tagged in our repo [12:40:11] neither v0.7.0 (if such thing ever existed) [12:40:16] ah [12:40:21] I have to push the tags ba [12:40:27] yup [12:40:28] sorry :-( [12:40:38] no worries [12:40:54] bbl, gonna eat [12:44:06] tag sent [12:45:21] (03PS1) 10Hashar: remove custom debian/gbp.conf [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151850 [13:03:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [14:00:36] (03CR) 10Alexandros Kosiaris: [C: 032] Add CNAME for osmdb service [operations/dns] - 10https://gerrit.wikimedia.org/r/150180 (owner: 10Alexandros Kosiaris) [14:04:30] (03CR) 10Alexandros Kosiaris: [C: 032] remove custom debian/gbp.conf [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151850 (owner: 10Hashar) [14:15:01] (03PS1) 10Alexandros Kosiaris: akosiaris .dotfiles REPREPRO_BASE_DIR [operations/puppet] - 10https://gerrit.wikimedia.org/r/151853 [14:16:38] (03CR) 10Alexandros Kosiaris: [C: 032] akosiaris .dotfiles REPREPRO_BASE_DIR [operations/puppet] - 10https://gerrit.wikimedia.org/r/151853 (owner: 10Alexandros Kosiaris) [14:23:41] (03PS6) 10Ottomata: Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 [14:23:49] (03CR) 10Ottomata: [C: 032 V: 032] Create analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151657 (owner: 10Ottomata) [14:33:42] (03PS1) 10Ottomata: Alignment fix in data.yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/151855 [14:33:44] (03PS1) 10Ottomata: Add Juliusz Gonera to analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151856 [14:34:06] (03PS1) 10Alexandros Kosiaris: Change redirect for wikimedia.org/research [operations/puppet] - 10https://gerrit.wikimedia.org/r/151857 [14:34:44] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [14:35:03] _joe_: a quick review on this ^ please? [14:35:12] (03CR) 10Ottomata: [C: 032 V: 032] Alignment fix in data.yaml [operations/puppet] - 10https://gerrit.wikimedia.org/r/151855 (owner: 10Ottomata) [14:35:33] (03CR) 10Ottomata: [C: 032 V: 032] Add Juliusz Gonera to analytics-privatedata-users group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151856 (owner: 10Ottomata) [14:39:34] (03PS1) 10Alexandros Kosiaris: Update refreshDomainRedirects with port number [operations/puppet] - 10https://gerrit.wikimedia.org/r/151858 [14:45:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [14:45:36] AH [14:45:37] that is my problem [14:45:39] strontium! [14:47:28] ottomata: did you notice it when you merged ? [14:47:38] or was it silent during puppet-merge ? [14:48:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:48:28] i didn't notice... [14:48:39] ah still have output up [14:48:43] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Epic puppet fail [14:48:48] error: Ref refs/remotes/origin/production is at 49e7677b70181fc3e522903d56cd3c75eb7fef9e but expected 7144d9842db5399672e945aaf6248c15b3119da4 [14:48:52] EPIC!? [14:48:53] nuh uh [14:49:01] there. [14:49:25] ottomata: can you paste the entire output somewhere? [14:49:43] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:50:20] https://gist.github.com/ottomata/f344d37478309edc1bdb [14:50:33] thanks [14:52:36] <_joe_> akosiaris: sorry I was knee-deep in recursion [14:53:30] (03PS2) 10Giuseppe Lavagetto: Change redirect for wikimedia.org/research [operations/puppet] - 10https://gerrit.wikimedia.org/r/151857 (owner: 10Alexandros Kosiaris) [14:53:36] (03CR) 10Giuseppe Lavagetto: [C: 031] Change redirect for wikimedia.org/research [operations/puppet] - 10https://gerrit.wikimedia.org/r/151857 (owner: 10Alexandros Kosiaris) [14:54:04] (03CR) 10Alexandros Kosiaris: [C: 032] Change redirect for wikimedia.org/research [operations/puppet] - 10https://gerrit.wikimedia.org/r/151857 (owner: 10Alexandros Kosiaris) [14:54:38] (03PS3) 10Ottomata: Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 [14:54:49] how deep is knee deep in recursion? [14:55:04] (03PS4) 10Ottomata: Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 [14:55:08] it is knees all the way down [14:55:42] (03CR) 10jenkins-bot: [V: 04-1] Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 (owner: 10Ottomata) [14:55:50] (03PS5) 10Ottomata: Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 [14:55:54] no base case, eh? [14:56:23] (03PS6) 10Ottomata: Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 [14:56:30] haha no I don't think so [14:57:40] (03CR) 10Ottomata: [C: 032 V: 032] Create analytics-admins group with qchris as member [operations/puppet] - 10https://gerrit.wikimedia.org/r/150560 (owner: 10Ottomata) [15:00:54] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:40] chasemp: can I put a system user that is not managed by the admin module into a group? [15:01:47] i want the hdfs user to be in the analytics-admins group [15:01:53] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.062 second response time [15:02:39] root is doing a graceful restart of all apaches [15:03:45] !log root gracefulled all apaches [15:03:51] Logged the message, Master [15:04:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [15:06:00] root is doing a graceful restart of all apaches [15:07:24] !log root gracefulled all apaches [15:12:01] Otto: not as of now, but you are the second person to have the need so may be a good week to figure it out. [15:13:31] Ottomata^ [15:13:47] aye [15:13:49] k [15:13:57] chasemp: I previously was using an exec to do this [15:14:03] going to continue this for now, and just put a TODO on it [15:15:29] (03PS1) 10Ottomata: Group own refinery logs by analytics-admins, add hdfs to analytics-admins group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151863 [15:15:56] Could you make a ticket and toss it my way? If you manage that user like that it will create perpetual churn with puppet I think and be weird...at least I think. I will try to look at this soon [15:17:16] naw no churn, it works ok...as long as I set up the exec properly [15:18:32] chasemp: https://rt.wikimedia.org/Ticket/Display.html?id=8080 [15:18:41] Good enough for now then :) [15:18:49] thanks [15:19:07] (03CR) 10Ottomata: [C: 032 V: 032] Group own refinery logs by analytics-admins, add hdfs to analytics-admins group [operations/puppet] - 10https://gerrit.wikimedia.org/r/151863 (owner: 10Ottomata) [15:21:43] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:29:22] (03PS1) 10Alexandros Kosiaris: Allow overriding postgres datadir [operations/puppet] - 10https://gerrit.wikimedia.org/r/151864 [15:30:00] (03CR) 10jenkins-bot: [V: 04-1] Allow overriding postgres datadir [operations/puppet] - 10https://gerrit.wikimedia.org/r/151864 (owner: 10Alexandros Kosiaris) [15:33:36] (03PS2) 10Alexandros Kosiaris: Allow overriding postgres datadir [operations/puppet] - 10https://gerrit.wikimedia.org/r/151864 [15:44:15] <_joe_> oh how many wasted commits we did [15:55:19] (03CR) 10Ottomata: "LGTM too!" (031 comment) [operations/debs/logstash-gelf] - 10https://gerrit.wikimedia.org/r/151615 (owner: 10Gage) [16:07:15] off to london, ttyl! [16:12:33] (03PS1) 10Giuseppe Lavagetto: puppet: hiera backend for the WMF [operations/puppet] - 10https://gerrit.wikimedia.org/r/151869 [16:12:41] <_joe_> ciao godog [16:15:26] _joe_: are you coming to Wikimania? [16:15:36] <_joe_> YuviPanda: sadly no [16:15:40] :( [16:16:03] <_joe_> I'll be here working on hhvm and ruby while you guys sip martini at the lounge :P [16:16:18] heh [16:16:27] I'm sitting on a friend's couch atm [16:17:23] <_joe_> and I'm off right now :) [16:18:54] (03PS6) 10Ottomata: RT 7858: datasets Apache and Puppet edits. [operations/puppet] - 10https://gerrit.wikimedia.org/r/147226 (owner: 10Scottlee) [16:19:14] _joe_: cya! [16:19:24] (03CR) 10Ottomata: [C: 032 V: 032] RT 7858: datasets Apache and Puppet edits. [operations/puppet] - 10https://gerrit.wikimedia.org/r/147226 (owner: 10Scottlee) [16:23:48] (03PS1) 10Ottomata: Redirect stat1001.wikimedia.org to http, not https [operations/puppet] - 10https://gerrit.wikimedia.org/r/151871 [16:25:00] (03CR) 10Ottomata: [C: 032 V: 032] Redirect stat1001.wikimedia.org to http, not https [operations/puppet] - 10https://gerrit.wikimedia.org/r/151871 (owner: 10Ottomata) [17:05:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [17:16:02] (03CR) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [17:20:03] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 15:19:51 UTC [17:20:23] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Tue Aug 5 17:20:16 UTC 2014 [17:58:03] marxarelli: can i borrow you for some ruby code-review? [17:58:14] specifically: https://gerrit.wikimedia.org/r/#/c/151869/ (_joe_'s hiera patch) [17:59:05] ori: for testwiki HHVM, maybe put a sitenotice on testwiki? [18:00:31] jeremyb: that's a really good idea. how would i do that? [18:01:25] * ori RTFMs. [18:01:58] https://test.wikipedia.org/wiki/MediaWiki:Sitenotice https://test.wikipedia.org/wiki/MediaWiki:Sitenotice [18:02:01] edit those 2 pages [18:02:06] ori [18:02:29] jeremyb: thanks! [18:03:02] jeremyb: two pages? isn't that the same url twice? [18:03:13] it's the same :) [18:03:21] just edit one of them ;) [18:03:42] BUT WHICH??? [18:03:43] * ori kids [18:04:01] hah [18:04:04] yes, two [18:04:06] ori: no matter, both liniks are the same :) [18:04:21] i don't know the structure of testwiki, but normally it's enough to edit MediaWiki:Sitenotice :) [18:04:21] I'm going to replace the Wikidata notice; it's been around since April [18:04:33] ("From March 6 14:30 (UTC) on this wiki will use [[testwikidata:|testwikidata]] as its Wikibase repository, instead of the production [[wikidata:|Wikidata]]. ") [18:04:35] one should be https://test.wikipedia.org/wiki/MediaWiki:anonnotice [18:05:39] * jeremyb runs away [18:10:38] jeremyb, FlorianSW: {{Done}}. thanks! [18:11:11] ori np :) But why you removed the version notice? :() [18:11:14] *:) [18:11:24] FlorianSW: are you florianschmidt? [18:11:32] FlorianSW: because I'm dumb! I'll revert that bit [18:11:37] RD: Yes :/ [18:11:46] ori: Just a suggestion :P [18:11:47] I have an OTRS question for ya...will PM :) [18:11:53] RD OK :) [18:13:08] (03PS2) 10Ori.livneh: HHVM: add ::hhvm::status [operations/puppet] - 10https://gerrit.wikimedia.org/r/151772 [18:13:16] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: add ::hhvm::status [operations/puppet] - 10https://gerrit.wikimedia.org/r/151772 (owner: 10Ori.livneh) [18:19:37] ori: sure thing [18:24:21] akosiaris: yt? ready for qs about passive check? [18:26:16] ori: tweaked the notice [18:28:49] ottomata: can we get something useful on datasets.wm.o root (like a link to /public-datasets). also, SSL is really messed up there. sending wrong cert (CN mismatch) even when using SNI and also wrong chain. (all domains I tried including metrics.wm.o send the wrong chain. a mafia CA cert with a WMF CA chain) [18:29:58] jeremyb: I am feeling sorry I switched at all atm! :/ [18:30:15] this is how it has been forever as far as I can tell, all I wanted to do was merge a volunteers patchset that had been sitting for a while [18:30:28] i don't think https is supported here! [18:33:04] jeremyb: i am aware that it is totally weird over there. i will be on RT duty in 2 weeks, and will attempt to fix then [18:33:05] ottomata: i'm not saying fix it immediately... I can file it if you like. IMHO, we should not provide HTTPS service on domains like that unless it works. (so either turn off 443 entirely and add an exception to the HTTPS everywhere rules OR fix the certs/chains) [18:33:47] sure, gimme RT ticket for reminder! [18:33:54] ok, danke [18:36:06] akosiaris: can you poke #8068 ? i'll make the patch if you want [18:44:41] (03Abandoned) 10CSteipp: Use Type 'B' Passwords on CentralAuth Wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/117469 (owner: 10CSteipp) [18:53:27] Jeff_Green: yt? [18:55:34] hey is cmjohnson1 on vacation? [18:56:09] yes [18:57:01] qchris: yt? [18:57:03] oh! [18:57:08] sorry, saw Jeff_Green's response [18:57:15] ok so, i'm looking into this passive check thing again [18:57:21] i'm also understanding how the puppet freshness check works [18:57:30] * qchris is reading along. [18:57:33] it seems like it fits my use case pretty well...but i don't want to use snmp [18:57:55] looks like somehow the puppet freshness check is a passive-check_freshness type of service check [18:57:59] ok [18:58:08] akosiaris: ping [18:58:10] that will trigger an error if: the command fails OR if it is stale [18:58:16] seems good [18:58:43] i could use oozie (which runs jobs in hadoop) to submit_check_result [18:58:47] as a passive check [18:58:54] sure [18:58:57] and then a passive -freshness-check to make sure it works [18:59:01] only problem is the remote thing [18:59:07] oozie runs on analytics1027 [18:59:11] and the submit_check_result command runs on neon [18:59:21] the nagios docs say they use a thing called send_ncsa [18:59:26] http://nagios.sourceforge.net/docs/3_0/passivechecks.html [18:59:27] right [18:59:37] but it doesn't look like we run that? [18:59:53] in frack that's what I do, I'm not sure about the puppet freshness [19:00:08] puppet freshness looks like the remote stuff is handled via snmp traps :/ [19:00:12] puppet freshness was snmp trap i though [19:00:14] thought* [19:00:17] yeah, it is [19:00:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [19:00:28] oh so we have yet another mechanism involved there [19:00:43] shocking that it's spotty :-P [19:00:55] https://github.com/wikimedia/operations-puppet/blob/production/files/snmp/snmptt.conf [19:01:20] turtles all the way down! [19:01:32] ha, ja [19:01:33] so [19:01:36] i don't htink i want to do that [19:01:39] too complicated [19:01:48] 'alternatively, i could just forget the passive stuff altogether [19:02:04] i'd need to check for the existence of a file in hdfs every hour [19:02:22] ok so [19:02:22] so i'd have to make a check command plugin that formatted the file path for the current hour to check [19:02:27] but, dunno [19:02:33] my vague understanding of the value of passive checks [19:02:33] this example here [19:02:34] http://nagios.sourceforge.net/docs/3_0/freshness.html [19:02:39] is very simliar to our use case [19:02:42] it isn't a backup [19:02:49] but it is a dataset that should be generated every hour [19:03:04] I think you would use a passive check in the case where you don't want or can't have the master polling and remote-executing code on the client [19:03:27] other than that, I don't know that it adds any value [19:03:30] so that and in general events where you are truly looking for a lack [19:03:36] like cron backups are a good passive check [19:03:46] well, this case seems to fit pretty well [19:03:49] if you want to check that dataset is generated every hour [19:03:51] every hour, if all goes well [19:03:58] you could have the job that enerates datasets send the check in [19:04:01] chasemp: can't you schedule nrpe checks too though? [19:04:05] there is a _SUCCESS file created in hdfs for that hour's directory [19:04:17] yes, chasemp, could do that, but how to send the check in? [19:04:57] i think that would be the ideal situation [19:05:11] so I guess that's another use case--where the check is something you don't want to remote-execute for another reason (like it takes 20 minutes to run) ? [19:05:12] oozie could actively send OK or FAIL status [19:05:23] and the passive freshness check would trigger if oozie didn't send anything at all [19:05:25] there is a python nsca library I've used [19:05:36] Jeff_Green: yeah agreed you could [19:05:37] yeah, chasemp, that is mentioned in the docs [19:05:43] but i don't see us using it anywhere :/ [19:05:54] I tend to think passive checks are better when it's literally checking for the absence of an operation [19:06:03] checking mod times on files to see if things are fresh [19:06:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [19:06:10] becomes very unweidly and weird as it scales [19:06:20] versus having scheduled tasks checkin with good or bad status [19:06:44] so the passive check saves you having to have one process write state files for an active check to poll [19:06:46] and then as long as it is scheduled in teh interval you can say it's every 5 or 2 hours or whatever for when you care it has gone missing [19:07:03] IMO that way (which i've done :) is more hacky [19:07:09] than counting on checking in if everything is ok [19:07:15] and anything besides an all clear is considered errant [19:07:18] ha, which way is more hackY? [19:07:30] imo nrpe is the most hacky of all :-) [19:07:31] writing out a file that you then actively check for modtime [19:07:36] ah ah [19:07:36] yeah [19:07:44] so, that is sort of equivalent to this [19:07:46] it isn't a singel file [19:07:53] if this was an active check [19:08:01] i would ahve to infer the current hour to check (this could only run hourly) [19:08:13] and then format a path with the hour [19:08:19] and then check for a _SUCCESS file in that path [19:08:20] so [19:08:21] yeah so you do your op and bail if error and only submit if you have verified everything is cool [19:08:32] .../2014/07/31/01/_SUCCESS [19:08:32] that way any unknown we didn't submit it worked state is considered back by default, eh? [19:08:44] it's partially just preference [19:08:54] ? [19:08:56] I've tracked down so many, "well it touched my file but didn't really complete" [19:09:03] cases I tend to dislike doing it that way [19:09:18] so is this a cron? [19:09:20] we are monitoring [19:09:28] its not cron, no [19:09:32] its oozie! [19:09:34] haha [19:09:40] its a job scheduler in hadoop [19:09:50] that does some stuff to make the data queryable every hour [19:09:56] it also checks the hourly data for missing or duplicate data [19:10:05] if everything is cool, it generates a _SUCCESS file as its last step [19:10:13] ah so that's all default behavior [19:10:29] in that case, sure check that file could be most sane [19:10:44] um, the _SUCCESS file we told it to make, but that is a standard thing to do [19:10:58] if oozie can do an action at the end of its run, calling send_nsca is not a big whoop [19:11:07] ah, then I would vote for that I guess [19:11:22] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/partition/add/workflow.xml#L130 [19:11:28] but just trying to be helpful not step into everyones bidnezz [19:11:29] yeah, it can totally do that [19:11:32] but [19:11:35] we don't run the ncsa daemon! [19:11:37] i'd have to set that up! [19:11:39] wouldn't I?! [19:11:46] I thought we did [19:11:51] or last time I checked we did both [19:12:00] oh! [19:12:00] ottomata: of course we do [19:12:02] traps + nsca [19:12:06] it runs on the master [19:12:12] the client is not a service [19:12:16] service name? [19:12:20] i know, i'm looking on neon [19:12:33] send_nsca connects to the master and posts its report sorta like HTTP [19:12:34] maybe it didn't come back up post-neon issues? [19:13:01] and then the master has to have a preconfigured check to pull that report from the nsca daemon into icinga [19:13:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:13:26] NSCA! [19:13:30] i ahve been writing NCSA all this time [19:13:42] that drives me nuts [19:13:42] hahaha [19:13:48] * jgage used to work for NCSA [19:14:01] fwiw it's running :) [19:14:05] just remember nagios is not supercomputing, and it all comes clear [19:14:07] haha, someone even typed it outin a comment wrong [19:14:08] nsca that is [19:14:11] icinta.pp line 638 [19:14:12] haha [19:14:14] # ncsa on port 5667 [19:14:16] haha [19:14:27] ok cool [19:14:33] I noticed it but didn't want to be a pedantic tool on you :) [19:14:43] (03PS3) 10Ori.livneh: HHVM: add hhvm-debug-dump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [19:15:34] ottomata: see neon:/etc/icinga/nsca_frack.cfg for all the checks that pull nsca data into icinga [19:16:17] if you start spewing data to nsca before the master-side checks are configured, it will just spew to the log and drop the report [19:17:37] hm ok so, how do you actually send the data? just write to the port? or is there this 'send_nsca' script somewhere (still looking...) [19:17:51] send_nsca is a shell command [19:18:00] it takes the report as a series of command line arguments [19:18:05] one call per metric [19:18:24] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 310 seconds [19:18:33] nsca-client package! [19:18:36] http://manpages.ubuntu.com/manpages/hardy/man1/send_nsca.1.html [19:18:42] yeah, has to be installed of course :-) [19:19:04] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 323 seconds [19:20:00] awesome, ok going to do this then [19:20:03] qchris, ok with you? [19:20:10] Sure. [19:20:12] add one more action to the end of the chain? [19:20:18] call send_nsca from that? [19:20:26] akosiaris: are you still about? [19:20:28] Yes. Totally. [19:20:36] and then i'll set up passive -freshness checks on icinga [19:20:37] cool [19:20:48] is anyone up for reviewing a bit of noob python code for me? [19:20:50] welp, we wrote a nice little script, qchris :) [19:20:54] guess we'll keep it around [19:20:55] Jeff_Green: I can sure [19:20:59] (03PS1) 10Ori.livneh: Add transparency.wikimedia.org misc-varnish CNAME [operations/dns] - 10https://gerrit.wikimedia.org/r/151902 [19:21:02] beta labs is returning immediate HTTP 503s for api.php, load.php, and wiki pages, e.g. http://en.wikipedia.beta.wmflabs.org/w/index.php [19:21:41] that's me, i'll fix [19:21:45] I think the helpful folk in the #wikimedia-labs channel are at Wikimania. In the past people have restarted HHVM... [19:22:32] ori: is that transparency stuff all ok? [19:22:41] I was looking into it too, seems it will be a big problem is not solved [19:22:48] chasemp: cool. it's ocg1001:/root/check_ocg_health -- It's a nagios check for the OCG server, polls it on localhost and generates a nagios-ish report. you can run it as any user [19:22:52] afaict nsca-client is not installed anywhere (via puppet at least) [19:23:01] ottomata: it's on aluminium [19:23:24] (03PS1) 10Ori.livneh: HHVM: update dynamic_extension_path for new package [operations/puppet] - 10https://gerrit.wikimedia.org/r/151903 [19:23:42] Jeff_Green: is it bad to request a changeset? we could do it in phabricator in labs if you wanted :) [19:23:50] chasemp: i've done that before (set up a misc non-wiki web site behind misc-varnish-lb) and i know the best practices faidon / mark recommend for it. if you can review, i can submit patches [19:23:56] Jeff_Green: installed via puppet? [19:24:14] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: update dynamic_extension_path for new package [operations/puppet] - 10https://gerrit.wikimedia.org/r/151903 (owner: 10Ori.livneh) [19:24:15] ottomata: once upon a time it was, but I've been ripping puppet off this box because people keep breaking it [19:24:21] :-P [19:24:27] ori: yeah I've done it once so I think I can? Where are the files, etc tho? [19:24:29] and I desperately want to kill the box [19:24:31] aye [19:24:32] ha [19:24:33] k [19:25:00] chasemp: whaddya mean. [19:25:35] i have zero experience with phabricator, partly because I can't get over the name [19:25:40] I was thinking changset +1'ing and that stuff, do you want me to run it as it stands to verify function or poke it with a stick in code review? [19:25:52] oh [19:26:15] i guess ultimately it goes in puppet like the other local-built nagios plugins [19:26:33] phool, it's the phuture :) [19:27:04] I'm willing to bet a very small amount of money in a year we'll think it is the misguided way of the past [19:27:11] :-) [19:27:19] chasemp: not 100% sure, reconstructing from the ticket [19:27:28] (just based on the name alone) [19:27:38] ori: yup thanks! [19:27:53] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [19:28:02] chasemp: first thing I guess I'm looking for is style stuff--I have no sense of what is pythonic yet [19:28:13] chasemp: git clone https://gerrit.wikimedia.org/r/wikimedia/TransparencyReport [19:28:22] i gotta run and get food, dying. biab. [19:29:09] (03CR) 10Rush: [C: 031] "seems right" [operations/dns] - 10https://gerrit.wikimedia.org/r/151902 (owner: 10Ori.livneh) [19:29:15] ori: do I need to merge that? [19:29:33] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 317 seconds [19:29:33] akosiaris: Hi! [19:29:45] chasemp: if you could, then i can bang out the other stuff. it looks like it's using a static site generator. we won't want to generate it in production, so i'll create a branch with the generated static files [19:29:53] (03CR) 10Bsimmers: [C: 04-1] HHVM: add hhvm-debug-dump script (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [19:29:59] https://bugzilla.wikimedia.org/show_bug.cgi?id=68995 popped up on my list of bugs with merged patches [19:30:04] (03CR) 10Rush: [C: 032 V: 032] "seems right" [operations/dns] - 10https://gerrit.wikimedia.org/r/151902 (owner: 10Ori.livneh) [19:30:05] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 332 seconds [19:30:06] Guess it should be closed? [19:30:34] ori: ok I updated dns [19:30:40] I would think a few minutes and you are gtg [19:30:50] odder: I think akosiaris is afk due to the time there now [19:31:01] may want to comment in the issue and/or circle back tomorrow :) [19:31:23] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:31:31] ori, they said something about a cron job to pull in from translatewiki [19:31:52] Oh right, they have a crazy timezone back in Greece [19:33:01] odder, the RT linked from there is resolved [19:33:28] Can't tell [19:33:47] jeremyb: Feel free to close it, then? [19:35:01] odder, > Built the package using tag v0.10.0 as pushed to the gerrit repo by Antoine. Package is ready and uploaded on apt.wikimedia.org. Resolving this, feel free to reopen if necessary [19:35:34] * jeremyb runs away [19:37:13] jeremyb: closed it myself then [19:39:39] qchris: do you think I should try to send a passive failure if _SUCCESS is not created for some reason? [19:39:44] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [19:39:53] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.018 second response time [19:40:03] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 30 data above and 0 below the confidence bounds [19:40:07] or just let the passive check_freshness check trigger a failure? [19:40:10] hm [19:40:19] ottomata: No clue. [19:40:21] also, should this be at the end of the workflow? [19:40:26] or hm, should it be its own workflow? [19:40:33] that is triggered by the _SUCCESS done flag [19:40:33] hm [19:40:42] ottomata: when in the process is _SUCCESS triggered now? [19:40:47] at the end yes? [19:40:48] yes [19:40:51] ottomata: But if we know that something failed, we should push that signal forward. So sending a "passive failure" (if that's possible) sounds better to me. [19:40:55] but, this makes it harder to test in labs [19:40:58] if I add this stuff to this workflow [19:41:14] so, qchris: [19:41:15] [19:41:18] [19:41:18] [19:41:44] Let me look at the workflow files again ... [19:41:45] basically the same exec, but with different output and retcodes passed to the send_nsca command [19:42:12] yep that seems good says the guy who has no idea :) [19:42:25] but then if even teh fail ...fails to send it will expire and alert at the freshness interval [19:42:38] but failing explicitly is always best practice even with passive I think [19:44:03] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 91 seconds [19:44:11] ottomata: I'd just add it as action after "mark_dataset_done" within workflow.xml [19:44:18] I would not make it a separate workflow. [19:44:32] ok yeah [19:44:33] so [19:44:33] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay -0 seconds [19:44:34] somethign like this? [19:44:34] https://gist.github.com/ottomata/5f7694d374acb3b10cd9 [19:44:48] * qchris looks [19:45:44] the ppl who said, let's make a human readable format and then came up with xml [19:45:52] they were definitly interesting folk [19:45:57] I am not sure about the "" in send_nsca_icinga_check_fail is good. [19:46:13] Because we only collect the orrer of the last (not first) failing action. [19:46:28] So we'd bubble up the error of failing nsca, not the real error. [19:47:08] Also ... this code does only come in play after mark_data_set done, so if [19:47:19] there are duplicates or missing files (and hence [19:47:35] mark_dataset_done is not reached, Icinga does not get a passive failure. [19:48:43] So maybe the "error to" part of "check_sequence_statistics" should also fire the passive failure? [19:48:43] (03PS1) 10Ori.livneh: provision transparency.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 [19:48:46] chasemp: ^ [19:49:14] actually let me amend a small thing real quick [19:49:20] k [19:51:30] (03PS2) 10Ori.livneh: provision transparency.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 [19:51:43] chasemp: amended. note that misc-varnish gives it https for free, too [19:57:09] (03CR) 10Rush: provision transparency.wikimedia.org (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [19:57:15] one comment not sure if http.host has to be [19:57:21] http.Host [19:57:22] as in other examples [19:57:48] (03CR) 10Ori.livneh: provision transparency.wikimedia.org (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [19:58:05] qchris: good point, fail should come as the error to for check_sequence_statistics [19:58:32] ottomata: Maybe also to the "add_partition" action? [19:59:02] well, for all of them, really, right? [19:59:07] Right :-) [19:59:48] (03PS3) 10Ori.livneh: provision transparency.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 [20:01:35] this is going to be so hard to test! [20:03:40] Meh. We just fake the send_nsca binary on the labs instances. ... with something like 'echo "$@" >>well_known_file' [20:04:31] qchris: do you have time to help me test? i betcha not, since this is not in your sprint, eh? [20:04:44] Hahaha. Do not tell my team :-P [20:05:05] But today, I won't find time. [20:05:14] I can test tomorrow in the morning. [20:05:19] (03CR) 10Rush: [C: 031] "I'm not crazy about putting the template under teh generic apache module, but no better ideas for now. should work" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [20:05:29] ottomata: Just leave something for me in gerrit :-) [20:05:34] k [20:06:01] (03CR) 10Ori.livneh: "@chasemp: It's not in the Apache module; it's in the apache subfolder of the top-level templates directory, where we have other site confi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [20:06:29] chasemp: i can merge/deploy if you like [20:06:43] (03CR) 10Rush: "ah yes you are right :) don't like that either heh. but I think is SOP for now" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [20:06:49] go for it [20:07:07] @seen manybubbles [20:07:07] (03PS4) 10Ori.livneh: provision transparency.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 [20:07:15] (03CR) 10Ori.livneh: [C: 032 V: 032] provision transparency.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/151905 (owner: 10Ori.livneh) [20:07:59] ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=68558 got the patch merged, care to mark as fixed? [20:11:47] (03PS1) 10Ori.livneh: Fix typo in I904f5fc81 [operations/puppet] - 10https://gerrit.wikimedia.org/r/151958 [20:12:26] (03CR) 10Ori.livneh: [C: 032 V: 032] "argh blargh" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151958 (owner: 10Ori.livneh) [20:20:27] ori: are you ok with that stuff, any help from me needed? [20:21:07] chasemp: it applied correctly on zirconium and misc-varnish, but i'll need to follow up with another patch, give me a minute [20:21:20] k [20:22:42] Who is Chasemp? [20:22:44] oO [20:23:06] some jerk [20:23:48] Oh, you're Chase Pettet [20:23:56] "Operations Engineer" [20:23:59] Cool [20:24:04] chasemp: I agree; he's a annoying jerk too. [20:24:11] Heh [20:38:10] (03PS1) 10Ottomata: Set up passive icinga for webrequest data imports in HDFS and Hive [operations/puppet] - 10https://gerrit.wikimedia.org/r/151963 [20:39:22] (03CR) 10Ottomata: "I can't say I fully understand this yet, but I think this is how it works..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151963 (owner: 10Ottomata) [20:40:42] (03PS1) 10Ori.livneh: transparency.wikimedia.org: 404 until 21pm UTC [operations/puppet] - 10https://gerrit.wikimedia.org/r/151964 [20:44:09] (03CR) 10Rush: [C: 031] transparency.wikimedia.org: 404 until 21pm UTC [operations/puppet] - 10https://gerrit.wikimedia.org/r/151964 (owner: 10Ori.livneh) [20:44:25] (03CR) 10Ori.livneh: [C: 032] transparency.wikimedia.org: 404 until 21pm UTC [operations/puppet] - 10https://gerrit.wikimedia.org/r/151964 (owner: 10Ori.livneh) [20:55:39] (03PS1) 10Ori.livneh: transparency.wikimedia.org: set actual launch time [operations/puppet] - 10https://gerrit.wikimedia.org/r/151965 [20:59:04] "Failed to load RSS feed from http://blog.wikimedia.org/feed/: HTTP request timed out." [20:59:32] Bsadowski1: wfm [21:00:35] https://wikimediafoundation.org/wiki/Template:Blogbox [21:00:37] o_o [21:07:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [21:08:56] (03PS4) 10Ori.livneh: HHVM: add hhvm-debug-dump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [21:11:52] (03PS2) 10Ori.livneh: transparency.wikimedia.org: set actual launch time [operations/puppet] - 10https://gerrit.wikimedia.org/r/151965 [21:12:22] (03PS3) 10Ori.livneh: transparency.wikimedia.org: set actual launch time [operations/puppet] - 10https://gerrit.wikimedia.org/r/151965 [21:12:31] (03CR) 10Ori.livneh: [C: 032 V: 032] transparency.wikimedia.org: set actual launch time [operations/puppet] - 10https://gerrit.wikimedia.org/r/151965 (owner: 10Ori.livneh) [21:35:46] (03CR) 10Ori.livneh: [C: 032] HHVM: add hhvm-debug-dump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [21:43:07] (03PS6) 10Dr0ptp4kt: WIP: Log Internet.org via header in X-Analytics when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [21:43:56] apergos: Kicked of the wikidata dumps yet? [21:44:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [21:45:25] (03PS7) 10Dr0ptp4kt: Log when Internet.org in X-Analytics iorg when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [21:46:27] (03PS8) 10Dr0ptp4kt: Log when Internet.org in X-Analytics iorg when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 [21:56:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:32:34] (03CR) 10Yurik: [C: 031] Log when Internet.org in X-Analytics iorg when appropriate [operations/puppet] - 10https://gerrit.wikimedia.org/r/151687 (owner: 10Dr0ptp4kt) [22:52:20] (03PS1) 10Ori.livneh: HHVM: keep DSO path up-to-date; rename script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151980 [23:02:17] (03CR) 10Bsimmers: [C: 031] HHVM: keep DSO path up-to-date; rename script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151980 (owner: 10Ori.livneh) [23:04:03] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 0 below the confidence bounds [23:04:05] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [23:05:28] (03CR) 10Ori.livneh: [C: 032] HHVM: keep DSO path up-to-date; rename script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151980 (owner: 10Ori.livneh) [23:08:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [23:18:03] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Tue 05 Aug 2014 21:17:16 UTC [23:29:33] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [23:37:23] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Tue Aug 5 23:37:19 UTC 2014 [23:45:04] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:45:13] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 0 below the confidence bounds [23:45:14] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 28 data above and 0 below the confidence bounds [23:45:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [23:45:53] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.015 second response time [23:47:33] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:59:35] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]