[00:17:31] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:31:27] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [01:32:27] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:27] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:27] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:38] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:38] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:38] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [01:40:38] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [01:40:38] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [01:43:38] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [01:45:38] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:38] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [01:48:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [01:51:38] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [01:56:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:38] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [01:57:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [01:57:38] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [02:08:04] !log LocalisationUpdate completed (1.22wmf14) at Tue Sep 3 02:08:03 UTC 2013 [02:08:11] Logged the message, Master [02:09:55] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [02:14:16] !log LocalisationUpdate completed (1.22wmf15) at Tue Sep 3 02:14:15 UTC 2013 [02:14:22] Logged the message, Master [02:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [02:25:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 3 02:25:20 UTC 2013 [02:25:26] Logged the message, Master [02:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [03:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [03:26:02] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [03:29:02] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [03:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [03:43:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:57:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:07:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [04:40:45] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [04:40:45] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [04:40:45] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [04:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [04:57:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:31:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [05:51:01] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [05:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [06:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [06:26:18] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [06:27:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.165 second response time [06:32:18] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:39:18] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [06:42:21] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [06:51:21] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:52:21] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:21] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:21] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:21] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:21] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [07:04:22] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:07:08] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [07:07:08] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:08] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:08] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [07:12:08] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [07:21:28] PROBLEM - Disk space on wtp1022 is CRITICAL: DISK CRITICAL - free space: / 339 MB (3% inode=77%): [07:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:38] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 335 MB (3% inode=77%): [07:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.149 second response time [07:27:08] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [07:27:58] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:08] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [07:30:28] RECOVERY - Disk space on wtp1022 is OK: DISK OK [07:30:38] RECOVERY - Disk space on wtp1023 is OK: DISK OK [07:48:35] akosiaris: good morning :) [07:48:53] hashar: good morning to you too. [07:49:28] I found out an issue with the git version provided by Ubuntu :] [07:49:41] yet another ticket hu [07:50:18] your php packages are fresh out of the oven. I have not yet uploaded them to apt. Would you like to test them first ? [07:50:37] the one with git-http-backend and --references ? [07:50:37] I have no clue how to tests them :( [07:50:50] though I could manually deploy them on beta cluster and on the contint server [07:51:05] yup git-http-backend and --references [07:52:11] maybe recompiling with symbols to get a stacktrace would help but that is beyond my knowledge [07:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:55:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [07:57:02] hashar: good morning [07:57:10] paravoid: :-] [07:57:17] it is good to see you guys awake in the morning [07:58:07] akosiaris: I have no clue how we handle PHP upgrades usually [07:58:31] but having them manually deployed on the beta cluster and contint server might catch most of the potential issues [07:58:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:36] hashar: https://gerrit.wikimedia.org/r/82264 [07:58:49] hashar: could you help with integrating that to jenkins? [07:58:49] as for the rest of the cluster, might need to be scheduled properly and notify the whole engineering [07:59:03] the way this works is one of those two ways (after you include authdns::lint) [07:59:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:59:25] you either call "authdns-lint " or "authdns-lint " [07:59:38] in the first case it'll mktemp a temp directory for you and clean it up at the end [07:59:41] paravoid: iirc the job is already ready and triggered [08:00:12] ah you provide a shell script \O/ [08:00:20] yeah [08:00:24] if it exits 0, all is fine [08:00:36] if not, the stdout/err output should be plenty to debug further [08:01:11] also, where should we include the authdns::lint class? [08:01:53] in the ugly modules/contint/manifests/packages.pp [08:01:59] I still haven't split it [08:02:06] it is applied on all Jenkins slaves [08:02:27] it's not a package :) [08:02:42] oh [08:03:01] it's a class that I'll keep up to date internally within authdns [08:04:07] but I guess it kinda fits there [08:04:16] or a new class [08:06:44] (03PS3) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [08:06:51] paravoid: maybe create a modules/contint/manifests/authdnslint.pp and include that in role::ci::slave [08:06:52] have a look :) [08:06:56] nonono [08:07:17] note that there is already an include for geoip: manifests/packages.pp: include geoip [08:07:32] grumble grumble [08:07:42] that is true [08:07:47] packages.pp is good to me as well [08:09:21] (03PS4) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [08:09:29] include geoip will do for now... [08:09:44] can you have a look? [08:10:09] reading the shell script :) [08:12:14] I deliberately left your wikidata change unmerged so we can dry run this :) [08:12:31] we can always retrigger a job if needed :-] [08:12:43] yeah it was also a reminder to self :) [08:12:57] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:02] so basically, clone the dns.git in workingdir [08:14:08] yes [08:14:12] in some place [08:14:20] then mkdir -p output; /usr/local/bin/authdns-lint workingdir output [08:14:29] that, or just authdns-lint workingdir [08:14:37] and it'll mktemp something in /tmp for you [08:14:41] and clean it up afterwards [08:14:47] that would clutter the temp directory :-] [08:14:57] it'll clean it up in the end, no huge deal [08:15:35] but I wasn't sure if you want everything to be in SSD or whatever [08:15:46] so I added a third-party provided output dir as well [08:16:51] mkdir -p "$WORKSPACE"/build [08:16:52] /usr/local/bin/authdns-lint "$WORKSPACE" "$WORKSPACE"/build [08:17:06] workspace being the git tree? [08:17:11] yup [08:17:12] git checkout even [08:17:16] that is where Jenkins fetch everything [08:17:20] I guess that could work [08:17:27] PROBLEM - Disk space on wtp1006 is CRITICAL: DISK CRITICAL - free space: / 160 MB (1% inode=77%): [08:17:46] kinda ugly but sure :) [08:18:39] https://gerrit.wikimedia.org/r/82372 :-] [08:18:45] will refresh the job [08:18:51] I guess you can merge in authdns::lint [08:19:23] I refreshed the operations-dns-lint jenkins job, so whenever authdns::lint is applied on gallium we can retrigger the job and see what is happening [08:19:23] wait [08:19:31] will jenkins clean up /build afterwards? [08:19:34] it's important to do so [08:19:38] yup [08:19:46] jenkins delete the $WORKSPACE directory before running the build [08:19:55] the git checkout too? [08:19:58] and fetch again the git repository [08:20:03] yup it is completely wiped [08:20:04] oh god, that's inefficient [08:20:10] fetching the git again?! [08:20:15] that is a bit CPU and I/O intensive but that ensure nothing is left around [08:20:15] (03CR) 10Faidon Liambotis: [C: 032] authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 (owner: 10Faidon Liambotis) [08:20:17] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [08:20:28] the git checkout are fast since the repository and the workspace are on the same device ( SSD) [08:20:33] ah [08:20:34] so git does some hardlinks [08:20:38] it doesn't matter if it's sssd [08:20:40] yes, exactly [08:20:45] git does hardlinks for local checkouts [08:20:46] I should make them shallow clone (aka just fetch the current HEAD) [08:21:32] running puppetd on gallium [08:21:40] I tried keeping the workspace and just doing some git clean -whatever but that sometime left stuff behind :( [08:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.128 second response time [08:23:31] one day I will have to look at stafford :D [08:24:07] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=77%): [08:24:17] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [08:24:27] RECOVERY - Disk space on wtp1006 is OK: DISK OK [08:25:58] so [08:26:02] paravoid: if you get to https://integration.wikimedia.org/ci/job/operations-dns-lint/ [08:26:06] and log in with your labs account [08:26:13] you should be able to rebuild the last job [08:26:17] (or another one) [08:26:29] once logged in, there is a link on the left ' Rebuild Last ) [08:26:50] which takes the parameters passed to the last build and let you press the button to trigger a run of the job [08:27:04] you didn't merge 82372 yet though :) [08:27:04] that does not report back to gerrit but that is a good way to rerun a job, debug what is going on [08:27:10] ah [08:27:16] I have deployed it nonetheless :-] [08:27:24] will further tweak it if need be [08:27:34] and merge it once it is working :-] [08:28:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:20] somewhere is a link to the job console [08:29:22] https://integration.wikimedia.org/ci/job/operations-dns-lint/41/console [08:29:28] SUCCESS \O/ [08:29:40] ;) [08:29:59] and rebuilding again https://integration.wikimedia.org/ci/job/operations-dns-lint/42/console [08:30:07] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:11] at the top there is a line saying: [08:30:12] 08:29:48 Wiping out workspace first. [08:30:17] that means the full $WORKSPACE is deleted [08:30:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [08:30:22] yep [08:30:23] perfect [08:30:24] and the operations/dns.git is recloned [08:30:44] you can have multiple authdns-lint at the same time [08:31:00] not sure if you can tell jenkins it doesn't need to serialize them [08:31:02] nop [08:31:02] though it could be configured [08:31:05] the jobs are serialized [08:31:19] right, there's no reason for this test, that's all I'm saying [08:31:20] but we can make them run in parralel [08:31:25] I don't run a DNS server for example [08:31:30] I'm just running checkconf in a chroot [08:31:32] so it's safe [08:31:39] I don't use a shared location anywhere either [08:31:43] merging jenkins job builder configuration change [08:31:55] thanks! [08:32:12] the next step is to make operations-dns-lint to block the patchset whenever the job fail [08:32:17] that is done in Zuul configuration [08:32:19] I will do it :-] [08:32:43] please do but feel free to share the patchsets, it's interesting to see how this all works [08:34:43] https://gerrit.wikimedia.org/r/82375 [08:34:45] :-] [08:34:55] each patchset triggers a bunch of jobs [08:35:07] whenever one job fail, that will cause Zuul to report verified -1 [08:35:23] "voting: yes" being the default I'm guessing? [08:35:30] but we can hack some job to not be an issue, that is known in Zuul as voting [08:35:33] yup [08:35:45] so I had an exception to make operations-dns-lint result to be ignored [08:36:01] yep [08:36:11] on the zuul configuration change https://gerrit.wikimedia.org/r/#/c/82375/ there is a jenkins job result that is ignored [08:36:17] https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/411/console : FAILURE in 2s (non-voting) [08:36:32] that does a diff of Zuul configuration at HEAD^ with HEAD [08:36:36] https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/411/console [08:36:54] that is how I manually validate the zuul configuration changes [08:37:03] 08:34:45 -INFO:zuul.IndependentPipelineManager: [nonvoting] [08:37:03] 08:34:45 +INFO:zuul.IndependentPipelineManager: [08:37:20] it is no more 'nonvoting', hence whenever operations-dns-lint fails, Zuul will vote -1 [08:39:28] (03PS1) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 [08:39:35] (03CR) 10jenkins-bot: [V: 04-1] Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 (owner: 10Hashar) [08:39:41] :-] [08:40:51] here is the failure log https://integration.wikimedia.org/ci/job/operations-dns-lint/43/console [08:40:51] (03PS2) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 [08:41:07] 08:39:34 rfc1035: Zone wikimedia.org.: Zonefile parse error at line 17: unparseable [08:41:08] 08:39:34 rfc1035: Cannot load zonefile 'wikimedia.org', failing [08:41:41] (03Abandoned) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 (owner: 10Hashar) [08:41:58] the next step would be to have a job that start a DNS server [08:42:03] and then run dig commands to validate the records :-] [08:43:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [08:44:33] paravoid: I guess you can close https://rt.wikimedia.org/Ticket/Display.html?id=5688 "write a puppet class to install gdnsd on Jenkins slaves" [08:46:54] thank you faidon :-] [08:48:09] thank you :) [08:48:14] I think the authdns-lint script is neat [08:48:20] the way to lint this I mean [08:48:32] I can do changes without going outside the authdns module [08:50:05] RECOVERY - Disk space on wtp1021 is OK: DISK OK [08:50:14] hashar: how can I tell jenkins to retry https://gerrit.wikimedia.org/r/#/c/80993/ ? [08:50:17] and vote I mean [08:50:22] is that zuul? [08:50:43] comment 'recheck'? [08:51:42] maybe recheck yeah [08:51:46] that would retrigger linting job [08:51:56] what's recheck? [08:51:56] fill a comment in the change with the content: "recheck" [08:51:59] oh [08:52:22] (03CR) 10Faidon Liambotis: "recheck" [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:53] perfect [08:53:06] :-] [08:53:18] another solution is to submit a new patchset [08:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:53:32] I usually edit the commit summary and insert a new line just above the Change-Id: xxx line [08:53:52] the question was more of if I can hit something under integration.wm.org [08:55:41] (03CR) 10Faidon Liambotis: [C: 032] points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:56:26] (03PS3) 10Faidon Liambotis: points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:58:23] paravoid: and potentially we could load the DNS zone in a varnish instance then run some dig commands / dns checker :D [08:59:30] (03CR) 10Faidon Liambotis: [C: 032] points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:59:51] varnish?! [09:00:49] grrr [09:00:51] vagrant [09:01:01] see, both contains the letters V A and R [09:01:10] var vagrant = "varnish"; [09:01:24] A friend yesterday kept saying 'valgrint' instead of vagrant :P [09:02:56] huh... like valgrind ? he seeks memory leaks in VMs ? :P [09:03:29] :P [09:03:55] * YuviPanda runs vagrant under valgrind [09:04:04] hmm, I wonder how long vbox would last [09:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [09:33:35] akosiaris: so php ? :-] [09:33:48] akosiaris: should we just install it on contint server and the beta cluster to try it out ? [09:34:01] hashar: yes. [09:34:20] I am not sure whether we could provide the package via apt.wm.o without having the full cluster to self upgrade [09:34:31] And I then i upload them in apt [09:34:42] i think we could if we used some other component [09:34:43] so I guess we need to scp the packages and manually install them with dpkg -i or something [09:35:11] I am wondering though how i can get the packages to brewster from labs (i built them there) [09:35:27] without going through my PC [09:35:58] if they are small enough, I usually scp them to fenari:/home/hashar/public_html [09:36:07] then wget from http://noc.wikimedia.org/~hashar/ [09:36:13] seems like services can not be hosted on labs ? This has something to do with the DNAT you we asking about ? [09:36:21] you were* [09:36:50] the nat issue happens whenever an instance attempt to access the public IP of an other instance [09:37:18] not sure how it is related to copying packages from brewster to labs :-D [09:37:28] not related then [09:48:52] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [09:48:52] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [09:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [10:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:31:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [11:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [11:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:31:51] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [11:32:51] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:57] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [11:39:57] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [11:39:57] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [11:40:57] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [11:40:57] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [11:43:57] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:57] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [11:46:57] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:57] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [11:51:57] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [11:56:57] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [11:57:57] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [12:04:45] apergos: About? [12:07:04] yes [12:07:11] Reedy: what's up? [12:07:36] back [12:09:03] apergos: Could you have a look why http://dumps.wikimedia.org/other/incr/wikidatawiki/ is serving 403s for Denny and a few other people please? [12:09:11] WFM and other people in the WMDE office too... [12:09:29] [13:08:25] i am not 403ed anymore [12:10:08] it serves 403 for people who try to open more than two connections from the same ip at once [12:10:11] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:25] aha [12:10:30] nice one [12:10:30] it should say so on the page [12:10:44] http://dumps.wikimedia.org/ [12:10:51] indeed, at the top it says so [12:11:05] Does it show the error on the 403? [12:11:10] error/message [12:11:17] no idea [12:11:22] heh [12:12:04] I told Denny to stop wasting resources ;) [12:12:12] do they need the latest dumps right this second? if not they can get better bandwidth and more simultaneous downloads from your.org (but there is a few hour delay) [12:12:15] hah [12:12:57] I should probably go mirror hunting again sometime soon, see if we can't find a couple more sites willing to do all dumps [12:13:01] I think for some/most of the requests they'd be ok hitting the mirrors [12:13:10] and bittorrent!! :-] [12:13:18] yep [12:13:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:48] do serial requets and have no problem :-P [12:13:52] *requests [12:14:08] Could we do something with ULSFO? [12:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:14:23] Get a large storage box and wire it up? [12:14:46] akosiaris: are you still alive ? :-] [12:14:54] I guess doing the same thing in ESAMS might not be possible [12:18:30] well we already do mirror internally, it's just not as good as having a few external mirrors for downloaders [12:18:48] of course the one internal mirror will go away soon (tampa) [12:20:11] Hmmm [12:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [12:24:46] * Reedy tries poking a local company [12:25:34] They gave Debian 57TB... http://blog.bytemark.co.uk/2013/04/04/a-major-infrastructure-donation-to-the-debian-project [12:25:45] why wouldn't it be possible in esams? ;) [12:26:10] Reedy: yep, they're awesome [12:26:22] it's funny, a few months before that [12:26:27] should give our ceph cluster to debian [12:26:28] hahaha [12:26:38] we sat and discussed and made decisions on focusing on a much smaller amount of PoPs [12:27:05] and then bytemark came and said "hey, do you want all this hardware?" which was more than half of our sites had [12:27:10] and messed our plan [12:27:25] What a shame :p [12:27:58] :P [12:28:04] I feel your pain [12:28:45] Hell, we only ideally need 30TB for "all the dumps" currently :D [12:29:31] you know we have something like 140T assigned for dumps but currently unused, right? :P [12:30:54] That's a lot of 0s to mirror [12:31:29] why haven't we filled the 140T yet? [12:32:09] because someone's good at buying/claiming hardware, but not good at actually doing stuff with it [12:32:38] at least it is not a full datacenter idling [12:33:01] full datacenters idling is a good thing [12:33:13] where it provides redundancy / standby [12:33:14] we could create some copies of the 30TB of xml files [12:33:21] that'd soon fill 140T [12:33:37] What about media from Commons? [12:33:52] What about it? [12:36:00] are you also saying we should make a big tarball from them and put them on a file server? [12:36:01] because that would be efficient use of those resources? ;) [12:36:12] I'm doing that btw [12:38:12] (03PS1) 10coren: Tool Labs: package python-scipy [operations/puppet] - 10https://gerrit.wikimedia.org/r/82398 [12:40:23] (03CR) 10coren: [C: 032] "Trivial package addition" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82398 (owner: 10coren) [12:43:30] (03PS1) 10coren: Tool Labs: package python-rsvg [operations/puppet] - 10https://gerrit.wikimedia.org/r/82399 [12:51:38] (03CR) 10coren: [C: 032] "+package" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82399 (owner: 10coren) [12:53:14] (03CR) 10Siebrand: [C: 031] "Can this just be merged?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [12:55:05] (03PS1) 10coren: Tool Labs: Boost for python [operations/puppet] - 10https://gerrit.wikimedia.org/r/82401 [12:58:23] (03PS2) 10coren: Don't show the IME in the CodeEditor textarea [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [12:59:14] (03CR) 10coren: [C: 032] "Slightly less trivial." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82401 (owner: 10coren) [12:59:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:00:08] (03CR) 10coren: [C: 032] "LGM" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80717 (owner: 10Amire80) [13:00:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [13:03:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:05:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [13:16:14] (03PS1) 10coren: Tool Labs: User-requested perl packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/82403 [13:17:08] (03CR) 10coren: [C: 032] "+packages" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82403 (owner: 10coren) [13:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [13:26:40] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [13:29:40] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [13:40:25] (03CR) 10Anomie: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82403 (owner: 10coren) [13:43:48] (03PS1) 10coren: Tool Labs: fix doubled up package inclusion [operations/puppet] - 10https://gerrit.wikimedia.org/r/82406 [13:51:45] (03CR) 10coren: [C: 032] "Small fix." [operations/puppet] - 10https://gerrit.wikimedia.org/r/82406 (owner: 10coren) [13:52:41] anomie: Thankfully, puppet doesn't throw a fit for duplicates in the same declaration. :-) [13:53:57] (03PS2) 10coren: Add role to toollabs for generic web proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/82047 (owner: 10Yuvipanda) [13:56:59] (03CR) 10coren: [C: 032] "LGM (new class)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82047 (owner: 10Yuvipanda) [14:08:20] hashar: just letting you know the PHP packages are ready for testing. I updated #5209 [14:13:40] (03PS1) 10QChris: Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 [14:13:41] (03PS1) 10QChris: Split off geowiki cron job into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 [14:13:42] (03PS1) 10QChris: Extract geowiki's research MySQL config into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82411 [14:13:43] (03PS1) 10QChris: Add cronjob to generate and push geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/82412 [14:14:24] akosiaris: just seen it [14:14:44] akosiaris: when checking the PHP5 packages I have installed, I got two extensions which are not provided there: php5-parsekit and php5-xdebug [14:15:01] akosiaris: they both depends on phpapi-20090626 which is still provided by the new package. [14:15:08] maybe there is no need to rebuild them [14:16:20] probably not... not generated from php5 package [14:16:34] plus completely different versioning/maintainers/upstreams [14:16:51] maybe they build depends on some phpapi version [14:16:55] hopefully that is still working [14:17:31] it should. It only 3 minor patches between the previous version and this one [14:19:41] !git mediawiki/core.git [14:19:42] For more information about git on labs see https://labsconsole.wikimedia.org/wiki/Help:Git [14:19:46] .. [14:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [14:26:12] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [14:26:42] !log upgrading PHP5 on gallium (contint server) [14:26:48] Logged the message, Master [14:28:15] !log gallium: dpkg -i libapache2-mod-php5_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-cli_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-common_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-curl_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-dbg_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-dev_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-gd_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-intl_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-mysql_5.3.10-1ubuntu3.7+wmf1_amd64.deb [14:28:15] php5-pgsql_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-sqlite_5.3.10-1ubuntu3.7+wmf1_amd64.deb php5-tidy_5.3.10-1ubuntu3.7+wmf1_amd64.deb [14:28:21] Logged the message, Master [14:28:22] spammer! [14:28:38] mark said spam is allowed there :] [14:29:20] hashar: It won't fit in a tweet though :( [14:29:22] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:33:07] akosiaris: I have upgraded the PHP package on the contint server (gallium). We will see if code coverage still segfault ( https://integration.wikimedia.org/ci/job/mediawiki-core-code-coverage/189/console on progress). Thx! [14:33:51] hashar: crossing my fingers :-D [14:37:54] I get to git upgraded as well [14:38:11] but I have no idea how to handle it. I could simply copy the package from quantal to precise [14:38:27] and if that works fine, get it updated in apt.wm.o for the whole cluster to upgrade [14:39:22] I will next upgrade the beta cluster instances. [14:40:55] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [14:40:55] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [14:40:55] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [14:46:23] ah mw/core tests are still passing https://gerrit.wikimedia.org/r/#/c/67418/ :-] [14:46:23] drdee: hey [14:46:30] drdee: why do you keep saying cdh 4.5? [14:46:49] drdee: I've pointed out that 4.3.1 has the patch for the bug you referred to at least three times in that thread [14:47:07] yes and i said we could test it [14:47:27] i was just laying out the different options [14:47:39] you layed out that the fixes are in 4.5 as the fact [14:47:54] but the patch for MAPREDUCE-2264 is on 4.3.1 and that's a fact too [14:48:01] in even [14:48:12] "We can install openJDK right now and see how CDH 4.3 behaves. It might work, it might not work, it might be hard to tell whether it's working." [14:48:14] and there are no other known bugs [14:48:29] what's the diff between 4.3.1 and 4.5? do you have other bugs in mind? [14:48:53] if it's just MAPREDUCE-2264 then we know that it's fixed in 4.3.1 as much as we know it's fixed in 4.5 [14:49:03] no, not that i am aware off [14:49:29] so why mention 4.5 at all [14:49:53] because that would be the first version to officially support openjdk 7 [14:50:36] did you do a git bisect to determine that MAPREDUCE-2264 was fixed in cdh 4.3? [14:50:57] no, there's patches in cloudera/patches and explicit mention in debian/changelog about that [14:51:02] I pasted it in my previous mail [14:51:03] (because i just checked the documentation ) [14:51:34] hadoop (2.0.0+1367-1.cdh4.3.1.p0.69) cloudera; urgency=low [14:51:34] Commit 5cce98177a36ceb2f3030fc524d1d1f09b802645: [14:51:34] MR1: MAPREDUCE-2264. Job status exceeds 100% in some cases. (devaraj.k and sandyr via tucu) [14:51:40] aight [14:51:43] git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1@1438286 13f79535-47bb-0310-9956-ffa450edef68 [14:51:46] (cherry picked from commit 142e1b8b722c53be5264310f64c81e8afcda5457) [14:51:49] [14:51:51] yeah so as i said it might work [14:51:51] Reason: Customer/support request [14:51:54] Ref: CDH-7179 [14:51:56] Author: Sandy Ryza [14:51:57] also [14:51:57] Commit 2c0428f73eddda137e8aa6ef4cb0112b64f0e2d3: [14:51:57] MR1: MAPREDUCE-5008. Merger progress miscounts with respect to EOF_MARKER. Contributed by Sandy Ryza. [14:52:00] Author: Sandy Ryza [14:52:01] Reason: Fix progress after MAPREDUCE-2264 [14:52:03] Ref: CDH-11368 [14:52:26] but that still does not answer my question :) [14:52:28] honestly, I think this is misrepresenting things [14:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:37] not my intention [14:52:53] if there are chances that it doesn't work, then it won't work with cdh-4.5 either [14:53:03] i.e. both have the same code [14:53:22] well cdh 4.5 is not released yet so we can't say htat [14:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [14:53:52] paravoid, akosiaris, I'd like to merge https://gerrit.wikimedia.org/r/#/c/77332/4, does either of you have time to review this evening? [14:54:07] where is it stated that it'll officially support openjdk 7? [14:54:10] that's interesting [14:54:24] andrewbogott: i have started looking at it ... [14:54:40] nowhere yet but looking at the changes you can see that a lot of effort is made to make it openjdk 7 compatible [14:55:06] i think the question is one of prioritization, as in my email, what do we gain by speeding up phasing out oracle jdk 6 for hadoop by a couple of months [14:55:24] I think we should actually go for jdk 7 [14:55:27] right now [14:55:43] oracle jdk 7? [14:55:54] either open or oracle, preferrably open [14:56:03] akosiaris, ok, thank you [14:56:07] sure, that's the end goal and we all agree [14:56:16] cdh 4.2 release notes mentioned 2264 as the only catch and we know that's fixed [14:56:27] so unless we have evidence of the contrary, why wouldn't it work? [14:56:45] because cdh 4.3 says that it has the same restrictions as 4.2 [14:56:56] which is still just 2264 [14:57:03] which the changelog says it was fixed in 4.3.1 [14:57:18] so, that leaves us with no known bugs [14:57:34] where the hive bugs fixed in 4.3? [14:57:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:57:40] these were build time bugs [14:57:45] well, bug [14:57:51] the one you mentioned, I'm not aware of any others [14:58:23] but what do we gain by speeding it up a couple of months? [14:58:38] I don't understand [14:58:49] we have no facts that anything will break [14:58:56] why would we not do it now instead of postpone it? [14:59:20] you're doing conformance and performance testing now, it doesn't have any production clients yet (and remember, you can't mix and match jdk 6/7, so that would mean a complete downtime in two months) [14:59:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [14:59:44] it looks like the perfect time to do it and we basically have no reason to /not/ to do it now [14:59:46] hi guyyyyys! [15:00:02] heya [15:00:19] hi [15:00:25] hola [15:00:57] I think drdee is worried about the unknown risk of using the not-recommended system [15:01:11] but, but, i tend to agree with faidon here I think [15:01:13] now is a really good time to do this [15:01:28] if there are no currently known issues with doing so [15:01:56] k give it a spin in labs? [15:02:47] why isn't it recommended? they said it works with 4.2.0, which is months ago [15:04:14] oracle jdk not opendjk [15:04:22] http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html [15:04:22] andrewbogott: why not 4 spaces instead of tabs? [15:04:35] andrewbogott: also if $::lsbdistid == "Ubuntu" { isn't really needed any more [15:04:42] paravoid: If it has tabs that's by mistake. [15:04:52] andrewbogott: it does as far as I can see [15:04:59] ok, I will fix [15:05:28] hashar: CI meeting ? [15:05:43] oooops [15:06:02] akosiaris: andrewbogott: do we have anything to report for CI ? [15:06:09] I sure don't. [15:06:14] neither do I :-D [15:06:25] drdee: openjdk & oracle jdk 7 are very very close, but falling back to oracle jdk 7 if openjdk 7 doesn't do it is still an improvement [15:06:25] Other than "I'm back and hoping to start working on this again" [15:06:35] not really. I have 1-2 things but nothing important [15:06:41] I guess we can skip [15:06:48] I can always email you [15:06:53] aren't we all in SF next week ? [15:07:23] yes we are [15:07:26] we can get a quick informal meetup to disucss about ci [15:07:30] and a CI round of beer hehe [15:07:43] that sounds like fun :-D [15:07:59] paravoid, ottomata: i am apparently not able to convince you guys so i will withdraw my objections, and i just hope it works [15:08:33] drdee: I'm open to be convinced but I'd like something more solid than fear it'll break :) [15:09:10] bugs we can work on, or track them to see their progress [15:09:23] I'd also hate to see a painful transition a couple of months from now [15:09:38] the "do not mix versions" requirement won't go away [15:09:59] plus, it'd be harder to do performance comparisons and benchmarks when we'll have users in production [15:10:09] so I think this is the ideal time to at least test it and get some numbers [15:10:44] well compiling using jdk 7 won't work for sure: check https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/eGeLA7XVXAk [15:10:47] but let's try it! [15:10:56] at the end of the day, let the data speka [15:11:13] getting coffee [15:11:22] we don't care about compiling with jdk 7 though, aiui [15:11:36] we can build with 6 (preferrably also openjdk) [15:11:45] and run on openjdk 7 and go forward with that [15:11:59] aight, let's do it! [15:12:12] heh [15:12:41] 15:11:53 /tmp/hudson9015395653267143336.sh: line 4: 23338 Segmentation fault php tests/phpunit/phpunit.php --exclude-group Dump,Broken,ParserFuzz,Stub --coverage-html /srv/org/wikimedia/integration/cover/mediawiki-core/master/php [15:12:42] !!!!!!!! [15:12:46] yeah segfaulting sstill :( [15:12:51] :-( [15:13:05] 14:31:47 PHP 5.3.10-1ubuntu3.7+wmf1 is installed. [15:13:34] I guess I am not good at figuring out PHP upstream issues [15:13:56] ok cool, (sorry this mobile internet is flaky sometimes). That's a yes to trying openjdk 7? [15:14:02] drdee? [15:14:18] akosiaris: maybe I should run the job harnessed in gdb and have it generate a backtrace for us :D [15:14:49] yeah... and hopefully chase it down from there [15:15:29] do we have debugging symbol in our php package? [15:17:02] php5-dbg_5.3.10-1ubuntu3.7+wmf1_amd64.deb [15:17:10] make sure this is installed. [15:17:30] and you should be able to get a bt [15:17:48] with at least PHP's symbols [15:18:14] I got it :-] [15:19:55] warning: the debug information found in "/usr/lib/debug/usr/lib/php5/20090626/pdo_mysql.so" does not match "/usr/lib/php5/20090626/pdo_mysql.so" (CRC mismatch). [15:19:56] I guess I will survive :-] [15:19:58] (03PS1) 10Ottomata: Syncing debian/bin/kafka script with recent 0.8 branch bin/*.sh scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82419 [15:20:45] (03CR) 10Ottomata: [C: 032 V: 032] Syncing debian/bin/kafka script with recent 0.8 branch bin/*.sh scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82419 (owner: 10Ottomata) [15:21:45] !log gallium running mw phpunit code coverage under gdb in a tmux. [15:21:51] Logged the message, Master [15:22:22] there's some thing you need to include in gdb to get meaningful zend backtraces [15:23:56] there's a .gdbinit [15:24:02] then you get a zbacktrace [15:24:15] i just googled it again, I knew it was something but I didn't remember what :0 [15:24:21] huh ? [15:24:26] zbacktrace ? [15:24:31] yeah [15:24:33] * akosiaris sighs [15:25:01] there's a .gdbinit somewhere in the php source [15:25:19] okay, bbl [15:25:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:45] cat ./php5-5.3.10/.gdbinit |wc -l [15:25:46] 564 [15:26:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [15:27:36] PROBLEM - RAID on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:27:41] akosiaris: https://wikitech.wikimedia.org/wiki/GDB_with_PHP [15:28:05] the doc says: source ~tstarling/php.gdb :] [15:31:46] PROBLEM - DPKG on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:31:56] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [15:34:35] akosiaris: I get my backtrace http://paste.openstack.org/show/45679/ not sure what else is needed though :D [15:34:36] PROBLEM - RAID on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:37:01] someone knowing zend is needed i fear... cause line 889 of zend_alloc_canary.c is ZEND_MM_CHECK_TREE(mm_block); [15:37:26] i am running gdb with gdbinit loaded [15:37:39] and I must say I have no idea what that function does [15:37:59] (03PS1) 10Ottomata: Adapting debian/bin/kafka's server-stop command to change from bin/kafka-server-stop.sh introduced in kafka-1031. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82421 [15:38:51] (03CR) 10Ottomata: [C: 032 V: 032] Adapting debian/bin/kafka's server-stop command to change from bin/kafka-server-stop.sh introduced in kafka-1031. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/82421 (owner: 10Ottomata) [15:41:04] hashar: #define ZEND_MM_CHECK_TREE(block) \ [15:41:04] if (UNEXPECTED(*((block)->parent) != (block))) { \ [15:41:04] zend_mm_panic("zend_mm_heap corrupted"); \ [15:41:04] } [15:41:13] sooo something unexpected happens ?? :P [15:41:23] that looks like chinese to me :( [15:41:51] I am waiting for the stacktrace then will write down the core file [15:42:07] i get the feeling that the if clause has the problem though [15:42:34] cause zend_mm_panic looks like a function and it would show up at the bt ... [15:49:30] (gdb) phpbt [15:49:31] No symbol "execute_data" in current context. [15:49:32] :( [15:49:55] nice :-) [15:50:07] i think we should call a rescue party [15:51:05] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [15:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:53:57] (03PS1) 10Cmjohnson: adding rdb1 and 2 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82424 [15:55:58] (03PS2) 10QChris: Turn on automatic pulling for geowiki repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/82409 [15:55:59] (03PS2) 10QChris: Split off geowiki cron job into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 [15:56:00] (03PS2) 10QChris: Extract geowiki's research MySQL config into separate class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82411 [15:56:01] (03PS2) 10QChris: Add cronjob to generate and push geowiki's limn files [operations/puppet] - 10https://gerrit.wikimedia.org/r/82412 [15:56:18] akosiaris: it seems to happen just before PHP ends its execution [15:56:49] I have pasted the traces in the bug report https://bugzilla.wikimedia.org/show_bug.cgi?id=43972 :-] [15:56:53] now is conf call again [15:57:45] PROBLEM - DPKG on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:59:15] (03CR) 10QChris: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82410 (owner: 10QChris) [16:09:58] (03CR) 10Cmjohnson: [C: 032 V: 032] adding rdb1 and 2 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82424 (owner: 10Cmjohnson) [16:10:32] (03CR) 10Chad: [C: 031] Specify push configurations for gerrit's replication in lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/82231 (owner: 10QChris) [16:10:43] <^d> Someone mind poking that ^? [16:12:12] gotcha... [16:14:26] qchris: just checking [16:14:44] by making $replication_basic_push_refs an Array [16:14:53] ottomata: cool. Thanks. [16:14:58] do you have change replication.config.erb to make sure it is rendered properly? [16:15:05] Array(params).sort.each { |value| [16:15:06] %> <%= option %> = <%= value %> [16:15:12] i guess params will be that array [16:15:16] It should already be in there. ... [16:15:18] Yes. [16:15:20] does that make an array of an array? [16:15:30] Look at the url parameters. That is alerady a list for some entries. [16:16:07] ottomata: I have no clue how that works :-) However, it works for the 'url' settings. [16:16:27] don't see any lists for url, which remote? [16:16:28] I tried to verify locally by expanderb, but that failed for me. [16:17:18] ottomata: jenkins-slaves [16:18:16] ottomata: line 45 in the old file. line 50 in the new file. [16:19:29] huh! i just checked Array(params) in ruby [16:19:34] it does not wrap the array! [16:19:38] just returns the array you give it [16:19:49] >> Array(1) [16:19:49] ottomata: (object) [undefined] [16:19:49] => [1] [16:19:50] or [16:20:03] >> Array([1,2,3]) [16:20:03] => [1, 2, 3] [16:20:03] ottomata: (object) [[1, 2, 3]] [16:20:15] what! who is exmabot! [16:20:43] that's the brother of skynet [16:20:47] what?! [16:20:49] apergos: ms-be1 is still reporting errors on sdh1. the disk was swapped last week [16:20:55] >> val nonya = "hithere"; [16:20:55] ottomata: SyntaxError: Unexpected identifier [16:21:00] an ecmascript(javascript) bot ... I have no idea what it is used for though [16:21:03] >> nonya = "hithere"; [16:21:03] ottomata: (string) 'hithere' [16:21:06] bah! [16:21:07] haha [16:21:22] hmm I'll look at that [16:21:24] cmjohnson1: [16:21:52] (03PS2) 10Ottomata: Specify push configurations for gerrit's replication in lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/82231 (owner: 10QChris) [16:22:00] (03CR) 10Ottomata: [C: 032 V: 032] Specify push configurations for gerrit's replication in lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/82231 (owner: 10QChris) [16:22:20] Thanks, ottomata :-D [16:22:24] qchris: , ^d: merged [16:22:24] yup! [16:22:25] yw [16:23:09] ottomata: while we are at gerrit ... can I bribe you into looking at https://gerrit.wikimedia.org/r/#/c/82044/ as well? [16:23:10] <^d> Thanks! [16:24:01] can do, I'll assume you know that works, here we go! [16:24:05] (03PS2) 10Ottomata: Fix double encoded characters in gitweb -> gitblit forwards [operations/puppet] - 10https://gerrit.wikimedia.org/r/82044 (owner: 10QChris) [16:24:10] (03CR) 10Ottomata: [C: 032 V: 032] Fix double encoded characters in gitweb -> gitblit forwards [operations/puppet] - 10https://gerrit.wikimedia.org/r/82044 (owner: 10QChris) [16:24:23] ottomata: Awesome! Thanks! [16:26:38] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [16:27:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:16] (03CR) 10Akosiaris: "So I just went through compiling half the catalogs locally and the change does not seem to break anything. That being said..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 (owner: 10Andrew Bogott) [16:29:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.810 second response time [16:29:38] PROBLEM - DPKG on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:30:37] Do we have any page with recommendations for people running random search engines? Background is http://lists.wikimedia.org/pipermail/mediawiki-api/2013-September/003111.html [16:31:50] I guess they're doing it anon and hitting the squid cached versions? [16:31:54] Which should be up to date anyway... [16:32:31] Though [16:32:32] http://en.wikipedia.org/robots.txt [16:32:38] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [16:32:47] I see no API restrictions [16:32:51] make your bets when we'll block them for overloading us:P [16:35:34] Reedy: API is disallowed in robots.txt by the "Disallow: /w/" rule for "*". [16:36:45] duh [16:37:31] it seems to me that crawling us via API is dangerous. first, because while in theory some APIs are cached, hit ratio would be very low. second, because API encourages the "let me request 500 pages at once" approach [16:39:25] (03PS1) 10Mark Bergsma: Fix PROXY bug [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/82426 [16:39:51] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [16:42:51] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [16:42:53] greg-g: did I just screw up about the checking? [16:42:55] *checkin [16:43:32] I thought for sure we set it for wednesday [16:46:40] (03PS5) 10Andrew Bogott: Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 [16:47:17] ^ don't review that one yet, another version is upcoming [16:47:25] <^d> qchris: Replication plugin reloaded. [16:48:43] (03PS1) 10Mark Bergsma: Add --write-proxy2 configuration option [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82427 [16:48:44] (03PS1) 10Mark Bergsma: Add PROXY protocol v2 send code [operations/debs/stud] (wikimedia) - 10https://gerrit.wikimedia.org/r/82428 [16:48:54] ^d: queue is still short. So I assume we'll have to wait for a minute to see the replication restart. [16:49:16] <^d> I didn't start a replication of everything. [16:49:28] ^d: Ok :-) [16:49:44] <^d> I'll do that now [16:50:23] apergos: there's two this week, sorry. Today was the RelEng/QA team meeting (which I would love if you could make in the future, weekly to start out with, going down to biweekly after we get a rhythm), tomorrow is a one-off with you and hashar and I so I can get a better understanding of what you two think we should do/can learn from/etc re Beta Cluster [16:50:37] ok the tomorrow one is the one then [16:51:00] I got that one in my calendar :-] [16:51:01] I was looking at today's and had no idea what the deal was [16:51:14] was more general [16:51:22] related to browser tests / jenkins / git deploy and so on [16:51:31] while tomorrow would be dedicated to beta I guess [16:51:41] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:51:45] one hot topic is bringing up better monitoring to beta [16:51:51] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [16:52:51] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [16:53:30] what hashar said :) [16:53:41] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [16:53:56] ok sounds interesting, maybe we can chat about that tomorrow a little [16:54:14] ok gotta run, back in a while [16:55:09] greg-g: sorry if I was not that much talking during the meeting [16:55:13] the internet connection there is crap [16:55:27] hashar: no worries [16:55:32] I should do the meeting at home but the time does not play nice hehe [16:55:37] :D [16:55:51] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [16:55:51] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [16:55:51] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [16:55:54] was for sure nice to see the whole qa team in the same meeting [16:56:11] still have to heavily use the qa mailing list [16:56:41] PROBLEM - RAID on analytics1012 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:56:59] ^d: https://github.com/wikimedia/mediawiki-core/releases [16:57:06] ^d: Tags show up there again \o/ [16:58:11] \O/ [16:58:51] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [16:58:51] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [16:59:41] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:09] I am off see you later [17:00:31] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [17:01:19] anomie: do you have OTRS? [17:01:28] jeremyb: No [17:01:40] i just saw you above now. idk why he mailed that list :( [17:01:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:51] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [17:01:51] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [17:02:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:02:50] anomie: i pinged you elsewhere [17:04:21] PROBLEM - Disk space on analytics1003 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 11250 MB (3% inode=99%): [17:04:40] mark paravoid we urgently need https://gerrit.wikimedia.org/r/#/c/81892/ [17:04:51] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:05:18] first minimize that ACL [17:05:31] PROBLEM - Disk space on analytics1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 11232 MB (3% inode=99%): [17:05:31] PROBLEM - DPKG on ms-be1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:05:37] mark, that code went through the optimizer [17:05:58] let me double check just in case, but my script already does that [17:06:03] yurik: i see a few places you can combine, 41.63.128.0/24 for example [17:06:28] actually looks like you can collapse that into a larger chunk as well in the 41.63 block [17:06:59] it's not completely equivalent [17:07:08] LeslieCarr, it looks that way, but its not collapsable :( [17:07:16] i use python's lib for that [17:07:32] it's basically the /20 , minus 41.63.128.0 [17:07:33] looks like it only excludes .0? [17:07:36] yeah that's stupid [17:07:44] and .143.255 [17:07:50] mark: what happened with netmapper then? [17:07:59] er [17:08:01] yurik: I meant [17:08:02] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [17:08:02] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [17:08:24] and the homepage stuff [17:08:40] these have been draggen on for quite a while now, despite being almost ready [17:09:02] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [17:09:02] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [17:09:16] yurik: need any help figuring out the collapsed bits or do you have it ? [17:09:21] and you made ops write that vmod, so we expect to be paid off in not having to do these reviews anymore :) [17:10:09] LeslieCarr, mark, i will consolidate them (as i doubt anyone else will use "0") but could we merge it right thereafter? Apparently we are doing a live test with them tomorrow [17:10:30] dan said he will contact the carrier to sort it out too [17:10:48] just fix it now? [17:12:27] mark, yes, fixing them right now [17:12:30] (03CR) 10Mark Bergsma: [C: 04-1] "That ACL can be minimized further; it seems to leave out just .0 and .255 on some /24 prefixes for no apparent reason other than the fact " [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 (owner: 10Yurik) [17:13:02] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [17:18:19] is it just me or gitblit search is totally broken? try "admins" @ https://git.wikimedia.org/summary/operations%2Fpuppet.git [17:18:53] also, fwiw, I think it's never worked as well for me as gitweb's search did. maybe i gave up too early [17:19:24] <^d> So it pretty much sucks :\ [17:22:02] (03PS1) 10Ottomata: Installing java openjdk 7 on analytics nodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82430 [17:22:02] at least for browsing it is very pretty [17:22:35] (03PS2) 10Ottomata: Installing java openjdk 7 on analytics nodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82430 [17:22:38] LeslieCarr++ [17:25:42] (03PS1) 10RobH: removing netmon1001 for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/82431 [17:27:21] (03CR) 10RobH: [C: 032] removing netmon1001 for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/82431 (owner: 10RobH) [17:30:42] PROBLEM - DPKG on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:32:58] (03PS2) 10Yurik: Added Orange Madagascar carrier 646-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 [17:33:02] PROBLEM - Host netmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:33:52] (03PS3) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [17:34:24] (03CR) 10Lcarr: [C: 04-1] "you still have some more collapsing to do" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 (owner: 10Yurik) [17:34:37] LeslieCarr, which one? [17:35:03] at first glance, 41.63.143.0/24 [17:35:17] which can then be combined with .142.0/24 to make a /23 [17:35:36] and i think the .136.0 through 143.255 can probably make a /21 [17:35:38] (03PS4) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [17:35:40] have to double check that [17:36:36] yeah, it does make a /21 [17:36:39] will be funny if the carrier comes back to us and tells us that those IPs are actually someone else and they are just forwarding through them or some other crazy stuff :) [17:37:23] yurik: i can double check the ip blocks if you'd like to make sure they're all originating from the same AS … a quick whois on the ip should verify that the addresses are all allocated to orange [17:37:35] yurik, dunno if this is helpful for you, but there is a nice little utility installed on stat1 called cidr calc [17:37:43] it won't collapse things automatically for you [17:38:16] (03PS5) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [17:38:20] but it will show ip ranges of different cidrs and more [17:38:22] ottomata, sure, what's the link? Although all those IPs are technically already optimized - they are just missing one or two, making the block incomplete [17:38:24] i've used that for collapsing some when I was reviewing dan and amits changes to that partner IP ranges page [17:38:31] uh , its a CLI [17:38:47] <^d> https://gerrit.wikimedia.org/r/#/c/81374/5/templates/gerrit/gerrit.config.erb that makes me so happy. [17:38:48] (03PS1) 10Jgreen: fix 'already-filtered' test in OTRS spam export script [operations/puppet] - 10https://gerrit.wikimedia.org/r/82433 [17:39:10] otto@stat1:~$ cidrcalc 41.63.143.0/24 [17:39:10] Address: 41.63.143.0 [17:39:10] Netmask: 255.255.255.0 (/24) [17:39:10] Wildcard: 0.0.0.255 [17:39:10] Network: 41.63.143.0/24 [17:39:10] Broadcast: 41.63.143.255 [17:39:11] Hosts: 41.63.143.1 - 41.63.143.254 [17:39:11] NumHosts: 254 [17:39:50] ottomata, thanks, i think i saw something similar online [17:41:21] (03CR) 10Jgreen: [C: 032 V: 031] fix 'already-filtered' test in OTRS spam export script [operations/puppet] - 10https://gerrit.wikimedia.org/r/82433 (owner: 10Jgreen) [17:44:07] (03PS3) 10Yurik: Added Orange Madagascar carrier 646-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 [17:44:08] LeslieCarr, collapsed to a /20 [17:45:28] i really want to know what they are using 197.159.159.128/25 for [17:45:39] because that stupid block breaks up a beautiful /20 [17:45:59] you're sure that 197.159.159.128-254 weren't in the list ? [17:50:07] LeslieCarr, all ranges were received by Dan, who simply copied them into https://meta.wikimedia.org/wiki/Zero:646-02 [17:50:17] we will send them a confirmation email after the test [17:50:30] but we need to get this stuff live before tomorrow's test [17:50:36] ok [17:50:43] though oi /32 [17:50:59] LeslieCarr, ?? [17:51:26] just strange ranges [17:51:45] i think so too :) No idea what they are smoking :) [17:52:09] (03CR) 10Lcarr: [C: 032] "collapsed as much as possible and yurik is going to double check that the 197.159.159.128/25 is really not being used and therefore can't " [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 (owner: 10Yurik) [17:52:34] merged [17:59:53] !updated Parsoid to 859a701 [17:59:59] !log updated Parsoid to 859a701 [18:00:04] ;) [18:00:05] Logged the message, Master [18:00:30] syncing.. [18:01:18] (03PS1) 10Ryan Lane: Prepare for gerrit migration [operations/dns] - 10https://gerrit.wikimedia.org/r/82435 [18:01:48] Ryan_Lane, git-deploy just crashed: http://paste.debian.net/34480/ [18:02:14] o.O [18:02:23] gwicke: did you guys add any new hosts? [18:02:33] I didn't [18:02:56] the pending minions were all old wtp10** hosts [18:03:19] let me sync the modules and such to all of them [18:03:26] maybe there's a new one [18:03:39] oh, wait [18:03:43] this is on tin [18:03:47] yup [18:04:40] ugh [18:04:45] that should be self.fetch() [18:04:55] and self.checkout(force) [18:05:16] haven't deployed in a while? :) [18:05:44] last on Thursday I think [18:05:50] huh [18:05:50] weird [18:06:05] oh. did you tell it to re-fetch? [18:06:17] it's weird, because this should have broken ages ago [18:06:52] I told it to retry fetching after getting the list of pending minions [18:07:04] first 'd' for detail, then 'r' for retry [18:07:13] ah [18:07:15] yeah [18:07:17] that's why [18:07:22] ok. I'm pushing in a fix [18:07:36] k [18:10:48] (03PS1) 10Ryan Lane: Call fetch and deploy via the object [operations/puppet] - 10https://gerrit.wikimedia.org/r/82436 [18:11:14] (03CR) 10Ryan Lane: [C: 032] Call fetch and deploy via the object [operations/puppet] - 10https://gerrit.wikimedia.org/r/82436 (owner: 10Ryan Lane) [18:11:47] LeslieCarr, thanks for all your help! I just sent a confirmation email to dan so he can contact the carriers. Also, I will work later today on adding two more carriers to the list. [18:12:03] cool [18:13:28] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:43] so google hangout is completely not working for me [18:18:54] mark: boo [18:19:02] :( [18:19:19] gwicke: should be fixed [18:20:00] Ryan_Lane: the git-deploy state is now hosed, which is the best way to clean that up? [18:20:28] gwicke: git deploy abort [18:20:54] ah, ok [18:21:13] easy enough [18:28:48] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [18:28:48] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [18:28:48] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [18:29:08] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.011 second response time [18:29:08] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [18:29:18] RECOVERY - Parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.002 second response time [18:29:18] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [18:29:18] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [18:29:18] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [18:29:28] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [18:29:29] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [18:30:11] a lot of parsoids were out of disk space [18:30:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:52] the current logging and packaging is rather crappy- no log rotation etc [18:31:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.615 second response time [18:40:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [18:53:17] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, wikimedia, closed and private wikis to 1.22wmf15 [18:53:23] Logged the message, Master [19:00:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.275 second response time [19:04:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything non 'pedia to 1.22wmf15 [19:04:12] Logged the message, Master [19:04:23] (03PS1) 10Reedy: Everything non \'pedia to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82446 [19:04:39] (03CR) 10Reedy: [C: 032] Everything non \'pedia to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82446 (owner: 10Reedy) [19:04:49] (03Merged) 10jenkins-bot: Everything non \'pedia to 1.22wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82446 (owner: 10Reedy) [19:04:52] marktraceur: that's uploadwizard deployed to commons ^ [19:04:58] (in case shit breaks) [19:06:18] Wooo [19:06:25] * marktraceur waits [19:06:44] Someone was complaining copy upload was broken in UW or something [19:07:37] Copying metadata that's defined by campaigns is broken [19:24:46] (03PS1) 10Jgreen: OTRS spam exporter can optionally close tickets after export [operations/puppet] - 10https://gerrit.wikimedia.org/r/82448 [19:25:59] (03CR) 10Jgreen: [C: 032 V: 031] OTRS spam exporter can optionally close tickets after export [operations/puppet] - 10https://gerrit.wikimedia.org/r/82448 (owner: 10Jgreen) [19:37:00] !log reedy synchronized php-1.22wmf15/extensions/UploadWizard/resources 'touch' [19:37:06] Logged the message, Master [19:45:50] !log reedy synchronized wmf-config/ 'touch' [19:45:56] Logged the message, Master [19:46:08] ty Reedy [19:47:10] PROBLEM - DPKG on ms-be1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:50] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [19:49:50] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [20:02:50] chase [20:03:12] meant as cntrl+f sorry [20:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:17] hey all; dumps.wm.o is giving 403 forbidden for everything [20:24:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [20:26:02] fyi: http://dumps.wikimedia.org/other/pagecounts-raw/2013 -> 403 [20:26:48] mwalker, several threads? [20:27:40] two connections from same ip max [20:28:01] and it's not giving 403 for everything, as I just retrieved something from there [20:28:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [20:29:31] (03PS1) 10RobH: RT#5704 bastion4001 ip [operations/dns] - 10https://gerrit.wikimedia.org/r/82456 [20:29:37] office needs to not have one ip someday, that's a bit of a problem... (for blocks too) [20:29:50] apergos, and since everyone in the office is so much more likely to download something.... [20:29:58] yup [20:30:38] (03CR) 10RobH: [C: 032] RT#5704 bastion4001 ip [operations/dns] - 10https://gerrit.wikimedia.org/r/82456 (owner: 10RobH) [20:32:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:36] mw tarballs will eventually be on another host I think [20:32:49] (03PS1) 10Yurik: Added 470-03 (Banglalink) and 416-03 (Umniah Jordan) zero carriers [operations/puppet] - 10https://gerrit.wikimedia.org/r/82457 [20:32:52] some of this stuff can go in swift, no? [20:33:05] LeslieCarr, if you have a sec ^^ - this one should be easier [20:33:18] we're getting a lot of zero carriers [20:33:19] no CIDR for multiples [20:33:20] which is awesome :) [20:33:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [20:33:40] LeslieCarr, yes, but fragmentation is increasing - really need to get the non-fragmented ESI working [20:33:45] yeah [20:33:59] (03CR) 10Lcarr: [C: 032] Added 470-03 (Banglalink) and 416-03 (Umniah Jordan) zero carriers [operations/puppet] - 10https://gerrit.wikimedia.org/r/82457 (owner: 10Yurik) [20:34:14] LeslieCarr, thanks! [20:51:22] are we having issues on meta? [20:51:49] (03PS1) 10RobH: RT#5704 adding ulsfo subnet to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/82465 [20:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:53:02] Getting a time out every time I do a translation mark up :-/ (it gets done it seems … but the time out is a bit scary) [20:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [20:53:37] Jamesofur: is this related to CentralNotice? [20:53:42] nope [20:53:45] this is translation extension [20:53:53] though you haven't responded to my email :) [20:56:55] Jamesofur: more info? is it your browser's timeout page or some other? [20:57:17] squid, nginx, apache? [20:57:21] nope it's the mediawiki time out page, I'm going to do another markup in a minute I'll see if it happens again and post it in here [20:57:23] squid iirc [20:57:40] at first i wanted to included LVS in that list but i resisted the urge :) [20:58:23] Jamesofur: try to get headers too maybe [20:58:32] also... what languages do you speak?? [20:59:10] fr badly [20:59:23] and romance languages even worse but can spot things when needed sometimes [20:59:29] *other romance [21:00:15] (03CR) 10RobH: [C: 04-1] "i need to add the dhcpd changes to this patchset" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82465 (owner: 10RobH) [21:01:57] just for spite purposes that one took forever but actually did complete [21:02:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:03:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [21:04:47] (03PS2) 10RobH: RT#5704 adding ulsfo subnet to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/82465 [21:06:41] * Nemo_bis for a moment thought the channel had been filled with romance [21:07:19] lol [21:08:32] Request: POST http://meta.wikimedia.org/wiki/Special:PageTranslation, from 208.80.154.77 via cp1018.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.138 (10.64.0.138) [21:08:33] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 03 Sep 2013 21:07:31 GMT [21:09:56] it appears to be having me mark it twice ... [21:10:18] * Jamesofur will try from firefox next time [21:10:49] (03CR) 10Ryan Lane: [C: 032] Prepare for gerrit migration [operations/dns] - 10https://gerrit.wikimedia.org/r/82435 (owner: 10Ryan Lane) [21:15:02] heh, midaired greg-g [21:15:08] hah :) [21:15:11] comment 8, comment 8! [21:16:00] fiiine :P [21:16:09] whatever, just pinging the people who should fix it (Ops) :) [21:16:56] jeremyb: want headers etc? Had same issue on FF [21:17:05] * Jamesofur grumbles about how it's always at the worst time [21:17:30] Jamesofur: well you had via in paste above. is there a pattern? is it always cp1018? [21:17:57] nope, latest is 1011 [21:18:06] and I think it was another before that [21:18:24] and 10.64.0.138 ? [21:18:33] meta itself was taking a long time to load too. [21:18:39] both of these are .138 yes [21:18:43] can't remember last one [21:18:58] 10.64.0.138 is cp1016 [21:19:11] https://meta.wikimedia.org/wiki/Subpoena_FAQ [21:19:13] oops [21:19:18] Request: POST http://meta.wikimedia.org/wiki/Special:PageTranslation, from 208.80.154.77 via cp1011.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.138 (10.64.0.138) [21:19:28] we should have more complaints if this were a serious/widespread thing [21:19:30] Error: ERR_READ_TIMEOUT, errno [No Error] at Tue, 03 Sep 2013 21:15:54 GMT [21:19:53] well if it's most effected by things like translation markup it isn't a huge customer base :) [21:20:03] though meta itself was not the fastest thing in the world to load [21:20:09] the other sites are fine though [21:20:21] i think there's no meta in watchmouse?? [21:20:43] don't think so, didn't see it when I checked [21:21:10] hmm, is ruWiki on the same cluster? [21:21:32] on enWiki the script I pull from ru is going ages to import [21:23:26] hmmm no it's not [21:32:25] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [21:32:35] (03PS3) 10RobH: RT#5704 adding ulsfo subnet to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/82465 [21:33:05] (03PS6) 10Andrew Bogott: Move base class and subclasses into a 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77332 [21:33:25] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [21:37:06] (03CR) 10RobH: [C: 032] RT#5704 adding ulsfo subnet to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/82465 (owner: 10RobH) [21:37:25] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [21:37:25] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [21:38:01] urgh [21:38:08] Who made varnish changes and didnt merge on sockpuppet? [21:38:47] oh RobH that was me [21:38:47] sorry [21:38:50] they are ok to merge [21:38:55] i know i just tracked it down ;] [21:39:00] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [21:39:36] we could use an IRC bot that has these message features where you leave a message for some nick who is offline and when that person joins later the bot tells them [21:39:53] e.g. TCL scripts for eggdrop [21:40:00] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [21:40:00] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [21:41:00] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [21:41:00] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [21:41:30] memoserv, mutante [21:42:15] apergos: exactly that, yea [21:42:23] want [21:42:30] ok but [21:42:37] if we have memoserv why do we need a bot for it [21:42:48] Ryan_Lane: https://bugzilla.wikimedia.org/show_bug.cgi?id=53723 [21:43:06] apergos: eh, that's just cause i thought Efnet where we didnt have it as service [21:43:15] ah [21:43:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:43:28] yeah freenode does, very handy [21:43:35] apergos: i see /query memoserv ..help etc.gotcha [21:43:41] yup [21:44:00] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [21:46:00] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [21:46:58] (03PS1) 10Cmjohnson: removing caesium from site.pp netboot and adding to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82531 [21:47:00] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [21:48:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:49:00] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [21:52:00] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [21:52:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:52:22] paravoid, around? [21:52:28] kinda [21:52:38] just wanted to get your opinion on something [21:52:41] I just got back, checking email and irc before I hit the bed [21:53:03] i'll ask anyway, if you want to punt and talk about it some other day that's fine [21:53:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [21:53:37] okay :) [21:53:42] robh: plz check https://gerrit.wikimedia.org/r/82531 [21:53:43] so the ganglia backend that's available for etsy's statsd sucks, and i rewrote it, it's about 200-300 lines of js [21:53:52] where should that go? i'm tempted to put it in puppet [21:54:14] heh [21:54:14] i think it could properly go to our git repo and be built into the package itself, but i'd want to test it in prod for a while first [21:54:18] is that v8? [21:54:41] it's node, which uses the v8 js engine, yes [21:54:51] not e.g. rhino [21:54:57] nor spidermonkey [21:55:03] so, I think statsd upstream should probably include this but I'm not sure if that will fly with them [21:55:16] they explicitly said no when the dude who wrote the previous backend submitted it [21:55:21] if it doesn't, I think a separate repo would be best, mirrored under a nice github name [21:55:29] /wikimedia/statsd-ganglia or whatever [21:55:36] + git-deploy? [21:55:43] I wouldn't mind puppet at all, I'm just saying it might be generally useful [21:55:50] yeah [21:55:53] puppet is easier, sure [21:55:59] i'll do both [21:56:28] puppet for just copying the file into place, repo for the benfit of others [21:56:48] alright, thanks! [21:57:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [21:57:14] ugh, idk about 2 places! [21:58:00] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [21:58:16] would have to be very clear and strict about not directly changing the puppet copy [21:58:23] everything going through the other repo [21:59:10] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [22:00:26] sure, i'll put a big scary comment header to that effect [22:02:20] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:04:00] ori-l: maybe we can enforce it with hooks too. if it commits to both the derived and canonical parts of puppet repo in one commit then reject. if it commits to derived parts without explicitly saying in the commit msg that it's just copying from the other repo then reject [22:04:11] * jeremyb still prefers having only one copy though [22:07:54] <^d> awight: Ping. [22:09:12] (03PS1) 10Andrew Bogott: Create an initial, empty Packages.gz. [operations/puppet] - 10https://gerrit.wikimedia.org/r/82532 [22:10:59] mutante: can you check this for me https://gerrit.wikimedia.org/r/82531 [22:11:06] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [22:15:42] (03PS1) 10Mattflaschen: Add GuidedTour to cawiki, hewiki, mswiki, and ukwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 [22:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:22:35] Ryan_Lane: was there discussion somewhere about apache vs. varnish vs. mediawiki for protorel redirects? [22:22:46] (03CR) 10Mattflaschen: "Should be merged and deployed during E3's Thursday deploy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 (owner: 10Mattflaschen) [22:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [22:23:27] jeremyb: "in my head" [22:28:44] ^d: hey, thanks for looking at the git issues [22:28:58] ^d: blasting the deployment branch is fine [22:29:21] <^d> Sweet. So what I'm actually gonna do is just blast the history. That way you'll keep what's there, minus the bit we want out. [22:29:26] <^d> With less rebasing hell for me :) [22:29:35] (03CR) 10Dzahn: "you're also making changes to loudon,mc1,zinc. is that intended?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/82531 (owner: 10Cmjohnson) [22:29:41] are we losing gerrit history, though? [22:30:08] cmjohnson1: looks good, +2 [22:30:11] ^d: whaddyu mean, "blast the history"? [22:30:12] (03PS2) 10Mattflaschen: Add GuidedTour to additional languages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82533 [22:30:24] ahh, needs rebase [22:30:29] (03PS2) 10RobH: removing caesium from site.pp netboot and adding to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82531 (owner: 10Cmjohnson) [22:30:31] ^d: blasting gerrit history is sort of OK [22:30:36] <^d> awight: I mean remove the history from the deployment branch. Luckily none of that's ever gone through gerrit. [22:30:39] ^d: losing git history not ok [22:30:42] <^d> Master is ok. [22:30:43] ^d: ah, fine [22:30:49] <^d> Git history for master will be there. [22:31:01] <^d> Gerrit history will not. I've not found a way to deal with it safely :\ [22:31:09] yeah i was thinking that would be the case [22:33:04] siebrand Nikerabbit if either of you are around. I know this may be ops related but haven't figured out how yet. I seem unable to mark translations on meta today without a time out in the end (the translation DOES get marked, the query just times out). Any thoughts? [22:34:39] jeremyb: in ops meetings [22:40:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [22:43:55] (03CR) 10Cmjohnson: [C: 032 V: 032] removing caesium from site.pp netboot and adding to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82531 (owner: 10Cmjohnson) [22:45:44] !log upgrading wikitech to 1.22wmf15 [22:45:50] Logged the message, Master [22:52:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:55:58] <^d> awight: Repo in place in gerrit, will need to re-clone. I'll be sending an e-mail to everyone on that list. [22:56:07] <^d> Also created you a new "deployment" branch based off master. [23:07:15] ^d: rad, thanks for taking the time for this [23:07:44] Ryan_Lane: hrmmmm, well i can't read it out of greg-g's head. and i guess there's no transcript of the meeting :/ [23:08:04] it's basically that we should point those locations at mediawiki [23:08:09] and mediawiki should return redirects [23:08:20] rather than apache or varnish doing it [23:13:25] andrewbogott: hey, i totally missed the sysctl merge/deploy from yesterday. thanks for that! [23:14:24] Ryan_Lane: right but i'm missing the reasoning behind that. also, that's not an existing feature in mediawiki right? so it takes time to add it? [23:14:37] it'll need to be added, yes [23:14:40] steal someone from robla or something [23:14:43] jeremyb: ask mark [23:14:49] ok [23:14:54] I don't have any strong opinion on how this is done [23:15:27] * robla growls at jeremyb and gets instinctively possessive :-P [23:15:47] * jeremyb holds up a mirror @ robla [23:16:45] jeremyb: mirrors aren't effective against rabid managers [23:17:13] never had a rabies vaccine [23:17:14] <^d> Ryan_Lane: We should be all set for tomorrow on my end. ytterbium is all ready, changes in puppet pushed to gerrit for merging. [23:17:27] <^d> We'll get hyperthreading, openjdk7 and a 20g heap out of the box. [23:17:31] has robla bitten anyone recently? [23:17:33] <^d> Can possibly tweak other things later. [23:18:59] ^d: sounds good [23:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:23:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [23:27:10] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [23:30:10] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [23:40:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:41:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [23:51:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time