[00:17:31] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:31:27] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [01:32:27] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:27] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:27] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:38] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:38] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [01:39:38] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [01:40:38] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [01:40:38] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [01:43:38] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [01:45:38] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [01:46:38] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [01:48:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [01:51:38] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [01:56:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:38] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [01:57:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [01:57:38] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [02:08:04] !log LocalisationUpdate completed (1.22wmf14) at Tue Sep 3 02:08:03 UTC 2013 [02:08:11] Logged the message, Master [02:09:55] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [02:14:16] !log LocalisationUpdate completed (1.22wmf15) at Tue Sep 3 02:14:15 UTC 2013 [02:14:22] Logged the message, Master [02:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [02:25:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 3 02:25:20 UTC 2013 [02:25:26] Logged the message, Master [02:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [03:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [03:26:02] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [03:29:02] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [03:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [03:43:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [03:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:57:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:07:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:11:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [04:40:45] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [04:40:45] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [04:40:45] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [04:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [04:57:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:31:51] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [05:51:01] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [05:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [06:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [06:26:18] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [06:27:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.165 second response time [06:32:18] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:39:18] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [06:42:21] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [06:51:21] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:52:21] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:55:21] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:21] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:21] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:21] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:21] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [07:01:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [07:04:22] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:07:08] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [07:07:08] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:08] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [07:08:08] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [07:12:08] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [07:21:28] PROBLEM - Disk space on wtp1022 is CRITICAL: DISK CRITICAL - free space: / 339 MB (3% inode=77%): [07:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:38] PROBLEM - Disk space on wtp1023 is CRITICAL: DISK CRITICAL - free space: / 335 MB (3% inode=77%): [07:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.149 second response time [07:27:08] PROBLEM - Parsoid on wtp1022 is CRITICAL: Connection refused [07:27:58] PROBLEM - Parsoid on wtp1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:08] RECOVERY - Parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [07:30:28] RECOVERY - Disk space on wtp1022 is OK: DISK OK [07:30:38] RECOVERY - Disk space on wtp1023 is OK: DISK OK [07:48:35] akosiaris: good morning :) [07:48:53] hashar: good morning to you too. [07:49:28] I found out an issue with the git version provided by Ubuntu :] [07:49:41] yet another ticket hu [07:50:18] your php packages are fresh out of the oven. I have not yet uploaded them to apt. Would you like to test them first ? [07:50:37] the one with git-http-backend and --references ? [07:50:37] I have no clue how to tests them :( [07:50:50] though I could manually deploy them on beta cluster and on the contint server [07:51:05] yup git-http-backend and --references [07:52:11] maybe recompiling with symbols to get a stacktrace would help but that is beyond my knowledge [07:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:55:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [07:57:02] hashar: good morning [07:57:10] paravoid: :-] [07:57:17] it is good to see you guys awake in the morning [07:58:07] akosiaris: I have no clue how we handle PHP upgrades usually [07:58:31] but having them manually deployed on the beta cluster and contint server might catch most of the potential issues [07:58:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:36] hashar: https://gerrit.wikimedia.org/r/82264 [07:58:49] hashar: could you help with integrating that to jenkins? [07:58:49] as for the rest of the cluster, might need to be scheduled properly and notify the whole engineering [07:59:03] the way this works is one of those two ways (after you include authdns::lint) [07:59:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:59:25] you either call "authdns-lint " or "authdns-lint " [07:59:38] in the first case it'll mktemp a temp directory for you and clean it up at the end [07:59:41] paravoid: iirc the job is already ready and triggered [08:00:12] ah you provide a shell script \O/ [08:00:20] yeah [08:00:24] if it exits 0, all is fine [08:00:36] if not, the stdout/err output should be plenty to debug further [08:01:11] also, where should we include the authdns::lint class? [08:01:53] in the ugly modules/contint/manifests/packages.pp [08:01:59] I still haven't split it [08:02:06] it is applied on all Jenkins slaves [08:02:27] it's not a package :) [08:02:42] oh [08:03:01] it's a class that I'll keep up to date internally within authdns [08:04:07] but I guess it kinda fits there [08:04:16] or a new class [08:06:44] (03PS3) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [08:06:51] paravoid: maybe create a modules/contint/manifests/authdnslint.pp and include that in role::ci::slave [08:06:52] have a look :) [08:06:56] nonono [08:07:17] note that there is already an include for geoip: manifests/packages.pp: include geoip [08:07:32] grumble grumble [08:07:42] that is true [08:07:47] packages.pp is good to me as well [08:09:21] (03PS4) 10Faidon Liambotis: authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 [08:09:29] include geoip will do for now... [08:09:44] can you have a look? [08:10:09] reading the shell script :) [08:12:14] I deliberately left your wikidata change unmerged so we can dry run this :) [08:12:31] we can always retrigger a job if needed :-] [08:12:43] yeah it was also a reminder to self :) [08:12:57] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [08:14:02] so basically, clone the dns.git in workingdir [08:14:08] yes [08:14:12] in some place [08:14:20] then mkdir -p output; /usr/local/bin/authdns-lint workingdir output [08:14:29] that, or just authdns-lint workingdir [08:14:37] and it'll mktemp something in /tmp for you [08:14:41] and clean it up afterwards [08:14:47] that would clutter the temp directory :-] [08:14:57] it'll clean it up in the end, no huge deal [08:15:35] but I wasn't sure if you want everything to be in SSD or whatever [08:15:46] so I added a third-party provided output dir as well [08:16:51] mkdir -p "$WORKSPACE"/build [08:16:52] /usr/local/bin/authdns-lint "$WORKSPACE" "$WORKSPACE"/build [08:17:06] workspace being the git tree? [08:17:11] yup [08:17:12] git checkout even [08:17:16] that is where Jenkins fetch everything [08:17:20] I guess that could work [08:17:27] PROBLEM - Disk space on wtp1006 is CRITICAL: DISK CRITICAL - free space: / 160 MB (1% inode=77%): [08:17:46] kinda ugly but sure :) [08:18:39] https://gerrit.wikimedia.org/r/82372 :-] [08:18:45] will refresh the job [08:18:51] I guess you can merge in authdns::lint [08:19:23] I refreshed the operations-dns-lint jenkins job, so whenever authdns::lint is applied on gallium we can retrigger the job and see what is happening [08:19:23] wait [08:19:31] will jenkins clean up /build afterwards? [08:19:34] it's important to do so [08:19:38] yup [08:19:46] jenkins delete the $WORKSPACE directory before running the build [08:19:55] the git checkout too? [08:19:58] and fetch again the git repository [08:20:03] yup it is completely wiped [08:20:04] oh god, that's inefficient [08:20:10] fetching the git again?! [08:20:15] that is a bit CPU and I/O intensive but that ensure nothing is left around [08:20:15] (03CR) 10Faidon Liambotis: [C: 032] authdns: introduce an authdns::lint class [operations/puppet] - 10https://gerrit.wikimedia.org/r/82264 (owner: 10Faidon Liambotis) [08:20:17] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [08:20:28] the git checkout are fast since the repository and the workspace are on the same device ( SSD) [08:20:33] ah [08:20:34] so git does some hardlinks [08:20:38] it doesn't matter if it's sssd [08:20:40] yes, exactly [08:20:45] git does hardlinks for local checkouts [08:20:46] I should make them shallow clone (aka just fetch the current HEAD) [08:21:32] running puppetd on gallium [08:21:40] I tried keeping the workspace and just doing some git clean -whatever but that sometime left stuff behind :( [08:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.128 second response time [08:23:31] one day I will have to look at stafford :D [08:24:07] PROBLEM - Disk space on wtp1021 is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=77%): [08:24:17] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [08:24:27] RECOVERY - Disk space on wtp1006 is OK: DISK OK [08:25:58] so [08:26:02] paravoid: if you get to https://integration.wikimedia.org/ci/job/operations-dns-lint/ [08:26:06] and log in with your labs account [08:26:13] you should be able to rebuild the last job [08:26:17] (or another one) [08:26:29] once logged in, there is a link on the left ' Rebuild Last ) [08:26:50] which takes the parameters passed to the last build and let you press the button to trigger a run of the job [08:27:04] you didn't merge 82372 yet though :) [08:27:04] that does not report back to gerrit but that is a good way to rerun a job, debug what is going on [08:27:10] ah [08:27:16] I have deployed it nonetheless :-] [08:27:24] will further tweak it if need be [08:27:34] and merge it once it is working :-] [08:28:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:20] somewhere is a link to the job console [08:29:22] https://integration.wikimedia.org/ci/job/operations-dns-lint/41/console [08:29:28] SUCCESS \O/ [08:29:40] ;) [08:29:59] and rebuilding again https://integration.wikimedia.org/ci/job/operations-dns-lint/42/console [08:30:07] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:30:11] at the top there is a line saying: [08:30:12] 08:29:48 Wiping out workspace first. [08:30:17] that means the full $WORKSPACE is deleted [08:30:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [08:30:22] yep [08:30:23] perfect [08:30:24] and the operations/dns.git is recloned [08:30:44] you can have multiple authdns-lint at the same time [08:31:00] not sure if you can tell jenkins it doesn't need to serialize them [08:31:02] nop [08:31:02] though it could be configured [08:31:05] the jobs are serialized [08:31:19] right, there's no reason for this test, that's all I'm saying [08:31:20] but we can make them run in parralel [08:31:25] I don't run a DNS server for example [08:31:30] I'm just running checkconf in a chroot [08:31:32] so it's safe [08:31:39] I don't use a shared location anywhere either [08:31:43] merging jenkins job builder configuration change [08:31:55] thanks! [08:32:12] the next step is to make operations-dns-lint to block the patchset whenever the job fail [08:32:17] that is done in Zuul configuration [08:32:19] I will do it :-] [08:32:43] please do but feel free to share the patchsets, it's interesting to see how this all works [08:34:43] https://gerrit.wikimedia.org/r/82375 [08:34:45] :-] [08:34:55] each patchset triggers a bunch of jobs [08:35:07] whenever one job fail, that will cause Zuul to report verified -1 [08:35:23] "voting: yes" being the default I'm guessing? [08:35:30] but we can hack some job to not be an issue, that is known in Zuul as voting [08:35:33] yup [08:35:45] so I had an exception to make operations-dns-lint result to be ignored [08:36:01] yep [08:36:11] on the zuul configuration change https://gerrit.wikimedia.org/r/#/c/82375/ there is a jenkins job result that is ignored [08:36:17] https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/411/console : FAILURE in 2s (non-voting) [08:36:32] that does a diff of Zuul configuration at HEAD^ with HEAD [08:36:36] https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/411/console [08:36:54] that is how I manually validate the zuul configuration changes [08:37:03] 08:34:45 -INFO:zuul.IndependentPipelineManager: [nonvoting] [08:37:03] 08:34:45 +INFO:zuul.IndependentPipelineManager: [08:37:20] it is no more 'nonvoting', hence whenever operations-dns-lint fails, Zuul will vote -1 [08:39:28] (03PS1) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 [08:39:35] (03CR) 10jenkins-bot: [V: 04-1] Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 (owner: 10Hashar) [08:39:41] :-] [08:40:51] here is the failure log https://integration.wikimedia.org/ci/job/operations-dns-lint/43/console [08:40:51] (03PS2) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 [08:41:07] 08:39:34 rfc1035: Zone wikimedia.org.: Zonefile parse error at line 17: unparseable [08:41:08] 08:39:34 rfc1035: Cannot load zonefile 'wikimedia.org', failing [08:41:41] (03Abandoned) 10Hashar: Jenkins validation (please ignore) [operations/dns] - 10https://gerrit.wikimedia.org/r/82376 (owner: 10Hashar) [08:41:58] the next step would be to have a job that start a DNS server [08:42:03] and then run dig commands to validate the records :-] [08:43:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [08:44:33] paravoid: I guess you can close https://rt.wikimedia.org/Ticket/Display.html?id=5688 "write a puppet class to install gdnsd on Jenkins slaves" [08:46:54] thank you faidon :-] [08:48:09] thank you :) [08:48:14] I think the authdns-lint script is neat [08:48:20] the way to lint this I mean [08:48:32] I can do changes without going outside the authdns module [08:50:05] RECOVERY - Disk space on wtp1021 is OK: DISK OK [08:50:14] hashar: how can I tell jenkins to retry https://gerrit.wikimedia.org/r/#/c/80993/ ? [08:50:17] and vote I mean [08:50:22] is that zuul? [08:50:43] comment 'recheck'? [08:51:42] maybe recheck yeah [08:51:46] that would retrigger linting job [08:51:56] what's recheck? [08:51:56] fill a comment in the change with the content: "recheck" [08:51:59] oh [08:52:22] (03CR) 10Faidon Liambotis: "recheck" [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:53] perfect [08:53:06] :-] [08:53:18] another solution is to submit a new patchset [08:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:53:32] I usually edit the commit summary and insert a new line just above the Change-Id: xxx line [08:53:52] the question was more of if I can hit something under integration.wm.org [08:55:41] (03CR) 10Faidon Liambotis: [C: 032] points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:56:26] (03PS3) 10Faidon Liambotis: points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:58:23] paravoid: and potentially we could load the DNS zone in a varnish instance then run some dig commands / dns checker :D [08:59:30] (03CR) 10Faidon Liambotis: [C: 032] points wikidata.org to pmtpa wikidata lb [operations/dns] - 10https://gerrit.wikimedia.org/r/80993 (owner: 10Hashar) [08:59:51] varnish?! [09:00:49] grrr [09:00:51] vagrant [09:01:01] see, both contains the letters V A and R [09:01:10] var vagrant = "varnish"; [09:01:24] A friend yesterday kept saying 'valgrint' instead of vagrant :P [09:02:56] huh... like valgrind ? he seeks memory leaks in VMs ? :P [09:03:29] :P [09:03:55] * YuviPanda runs vagrant under valgrind [09:04:04] hmm, I wonder how long vbox would last [09:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [09:33:35] akosiaris: so php ? :-] [09:33:48] akosiaris: should we just install it on contint server and the beta cluster to try it out ? [09:34:01] hashar: yes. [09:34:20] I am not sure whether we could provide the package via apt.wm.o without having the full cluster to self upgrade [09:34:31] And I then i upload them in apt [09:34:42] i think we could if we used some other component [09:34:43] so I guess we need to scp the packages and manually install them with dpkg -i or something [09:35:11] I am wondering though how i can get the packages to brewster from labs (i built them there) [09:35:27] without going through my PC [09:35:58] if they are small enough, I usually scp them to fenari:/home/hashar/public_html [09:36:07] then wget from http://noc.wikimedia.org/~hashar/ [09:36:13] seems like services can not be hosted on labs ? This has something to do with the DNAT you we asking about ? [09:36:21] you were* [09:36:50] the nat issue happens whenever an instance attempt to access the public IP of an other instance [09:37:18] not sure how it is related to copying packages from brewster to labs :-D [09:37:28] not related then [09:48:52] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [09:48:52] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [09:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [10:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:31:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [11:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [11:27:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:31:51] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: No successful Puppet run in the last 10 hours [11:32:51] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:57] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: No successful Puppet run in the last 10 hours [11:39:57] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: No successful Puppet run in the last 10 hours [11:39:57] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: No successful Puppet run in the last 10 hours [11:40:57] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: No successful Puppet run in the last 10 hours [11:40:57] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: No successful Puppet run in the last 10 hours [11:43:57] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: No successful Puppet run in the last 10 hours [11:45:57] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: No successful Puppet run in the last 10 hours [11:46:57] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: No successful Puppet run in the last 10 hours [11:48:57] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: No successful Puppet run in the last 10 hours [11:51:57] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [11:56:57] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: No successful Puppet run in the last 10 hours [11:57:57] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: No successful Puppet run in the last 10 hours [12:04:45] apergos: About? [12:07:04] yes [12:07:11] Reedy: what's up? [12:07:36] back [12:09:03] apergos: Could you have a look why http://dumps.wikimedia.org/other/incr/wikidatawiki/ is serving 403s for Denny and a few other people please? [12:09:11] WFM and other people in the WMDE office too... [12:09:29] [13:08:25] i am not 403ed anymore [12:10:08] it serves 403 for people who try to open more than two connections from the same ip at once [12:10:11] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:25] aha [12:10:30] nice one [12:10:30] it should say so on the page [12:10:44] http://dumps.wikimedia.org/ [12:10:51] indeed, at the top it says so [12:11:05] Does it show the error on the 403? [12:11:10] error/message [12:11:17] no idea [12:11:22] heh [12:12:04] I told Denny to stop wasting resources ;) [12:12:12] do they need the latest dumps right this second? if not they can get better bandwidth and more simultaneous downloads from your.org (but there is a few hour delay) [12:12:15] hah [12:12:57] I should probably go mirror hunting again sometime soon, see if we can't find a couple more sites willing to do all dumps [12:13:01] I think for some/most of the requests they'd be ok hitting the mirrors [12:13:10] and bittorrent!! :-] [12:13:18] yep [12:13:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:48] do serial requets and have no problem :-P [12:13:52] *requests [12:14:08] Could we do something with ULSFO? [12:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:14:23] Get a large storage box and wire it up? [12:14:46] akosiaris: are you still alive ? :-] [12:14:54] I guess doing the same thing in ESAMS might not be possible [12:18:30] well we already do mirror internally, it's just not as good as having a few external mirrors for downloaders [12:18:48]