[00:29:14] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [00:36:02] TimStarling: are you around and do you know enough about bits servers and central notice to help debug? [00:36:33] yes and maybe [00:36:36] help debug what? [00:36:42] there might have been a caching or some problem earlier that caused multiple banners in some cases [00:36:46] http://i.imgur.com/wrkfS.png [00:37:13] jeremyb saw this , like we remember some time ago there being multiple jimbos [00:37:20] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [00:37:25] but i didn't see it and such [00:37:59] we pulled the program banner (for logged-in users) until we know it won't happen again [00:41:51] it was served by srv 263 [00:42:17] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [00:42:43] the HTML you mean [00:42:59] yeah [00:43:16] it might be totally different issue but i clicked on http://en.wikipedia.org/w/index.php?title=Special:BannerController&cache=/cn.js&303-4 [00:43:24] which is in my source on enwiki [00:43:52] its an error [00:44:32] well, the bug might be there, or it might be in Special:BannerListLoader [00:44:40] e.g. http://en.wikipedia.org/w/index.php?title=Special%3ABannerListLoader&cache=/cn.js&language=en&project=wikipedia&country=US [00:44:52] that worked for me [00:45:08] the URL you just gave me is copied out of HTML [00:45:16] yeah [00:45:18] you have to replace the & with & before it will work [00:45:25] ok [00:45:35] yeah, it works :) [00:46:20] do you know why the multiple jimbos happened some months (or year ago)? [00:46:51] no [00:46:59] we're supposed to have one banner with the call for participation for logged in users, and then this other banner for logged out people [00:47:03] geotargeted [00:47:26] maybe $.centralNotice.fn.chooseBanner() runs multiple times [00:47:33] hmm [00:47:46] there should only be a maximum of 1, right? [00:47:54] yeah, just one [00:48:35] we set the weight and it's supposed to pick one [00:48:43] yeah, so neither chooseBanner() nor loadBanner() checks if it has been run before [00:49:17] so it just relies on details of jQuery and the browser to not load multiple banners [00:49:26] oh [00:49:51] so I think a good fix would be to set a variable when it is run the first time, and then to not run it again [00:50:04] how urgently do you need this? [00:50:22] we could make do with one banner [00:51:16] I can fix it now if it's urgent, or I can pass off my recommendations to a CentralNotice developer to do over the next 24 hours if it's not [00:51:19] maybe we can switch them in a couple days [00:51:29] 24 hours is ok [00:51:50] we have a week for the call for participation [00:52:04] and just want to make sure people don't miss i [00:52:07] ok [00:52:08] there's always a banner [00:52:20] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [00:52:27] and we can always extend the deadline [00:52:50] thanks [01:00:27] TimStarling: aude: fwiw, the one instance where i looked a little closer at the resulting DOM had a div#sitenotice with 2 children that were div#centralNotice.cn-default and each of those had 2 children. all 4 grandchildren were div#mw-wikimaniadc.mw-wikimaniadc [01:00:40] * jeremyb continues catching up [01:02:59] right [01:03:08] probably an issue with the campaign script then [01:04:01] so, easy fix is check if any $('div#sitenotice div#mw-wikimaniadc') exists. and bail [01:05:34] assuming that it's being run multiple times. but idk why it would have that nesting. (i.e. it should be 4 siblings that are identical to each other instead of 2 in one parent and 2 in another?). anyway, worth trying at least [01:06:12] the major issue is that we can never no it's actually gone unless we can repro reliably... [01:06:31] aude: do you want to try implementing that? i don't has meta sysop... [01:08:36] I don't think you can implement it with meta sysop rights [01:08:44] where would you put it? [01:09:19] TimStarling: in the banner [01:10:20] you mean this? http://meta.wikimedia.org/w/index.php?title=MediaWiki:Centralnotice-template-Wikimania2012Program&action=edit [01:10:36] because that's HTML, not a script [01:10:53] TimStarling: https://meta.wikimedia.org/w/index.php?title=Special:NoticeTemplate/view&template=WikimaniaInDC [01:13:18] * jeremyb is getting on phone now so will be less responsive [01:15:01] * aude back [01:33:03] so, is there a verdict? meta sysop is (at least as a quick fix) sufficient to make a change? [01:33:23] (still on phone) [01:51:44] New patchset: Dzahn; "put swift production servers into a nagios hostgroup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [01:51:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3067 [01:53:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3067 [01:53:26] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [02:00:58] New review: Dzahn; "looks like this has been done manually meanwhile" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2927 [02:01:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2927 [02:13:21] New patchset: Dzahn; "oops, what did i do there, fix the exec for nagios to exim group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:13:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3068 [02:15:27] New patchset: Dzahn; "oops, what did i do there, fix the exec for nagios to exim group, meh, get rid of whitespace too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:15:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3068 [02:16:08] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3068 [02:16:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:22:41] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2989 [02:23:45] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3024 [02:25:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3036 [02:34:54] jeremyb: we can ask kaldari or whoever tomorrow [02:35:33] aude: can you just do it yourself? or i can give you a diff? [02:36:42] jeremyb: it's not urgent enough for a quick hack [02:37:04] we'll do a new banner in 1-2 days for logged in people, with the cfp deadline on it [02:37:51] aude: okey... [02:38:03] * jeremyb is sleepy anyway! ;) [02:38:27] * aude sleepy [02:38:37] all the more reason not to do a hack now [02:39:04] and this is why i waited until yesterday to do any banners, so we could manage any issues [02:39:07] or problems [02:49:06] New patchset: Dzahn; "swift prod servers to monitoring group, that didnt appear to work because they dont include base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3069 [02:49:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3069 [02:50:06] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3069 [02:50:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3069 [03:01:33] New patchset: Dzahn; "meh, duplicate def from including standard, add nagios hostgroup in base definition then and do this in role::swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3070 [03:01:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3070 [03:03:03] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3070 [03:03:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3070 [03:33:56] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [03:33:56] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [03:35:34] !log upgrading libc6 and related packages on spence [03:35:45] Logged the message, Master [03:40:27] !log same (and nscd) on fenari [03:40:31] Logged the message, Master [04:19:16] RECOVERY - DPKG on spence is OK: All packages OK [04:19:34] RECOVERY - Disk space on spence is OK: DISK OK [04:19:45] !log wanted to restart nagios-nrpe-server on spence with debug=1 to investigate permission issue. arr! "Address already in use" "cant write to pidfile", killed the one started on Feb18, and reordered allowed_hosts, spence talks to itself again now :p [04:19:48] Logged the message, Master [04:20:01] RECOVERY - profiler-to-carbon on spence is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [04:20:19] RECOVERY - profiling collector on spence is OK: PROCS OK: 1 process with command name collector [04:20:19] RECOVERY - RAID on spence is OK: OK: no RAID installed [04:56:47] New patchset: Dzahn; "change Swift HTTP check to user port 8080" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3071 [04:56:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3071 [04:59:21] New review: Dzahn; "should be able to check them all with the same command, when using port 8080 " [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3071 [04:59:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3071 [05:04:13] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [05:14:16] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [05:15:10] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [05:15:10] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [05:20:16] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [05:23:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:23:33] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:24:07] New patchset: Dzahn; "..or not. make check_http_swift flexible for different ports and add an if-statement on hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:24:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3072 [05:26:05] New patchset: Dzahn; "make check_http_swift flexible for different ports and add an if-statement on hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:26:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3072 [05:27:31] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3072 [05:27:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:32:51] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [05:37:24] !log copper - installing (security) updates (apt,grub,openssl,ruby,libc6..) [05:37:28] Logged the message, Master [05:43:44] !log rebooting copper to make sure grub update didnt break it and asked for restart anyways [05:43:48] Logged the message, Master [05:48:00] PROBLEM - SSH on copper is CRITICAL: Connection refused [05:49:21] PROBLEM - Memcached on copper is CRITICAL: Connection refused [05:51:35] RECOVERY - Swift HTTP on magnesium is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 1.064 seconds [05:51:53] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [05:51:53] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [05:52:29] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:53:12] !log dunno, copper was stuck (no mgmt output after reboot) but powercycling it and back [05:53:15] Logged the message, Master [05:53:50] RECOVERY - Memcached on copper is OK: TCP OK - 0.027 second response time on port 11211 [05:58:11] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [05:58:12] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [05:58:12] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [05:58:13] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [05:58:13] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [05:58:35] TimStarling: Is there an open bug about the CentralNotice issue? [05:58:58] I doubt it [05:59:23] An interesting aspect of the screenshot showing the bug is that it wasn't the exact same banner every time [05:59:40] it looks like it's repetitively requesting and as such getting a new one from the server. [06:00:18] TimStarling: aude: fwiw, the one instance where i looked a little closer at the resulting DOM had a div#sitenotice with 2 children that were div#centralNotice.cn-default and each of those had 2 children. all 4 grandchildren were div#mw-wikimaniadc.mw-wikimaniadc [06:00:43] not sure how that is possible [06:01:06] looking at the templates on meta didn't enlighten me [06:02:57] right [06:03:16] iirc, we've seen this bug before during a fundraiser [06:03:26] likely browser and/or skin specific [06:03:31] I've never seen that ever [06:03:47] or race condition related [06:04:13] I can look at it in about 7-8 hours when I get back from school [06:04:30] going offline for now [06:07:02] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [06:07:03] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [06:07:03] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [06:07:04] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [06:07:04] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [06:07:05] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [06:07:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:07:06] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [06:10:31] !log turning off debug mode in nagios-nrpe, again had to kill it , restart fails [06:10:34] Logged the message, Master [06:11:05] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [06:11:06] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [06:11:06] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:11:07] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [06:12:08] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [06:13:11] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [06:13:11] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [06:16:02] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [06:27:27] RECOVERY - Swift HTTP on copper is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.127 seconds [06:28:48] RECOVERY - Swift HTTP on zinc is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.065 seconds [08:05:20] New review: Dzahn; "i changed this command in change 3072 to fix monitoring on zinc/magnesium/copper who need port 8080 ..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3036 [10:30:29] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [10:34:32] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [10:38:35] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [10:43:32] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [10:53:35] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [11:14:44] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 181 seconds [11:16:05] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 184 seconds [11:24:20] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [11:24:47] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [11:28:56] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:29:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:30:35] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:30:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:34:52] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:35:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:36:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3081 [11:36:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:37:27] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [12:03:05] New patchset: Mark Bergsma; "Small fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3082 [12:03:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3082 [12:03:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3082 [12:03:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3082 [12:17:15] New patchset: Mark Bergsma; "Monitor both frontend and backend varnish instances for mobile as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3083 [12:17:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3083 [12:17:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3083 [12:17:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3083 [12:19:35] New patchset: Mark Bergsma; "Add eqiad upload varnish cluster to torrus collection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3084 [12:19:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3084 [12:19:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3084 [12:19:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3084 [12:21:58] New review: Mark Bergsma; "Please do per-data center groups, following the same name style as has been done for squids/varnish ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [13:35:06] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [13:35:06] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [14:07:26] typo [14:08:27] its that kind of day. [14:33:08] New patchset: Mark Bergsma; "Support dual-layer varnish clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3086 [14:33:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3086 [14:33:54] god damn it [14:34:00] equinix installed the grounding wron gin all the new cabinets [14:34:02] what the fuck. [14:34:13] can they do nothing right the first time. [14:34:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3086 [14:34:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3086 [14:35:19] i take that back, the existing cabinets are wrong too [14:35:30] mark: so none of the grounding is right in any cabinet, which may be why we are having issues [14:35:43] they didnt bother to use star washers, so the grounding is attached to powder coat metal [14:35:46] rather than metal on metal [14:36:17] haha [14:38:22] i hate eq. [14:38:41] i cc'd you on the email. [14:39:07] going to have to run to lowes shortly, had to review on site what hardware i have and what i need to buy. [14:47:21] i am just going to buy all the shit i need to regound everything from scratch. [14:48:58] no it's their responsibility [14:49:49] New patchset: Mark Bergsma; "Support dual layer varnish setups, add upload-eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3087 [14:50:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3087 [14:51:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3087 [14:51:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3087 [14:53:32] they should probably be grounding the doors too [14:54:45] New patchset: Mark Bergsma; "Missing %" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3088 [14:54:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3088 [14:55:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3088 [14:55:02] mark: they wont ground the doors and such, so its ust as easy to buy extra washers [14:55:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3088 [14:55:09] i dont trust them to go in racks with working equipment [14:55:14] as they have fucked it up twice now. [14:55:43] the racks came with the wires ,but some have gone missing, some werent there to begin with, etc... so i just need to buy a spool of groundin gwire and the clamping tools to make them [14:57:29] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.8116703571 (gt 8.0) [14:59:53] ok, leaving laptop in eqiad and goin gto hardware store for stuff [15:00:58] I will be back shortly. i am not done today until grounding is all fixed, tired of all the odd erros. [15:01:38] New patchset: Mark Bergsma; "Set prefix to empty for single layer servers, add a decommission option for dual layer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3089 [15:01:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3089 [15:02:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3089 [15:02:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3089 [15:05:35] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [15:09:10] New patchset: Mark Bergsma; "Merge squidlayer and varnishlayer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3090 [15:09:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3090 [15:09:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3090 [15:09:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3090 [15:15:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:16:39] !log Rebooted manutius, stuck in a similar state as streber always did [15:16:41] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [15:16:41] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [15:16:43] Logged the message, Master [15:17:35] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [15:17:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [15:17:35] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [15:21:38] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [15:22:41] PROBLEM - Host manutius is DOWN: PING CRITICAL - Packet loss = 100% [15:24:38] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:24:38] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:24:38] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [15:34:41] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [15:42:02] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.75302 (gt 8.0) [15:45:43] New patchset: Mark Bergsma; "Create a separate vcl_config hash, as retry5x/cache4xx are not backend options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:45:54] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3091 [15:47:11] New patchset: Mark Bergsma; "Create a separate vcl_config hash, as retry5x/cache4xx are not backend options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:47:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3091 [15:48:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3091 [15:48:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:49:05] PROBLEM - DPKG on snapshot3 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:51:11] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 185 MB (2% inode=61%): /var/lib/ureadahead/debugfs 185 MB (2% inode=61%): [15:53:35] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [15:53:35] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [15:57:02] New patchset: Mark Bergsma; "Require GET or HEAD for upload, no POST" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3092 [15:57:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3092 [15:57:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3092 [15:57:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3092 [15:59:35] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [15:59:36] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [15:59:36] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [15:59:37] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [15:59:37] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [16:05:48] RECOVERY - Disk space on srv221 is OK: DISK OK [16:08:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [16:08:31] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [16:08:31] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [16:08:32] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [16:08:32] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [16:08:33] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [16:08:33] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:08:34] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [16:12:25] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [16:12:25] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [16:12:26] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [16:12:26] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [16:13:27] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [16:14:30] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [16:14:30] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [16:17:30] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [16:18:24] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3391611607 (gt 8.0) [16:45:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2989 [16:45:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2989 [16:48:15] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:46] New patchset: Lcarr; "Fixing fixme's from previous change https://gerrit.wikimedia.org/r/#change,2936" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3093 [16:54:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3093 [16:56:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.56872919643 (gt 8.0) [16:56:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/3093 [16:57:15] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [17:00:33] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:03] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3093 [17:02:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3093 [17:02:57] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [17:03:06] woosters: https://rt.wikimedia.org/Ticket/Display.html?id=2607 [17:03:15] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [17:03:20] whose varnish config diff ? [17:03:25] and is it ok to push ? [17:04:23] got it hexmode [17:06:42] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:08:51] LeslieCarr: did I forget to merge smt? [17:08:59] upload-frontend.inc.vcl.erb ? [17:09:00] woosters: :) [17:09:03] guessing is yours ? [17:09:06] yeah [17:09:08] go ahead [17:09:47] merged [17:11:30] PROBLEM - Varnish HTTP upload-frontend on cp1022 is CRITICAL: Connection refused [17:12:06] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [17:13:20] !log PXE booting cp1025-cp1028 [17:13:24] Logged the message, Master [17:14:57] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [17:16:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.5513432432 (gt 8.0) [17:25:53] !rt 2446 [17:25:53] https://rt.wikimedia.org/Ticket/Display.html?id=2446 [17:26:00] robh: ^ [17:26:14] New patchset: Mark Bergsma; "Setup dependency for varnish::logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3094 [17:26:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3094 [17:26:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3094 [17:26:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3094 [17:28:58] robh: did you see the rt? [17:29:03] rt! 2446 [17:29:08] cmjohnson1: yea row c [17:29:10] !rt 2446 [17:29:10] https://rt.wikimedia.org/Ticket/Display.html?id=2446 [17:29:12] okay [17:29:32] came a month early ;) [17:30:45] good, we need the racks [17:31:02] i just realized though you dont have cable rings for row c right? [17:31:03] =/ [17:32:26] RobH: any chance you can fix cp1025 today by swapping memory with another unused one (cp1040 or so) [17:33:01] robh: I have cable rings [17:33:12] ordered 2 boxes way back in September [17:33:20] RECOVERY - Varnish HTTP upload-frontend on cp1022 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [17:33:22] yea but you need 5 per rack minimum [17:33:25] preferrred 10 per rack [17:33:29] so you need more right? [17:33:36] 5 on each side. [17:33:47] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 2 processes with command name varnishncsa [17:33:48] probably...i don't think I have 30 [17:33:53] 10 per box [17:34:09] the ones we ordered before should fill up row D [17:34:16] and then you should have been out [17:34:28] so they match how they are setup in sdtpa [17:34:43] i am on the last box....but d1 is a little different, only have rings on 1 side [17:34:52] hrmm, thats not ideal. [17:35:01] i wanted all the racks to be identical. [17:35:18] I did it that way because matt and i want to do something different because of the fiber [17:35:28] run a tube [17:35:46] ok, but the tuvbe can go inside the cable rings down to the mx80 [17:35:51] like in sdtpa. [17:36:06] ok...np..ez fix [17:39:27] pls drop a procurement ticket witht he part # of the cable rings [17:39:31] and i will get more ordered for you [17:39:36] so you can balance them out in future ot match [17:39:42] (its not a big deal, easy fix, so on) [17:39:54] im just in a bad mood today so my tone is off. [17:40:22] mark: checking on that [17:40:28] k..i will get that ticket out to you shortly [17:40:33] np [17:40:35] ugh, someone sent OTRS a screenshot of a tracert (windows). but there's nothing obviously wrong with the tracert and the english in the message body is not really understandable [17:41:05] it's another thai ISP fwiw [17:42:30] mark: cp1017 and cp1019 have orange error leds illuminated, did you push them online? (i am seeing the errors on the front now) [17:42:40] looking at cp1025 now. [17:42:56] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.77458491071 (gt 8.0) [17:42:58] those are squids [17:43:03] and they are live yeah [17:45:12] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 2 processes with command name varnishncsa [17:45:40] here's the image from the ticket: http://i.imgur.com/eL5vY.png (2012031210002867) [17:46:05] !log restarting indexer on searchidx2 [17:46:08] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:47:12] on rereading again it looks like there's no images loaded on the enwikip main page (in the background of the screenshot) [17:47:37] well, fixing he grounding in live racks is going to be an issue. [17:47:47] since self tapping screws cause metal flakes to fall on servers. [17:47:49] today sucks. [17:48:28] but he hasn't convinced me that it's not just a local firefox preference to not load pics ;( [17:50:18] RobH: do you know what oxygen.wikimedia.org is being used for? (and why it feels I should be getting nagios alerts?) [17:51:18] its a locke replacement [17:51:27] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/site.pp;h=7b7b693d2d610fe36266cedfe3223ff14b0df6e3;hb=HEAD#l1496 ? [17:51:31] nimish_g: ^ [17:51:41] oh are yoy getting all those emails? :-D [17:52:56] yeah, but I figure as long as the nagios alerts resolve to an even number of messages, it's fine [17:53:46] :-D [17:54:24] New patchset: Mark Bergsma; "Throw cp1022-1028 into the upload eqiad pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3095 [17:54:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3095 [17:55:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3095 [17:55:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3095 [18:03:43] New patchset: Ryan Lane; "This seems silly, but let's try it." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3096 [18:03:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3096 [18:04:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3096 [18:04:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3096 [18:15:14] New patchset: Lcarr; "Declaring nagios monitor machines specifically" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3097 [18:15:22] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3097 [18:16:05] New patchset: Lcarr; "Declaring nagios monitor machines specifically" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3097 [18:16:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3097 [18:18:16] New patchset: Ryan Lane; "You suck puppet. I hate you so much." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3098 [18:18:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3098 [18:18:35] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3098 [18:18:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3098 [18:21:20] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.24173776786 (gt 8.0) [18:58:56] Ryan_Lane: presumably using HTTPS would be slower than HTTP, but it shouldn't be significantly slower should it? [18:59:17] should not be much slower [18:59:20] what issue are you seeing? [19:00:03] AWB using HTTP could do upto 15-20+ edits per minute. Using HTTPS it's been reported only 4-5 epm [19:00:29] Not tested it myself yet, and it might be AWB/.NET config at fault [19:01:36] are you reusing connections? [19:01:43] it'll be *way* slower if you don't [19:01:54] Mmm [19:01:56] connection speeds are the slow part of https [19:01:58] I think we aren't [19:02:02] you shoudl [19:02:08] at least for https [19:02:29] Yeah, that overheads of connection is what I was meaning originally [19:03:38] That looks like it will probably be it... [19:04:57] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:12] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.1719694643 (gt 8.0) [19:36:18] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.18729848214 (gt 8.0) [19:43:21] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [19:48:32] !rt 2595 [19:48:32] https://rt.wikimedia.org/Ticket/Display.html?id=2595 [19:48:37] robh ^ [19:49:14] thats just a loss of power error [19:49:28] doesnt give a timestamp though [19:49:29] hrmm [19:49:40] it did... [19:49:53] 0207 22:36:23 2012 [19:50:01] Feb 07 [19:50:52] updated ticket [19:52:26] bah, was gonna ping ben [19:52:29] but he just parted [19:52:32] bah!! [19:53:37] lunch time there [20:08:02] RobH: hey there [20:08:12] hiya [20:08:13] whats the status on the pediapress labs work? [20:08:24] their asking me pretty regularly about tit [20:08:46] join security and we can chat [20:09:12] RobH: can't [20:09:14] invite only [20:09:28] and i know that i've registered under this nick [20:09:33] lemme fix yer access =] [20:10:12] i'd prefer it under tfinc [20:10:42] tfinc: now its on both [20:10:45] try now [20:10:47] RobH: thanks [20:11:09] Cannot join channel (+r) - you need to be identified with services [20:11:10] fun [20:11:27] tfinc: since I finally see you on. what about that list admin password? [20:11:28] yea you need to ID [20:11:33] tfinc: also turn on enfocement man [20:11:35] there's some mail waiting for moderation I can't get to [20:11:40] or else anyone can pretend to be you [20:11:40] i really don't have the time to set this up right now RobH . just pm me [20:11:44] ok [20:11:56] apergos: sure [20:12:00] apergos: let me pull it up [20:12:42] apergos: how would you like me to send it to you? [20:12:54] hmm [20:12:59] do you have access to any cluster host? [20:15:28] i'm on fenari now [20:15:36] want me to mesg it you you? [20:15:54] apergos: pts/56 ? [20:16:36] um [20:17:37] pts/45 I think [20:20:03] Ryan_Lane: setting keepalive has made some improvement. It seems though the api apaches might just be on a go slow, as reverting back to the http code is only slightly faster [20:20:23] apergos: get my message ? [20:20:24] yes [20:20:33] send away [20:20:38] done [20:20:48] thanks [20:21:01] I will use it immediately, [20:21:04] . [20:26:29] !log cp1040 coming down for hardware stuffs [20:26:32] Logged the message, RobH [20:32:10] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [20:36:04] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [20:40:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [20:45:04] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [20:46:43] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.70933339286 [20:46:53] Reedy: is it still dramatically different between http and https with the keepalive option? [20:47:27] Ryan_Lane: testing HTTP now was only doing 1-2 more edits per minute [20:47:36] sounds about right [20:47:38] good to hear [20:47:40] Nooo [20:47:45] not 4 times like it used to (reverted back to the old code check) [20:47:48] HTTP is slow also [20:47:52] * Ryan_Lane nods [20:47:55] at the moment at least [20:50:30] well, lemme know if there's a problem [20:50:36] I can't imagine it being much slower [20:50:53] yeah, I'm not sure why HTTP is really slow at the moment [20:51:10] it does funnel more connections through a smaller number of servers, but those servers are pretty bored [20:51:36] The main slowness was apparently doing the saves [20:51:49] I'll keep an eye on it and see if it improves [20:51:52] * Ryan_Lane nods [20:52:02] stick it in another thread, then queue locally? [20:52:03] Could be other network esk factors affecting the guys speed [20:55:07] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [21:38:49] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [21:39:26] !log search1014 repaired per rt 2483 [21:39:30] Logged the message, RobH [21:39:45] notpeter: search1014 is fixed, rebooting it to disable the onboard sata ports as its got a perc controller [21:39:52] but will be done in a few and its all yours whenever you want it [21:40:05] sorry it took so long, first dead mainboard, which killed some memory, then had to get more meory [21:40:09] memory even [21:43:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:45:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.751 seconds [21:51:32] Ahm, apergos ? [21:51:34] snapshot1003: @ERROR: access denied to common from snapshot1003.eqiad.wmnet (10.64.16.141) [21:51:36] snapshot1003: rsync error: error starting client-server protocol (code 5) at main.c(1524) [Receiver=3.0.7] [21:51:37] snapshot1004: @ERROR: access denied to common from snapshot1004.eqiad.wmnet (10.64.16.142) [21:51:39] snapshot1004: rsync error: error starting client-server protocol (code 5) at main.c(1524) [Receiver=3.0.7] [21:52:04] baaahhhh [21:52:09] dang it [21:52:22] what am I supposed to add them to? [21:52:33] I have no idea what that's about or how to fix it [21:52:34] oh. they probably didn't have some sync apache thing run on them [21:52:36] for crhissakes [21:52:44] I just ran sync-file and this error came up [21:52:57] this is what I get for finally adding them to the dsh groups [21:53:10] It might just be denying stuff from eqiad [21:53:15] We've had problems like that before [21:53:18] can you ignore it for now and I'll look at it tomorrow? [21:53:21] Sure [21:53:34] searchidx1001 was broken because the pmtpa es servers only allowed connections from pmtpa [21:53:43] ok, I'll add it to my todos [21:54:01] I'll see [21:54:01] Oh, no, it's not anymore [21:54:04] Tim and Peter fixed it [21:54:08] that was fast [21:54:16] Just using that as an illustration that this might be Just Another Eqiad Bug [21:55:06] oh. yeah no I meant I will look at snaps tomorrow [21:55:16] you did not mean that they just now magically fixed them [21:55:22] Oh [21:55:30] No I meant Tim&Peter fixed the searchidx1001 issue [21:55:30] see what I mean, it's too late for productive work :-D [21:55:34] heh [21:56:17] what sync command did you run btw? [21:57:58] RobH: a winrar is you [21:58:00] thank you! [21:59:12] welcome [21:59:26] is it booting up at the moment? [22:11:58] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [22:14:02] RobH: can you take a look at search1014 again? [22:14:07] it's not booting up via console [22:14:29] sure, middle of creating some shipments, gimme a couple minutes and its my next task [22:14:48] yup, no prob [22:15:52] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 192 MB (2% inode=61%): /var/lib/ureadahead/debugfs 192 MB (2% inode=61%): [22:15:52] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 8 MB (0% inode=61%): /var/lib/ureadahead/debugfs 8 MB (0% inode=61%): [22:15:52] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 187 MB (2% inode=61%): /var/lib/ureadahead/debugfs 187 MB (2% inode=61%): [22:15:52] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 85 MB (1% inode=61%): /var/lib/ureadahead/debugfs 85 MB (1% inode=61%): [22:15:52] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 78 MB (1% inode=61%): /var/lib/ureadahead/debugfs 78 MB (1% inode=61%): [22:17:12] RoanKattouw: what server am I proxying to? do you have an rt open for this? [22:17:22] that way I can go back and reference it when I forget the server name again :) [22:19:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:03] anyone else having issues with gerrit and pushing to it ? [22:20:18] got a missing Change-Id even though my commit has a change id right now .. grrr [22:20:47] ok, mx80s going out to esams from eqiad tomorrow. [22:21:02] where did my day go ;_; [22:21:03] Some volunteers were complaining about something with gerrit earlier [22:21:14] I got here at 9:30, and it seems like I got so very little done. [22:21:18] LeslieCarr: I get that on rare occasionally, and usually it has to do with something to do with a screwed up local repo [22:21:24] here's how I fix it: [22:21:41] git checkout -b gerritisapain origin/production [22:21:56] git cherry-pick [22:22:00] then push it in [22:22:09] then, git checkout [22:22:17] git reset —hard origin/production [22:22:24] git branch -D gerritisapain [22:23:16] * RobH copies that down for when gerrit screws him in a similar fashion [22:23:26] and it will. [22:23:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.751 seconds [22:25:55] RECOVERY - Disk space on srv222 is OK: DISK OK [22:26:04] RECOVERY - Disk space on srv219 is OK: DISK OK [22:26:56] heh [22:27:20] I just went through the reset mixed dance today [22:27:23] not too exciting [22:29:58] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 248 MB (3% inode=61%): /var/lib/ureadahead/debugfs 248 MB (3% inode=61%): [22:30:16] RECOVERY - Disk space on srv223 is OK: DISK OK [22:30:16] RECOVERY - Disk space on srv220 is OK: DISK OK [22:31:37] Ryan_Lane: Cadmium [22:31:44] cadmium.eqiad.wmnet that is [22:31:51] No RT ticket currently [22:31:55] RECOVERY - Disk space on srv221 is OK: DISK OK [22:31:55] RECOVERY - Disk space on srv224 is OK: DISK OK [22:32:43] ok [22:36:07] gah. I was going to go sleep what, an hour ago? [22:38:49] * apergos gets [22:40:28] Ok, going to start cleaning up. It will be 7pm before I leave. [22:43:14] New patchset: Lcarr; "Adding in customized init file, preparing icinga for initial install Removing contact groups so that we don't get deluged by email in initial install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3099 [22:43:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3099 [22:55:20] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3099 [22:55:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3099 [22:57:31] New patchset: Lcarr; "including nagios configuration group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3100 [22:57:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3100 [22:58:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:02:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.453 seconds [23:11:07] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3097 [23:11:40] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3100 [23:11:42] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3100 [23:16:11] New patchset: Lcarr; "fixing network::checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3101 [23:16:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3101 [23:16:35] Change abandoned: Lcarr; "abandoning due to conflicts, redoing change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3097 [23:17:14] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3101 [23:17:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3101 [23:17:37] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2956 [23:29:59] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/3036 [23:30:46] maplebed: does ?action=purge send an HTCP message to the Squids right now? [23:31:28] yes. [23:31:40] (but only for files that exist on ms5) [23:36:42] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [23:36:42] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [23:38:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:23] maplebed: is ms-fe* public or internally ip'ed ? [23:42:39] LeslieCarr: internal. [23:42:44] thx [23:44:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.812 seconds