[00:29:14] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [00:33:17] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [00:36:02] TimStarling: are you around and do you know enough about bits servers and central notice to help debug? [00:36:33] yes and maybe [00:36:36] help debug what? [00:36:42] there might have been a caching or some problem earlier that caused multiple banners in some cases [00:36:46] http://i.imgur.com/wrkfS.png [00:37:13] jeremyb saw this , like we remember some time ago there being multiple jimbos [00:37:20] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [00:37:25] but i didn't see it and such [00:37:59] we pulled the program banner (for logged-in users) until we know it won't happen again [00:41:51] it was served by srv 263 [00:42:17] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [00:42:43] the HTML you mean [00:42:59] yeah [00:43:16] it might be totally different issue but i clicked on http://en.wikipedia.org/w/index.php?title=Special:BannerController&cache=/cn.js&303-4 [00:43:24] which is in my source on enwiki [00:43:52] its an error [00:44:32] well, the bug might be there, or it might be in Special:BannerListLoader [00:44:40] e.g. http://en.wikipedia.org/w/index.php?title=Special%3ABannerListLoader&cache=/cn.js&language=en&project=wikipedia&country=US [00:44:52] that worked for me [00:45:08] the URL you just gave me is copied out of HTML [00:45:16] yeah [00:45:18] you have to replace the & with & before it will work [00:45:25] ok [00:45:35] yeah, it works :) [00:46:20] do you know why the multiple jimbos happened some months (or year ago)? [00:46:51] no [00:46:59] we're supposed to have one banner with the call for participation for logged in users, and then this other banner for logged out people [00:47:03] geotargeted [00:47:26] maybe $.centralNotice.fn.chooseBanner() runs multiple times [00:47:33] hmm [00:47:46] there should only be a maximum of 1, right? [00:47:54] yeah, just one [00:48:35] we set the weight and it's supposed to pick one [00:48:43] yeah, so neither chooseBanner() nor loadBanner() checks if it has been run before [00:49:17] so it just relies on details of jQuery and the browser to not load multiple banners [00:49:26] oh [00:49:51] so I think a good fix would be to set a variable when it is run the first time, and then to not run it again [00:50:04] how urgently do you need this? [00:50:22] we could make do with one banner [00:51:16] I can fix it now if it's urgent, or I can pass off my recommendations to a CentralNotice developer to do over the next 24 hours if it's not [00:51:19] maybe we can switch them in a couple days [00:51:29] 24 hours is ok [00:51:50] we have a week for the call for participation [00:52:04] and just want to make sure people don't miss i [00:52:07] ok [00:52:08] there's always a banner [00:52:20] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [00:52:27] and we can always extend the deadline [00:52:50] thanks [01:00:27] TimStarling: aude: fwiw, the one instance where i looked a little closer at the resulting DOM had a div#sitenotice with 2 children that were div#centralNotice.cn-default and each of those had 2 children. all 4 grandchildren were div#mw-wikimaniadc.mw-wikimaniadc [01:00:40] * jeremyb continues catching up [01:02:59] right [01:03:08] probably an issue with the campaign script then [01:04:01] so, easy fix is check if any $('div#sitenotice div#mw-wikimaniadc') exists. and bail [01:05:34] assuming that it's being run multiple times. but idk why it would have that nesting. (i.e. it should be 4 siblings that are identical to each other instead of 2 in one parent and 2 in another?). anyway, worth trying at least [01:06:12] the major issue is that we can never no it's actually gone unless we can repro reliably... [01:06:31] aude: do you want to try implementing that? i don't has meta sysop... [01:08:36] I don't think you can implement it with meta sysop rights [01:08:44] where would you put it? [01:09:19] TimStarling: in the banner [01:10:20] you mean this? http://meta.wikimedia.org/w/index.php?title=MediaWiki:Centralnotice-template-Wikimania2012Program&action=edit [01:10:36] because that's HTML, not a script [01:10:53] TimStarling: https://meta.wikimedia.org/w/index.php?title=Special:NoticeTemplate/view&template=WikimaniaInDC [01:13:18] * jeremyb is getting on phone now so will be less responsive [01:15:01] * aude back [01:33:03] so, is there a verdict? meta sysop is (at least as a quick fix) sufficient to make a change? [01:33:23] (still on phone) [01:51:44] New patchset: Dzahn; "put swift production servers into a nagios hostgroup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [01:51:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3067 [01:53:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3067 [01:53:26] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [02:00:58] New review: Dzahn; "looks like this has been done manually meanwhile" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2927 [02:01:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2927 [02:13:21] New patchset: Dzahn; "oops, what did i do there, fix the exec for nagios to exim group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:13:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3068 [02:15:27] New patchset: Dzahn; "oops, what did i do there, fix the exec for nagios to exim group, meh, get rid of whitespace too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:15:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3068 [02:16:08] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3068 [02:16:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3068 [02:22:41] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2989 [02:23:45] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3024 [02:25:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3036 [02:34:54] jeremyb: we can ask kaldari or whoever tomorrow [02:35:33] aude: can you just do it yourself? or i can give you a diff? [02:36:42] jeremyb: it's not urgent enough for a quick hack [02:37:04] we'll do a new banner in 1-2 days for logged in people, with the cfp deadline on it [02:37:51] aude: okey... [02:38:03] * jeremyb is sleepy anyway! ;) [02:38:27] * aude sleepy [02:38:37] all the more reason not to do a hack now [02:39:04] and this is why i waited until yesterday to do any banners, so we could manage any issues [02:39:07] or problems [02:49:06] New patchset: Dzahn; "swift prod servers to monitoring group, that didnt appear to work because they dont include base" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3069 [02:49:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3069 [02:50:06] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3069 [02:50:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3069 [03:01:33] New patchset: Dzahn; "meh, duplicate def from including standard, add nagios hostgroup in base definition then and do this in role::swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3070 [03:01:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3070 [03:03:03] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3070 [03:03:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3070 [03:33:56] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [03:33:56] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [03:35:34] !log upgrading libc6 and related packages on spence [03:35:45] Logged the message, Master [03:40:27] !log same (and nscd) on fenari [03:40:31] Logged the message, Master [04:19:16] RECOVERY - DPKG on spence is OK: All packages OK [04:19:34] RECOVERY - Disk space on spence is OK: DISK OK [04:19:45] !log wanted to restart nagios-nrpe-server on spence with debug=1 to investigate permission issue. arr! "Address already in use" "cant write to pidfile", killed the one started on Feb18, and reordered allowed_hosts, spence talks to itself again now :p [04:19:48] Logged the message, Master [04:20:01] RECOVERY - profiler-to-carbon on spence is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [04:20:19] RECOVERY - profiling collector on spence is OK: PROCS OK: 1 process with command name collector [04:20:19] RECOVERY - RAID on spence is OK: OK: no RAID installed [04:56:47] New patchset: Dzahn; "change Swift HTTP check to user port 8080" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3071 [04:56:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3071 [04:59:21] New review: Dzahn; "should be able to check them all with the same command, when using port 8080 " [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3071 [04:59:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3071 [05:04:13] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [05:14:16] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [05:15:10] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [05:15:10] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [05:16:13] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [05:19:13] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [05:20:16] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [05:23:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:23:33] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:24:07] New patchset: Dzahn; "..or not. make check_http_swift flexible for different ports and add an if-statement on hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:24:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3072 [05:26:05] New patchset: Dzahn; "make check_http_swift flexible for different ports and add an if-statement on hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:26:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3072 [05:27:31] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3072 [05:27:34] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3072 [05:32:51] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [05:37:24] !log copper - installing (security) updates (apt,grub,openssl,ruby,libc6..) [05:37:28] Logged the message, Master [05:43:44] !log rebooting copper to make sure grub update didnt break it and asked for restart anyways [05:43:48] Logged the message, Master [05:48:00] PROBLEM - SSH on copper is CRITICAL: Connection refused [05:49:21] PROBLEM - Memcached on copper is CRITICAL: Connection refused [05:51:35] RECOVERY - Swift HTTP on magnesium is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 1.064 seconds [05:51:53] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [05:51:53] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [05:52:29] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [05:53:12] !log dunno, copper was stuck (no mgmt output after reboot) but powercycling it and back [05:53:15] Logged the message, Master [05:53:50] RECOVERY - Memcached on copper is OK: TCP OK - 0.027 second response time on port 11211 [05:58:11] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [05:58:11] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [05:58:12] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [05:58:12] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [05:58:13] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [05:58:13] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [05:58:35] TimStarling: Is there an open bug about the CentralNotice issue? [05:58:58] I doubt it [05:59:23] An interesting aspect of the screenshot showing the bug is that it wasn't the exact same banner every time [05:59:40] it looks like it's repetitively requesting and as such getting a new one from the server. [06:00:18] TimStarling: aude: fwiw, the one instance where i looked a little closer at the resulting DOM had a div#sitenotice with 2 children that were div#centralNotice.cn-default and each of those had 2 children. all 4 grandchildren were div#mw-wikimaniadc.mw-wikimaniadc [06:00:43] not sure how that is possible [06:01:06] looking at the templates on meta didn't enlighten me [06:02:57] right [06:03:16] iirc, we've seen this bug before during a fundraiser [06:03:26] likely browser and/or skin specific [06:03:31] I've never seen that ever [06:03:47] or race condition related [06:04:13] I can look at it in about 7-8 hours when I get back from school [06:04:30] going offline for now [06:07:02] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [06:07:02] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [06:07:03] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [06:07:03] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [06:07:04] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [06:07:04] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [06:07:05] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [06:07:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:07:06] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [06:08:05] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [06:10:31] !log turning off debug mode in nagios-nrpe, again had to kill it , restart fails [06:10:34] Logged the message, Master [06:11:05] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [06:11:05] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [06:11:06] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [06:11:06] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:11:07] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [06:12:08] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [06:13:11] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [06:13:11] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [06:14:05] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [06:16:02] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [06:27:27] RECOVERY - Swift HTTP on copper is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.127 seconds [06:28:48] RECOVERY - Swift HTTP on zinc is OK: HTTP OK HTTP/1.1 200 OK - 2359 bytes in 0.065 seconds [08:05:20] New review: Dzahn; "i changed this command in change 3072 to fix monitoring on zinc/magnesium/copper who need port 8080 ..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3036 [10:30:29] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [10:34:32] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [10:38:35] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [10:43:32] PROBLEM - Puppet freshness on virt1 is CRITICAL: Puppet has not run in the last 10 hours [10:53:35] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [11:14:44] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 181 seconds [11:16:05] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 184 seconds [11:24:20] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [11:24:47] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [11:28:56] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:29:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:30:35] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:30:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:34:52] New patchset: Mark Bergsma; "Support multiple varnish instances in the ganglia metrics module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:35:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3081 [11:36:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3081 [11:36:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3081 [11:37:27] PROBLEM - Puppet freshness on mw53 is CRITICAL: Puppet has not run in the last 10 hours [12:03:05] New patchset: Mark Bergsma; "Small fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3082 [12:03:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3082 [12:03:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3082 [12:03:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3082 [12:17:15] New patchset: Mark Bergsma; "Monitor both frontend and backend varnish instances for mobile as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3083 [12:17:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3083 [12:17:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3083 [12:17:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3083 [12:19:35] New patchset: Mark Bergsma; "Add eqiad upload varnish cluster to torrus collection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3084 [12:19:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3084 [12:19:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3084 [12:19:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3084 [12:21:58] New review: Mark Bergsma; "Please do per-data center groups, following the same name style as has been done for squids/varnish ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3067 [13:35:06] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [13:35:06] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [14:07:26] typo [14:08:27] its that kind of day. [14:33:08] New patchset: Mark Bergsma; "Support dual-layer varnish clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3086 [14:33:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3086 [14:33:54] god damn it [14:34:00] equinix installed the grounding wron gin all the new cabinets [14:34:02] what the fuck. [14:34:13] can they do nothing right the first time. [14:34:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3086 [14:34:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3086 [14:35:19] i take that back, the existing cabinets are wrong too [14:35:30] mark: so none of the grounding is right in any cabinet, which may be why we are having issues [14:35:43] they didnt bother to use star washers, so the grounding is attached to powder coat metal [14:35:46] rather than metal on metal [14:36:17] haha [14:38:22] i hate eq. [14:38:41] i cc'd you on the email. [14:39:07] going to have to run to lowes shortly, had to review on site what hardware i have and what i need to buy. [14:47:21] i am just going to buy all the shit i need to regound everything from scratch. [14:48:58] no it's their responsibility [14:49:49] New patchset: Mark Bergsma; "Support dual layer varnish setups, add upload-eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3087 [14:50:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3087 [14:51:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3087 [14:51:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3087 [14:53:32] they should probably be grounding the doors too [14:54:45] New patchset: Mark Bergsma; "Missing %" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3088 [14:54:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3088 [14:55:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3088 [14:55:02] mark: they wont ground the doors and such, so its ust as easy to buy extra washers [14:55:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3088 [14:55:09] i dont trust them to go in racks with working equipment [14:55:14] as they have fucked it up twice now. [14:55:43] the racks came with the wires ,but some have gone missing, some werent there to begin with, etc... so i just need to buy a spool of groundin gwire and the clamping tools to make them [14:57:29] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.8116703571 (gt 8.0) [14:59:53] ok, leaving laptop in eqiad and goin gto hardware store for stuff [15:00:58] I will be back shortly. i am not done today until grounding is all fixed, tired of all the odd erros. [15:01:38] New patchset: Mark Bergsma; "Set prefix to empty for single layer servers, add a decommission option for dual layer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3089 [15:01:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3089 [15:02:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3089 [15:02:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3089 [15:05:35] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [15:09:10] New patchset: Mark Bergsma; "Merge squidlayer and varnishlayer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3090 [15:09:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3090 [15:09:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3090 [15:09:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3090 [15:15:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [15:16:39] !log Rebooted manutius, stuck in a similar state as streber always did [15:16:41] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [15:16:41] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [15:16:43] Logged the message, Master [15:17:35] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [15:17:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [15:17:35] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [15:20:35] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [15:21:38] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [15:22:41] PROBLEM - Host manutius is DOWN: PING CRITICAL - Packet loss = 100% [15:24:38] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:24:38] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:24:38] RECOVERY - Host manutius is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [15:34:41] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [15:42:02] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.75302 (gt 8.0) [15:45:43] New patchset: Mark Bergsma; "Create a separate vcl_config hash, as retry5x/cache4xx are not backend options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:45:54] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3091 [15:47:11] New patchset: Mark Bergsma; "Create a separate vcl_config hash, as retry5x/cache4xx are not backend options" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:47:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3091 [15:48:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3091 [15:48:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3091 [15:49:05] PROBLEM - DPKG on snapshot3 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:51:11] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 185 MB (2% inode=61%): /var/lib/ureadahead/debugfs 185 MB (2% inode=61%): [15:53:35] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [15:53:35] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [15:57:02] New patchset: Mark Bergsma; "Require GET or HEAD for upload, no POST" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3092 [15:57:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3092 [15:57:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3092 [15:57:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3092 [15:59:35] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [15:59:35] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [15:59:36] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [15:59:36] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [15:59:37] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [15:59:37] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [16:05:48] RECOVERY - Disk space on srv221 is OK: DISK OK [16:08:30] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [16:08:30] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [16:08:31] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [16:08:31] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [16:08:32] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [16:08:32] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [16:08:33] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [16:08:33] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:08:34] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [16:09:24] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [16:12:24] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [16:12:25] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [16:12:25] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [16:12:26] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [16:12:26] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [16:13:27] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [16:14:30] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [16:14:30] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [16:17:30] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [16:18:24] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.3391611607 (gt 8.0) [16:45:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2989 [16:45:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2989 [16:48:15] PROBLEM - Host cp1021 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:46] New patchset: Lcarr; "Fixing fixme's from previous change https://gerrit.wikimedia.org/r/#change,2936" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3093 [16:54:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3093 [16:56:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.56872919643 (gt 8.0) [16:56:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/3093 [16:57:15] RECOVERY - Host cp1021 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [17:00:33] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:03] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3093 [17:02:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3093 [17:02:57] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [17:03:06] woosters: https://rt.wikimedia.org/Ticket/Display.html?id=2607 [17:03:15] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [17:03:20] whose varnish config diff ? [17:03:25] and is it ok to push ? [17:04:23] got it hexmode [17:06:42] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:08:51] LeslieCarr: did I forget to merge smt? [17:08:59] upload-frontend.inc.vcl.erb ? [17:09:00] woosters: :) [17:09:03] guessing is yours ? [17:09:06] yeah [17:09:08]