[00:01:38] !log oxygen install done, booting successfully after multiple tests, now running puppet for initial config [00:01:42] Logged the message, RobH [00:01:47] then im done. [00:07:26] New patchset: Asher; "pretty graphs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3002 [00:07:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3002 [00:08:17] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3002 [00:08:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3002 [00:11:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:19:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.963 seconds [00:20:56] maplebed: http://www.meetup.com/openstack/events/55240842/ [00:21:55] Ryan_Lane: already on my calendar! [00:21:56] :) [00:22:01] heh [00:22:21] I'm hoping some swift people will come [00:23:57] Ryan_Lane: can you get http://www.wmflabs.org/ to redirect to labconsole.wikimedia.org ? [00:24:05] labsconsole* [00:24:16] I don't think we want to do that. [00:24:20] or would that cause problems with all the current instances [00:24:22] we really need some kind of portal page there [00:24:31] hmm yes [00:25:07] <^demon> Ryan_Lane: Just slap a unicorn picture on an index.html :) [00:25:26] heh [00:25:40] it would be really great to have something that links to all active labs sites [00:25:59] Ryan_Lane: i actually noticed labs is not on wikimediafoundation.org's list of sites [00:26:15] like, oh hey, here's beta, and here's huggle, and here's education, etc. etc. [00:38:33] binasher: gdash is pretty awesome [00:38:56] thanks! [00:39:19] it's super nyan catish [00:39:27] https://graphite.wikimedia.org/render/?title=edit%20submits/min%20Top%2010%20Non-Wikipedia%20Wikis%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=cactiStyle(substr(highestMax(exclude(reqstats.edits.*.submits,%22wikiped%22),10),2,3)) [00:39:36] o.O [00:39:49] heh. the graphics come from graphite, eh? [00:40:14] uh oh.. was that supposed to link to "no data" [00:41:28] The parentheses are part of the URL [00:42:00] nyan cat went vector! [00:45:14] New patchset: Dzahn; "add nagios::monitor::snmp to spence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3003 [00:45:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3003 [00:46:22] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3003 [00:46:25] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3003 [00:51:28] * Ryan_Lane has a seizure [00:51:28] https://graphite.wikimedia.org/render/?title=Top%2010%20API%20Methods%20by%20Max%20Average%20Time%20(ms)%20log(2)%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle(substr(highestMax(maximumAbove(API.*.tavg,1),10),0,2)) [00:54:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:55:07] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [00:57:00] !log updated pyfribidi to 0.11.0 fixing https://github.com/pediapress/pyfribidi/issues/2 [00:57:01] !log updated pyfribidi to 0.11.0 fixing https://github.com/pediapress/pyfribidi/issues/2 [00:57:03] Logged the message, Master [00:57:06] Logged the message, Master [00:57:27] Ryan_Lane: i need to work on adding an &strobe=true option [00:57:36] !log updated pyfribidi to 0.11.0 fixing https://github.com/pediapress/pyfribidi/issues/2 [00:57:39] Logged the message, Master [00:57:40] what would strobe do? [00:57:51] ah.heh [00:58:12] binasher: about to lock down spence snmp traps , you wrote "block should include labs host", which made perfect sense to me, just a bit surprised to see stuff like "spence snmptrapd[1366]: 2012-03-09 00:56:36 venus.pmtpa.wmflabs" [00:58:28] so labs talking to spence,, but it has its own nagios.. hmm [00:58:50] yes [00:58:58] that's because we haven't fixed puppet yet :) [00:59:15] it should talk to an snmptrap enabled on nagios.pmtpa.wmflabs [00:59:15] Ryan_Lane: it would make you want to stare at it all night [00:59:21] :D [00:59:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.306 seconds [01:00:11] mutante: yeah, i was surprised to see a bunch of unnamed labs instances generating "unknown host" errors in the nagios log on spence. thanks for grabbing that ticket! [01:00:27] I've been meaning to get to it [01:00:40] it would be awesome for the labs nagios to report broken puppet too [01:01:09] alright, so still ok if i drop those from labs [01:01:12] Ryan_Lane: so it is definitely the labs-in4 filter [01:01:17] the big question is why [01:01:18] grrr [01:01:20] mutante: yes [01:03:16] heh, and unrelated but "spence nrpe[14209]: Host 208.80.152.161 is not allowed to talk to us!", and that IP is ... spence ;p [01:03:31] hah nice [01:08:15] binasher: hmm, why are the LockManager and StreamFile graphs combined...I don't think the later is as general as it may seem ;) [01:08:20] http://gdash.wikimedia.org/dashboards/filebackend/ [01:11:41] AaronSchulz: should all three (FileBackend, LockManager, and StreamFile) be split out? [01:12:03] LockManager can be combined with FileBackend provided the graphs don't look useless [01:12:15] StreamFile should be on its own [01:12:33] but FB + LM = good? [01:12:53] yes, as long as it's not too cluttered [01:13:05] that correspond well, yes [01:14:53] s/that/they [01:15:50] root@i-0000013c:~# telnet 10.0.0.43 111 [01:15:51] Trying 10.0.0.43... [01:15:51] Connected to 10.0.0.43. [01:15:51] Escape character is '^]'. [01:18:46] AaronSchulz: is even having streamfile in gdash useful at this point? it appears rarely called [01:19:06] thumb.php should be calling it a fair amount [01:19:18] LeslieCarr: fyi, i made changes in the old nagios::monitor , moved snmp related stuff into nagios::monitor::snmp, and applied to spence. i did not touch the new class though, where this stuff was currently duplicated, so they differ now...hmm [01:21:15] ok [01:23:00] oh, and what is the subnet inside 10.0.0.0/8 that is used by labs, if you happen to know [01:23:15] 10.4.0.0/24 and 10.4.16.0/24 are [01:23:19] thx [01:23:20] right now [01:24:59] AaronSchulz: how does wfStreamThumb relate to StreamFile? [01:27:39] binasher: when the thumbnail is created, $thumb->streamFile( $headers ) is called, which uses StreamFile::stream() [01:29:25] New review: Lcarr; "I actually did this because nagios/icinga was breaking since no routers group was showing up/being a..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2956 [01:31:03] AaronSchulz: wfStreamThumb appears to be getting called far more than StreamFile::stream -- i wonder if thumb.php profiling data isn't always making it to the collector's "all" db [01:33:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:34:27] actually http://noc.wikimedia.org/cgi-bin/report.py?db=thumb-1.19 matches whats in graphite, SF::stream gets called < 10% of the time wfStreamThumb does. ok.. [01:38:42] binasher: so there must be many requests for thumbs that are not on ms5 but who's name normalizes to that of a file on ms5....or something [01:38:45] maplebed: re, swift monitoring,,all those different processes. first i thought about writing a custom check script that checks them all and just turns crit if the number of procs is too low, but combined into a single "service" in nagios, then "but you'll want it to also report which process is the missing one right away", then back to "then you can just as well have a separate check for each service"... [01:39:19] I don't think the number of each process is important [01:39:23] or wonk profiling ;) [01:39:23] just that each one exists. [01:39:40] AaronSchulz: that makes sense.. i also see a lot of requests for thumbs in resolutions larger than the orig which might hit that [01:39:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.685 seconds [01:40:26] maplebed: i meant the overall number of procs starting with swift- [01:40:59] binasher: yeah that is a subcase [01:41:12] oh, just that there are a lot of different services (-account-sync, -account-maint, -account-server, etc.) [01:41:14] yeah. [01:41:18] I don't have a good answer... [01:41:32] binasher: wait, I bet swift is copying those in to cache then ;) [01:41:50] one could keep requesting 1px larger and keeping taking up space [01:42:03] i think those requests result in a 500 response now [01:42:16] if you request a thumb that's larger than the original it refuses. [01:42:20] so they should always fall back to thumb.php [01:42:20] (the back end scalers refuse) [01:42:27] the 500s don't call stream() though [01:43:47] maplebed: i tend to just create a separate check for each process then, that way its more obvious what exactly stopped running and the only drawback is more services for nagis to handle [01:43:57] binasher: so there must be other subcases at work [01:44:39] * AaronSchulz still has a similar worry with those (e.g. different forms that normalize to the same file) [01:45:00] of course once swift uses thumb.php and lets it do the swift write directly this won't matter [01:45:15] (uses thumb.php directly I mean) [01:46:52] binasher: it still is interesting what these requests are that normalize to actual thumbnails [01:47:17] I can imagine possibilities, but empirically it might be nice to see specific what we getting [01:47:21] *specifically [01:47:24] * AaronSchulz sighhhs [01:48:04] no thanks, Sarah Munroe,i don't think we want to sell all our domains to your private venture capital broker. even if you keep sending that template for every domain now :p [01:48:14] yeah, it would be [01:48:54] maybe the base name suffix is not actually the source file name but something a bit different, or what not [01:50:36] at some point i'm hoping we'll have xhprof on an apache or two and record full request traces. so we'd be able to find one that resulted in a call to wfStreamThumb but not StreamFile and see what the actual request was [01:53:27] PROBLEM - udp2log log age on oxygen is CRITICAL: NRPE: Command check_udp2log_log not defined [01:53:45] PROBLEM - udp2log processes on oxygen is CRITICAL: NRPE: Command check_udp2log_procs not defined [02:06:51] binasher: I wonder why deleteInternal has nothing on the graphs [02:07:01] * AaronSchulz is suspicious [02:10:11] deleteinternal is on there [02:11:23] ok, I see few dots over the last few days [02:11:24] meh [02:11:41] compare the count to other filebackendstore functions on http://noc.wikimedia.org/cgi-bin/report.py?db=all&sort=count&limit=5000 [02:12:54] 2% sampling starts to suck when looking at things that happen infrequently [02:13:11] time to head home [02:16:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:48] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [02:20:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.429 seconds [02:42:39] RECOVERY - Puppet freshness on mw1010 is OK: puppet ran at Fri Mar 9 02:42:28 UTC 2012 [03:25:42] PROBLEM - check_minfraud_primary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:30:12] RECOVERY - check_minfraud_primary on payments4 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 118 bytes in 0.198 second response time [03:46:25] New patchset: Dzahn; "nagios,snmp,iptables: allow private IP ranges, but not labs 10.4.0.0/24 and 10.4.16.0/24" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3004 [03:46:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3004 [04:46:05] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3004 [04:46:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3004 [04:58:34] New patchset: Dzahn; "can't use Iptables_add_service[lo_all] etc. twice, duplicate definition, so merging nsca and nrpe iptables rules into one nagios::monitoring::firewall class instead of repeating everything all over with different names, also disallow nsca from labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3005 [04:58:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3005 [05:01:16] New patchset: Dzahn; "can't use Iptables_add_service[lo_all] etc. twice, duplicate definition, so merging nsca and nrpe iptables rules into one nagios::monitoring::firewall class instead of repeating everything all over with different names, also disallow nsca from labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3005 [05:01:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3005 [05:02:53] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3005 [05:02:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3005 [05:07:50] New patchset: Dzahn; "iptables_add_exec{ "${hostname} would also be a duplicate .. we need 2 of them though, right?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3006 [05:08:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3006 [05:08:36] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3006 [05:08:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3006 [05:22:42] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [05:22:42] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [06:07:22] New patchset: Ryan Lane; "Up the version of user-management scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3007 [06:07:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3007 [06:07:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3007 [06:07:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3007 [06:09:26] New patchset: Ryan Lane; "Adding manage-volumes script link" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3008 [06:09:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3008 [06:09:50] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3008 [06:09:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3008 [06:35:49] New patchset: Dzahn; "typo: snmaptrap :p we aren't portscanning here .. and let's remove all "-" just in case" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3009 [06:36:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3009 [06:38:29] !log reloading autofs on all labs instances [06:38:33] Logged the message, Master [06:38:38] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3009 [06:38:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3009 [06:53:43] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [06:54:07] come on spence.. come on [07:02:52] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:02:52] New patchset: Ryan Lane; "Add manage-volume cron on labstore2, and add an ircecho bot for the logfile" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3010 [07:03:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3010 [07:03:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3010 [07:03:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3010 [07:04:58] New patchset: Ryan Lane; "Placing the script properly on labstore2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3011 [07:05:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3011 [07:05:14] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3011 [07:05:17] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3011 [07:11:52] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:11:52] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:28:12] New patchset: Dzahn; "nagios monitoring for mw profiling collector and graphite (RT-2367)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3012 [07:28:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3012 [07:30:20] New patchset: Dzahn; "nagios monitoring for mw profiling collector and graphite (RT-2367)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3012 [07:30:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3012 [07:32:17] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3012 [07:32:20] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3012 [07:36:45] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%): /var/lib/ureadahead/debugfs 0 MB (0% inode=95%): [07:37:48] PROBLEM - DPKG on snapshot1 is CRITICAL: Connection refused by host [07:38:15] PROBLEM - Disk space on snapshot1 is CRITICAL: Connection refused by host [07:38:42] PROBLEM - DPKG on virt1 is CRITICAL: Connection refused by host [07:38:42] PROBLEM - RAID on ms5 is CRITICAL: Connection refused by host [07:39:00] PROBLEM - RAID on snapshot1 is CRITICAL: Connection refused by host [07:39:09] PROBLEM - DPKG on cp1010 is CRITICAL: Connection refused by host [07:39:09] PROBLEM - RAID on mw8 is CRITICAL: Connection refused by host [07:39:18] PROBLEM - MySQL disk space on db25 is CRITICAL: Connection refused by host [07:39:18] PROBLEM - DPKG on db25 is CRITICAL: Connection refused by host [07:39:27] PROBLEM - Full LVS Snapshot on db25 is CRITICAL: Connection refused by host [07:39:27] PROBLEM - RAID on mw1051 is CRITICAL: Connection refused by host [07:39:36] PROBLEM - Disk space on ms5 is CRITICAL: Connection refused by host [07:39:45] PROBLEM - Disk space on cp1010 is CRITICAL: Connection refused by host [07:39:45] PROBLEM - MySQL Idle Transactions on db12 is CRITICAL: Connection refused by host [07:39:45] PROBLEM - RAID on mw12 is CRITICAL: Connection refused by host [07:39:45] PROBLEM - MySQL Idle Transactions on db25 is CRITICAL: Connection refused by host [07:39:45] PROBLEM - DPKG on ms5 is CRITICAL: Connection refused by host [07:39:54] PROBLEM - DPKG on mw8 is CRITICAL: Connection refused by host [07:39:54] PROBLEM - MySQL Recent Restart on db12 is CRITICAL: Connection refused by host [07:39:54] that was me :p fixing [07:39:57] dan [07:40:03] PROBLEM - MySQL Recent Restart on db25 is CRITICAL: Connection refused by host [07:40:12] PROBLEM - DPKG on mw1051 is CRITICAL: Connection refused by host [07:40:21] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: Connection refused by host [07:40:21] PROBLEM - RAID on srv255 is CRITICAL: Connection refused by host [07:40:21] PROBLEM - Disk space on virt1 is CRITICAL: Connection refused by host [07:40:21] PROBLEM - MySQL Replication Heartbeat on db25 is CRITICAL: Connection refused by host [07:40:30] PROBLEM - DPKG on db1018 is CRITICAL: Connection refused by host [07:40:30] PROBLEM - Disk space on mw8 is CRITICAL: Connection refused by host [07:40:30] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: Connection refused by host [07:40:39] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: Connection refused by host [07:40:39] PROBLEM - Disk space on db1018 is CRITICAL: Connection refused by host [07:40:39] PROBLEM - MySQL Slave Running on db1018 is CRITICAL: Connection refused by host [07:40:39] PROBLEM - Disk space on mw1051 is CRITICAL: Connection refused by host [07:40:39] PROBLEM - RAID on cp1010 is CRITICAL: Connection refused by host [07:40:48] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: Connection refused by host [07:40:48] PROBLEM - MySQL Slave Delay on db25 is CRITICAL: Connection refused by host [07:40:48] PROBLEM - mysqld processes on db25 is CRITICAL: Connection refused by host [07:40:48] PROBLEM - mysqld processes on db1018 is CRITICAL: Connection refused by host [07:40:48] PROBLEM - DPKG on mw12 is CRITICAL: Connection refused by host [07:40:57] PROBLEM - Disk space on mw12 is CRITICAL: Connection refused by host [07:40:57] PROBLEM - Disk space on db25 is CRITICAL: Connection refused by host [07:40:57] PROBLEM - RAID on db25 is CRITICAL: Connection refused by host [07:40:57] PROBLEM - RAID on mw1127 is CRITICAL: Connection refused by host [07:40:57] PROBLEM - MySQL Slave Running on db25 is CRITICAL: Connection refused by host [07:40:58] PROBLEM - Full LVS Snapshot on db1018 is CRITICAL: Connection refused by host [07:40:58] PROBLEM - RAID on db12 is CRITICAL: Connection refused by host [07:41:06] PROBLEM - DPKG on srv255 is CRITICAL: Connection refused by host [07:41:06] PROBLEM - Disk space on db12 is CRITICAL: Connection refused by host [07:41:06] PROBLEM - MySQL Slave Running on db12 is CRITICAL: Connection refused by host [07:41:06] PROBLEM - Disk space on srv221 is CRITICAL: Connection refused by host [07:41:15] PROBLEM - mysqld processes on db12 is CRITICAL: Connection refused by host [07:41:15] PROBLEM - DPKG on db12 is CRITICAL: Connection refused by host [07:41:15] PROBLEM - Full LVS Snapshot on db12 is CRITICAL: Connection refused by host [07:41:24] PROBLEM - DPKG on mw1083 is CRITICAL: Connection refused by host [07:41:24] PROBLEM - DPKG on srv221 is CRITICAL: Connection refused by host [07:41:24] PROBLEM - MySQL Idle Transactions on db1018 is CRITICAL: Connection refused by host [07:41:24] PROBLEM - Disk space on srv255 is CRITICAL: Connection refused by host [07:41:24] PROBLEM - MySQL disk space on db12 is CRITICAL: Connection refused by host [07:41:25] PROBLEM - MySQL disk space on db1018 is CRITICAL: Connection refused by host [07:41:25] PROBLEM - DPKG on mw1127 is CRITICAL: Connection refused by host [07:42:01] sorry, mistake with iptables... no worries [07:46:10] heh [07:46:37] arr, yeah, i wanted to manually get rid of 2 rules, because puppet doesnt remove them [07:46:55] so i used line numbers to delete them [07:47:06] line numbers didn't match up? [07:47:13] but of course after you delete the first one [07:47:20] the line number changes for the second one :o [07:47:30] sweet [07:48:10] PROBLEM - MySQL Recent Restart on db1018 is CRITICAL: Connection refused by host [07:48:10] PROBLEM - DPKG on mw43 is CRITICAL: Connection refused by host [07:48:10] PROBLEM - Disk space on mw1127 is CRITICAL: Connection refused by host [07:48:19] PROBLEM - Full LVS Snapshot on db54 is CRITICAL: Connection refused by host [07:48:28] PROBLEM - Full LVS Snapshot on db22 is CRITICAL: Connection refused by host [07:48:28] PROBLEM - MySQL disk space on db22 is CRITICAL: Connection refused by host [07:48:28] PROBLEM - MySQL disk space on db32 is CRITICAL: Connection refused by host [07:48:28] PROBLEM - Full LVS Snapshot on db32 is CRITICAL: Connection refused by host [07:48:37] PROBLEM - RAID on mw43 is CRITICAL: Connection refused by host [07:48:45] as long as none of those page :-P [07:48:58] uhm, stopped nagios to prevent paging [07:49:17] but now that i start it again..it still remembers [07:51:32] stopping gammu [07:55:54] no pages sent, stopping bot until it clears up [08:01:47] cool [08:01:52] PROBLEM - RAID on cp1016 is CRITICAL: Connection refused by host [08:01:52] PROBLEM - Disk space on cp1007 is CRITICAL: Connection refused by host [08:01:52] PROBLEM - DPKG on cp1008 is CRITICAL: Connection refused by host [08:01:52] PROBLEM - Disk space on mw1106 is CRITICAL: Connection refused by host [08:01:52] PROBLEM - Disk space on srv259 is CRITICAL: Connection refused by host [08:01:53] PROBLEM - DPKG on cp1041 is CRITICAL: Connection refused by host [08:01:53] PROBLEM - Disk space on srv203 is CRITICAL: Connection refused by host [08:01:57] heh [08:02:01] PROBLEM - Disk space on mw1084 is CRITICAL: Connection refused by host [08:02:10] PROBLEM - RAID on srv233 is CRITICAL: Connection refused by host [08:02:30] PROBLEM - RAID on cp1001 is CRITICAL: Connection refused by host [08:02:31] PROBLEM - MySQL Slave Running on db43 is CRITICAL: Connection refused by host [08:02:31] PROBLEM - MySQL Recent Restart on db52 is CRITICAL: Connection refused by host [08:02:31] PROBLEM - Disk space on db43 is CRITICAL: Connection refused by host [08:02:31] PROBLEM - RAID on db52 is CRITICAL: Connection refused by host [08:02:37] PROBLEM - DPKG on nfs1 is CRITICAL: Connection refused by host [08:02:37] PROBLEM - MySQL Idle Transactions on db24 is CRITICAL: Connection refused by host [08:02:37] PROBLEM - mailman on sodium is CRITICAL: Connection refused by host [08:02:37] PROBLEM - Disk space on sodium is CRITICAL: Connection refused by host [08:02:37] PROBLEM - DPKG on srv233 is CRITICAL: Connection refused by host [08:02:46] PROBLEM - Disk space on srv275 is CRITICAL: Connection refused by host [08:02:46] PROBLEM - RAID on nfs1 is CRITICAL: Connection refused by host [08:02:46] PROBLEM - RAID on mw1084 is CRITICAL: Connection refused by host [08:02:46] PROBLEM - MySQL Recent Restart on db24 is CRITICAL: Connection refused by host [08:02:46] PROBLEM - RAID on db24 is CRITICAL: Connection refused by host [08:02:47] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: Connection refused by host [08:03:01] puppet started it.but need to run puppet .. [08:04:46] running a loop to make sure gammu stays down :p [09:02:38] RECOVERY - RAID on cp1015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:03:32] RECOVERY - DPKG on cp1015 is OK: All packages OK [09:03:41] RECOVERY - Disk space on cp1015 is OK: DISK OK [10:56:23] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [11:19:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:51] New patchset: Mark Bergsma; "Create a Varnish VCL consisting of all Wikimedia networks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3013 [11:20:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3013 [11:20:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3013 [11:20:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3013 [11:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.425 seconds [11:25:04] New patchset: Mark Bergsma; "Add missing ;" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3014 [11:25:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3014 [11:25:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3014 [11:25:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3014 [11:29:05] PROBLEM - Varnish HTTP bits on sq67 is CRITICAL: Connection refused [11:29:53] New patchset: Mark Bergsma; "Hate Varnish persisting usage of ACLs and backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3015 [11:30:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3015 [11:30:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3015 [11:30:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3015 [11:33:36] New patchset: Mark Bergsma; "Hate Puppet's assignment rules too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3016 [11:33:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3016 [11:34:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3016 [11:34:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3016 [11:35:29] New patchset: Mark Bergsma; "Syntax" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3017 [11:35:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3017 [11:35:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3017 [11:35:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3017 [11:38:14] New patchset: Mark Bergsma; "Change parameter defaults" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3018 [11:38:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3018 [11:38:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3018 [11:38:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3018 [11:40:16] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 627 bytes in 0.004 seconds [11:43:11] New patchset: Mark Bergsma; "Make varnish work for upload thumbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3019 [11:43:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3019 [11:43:50] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3019 [11:43:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3019 [11:47:07] New patchset: Mark Bergsma; "Syntax, missing include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3020 [11:47:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3020 [11:47:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3020 [11:47:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3020 [11:54:31] New patchset: Mark Bergsma; "Make xff_sources parameter a normal array without hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3021 [11:54:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3021 [11:55:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3021 [11:55:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3021 [11:56:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.046 seconds [12:07:45] New patchset: Mark Bergsma; "Fix template error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3022 [12:07:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3022 [12:08:37] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3022 [12:08:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3022 [12:20:37] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [12:37:11] New patchset: Mark Bergsma; "Put the probe in the upload cluster specific include file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3023 [12:37:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3023 [12:37:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3023 [12:37:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3023 [12:38:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.585 seconds [13:17:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.614 seconds [13:59:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.356 seconds [14:09:45] PROBLEM - SSH on ssl1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:39] RECOVERY - SSH on ssl1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:23:42] PROBLEM - SSH on ssl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:25:48] RECOVERY - SSH on ssl1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:27:09] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [14:29:06] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:34:21] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [14:39:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.986 seconds [14:46:21] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [14:47:25] New patchset: Demon; "Adding myself to new gerrit box" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3024 [14:47:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3024 [14:59:33] PROBLEM - SSH on ssl1003 is CRITICAL: Server answer: [15:00:54] PROBLEM - SSH on ssl1001 is CRITICAL: Server answer: [15:18:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:47] New patchset: Jgreen; "adding wmf/fb mysql to hume for civicrm/drupal upgrade testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3026 [15:18:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3026 [15:19:59] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3026 [15:20:02] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3026 [15:24:09] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [15:24:09] PROBLEM - Puppet freshness on mw1110 is CRITICAL: Puppet has not run in the last 10 hours [15:24:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.403 seconds [15:25:15] New patchset: Pyoungmeister; "some loggin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3027 [15:25:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3027 [15:28:14] New patchset: Pyoungmeister; "some loggin'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3027 [15:28:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3027 [15:28:38] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3027 [15:28:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3027 [15:32:22] New patchset: Mark Bergsma; "Make Varnish not complain about unused VCL resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3028 [15:32:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3028 [15:32:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3028 [15:32:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3028 [15:34:12] New patchset: Mark Bergsma; "Add -p param" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3029 [15:34:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3029 [15:35:15] PROBLEM - DPKG on hume is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:35:20] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3029 [15:35:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3029 [15:40:17] New patchset: Mark Bergsma; "Varnish doesn't accept definition of probes below functions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3031 [15:40:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3031 [15:41:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3031 [15:41:14] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3031 [15:43:21] RECOVERY - Varnish HTTP upload-backend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 625 bytes in 0.053 seconds [15:44:08] !log hume apt upgrades, puppetd --test, switch to mysql 5.1.53-fb3753-wm1 [15:44:12] Logged the message, Master [15:51:00] RECOVERY - Disk space on snapshot2 is OK: DISK OK [15:52:05] !log Turned off vcc_err_unref on all varnish servers, so varnish doesn't complain when ACLs/probes/backends are unused [15:52:07] Logged the message, Master [15:58:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.365 seconds [16:07:50] New patchset: Mark Bergsma; "Just test on cp1021 for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3032 [16:08:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3032 [16:08:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3032 [16:08:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3032 [16:38:30] New patchset: Mark Bergsma; "Add upload service IP to lvs1002/lvs1005" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3033 [16:38:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3033 [16:38:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3033 [16:39:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3033 [16:44:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds [16:49:01] New patchset: Mark Bergsma; "Pass realserver IPs as a hash, not an array" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3034 [16:49:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3034 [16:49:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3034 [16:49:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3034 [16:55:43] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [17:04:43] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:05:46] PROBLEM - Puppet freshness on knsq23 is CRITICAL: Puppet has not run in the last 10 hours [17:05:46] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [17:06:40] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:06:40] PROBLEM - Puppet freshness on amssq49 is CRITICAL: Puppet has not run in the last 10 hours [17:06:40] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Puppet has not run in the last 10 hours [17:08:46] PROBLEM - Puppet freshness on knsq21 is CRITICAL: Puppet has not run in the last 10 hours [17:08:46] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [17:08:46] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [17:08:46] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [17:11:46] PROBLEM - Puppet freshness on amssq62 is CRITICAL: Puppet has not run in the last 10 hours [17:13:43] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:13:43] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:13:53] New patchset: Mark Bergsma; "Fix URI match, and require upload.wikimedia.org host header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3035 [17:14:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3035 [17:15:00] New patchset: Mark Bergsma; "Fix URI match, and require upload.wikimedia.org host header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3035 [17:15:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3035 [17:15:27] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3035 [17:15:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3035 [17:19:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.353 seconds [17:32:28] !log set swift storage device weight on ms2 to 0 and pushed out rings [17:32:32] Logged the message, Master [17:42:55] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [17:42:55] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on amssq58 is CRITICAL: Puppet has not run in the last 10 hours [17:49:58] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [17:49:59] PROBLEM - Puppet freshness on amssq52 is CRITICAL: Puppet has not run in the last 10 hours [17:49:59] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [17:50:00] PROBLEM - Puppet freshness on knsq29 is CRITICAL: Puppet has not run in the last 10 hours [17:50:00] PROBLEM - Puppet freshness on knsq25 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq39 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq55 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq54 is CRITICAL: Puppet has not run in the last 10 hours [17:56:52] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Puppet has not run in the last 10 hours [17:56:53] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [17:57:46] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [17:57:55] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [17:57:55] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Puppet has not run in the last 10 hours [17:57:55] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [17:57:55] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Puppet has not run in the last 10 hours [17:57:55] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [17:57:55] PROBLEM - Puppet freshness on ssl3004 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on amssq45 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on amssq48 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on knsq27 is CRITICAL: Puppet has not run in the last 10 hours [17:58:58] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [17:59:52] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [17:59:52] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Puppet has not run in the last 10 hours [17:59:52] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [18:00:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:55] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on amssq57 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on knsq18 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on knsq16 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: Puppet has not run in the last 10 hours [18:00:55] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [18:01:58] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [18:02:52] PROBLEM - Puppet freshness on knsq22 is CRITICAL: Puppet has not run in the last 10 hours [18:03:55] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [18:04:58] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [18:04:58] PROBLEM - Puppet freshness on amssq61 is CRITICAL: Puppet has not run in the last 10 hours [18:04:58] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [18:05:04] * schoolcraftT dusts off a kitchen towel and slaps it at Thehelpfulone [18:05:52] PROBLEM - Puppet freshness on knsq20 is CRITICAL: Puppet has not run in the last 10 hours [18:06:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.045 seconds [18:14:16] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [18:18:28] PROBLEM - DPKG on ms-be5 is CRITICAL: Connection refused by host [18:19:04] PROBLEM - Disk space on ms-be5 is CRITICAL: Connection refused by host [18:19:31] PROBLEM - RAID on ms-be5 is CRITICAL: Connection refused by host [18:28:54] New patchset: Bhartshorne; "changing lvs and nagios to check for a file in swift directly rather than going through the swift rewrite stuff for thumbnails to protect against the thumbnail getting deleted" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3036 [18:29:06] mark: would you review ^^^^ [18:29:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3036 [18:40:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3036 [18:43:49] PROBLEM - NTP on ms-be5 is CRITICAL: NTP CRITICAL: No response from NTP server [18:44:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.416 seconds [18:45:27] Ryan_Lane: can you add a DNS entry for test.m.wikipedia.org? [18:46:15] lemme see [18:46:51] u r my hero [18:50:16] done [18:50:41] \o/ [18:50:43] thanks dude [18:54:25] yw [19:20:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.023 seconds [19:45:33] New patchset: Ryan Lane; "Ensure glustermanager can run mkdir -p, not just mkdir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3037 [19:45:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3037 [19:46:09] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3037 [19:46:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3037 [19:58:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.398 seconds [20:34:18] !log stopping search indexer on searchidx2 for fresh rsync to searchidx1001 [20:34:21] Logged the message, and now dispaching a T1000 to your position to terminate you. [20:41:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.677 seconds [20:58:03] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [21:20:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:26:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.863 seconds [22:00:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:06:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.798 seconds [22:21:46] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [22:27:01] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 308 seconds [22:27:55] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 361 seconds [22:35:52] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [22:36:55] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [22:42:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.787 seconds [22:51:18] New patchset: Ryan Lane; "1-2 not 12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3038 [22:51:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3038 [22:51:35] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3038 [22:51:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3038 [22:54:20] New patchset: Ryan Lane; "I hate you puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3039 [22:54:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3039 [22:55:41] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3039 [22:55:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3039 [23:20:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.164 seconds