[00:04:25] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [00:04:25] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [00:04:25] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [00:04:25] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [00:04:25] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [00:04:26] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [00:04:26] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [00:04:27] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [00:04:27] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [00:04:28] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [00:04:28] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [00:04:29] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [00:04:29] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [00:04:30] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [00:04:30] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [00:04:31] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [00:04:35] Ahm? [00:04:48] * RoanKattouw wonders if someone broke the puppet manifest for apaches [00:05:51] probably [00:14:31] hm, those are all precise apaches, not in production due to the sorting issue [00:15:00] I'm trying a manual run on one of them [00:15:08] puppet is running on the couple i checked, and a manual run on one was fine [00:15:39] interesting, wonder if something is borked on them sending the trap to know that puppet ran ? [00:16:07] are there logs on spence of passive check requests? [00:16:20] you can see it in syslog [00:16:24] when it receives them [00:16:52] you can also try a tcpdump on the machine to see if it's swending them when it is supposed to [00:25:43] PROBLEM - mysqld processes on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:52] PROBLEM - MySQL disk space on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:01] PROBLEM - MySQL Idle Transactions on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:10] PROBLEM - MySQL Recent Restart on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:10] PROBLEM - MySQL Replication Heartbeat on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:10] PROBLEM - MySQL Slave Running on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:26:28] PROBLEM - SSH on db25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:37] PROBLEM - Full LVS Snapshot on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:13] PROBLEM - MySQL Slave Delay on db25 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:47:01] uhoh [00:47:43] pulling.. [00:48:26] PROBLEM - NTP on db25 is CRITICAL: NTP CRITICAL: No response from NTP server [01:00:57] can someone please flush the mobile varnish cache? binasher, mutante [01:01:34] i am dealing with a down db, so someone else if possible please [01:02:40] notpeter, apergos ^^ [01:04:47] nm, just ran the command [01:05:29] thanks binasher sorry for the distraction [01:06:05] it was still in my shell history :) so no biggy [01:08:10] New patchset: Asher; "building db64 to replace db25 (dead sun) in s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25979 [01:09:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25979 [01:09:53] RECOVERY - MySQL disk space on db25 is OK: DISK OK [01:09:53] RECOVERY - mysqld processes on db25 is OK: PROCS OK: 1 process with command name mysqld [01:10:38] RECOVERY - NTP on db25 is OK: NTP OK: Offset -0.001613140106 secs [01:10:38] RECOVERY - SSH on db25 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:10:38] RECOVERY - Full LVS Snapshot on db25 is OK: OK no full LVM snapshot volumes [01:11:41] RECOVERY - MySQL Replication Heartbeat on db25 is OK: OK replication delay seconds [01:12:08] RECOVERY - MySQL Recent Restart on db25 is OK: OK seconds since restart [01:12:08] RECOVERY - MySQL Slave Delay on db25 is OK: OK replication delay seconds [01:12:09] !log stopping mysql on db25 and powering down - logging lots of ecc chipkill errors and periodically unresponsive. building db64 as a perm replacement [01:12:25] Logged the message, Master [01:12:26] RECOVERY - MySQL Slave Running on db25 is OK: OK replication [01:16:31] !log streaming hotbackup of db1019 to db64 (s3) [01:16:41] Logged the message, Master [01:30:22] New patchset: Tim Starling; "Send sessions to Redis as well as memcached" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25982 [01:32:21] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25982 [01:33:40] [6d3b473f] 2012-10-02 01:33:30: Fatal exception of type MWException [01:33:50] PHP fatal error in Unknown line 0: [01:33:50] Exception thrown without a stack frame [01:34:43] did someone do a deploy? [01:35:05] 10/01/12 [21:32:56] !log tstarling synchronized wmf-config/CommonSettings.php [01:35:48] looks like its back [01:36:27] Tim reverted the change [01:38:06] New patchset: Krinkle; "Revert "Send sessions to Redis as well as memcached"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25983 [01:38:32] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25983 [01:38:59] New review: Krinkle; "Brought down the site. " [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25982 [01:39:35] PROBLEM - Puppet freshness on manganese is CRITICAL: Puppet has not run in the last 10 hours [01:40:18] * RoanKattouw wonders why puppet hasn't been running on manganese [01:40:56] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 224 seconds [01:41:26] manganese [01:41:35] err ignore my client fail [01:41:41] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [01:44:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [01:45:44] PROBLEM - Apache HTTP on srv215 is CRITICAL: Connection refused [01:46:08] New patchset: Catrope; "Revert "Revert "Send sessions to Redis as well as memcached""" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25984 [01:49:38] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 684s [01:50:19] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25984 [01:51:44] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 23 seconds [01:52:47] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 24s [01:59:23] RECOVERY - Apache HTTP on srv215 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.071 second response time [02:40:51] PROBLEM - mysqld processes on db62 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:41:00] PROBLEM - mysqld processes on db64 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:51:03] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 225 seconds [02:51:12] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 233 seconds [02:51:21] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 244 seconds [02:51:21] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 244 seconds [02:51:39] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 260 seconds [02:51:57] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 279 seconds [02:52:24] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 307 seconds [02:54:57] RECOVERY - MySQL Slave Delay on db1003 is OK: OK replication delay 0 seconds [02:55:42] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:55:51] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [02:56:00] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [02:56:00] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [02:56:18] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [02:56:18] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [02:57:12] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Tue Oct 2 02:57:11 UTC 2012 [03:41:57] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [03:41:57] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [03:41:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:41:57] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [03:41:57] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:14:56] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [04:59:56] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [05:21:41] RECOVERY - mysqld processes on db64 is OK: PROCS OK: 1 process with command name mysqld [05:23:18] !log started replicating db64 from the s3 master [05:23:28] Logged the message, Master [05:25:26] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 9309 seconds [05:26:20] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 9230 seconds [05:42:23] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [05:42:41] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [05:43:17] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [06:03:01] morning... [06:19:52] yes it is! [06:20:31] paravoid: morning! are you off today work today for velocity? [06:21:03] minus one today [06:21:13] * apergos goes to check ont he rsyncs [06:21:15] no [06:21:24] today is the trainings day and I didn't opt in going there [06:22:09] ah they are all on 'c', so they'll be done in a couple hours tops and I can start new ones [06:22:15] so I intend to work, as long as I find WiFi that does not suck [06:22:23] good luck with that [06:22:50] ooh, in that case! can you work on building the precise package that supposedly fixes the sorting issue? i don't actually know what it is, but hopefully peter provided enough details [06:22:51] yeah, wasn't too easy yesterday. the Mozilla Space kindly hosted me for a few hours but they're all booked for today [06:23:04] yeah, I should... [06:23:34] i was also wondering if one of you could talk to mark when he gets on about setting up async snapmirroring for the originals vol on the netapp [06:24:05] we didn't want to start that right now, best to get the rsyncs completed (in case there's an impact) [06:24:26] if there's an impact, can't it be turned off immediately? [06:24:27] I would expect tomorrow or wed that could be turned on though [06:25:16] yes but why not just wait the one or two days [06:27:26] hmm 8.2T copied, that's much better [06:27:37] waiting til the very last minute in all things is indeed the wmf way. i guess it's more exciting that way, eh? [06:27:42] anyways, night! [06:27:54] hahahah :) [06:28:02] you can't blame him, can you? :) [06:28:05] now that is what I call taking a cheap potshot [06:28:19] fwiw, I don't think it matters either way [06:28:52] I will point out I had been trying to get the swiftmigration to happen with a deadline of this march (the one that past) so we wouldn't have this space issue [06:28:54] but nooooo [06:28:58] [06:32:27] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:35:27] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:47:54] RECOVERY - Squid on brewster is OK: TCP OK - 0.013 second response time on port 8080 [07:38:55] New patchset: ArielGlenn; "Fix 'find most recent deployment dir date' to not be broken; cleanup README" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/26000 [07:39:48] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/26000 [07:53:49] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:31:57] [ 66.997126] Adding 10485752k swap on /thank-you-for-partitioning-without-swap-space. Priority:-1 extents:2678 across:10622388k [08:32:01] that smells like Tim [08:32:13] (fenari) [09:06:17] !log squid deploy all: switching math & timeline to be served from swift [09:06:29] Logged the message, Master [09:15:52] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:15:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:16:03] mark: around? [09:26:42] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [09:29:49] New patchset: Faidon; "upload varnish: switch math/timeline to swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26011 [09:30:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26011 [09:35:40] New review: Matthias Mullie; "CSteipp: You're right, I had completely misinterpreted the logic behind it." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25855 [10:05:16] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [10:05:16] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [10:05:16] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [10:05:16] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [10:05:16] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [10:05:17] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [10:05:17] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [10:05:18] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [10:05:18] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [10:05:19] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [10:05:19] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [10:05:20] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [10:05:20] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [10:05:21] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [10:05:21] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [10:05:22] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [10:13:03] anyone know why puppet does not run on the mw1-15 boxes ? [10:13:32] peter would know [10:13:40] but they've been depooled because of the icu thing [10:13:44] reedy@fenari:~$ ssh mw1 [10:13:44] Welcome to Ubuntu 12.04 LTS (GNU/Linux 3.2.0-29-generic x86_64) [10:13:44] The last Puppet run was at Tue Oct 2 10:01:30 UTC 2012 (12 minutes ago). [10:13:54] oh hah, interesting [10:13:56] paravoid: yes? [10:14:22] Reedy: well done :) [10:14:29] ;) [10:14:35] morning mark [10:15:05] https://gerrit.wikimedia.org/r/26011 ack for deploy? [10:15:46] maybe the SNMP traps are never sent / received. Could be a fw filter issue [10:25:51] +1 [10:26:29] to what? r26011 or the fw filter? :) [11:01:07] 26011 [11:01:51] okay [11:02:33] what are the "severe problems" that we faced with the GRE? [11:02:35] just wondering :) [11:04:54] well [11:05:04] routing private ips to public ips needs policy routing and shit [11:05:14] it got messy really quickly [11:05:51] we didn't want to route everything across the tunnels [11:05:52] hm? wouldn't be a tunnel? [11:05:56] ah [11:06:10] especially since right now the tunnels are not redundant yet [11:06:11] don't get me wrong, I never liked the idea [11:06:19] we need to migrate to another MX80 first etc [11:06:25] so we looked if we could do a single tunnel just for ori's stuff [11:06:30] but since it got messy, called it off [11:06:32] mostly since I generally hate tunnels :) [11:06:35] yeah [11:06:40] so we're looking into an mpls link now [11:06:46] yeah, that would be great [11:06:50] finally [11:06:58] well, really, any IP feed [11:07:03] don't care what the transport is [11:07:21] as long as it's a normal 1500-mtu ethernet or IP feed [11:07:48] yeah [11:10:40] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26011 [11:22:09] New patchset: Faidon; "upload varnish: fix syntax error (duh)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26034 [11:23:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26034 [11:23:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26034 [11:46:04] !log Moving US upload traffic to upload-lb.eqiad (Varnish) [11:46:16] Logged the message, Master [11:52:12] !log apt: include new php5 built with libicu42 (precise-wikimedia/main), include libicu42 [11:52:23] Logged the message, Master [11:53:00] paravoid: you are awesome!!! [11:53:18] especially as last night someone restarted all of the jobrunners that were sorting things wrong.... [11:53:38] someone needs to upgrade all of these boxes now :P [11:54:15] I bet someone named puppet will :) [11:55:00] hrhr [11:56:07] I'm not so sure :) [11:56:32] meh, even if I have to reimage, I have that down to an art :) [11:56:44] an art, or engineering? [11:57:29] little bit of both [11:58:20] okay, I'm now 4h in this cafe [11:58:23] I should probably leave :) [11:58:46] I'll try to get online later [11:59:12] You the sort of guy that spends 4hours drinking one coffee and is really just there for the wifi? heh [12:00:06] good luck again [12:00:17] oh! yay. puppet class was commented out. hurray! no faulty jobrunners [12:02:11] hashar: parsekit is next in line btw :) [12:03:14] paravoid: hurrah :-) [12:04:31] paravoid: ah, these all have the same version numbers as the ones that were already in the repo [12:04:39] yes, this will require some re-installing [12:04:54] or some dpkg brutality [12:05:14] same version nrs? that's not a debian thing to do ;p [12:06:03] notpeter: no they don't [12:06:12] it's bumped [12:06:52] PROBLEM - Puppet freshness on knsq19 is CRITICAL: Puppet has not run in the last 10 hours [12:08:15] ok, then I'm probably am just reading this email from reprepro wrong [12:08:29] it's early and I have had no coffee yet [12:08:35] so that is most likely the case :) [12:11:15] New patchset: Pyoungmeister; "re-adding lucid jobrunners on mw* hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26035 [12:12:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26035 [12:12:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26035 [12:13:12] the mails are very confusing [12:13:16] that should help the jobqueue :) [12:13:22] mark: yeah [12:13:25] that's because the mail script is "echo $@ | mail" [12:13:40] I'm 100% confident that faidon did things correctly :) [12:13:44] ah, gotcha [12:54:28] New patchset: Pyoungmeister; "re-enabling precise jobrunners" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26038 [12:55:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26038 [12:55:33] \o/ [12:55:52] where are we with eqiad apache installs btw? [12:56:24] I hav edone 3 for testing. the rest are imaged, but not puppetized [12:56:30] good [12:56:34] so, I can finish up at any point [12:56:38] excellent even [12:56:45] I like imaged but not puppetized [12:56:59] indeed. even got the one hardware issue fixed [12:57:22] although puppet is not gonna like the hundreds of extra clients [12:58:07] truuuueeeee [13:03:55] New patchset: Hashar; "Gerrit hook tests now creates hookconfig.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26040 [13:04:51] New patchset: Hashar; "Gerrit hook tests extended coverage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26041 [13:05:23] !log enabling srv190 in imagescaler pool at half-weight to test precise image-scaling [13:05:34] Logged the message, notpeter [13:05:44] New patchset: Hashar; "Gerrit notifications for Wikidata to their channel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26042 [13:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26040 [13:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26041 [13:06:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26042 [13:10:39] !log rebooting mw1-16 to pick up kernel upgrades before re-enabling them as (precise) jobrunners [13:10:50] Logged the message, notpeter [13:14:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26038 [13:15:21] New patchset: Hashar; "Gerrit notifications for Wikidata to their channel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26042 [13:16:14] New review: Hashar; "I have forgot to remove the wikidata.log -> #mediawiki binding." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26042 [13:16:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26042 [13:17:24] RECOVERY - Puppet freshness on mw9 is OK: puppet ran at Tue Oct 2 13:17:00 UTC 2012 [13:17:24] RECOVERY - Puppet freshness on mw1 is OK: puppet ran at Tue Oct 2 13:17:13 UTC 2012 [13:20:33] RECOVERY - Puppet freshness on mw10 is OK: puppet ran at Tue Oct 2 13:20:22 UTC 2012 [13:20:42] RECOVERY - Puppet freshness on mw2 is OK: puppet ran at Tue Oct 2 13:20:27 UTC 2012 [13:21:54] RECOVERY - Puppet freshness on mw11 is OK: puppet ran at Tue Oct 2 13:21:37 UTC 2012 [13:21:54] RECOVERY - Puppet freshness on mw3 is OK: puppet ran at Tue Oct 2 13:21:44 UTC 2012 [13:22:57] RECOVERY - Puppet freshness on mw13 is OK: puppet ran at Tue Oct 2 13:22:22 UTC 2012 [13:22:57] RECOVERY - Puppet freshness on mw12 is OK: puppet ran at Tue Oct 2 13:22:48 UTC 2012 [13:23:06] RECOVERY - Puppet freshness on mw4 is OK: puppet ran at Tue Oct 2 13:22:55 UTC 2012 [13:23:51] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Tue Oct 2 13:23:39 UTC 2012 [13:24:27] RECOVERY - Puppet freshness on mw5 is OK: puppet ran at Tue Oct 2 13:24:14 UTC 2012 [13:24:46] New patchset: Pyoungmeister; "bump php version number for precise jobrunners" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26044 [13:25:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26044 [13:26:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26044 [13:30:27] RECOVERY - Puppet freshness on mw6 is OK: puppet ran at Tue Oct 2 13:30:15 UTC 2012 [13:32:51] RECOVERY - Puppet freshness on mw14 is OK: puppet ran at Tue Oct 2 13:32:50 UTC 2012 [13:34:12] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Tue Oct 2 13:33:51 UTC 2012 [13:34:57] RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Tue Oct 2 13:34:46 UTC 2012 [13:35:24] RECOVERY - Puppet freshness on mw16 is OK: puppet ran at Tue Oct 2 13:35:02 UTC 2012 [13:35:31] New review: Aude; "I think this change is reasonable, although don't necessarily speak for the entire team." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/26042 [13:42:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:42:36] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [13:42:36] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [13:42:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:42:36] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [13:45:45] well, I can happily say that the new precise build of the imagescalers spews errors into thumbnail.log at roughly half the rate of the lucid ones. [13:46:05] so that's progress! [13:47:09] heh [13:47:12] good! [13:47:18] :-] [13:47:39] assuming that's not just creating blank images and not hitting any errors in the process! :) [13:48:19] apergos: hi [13:48:32] cmjohnson1: lo [13:49:53] did some writing to the disk last night...can you check sdf1 plz. b4 i left i put the old disk back [13:50:20] it shows as mounted [13:52:55] ok I will have a look in just a minute [13:53:23] mark: also possible..... [13:53:28] k..nearly positive that all is correct...like to have 2nd set of eyes [13:54:18] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:44] looks to be the same general classes of errors as the lucid ones [13:56:44] argh [13:56:46] /dev/sdf1 on /srv/swift-storage/sdg1 type xfs (rw,noatime,nodiratime,nobarrier,logbufs=8) [13:57:06] !log rebooting srv194-199 and mw55-59 to pick up kernel upgrades, and then re-adding them to apaches pool [13:57:16] Logged the message, notpeter [13:57:50] and we have 21 mounts instead of 22 so something's up with that [13:57:54] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [13:58:21] PROBLEM - Host srv197 is DOWN: PING CRITICAL - Packet loss = 100% [13:58:35] hmmm [13:58:57] mount: special device LABEL=swift-sdf1 does not exist [13:59:09] how should I go about troubleshooting a machine that will not PXE boot even though it should [13:59:12] it is booting from HDD every time [13:59:17] ? [13:59:46] ottomata: you can hit f12 during post and it should pxe boot [14:00:14] apergos: I removed a disk and put it back last night [14:00:23] so cmjohnson1 it's reassigned all theids and the disk you reinserted didn't show up [14:00:34] lemme look a bit at the logs [14:00:54] RECOVERY - Host srv197 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [14:01:41] apergos: yep...i noticed that yesterday that the labels moved up one [14:02:18] did you reboot after that at some point? [14:02:20] oh, wait, srv190 is at half the weight as the other imagescalers.... so it spews things into the logs just as quickly... damn. [14:03:12] apergos: the only reboot was when I swapped the disk last night [14:03:12] !log Sending Japanese upload traffic to upload-lb.eqiad (Varnish) [14:03:15] right, ok thanks, will see if I can send f12... [14:03:23] Logged the message, Master [14:03:35] ottomata: right at the beginning...top left corner [14:03:36] PROBLEM - Apache HTTP on mw57 is CRITICAL: Connection refused [14:03:36] PROBLEM - Apache HTTP on mw56 is CRITICAL: Connection refused [14:03:45] PROBLEM - Apache HTTP on srv196 is CRITICAL: Connection refused [14:03:47] sorry top right corner [14:04:16] those apaches being crit is fine, btw [14:04:45] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:20] PROBLEM - Apache HTTP on srv195 is CRITICAL: Connection refused [14:05:38] PROBLEM - Apache HTTP on mw59 is CRITICAL: Connection refused [14:05:38] PROBLEM - Apache HTTP on srv194 is CRITICAL: Connection refused [14:06:05] PROBLEM - Apache HTTP on srv197 is CRITICAL: Connection refused [14:06:23] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [14:06:59] RECOVERY - Apache HTTP on mw56 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [14:07:31] New review: Hashar; "I have talked about this change with Aude and other Wikidata folks in their #wikimedia-wikidata chan..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26042 [14:08:56] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:09:23] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [14:09:58] !log Restarted pdns on linne [14:10:11] Logged the message, Master [14:10:26] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.023 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:10:26] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [14:10:53] RECOVERY - Apache HTTP on srv197 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [14:12:09] !log Sending Brazil upload traffic to upload-lb.eqiad (Varnish) [14:12:20] Logged the message, Master [14:12:32] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:12:50] RECOVERY - Apache HTTP on srv196 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [14:13:04] cmjohnson1, i got it, thanks! it is resinstalling [14:13:35] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [14:13:35] cool [14:14:38] PROBLEM - SSH on analytics1001 is CRITICAL: Connection refused [14:15:50] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [14:15:59] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [14:16:17] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [14:16:17] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:16:44] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [14:17:38] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.063 seconds response time. www.wikipedia.org returns 208.80.154.225 [14:23:35] all precise apache builds now re-enabled! [14:23:47] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:06] time to hide [14:24:18] hehehehe [14:26:18] !log Sending Mexican upload traffic to upload-lb.eqiad (Varnish) [14:26:29] Logged the message, Master [14:27:05] RECOVERY - SSH on analytics1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:27:14] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 26.83 ms [14:33:44] mark: while it shouldn't be substantially differnt than the regular apaches, do you think it's worth upgrading one of the bits boxes in pmtpa to lucid for testing purposes? [14:33:57] s/lucid/precise/ [14:34:19] yeah I think that's worth it [14:34:34] I don't expect any problems, but if there are any, i'd prefer to know now rather than during eqiad switchover [14:35:32] yep, my same thoughts. ok, cool [14:57:25] !log Sending traffic from India, Korea, China, Singapore, Thailand, Vietnam to upload-lb.eqiad (Varnish) [14:57:37]