[00:24:12] New patchset: Bhartshorne; "adding in country filters for mobile using new udp_filter framework" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2185 [00:24:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2185 [00:24:46] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2185 [00:24:47] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2185 [00:27:28] New patchset: Bhartshorne; "typo correction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2186 [00:27:54] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2186 [00:27:55] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2186 [00:30:18] robla: re RT-2370, you can already see source through gerrit's gitweb stuff, [00:30:26] it's just that the links are small. [00:30:50] oh. [00:30:50] nm. [00:30:57] that's what you say in bugzilla. I misread the RT. [00:33:44] maplebed: let me know when you are ok with the thumbnail purging or if I find anything [00:48:40] New patchset: Asher; "fix innodb statistic collection (at long last), update more frequently, add query rate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2187 [00:57:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:41] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:05:41] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [01:06:45] AaronSchulz: I think you should start the deletion job; it might take a while. [01:09:11] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.440 seconds [01:18:56] maplebed: ok [01:20:22] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2187 [01:20:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2187 [01:43:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:11] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 98 MB (1% inode=60%): /var/lib/ureadahead/debugfs 98 MB (1% inode=60%): [01:58:08] New patchset: Bhartshorne; "disabling 'check-apache' command for swift since it actually checks ssh, not apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2188 [01:58:26] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2188 [01:58:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2188 [01:58:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2188 [02:07:11] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.412 seconds [02:15:51] RECOVERY - Disk space on srv222 is OK: DISK OK [02:18:31] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1452s [02:26:31] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1932s [02:38:01] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 9s [02:41:11] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 19s [02:52:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:51] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Feb 2 03:03:27 UTC 2012 [03:04:41] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [03:22:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.630 seconds [03:32:53] PROBLEM - mysqld processes on db56 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:37:33] RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Thu Feb 2 03:37:21 UTC 2012 [03:52:25] hi maplebed [04:16:10] RECOVERY - Disk space on es1004 is OK: DISK OK [04:18:40] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:44:20] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:51:51] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 682 seconds [09:12:00] the s2 active hosts are all very unhappy (dbs 13, 24, 54) [09:12:01] PROBLEM - MySQL Slave Delay on db13 is CRITICAL: CRIT replication delay 1339 seconds [09:12:11] PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 317 seconds [09:12:20] thanks nag. [09:14:15] I am getting some squids error [09:14:16] Request: POST http://www.mediawiki.org/wiki/Special:Code/MediaWiki/110486, from 208.80.152.76 via sq61.wikimedia.org (squid/2.7.STABLE9) to () [09:14:16] Error: ERR_CANNOT_FORWARD, errno [No Error] at Thu, 02 Feb 2012 09:14:00 GMT [09:14:19] on www.mediawiki.org [09:14:23] that is last the third time [09:14:53] yes, they are backed up because of the dbs [09:15:16] hashar, see #wikimedia-tech too [09:15:20] k thx [09:16:57] PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:47] RECOVERY - ps1-d2-pmtpa-infeed-load-tower-A-phase-Z on ps1-d2-pmtpa is OK: ps1-d2-pmtpa-infeed-load-tower-A-phase-Z OK - 1200 [09:22:47] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:47] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:25:17] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:26:27] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:27] PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:27:57] RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 57953 bytes in 0.155 seconds [09:29:17] PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:17] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:17] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:30:27] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [09:36:27] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time [09:37:37] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [09:38:37] RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds [09:40:27] RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [09:41:37] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [09:41:47] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [09:42:07] RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [09:43:37] RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds [09:45:27] RECOVERY - MySQL Slave Delay on db13 is OK: OK replication delay 0 seconds [09:46:17] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [09:48:07] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [09:58:37] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 158 MB (2% inode=60%): /var/lib/ureadahead/debugfs 158 MB (2% inode=60%): [10:11:27] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 446525 MB (3% inode=99%): [10:14:15] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 432545 MB (3% inode=99%): [10:22:55] RECOVERY - Disk space on srv223 is OK: DISK OK [10:27:41] New patchset: Mark Bergsma; "Discard mailman bounces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2189 [10:28:09] New patchset: Mark Bergsma; "Add removal support for system roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2175 [10:28:25] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2175 [10:28:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2175 [10:28:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2189 [10:28:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2189 [10:31:31] New patchset: Mark Bergsma; "It's :blackhole: instead of :discard:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2190 [10:31:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2190 [10:31:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2190 [11:17:18] New patchset: ArielGlenn; "rsyncd stanza for folks mirroring all public content from dumps.wm.o" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2191 [11:18:38] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2191 [11:18:39] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2191 [11:23:02] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [11:23:02] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [11:27:20] New patchset: Mark Bergsma; "Add new role::cache::squid::upload class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2192 [11:27:37] New patchset: Mark Bergsma; "Rename service IPs to be consistent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2193 [11:27:53] New patchset: Mark Bergsma; "Add TODO" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2194 [11:28:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2193 [11:28:27] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2192 [11:28:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2192 [11:28:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2193 [11:28:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2193 [11:29:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2194 [11:29:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2194 [11:40:42] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:48:16] New patchset: Hashar; "GraphViz needed for Jenkins installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2196 [12:23:09] PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours [12:59:49] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [13:07:34] New patchset: Mark Bergsma; "Make lvs::realserver support hashes as well as arrays for service IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204 [13:07:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2204 [13:08:40] New patchset: Mark Bergsma; "Merge LVS changes from test into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204 [13:08:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2204 [13:09:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2204 [13:09:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204 [13:11:09] RECOVERY - Disk space on srv221 is OK: DISK OK [13:17:58] New patchset: Mark Bergsma; "Use a dynamic lookup for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2205 [13:18:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2205 [13:18:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2205 [13:18:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2205 [13:24:03] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [13:28:02] New patchset: Mark Bergsma; "Refactor, remove duplication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2206 [13:28:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2206 [13:29:22] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2206 [13:29:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2206 [13:33:37] New patchset: Mark Bergsma; "A hash is unordered, so sort values to avoid constant Puppet changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2207 [13:33:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2207 [13:34:32] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2207 [13:34:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2207 [13:43:20] New patchset: Mark Bergsma; "Remove old system roles, preparing for migration of role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2208 [13:43:23] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:43:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2208 [13:43:40] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2208 [13:43:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2208 [13:44:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:46] New patchset: Mark Bergsma; "Cleanup Squid monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2209 [13:55:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2209 [13:55:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2209 [13:59:02] did anyone pushed something recently? [13:59:13] I got [notice] child pid 7721 exit signal Segmentation fault (11) [13:59:14] messages [13:59:38] <^demon> RoanKattouw says that's known. [13:59:45] Tim knows about it [14:00:10] IIRC he said that PHP is segfaulting during shutdown, and I think it was related to wmerrors [14:00:35] Although I'm seeing a lot of new segfaults now, I thought they were supposed to have gone away [14:00:39] <^demon> wmerrors was disabled iirc. [14:00:54] I am just warning because it suddenly started to happen [14:00:56] (full disclosure, I'm running scap right nwo) [14:01:01] a segfault every second or though [14:01:04] ahh [14:01:08] so that must be scap :D [14:01:22] would explain why it happens form different apaches [14:06:19] OK, scap is done, no new segfaults [14:07:38] New patchset: Mark Bergsma; "Rename squid role caches and nagios groups for consistency with varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2210 [14:07:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2210 [14:08:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2210 [14:08:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2210 [14:18:25] New patchset: Mark Bergsma; "Migrate varnish cache::bits and cache::mobile classes to role::cache in role/cache.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2211 [14:18:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2211 [14:20:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2211 [14:20:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2211 [14:28:26] New review: Dzahn; "moved puppetmaster monitoring out of site.pp" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2166 [14:28:26] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2166 [14:30:27] New review: Dzahn; "yep, graphviz needed" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2196 [14:30:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2196 [14:36:32] this looks like an apache problem: https://bugzilla.wikimedia.org/34160 [14:37:15] for simple example, try: http://mingle.corp.wikimedia.org/projects/ [14:39:26] <^demon> Yeah, looks like a busted redirect. I think office IT manages that box though. [14:42:54] looks like somebody just copy/pasted an internal link into the post [14:43:54] <^demon> No, it works, there's just a broken redirect. [14:44:04] <^demon> Figured this out yesterday. [14:47:31] New review: Dzahn; "you are deleting nightly.css but still using it?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2174 [14:49:02] New review: Demon; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2174 [15:01:05] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=64%): /var/lib/ureadahead/debugfs 1 MB (0% inode=64%): [15:11:55] RECOVERY - Disk space on srv220 is OK: DISK OK [15:14:31] New patchset: Mark Bergsma; "Split ridiculously long lines in lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2212 [15:14:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2212 [15:15:05] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2212 [15:15:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2212 [15:16:37] hi mark [15:16:45] hi [15:17:09] is there any news on the 'locke-replacement-department'? [15:17:16] no idea [15:17:19] what does the ticket say? :) [15:17:24] no idea [15:17:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.383 seconds [15:17:33] :D [15:18:14] no activity since last week when we opened it [15:18:20] time to prod ;) [15:18:25] prod who? [15:18:29] the ticket [15:18:36] rob was in eqiad yesterday [15:18:40] if nothing happened there, he likely forgot [15:18:58] can you remind him? [15:19:12] i'll ask, but best reply to the ticket too [15:20:25] in fact... [15:20:29] RobH: ^^ [15:20:39] there is no owner of the ticket yet [15:22:42] that doesn't really mean anything [15:23:05] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=64%): /var/lib/ureadahead/debugfs 1 MB (0% inode=64%): [15:25:23] New review: Hashar; "Please note nightly.css was *renamed* not deleted :" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2174 [15:27:13] New patchset: Mark Bergsma; "Remove old LVS services and old service IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2213 [15:28:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2213 [15:28:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2213 [15:31:09] New patchset: Mark Bergsma; "Remove old payments LVS service as well, rename the new one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2214 [15:31:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2214 [15:31:47] New patchset: Mark Bergsma; "Qualify all $site references in lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2215 [15:33:45] RECOVERY - Disk space on srv223 is OK: DISK OK [15:35:58] New patchset: Dzahn; "redirect http://irc.wikimedia.org to IRC meta page (currently "it works" page)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2216 [15:36:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2216 [15:37:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2214 [15:37:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2214 [15:37:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2215 [15:37:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2215 [15:38:43] robh: hi [15:39:00] heya [15:39:09] diederik: im lookin at your locke replacement ticket now [15:39:18] ok [15:39:19] trying to figure out if I have hardware already we can use [15:39:47] anyone know anything about OTRS upgrades? [15:40:00] i received the drives for db13..thx do u have time to help me w/it today? [15:40:18] !log Running apt-get update && apt-get dist-upgrade && reboot on lvs2 [15:40:20] Logged the message, Master [15:40:27] robh: i guess it's a middle of the road server [15:40:37] maybe a bit more harddrive space is useful [15:40:43] but memory is not a super big thing [15:41:05] but mark might have clearer ideas on this [15:42:52] Well, I think our stock misc server would work, except for the lack of hard disk [15:43:20] the stock misc servers are 500 GB, which are not large enough [15:43:31] cmjohnson1: lemme take a look at ticket [15:46:09] hexmode, in what sense? [15:46:34] Reedy: in the sense of being able to upgrade it [15:47:57] Reedy: There are a couple of RT tickets open for it [15:48:00] Jeff_Green was doing some work on it for something... I don't know if it's to the extent of trying to upgrade it... [15:48:07] RobH: what are the ocg boxes doing in the squid vlan? [15:48:27] * hexmode goes all hound-dog on Jeff_Green  [15:48:31] whut [15:48:38] otrs [15:48:41] Jeff_Green: OTRS! [15:48:46] double whut. [15:48:49] can you upgrade it? [15:49:20] i can try, but I warn you their official estimate according to philippe is that an upgrade for us will take a year [15:49:38] i lol about that, but at the same time I wonder why they think it will not be straightforward [15:51:00] Jeff_Green: maybe b/c we don't have anyone in OPs working on it and the RT ticket has been open a year? [15:51:12] * hexmode guesses at the time from the ticket # [15:51:30] very funny, but no. [15:51:38] heh [15:51:42] It's perl and otrs [15:51:51] We've no sadists on staff [15:52:02] perl! I love perl! [15:52:04] philippe consulted with the OTRS folks about upgrading--i.e. contracting them, and they said to anticipate a year-long process [15:52:11] cmjohnson1: checking db13 now [15:52:12] I have no problem with perl, you can't blame perl [15:52:18] heh [15:52:19] it is my favorite practical/functional language [15:52:37] ?? [15:52:54] OTRS is looking for weird ways to make money [15:52:55] I'm going to donate to the perl foundation just to annoy all you perl haters [15:53:08] hexmode: maybe. [15:53:11] cmjohnson1: So I show: Group 6, Segment 0 : Inconsistent (0,23) [15:53:31] okay [15:53:31] that is drive 23 in the controller, it should have an orange LED lit [15:53:40] Jeff_Green: I'd chip in but the Perl 6 saga is frustrating after 10 years [15:53:45] anyway, I'm finally at a place where the idea of being involved with an otrs upgrade doesn't make me want to kill myself, but yeah it's not trivial [15:54:02] Jeff_Green: if you dont wanna kill yourself and others, you missed an important step ;] [15:54:04] hexmode: yeah, fair enough. but all the other languages will get there too I figure [15:54:12] robh: no led on a specific drive...i will have to look at diagram [15:54:40] python nextgen and PHP 6 have similar stories [15:54:46] languages mature to the point where you're better off starting a new one rather than improving the old one [15:54:58] except lisp! [15:55:05] lisp is always awesome [15:55:30] that's actually one of the things I like about perl pre 6. it is what it is, it works, it doesn't have a security crisis every other day [15:55:39] !log Running apt-get update && apt-get dist-upgrade && reboot on lvs1 [15:55:41] Logged the message, Master [15:55:52] cmjohnson1: hrmm, odd that no drive is lit [15:56:41] cmjohnson1: argh, the 23 means nothing, its in slot 13 i think, still confirming [15:57:31] cmjohnson1: Ok, reading this, it appears that drive 15 is the dead drive [15:57:39] which should be in the bottom corner, but please confirm [15:58:59] ^demon|away: i see it breaks the redirect, but even if i fix it, it's still an internal link with a login on a public blog [15:59:28] robh: the drive location is top right corner [15:59:33] by "fix it" i just meant changing the URL in my browser :p [16:00:47] hexmode: bz 34160 should be an Office IT thing, afaik yep [16:00:53] cmjohnson1: is it blinking now? [16:00:59] cmjohnson1: i told it to blink the dead drive [16:01:44] robh: strangely enough all the drives but drive 15 are blinking ....as though they are in use except 15 [16:02:35] ok, well, command failed, oh well [16:02:37] heh [16:02:45] mutante: k [16:02:45] since the rest are blinking, i think we know the dead one then [16:03:03] okay....fyi http://docs.oracle.com/cd/E19121-01/sf.x4140/820-2394-13/overview.html#0_65739 [16:03:06] cmjohnson1: so log before you pull it, and then go ahead and swap it out and lets see if it fixes it. real fast though, lemme check to ensure its not a master [16:03:15] okay....let me know [16:03:36] http://noc.wikimedia.org/dbtree/ is new tool asher made [16:03:52] so we can see db13 is a slave for s2 [16:04:00] which is fine, i just get nervous when we work on the masters [16:04:10] so go ahead and log and swap with the system staying online [16:04:18] (all good on this end) [16:04:32] k [16:04:56] * hexmode lols @ [[User:Mutante]] [16:05:03] "I love User Boxes" [16:05:12] yes, we can all see that [16:05:12] !log replacing disk 15 on db11 [16:05:15] Logged the message, Master [16:05:50] New review: Dzahn; "re: asterisk in : ah,ok so Apache will expand that? re: renamed file: yes, i see. nevermi..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2174 [16:05:52] yeah I used that tool today to see which cluster was affected by the three dbs being whacked [16:06:09] New patchset: Mark Bergsma; "Migrate esams text squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2217 [16:06:18] !log disk 15 swap complete on db11 [16:06:19] Logged the message, Master [16:06:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2217 [16:07:55] hexmode: "This user is ambivalent about self-referential userboxes." :) [16:07:56] New patchset: Mark Bergsma; "Migrate esams squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2218 [16:08:05] robh: please verify the new disk..thx [16:08:18] cmjohnson1: [16:08:19] Device #15 [16:08:19] Device is a Hard drive [16:08:20] State : Rebuilding [16:08:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2217 [16:08:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2217 [16:08:31] Group 6, Segment 1 : Rebuilding (0,23) 30V1T30W 3PD1T30W [16:08:32] thx [16:08:36] looks good =] [16:13:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2218 [16:13:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2218 [16:21:00] New patchset: Mark Bergsma; "Migrate remaining squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2219 [16:21:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2219 [16:21:57] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2219 [16:24:21] New patchset: Mark Bergsma; "Revert "Migrate remaining squids to new role classes"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2220 [16:24:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2220 [16:24:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2220 [16:24:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2220 [16:24:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2220 [16:26:35] New patchset: Mark Bergsma; "Revert "Revert "Migrate remaining squids to new role classes""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2221 [16:26:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2221 [16:26:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2221 [16:26:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2221 [16:28:11] New patchset: Mark Bergsma; "Let's not convert live text squids into upload squids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2222 [16:28:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2222 [16:28:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2222 [16:28:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2222 [16:29:29] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2174 [16:29:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2174 [16:32:47] mark, apergos: Would it perhaps be a good idea to give the scalers a larger root partition? It's 7GB currently and they're out of space all the time because GhostScript creates 100MB files in /tmp [16:33:00] (I checked, they're not even old. The oldest ones I found were ~20 mins old) [16:33:06] that means they shouldn't be using the root partition at all [16:33:21] I know they aren't old [16:33:27] I updated that cron script recently [16:33:32] that should be a dedicated partition [16:38:06] New patchset: Hashar; "fix nightly.css relative path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2223 [16:42:35] New review: Dzahn; "done:)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2223 [16:42:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2223 [16:47:43] RECOVERY - RAID on db13 is OK: OK: 1 logical device(s) checked [16:48:32] quick question for ops folks: do we detect mobile devices in varnish? (and redirect them to the mobile site) [16:48:48] in squid, yes [16:49:15] New patchset: Hashar; "fix up nightly.css" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2224 [16:49:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2224 [16:51:32] New review: Dzahn; "looks good, not using IndexStyleSheet" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2224 [16:51:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2224 [16:56:03] PROBLEM - Host text.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:56:11] New patchset: Mark Bergsma; "Cleanup: remove obsolete old role classes text-squid and upload-squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2225 [16:56:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2225 [16:57:44] ok, i jsut got paged and emailed about text [16:57:54] old service ip, not a problem [16:57:58] me too! [16:58:04] heh, was gonna say, it looks ok... [16:58:07] its working ;] [16:58:09] hey look the paging system works! [16:58:13] :-P [16:58:13] no it's not working [16:58:19] ? [16:58:21] but that doesn't matter since it's been unused for 6 months ;) [16:58:29] the paging system? [16:58:29] heh [16:58:32] I disabled 208.80.152.2 today [16:58:33] it just paged us [16:58:34] WHAT'S GOING ON [16:58:39] we don't have an email alias that will page all the same people regular pages go to, do we? [16:58:42] so I guess it is working :-P [16:58:43] and removed it from nagios [16:58:55] mark: how is squid doing mobile detection? using wurfl? [16:58:55] eg to send a note to everybody that got paged but isn't here to say "don't worry, it's cool." [16:59:03] hi maplebed! [16:59:06] maplebed: ops, I guess [16:59:09] i'll do that now [16:59:15] mark: but taht doesn't page. [16:59:26] can we enable the final country filter ? [17:00:33] diederik: emery looks ok, so yes. [17:00:49] awesome! [17:00:58] i also checked it and it's still running [17:01:17] * binasher deletes site_stats from enwiki to further along the fire drill [17:01:51] mark: but hey! people really are getting much better about responding to pages :) [17:01:56] maplebed: now I think of it [17:01:59] I think you can do that using nagios [17:02:55] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2225 [17:02:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2225 [17:03:24] diederik: squid mobile detection is based on useragent regex's that were generated by inspected wurfl xml. i think they provided around 90% coverage of wurfl at the time of implementation [17:03:52] maplebed: yea, could use Nagios acknowledgement and configure it to page [17:04:11] binasher: cool, i just stumbled upon this: http://www.enrise.com/2011/02/mobile-device-detection-with-wurfl-and-varnish/ and it seemed relevant [17:06:02] diederik: yep, definitely. preilly has done some work on a c api to wurfl as well, that we may integrate with varnish one day [17:06:13] sweet! [17:06:51] the implementation in that blog post doesn't look great [17:07:39] no, i don't think that is a very scalable implementaton either but i didn't know it was possible in the first place (which says a lot about my knowledge) [17:07:56] maplebed: or actually, it might be to put a host into scheduled downtime as a reaction, and configuring "scheduled downtime started" as notification_options: Valid options are a combination of one or more of the following: d = send notifications on a DOWN state, u = send notifications on an UNREACHABLE state, r = send notifications on recoveries (OK state), f = send notifications when the host starts and stops flapping, and s = send notifi [17:08:14] New patchset: Mark Bergsma; "Migrate swift cluster role classes into role/swift.pp with role::prefix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2226 [17:08:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2226 [17:08:43] PROBLEM - Backend Squid HTTP on knsq25 is CRITICAL: Connection refused [17:09:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/2226 [17:09:18] mutante, mark: what I mean is that it's useful to be able to send information (as a human) in the situation when pages are going out and people are on their way to getting net access. [17:09:29] maplebed: yeah [17:09:34] poking nagios with downtimes or sending ack pages and all is still useful, but not the same. [17:09:39] I believe you can send out arbitrary comments in the nagios web interface [17:09:48] maplebed: putting it into scheduled downtime would be a human action via web ui [17:09:53] PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 403 Forbidden [17:10:38] maplebed: whenever you're ready, change 2226 is awaiting your review [17:11:24] mutante: it still conflates two equally possible messages. (1) shit's broken and I need nagios to shut up or (2) it's a false alarm and the rest of you that got a page should go back to sleep. Though if we can send arbitrary pages with the nagios UI as mark suggests, we can add a note to clarify... [17:11:28] comments should also work, but not sure if you can page on them easily, because its not in notification_options [17:12:01] I know I would prefer to just have an email address "page-ops@" that I can say what is necessary and not have to spend precious could-be-fixing time fighting with nagios and hoping it does what I want. [17:12:44] well that would have issues too [17:12:47] we would need to guard it against spam [17:12:55] and of course only the first 140 chars in your mail would get sent, or something [17:13:02] and imagine that being a mime header or whatever ;) [17:13:08] it's possible, but not trivial [17:13:22] mark: re: change 2226, you're basically just moving that stuff to a new file, right? [17:13:29] yeah and a slight rename of classes [17:13:31] i would say both cases are kind of "Acknowledgement", which has a comment field [17:13:36] dunno if this is helpful but at CL we ended up hacking a web-based [MUTE] button into our paging system, which would email a reminder ever so often while it was active [17:13:40] mark: in my experience those email addresses don't get a lot of spam (actually not any) [17:13:42] for consistency [17:14:02] actually [17:14:06] I see a problem with it already [17:14:11] inherits swift-cluster::base [17:14:16] I renamed swift-cluster::base [17:14:17] let me amend [17:14:21] k. [17:14:47] heh [17:14:50] this was broken before already [17:14:54] it's calling swift::base [17:14:59] that was always called swift-cluster::base [17:15:05] perhaps that's why inheritance wasn't working? [17:15:22] ah, no [17:15:26] that's calling into swift.pp... [17:15:30] at Linden each opsen had an email address "page-" and there was a "page-ops". They worked exceedingly well and were very often useful and I honestly can't remember ever getting spam to the addresses. [17:15:48] obviously we have more exposure here, but with minor obfuscation of the addresses, I'd bet it would be fine. [17:16:25] yeah we did that too, totally useful [17:16:47] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.909 seconds [17:16:48] you could write a wrapper for the pager script nagios uses as well [17:17:10] we did that too, when we used big brother :-P [17:17:17] New patchset: Mark Bergsma; "Migrate swift cluster role classes into role/swift.pp with role::prefix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2226 [17:17:29] I probably have that script somewhere, it did time- and perl regex filtering [17:17:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2226 [17:17:39] tea time. brb. [17:18:41] oh tea is a very good idea. i'm going to copy you. [17:18:44] pager script wrapper ++ [17:19:43] we could also do some reply processing [17:19:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=64%): /var/lib/ureadahead/debugfs 0 MB (0% inode=64%): [17:20:23] reply ack to acknowledge in nagios, or "msg …" to send a message everyone else on the page [17:20:44] (we also had an IRC bot that could both page people and accept return pages and echo them back to the channel in which the person was paged... it was awesome.) [17:20:47] echo "go back to sleep" | /usr/local/bin/gammu-smsd-inject TEXT $CONTACTPAGER$ [17:26:00] does nagios have a reasonable API that would allow poking it from email/sms/ircbot or would we need to hack that too? [17:26:20] it's written in perl, isn't it [17:26:24] >;-) [17:26:27] no clue! [17:26:37] or was it C [17:26:42] its hacking that status.dat file afaik [17:26:45] i was guessing something evil like ruby [17:26:52] definitely not ruby [17:26:57] or it would have died a long time ago [17:27:00] ha [17:27:14] http://mathias-kettner.de/checkmk_livestatus.html [17:27:53] Jeff_Green: email-to-nagios scripts i've seen wrote directly to the same nagios socket that the web page uses [17:28:37] ah, that's not bad actually [17:29:13] this actually sounds like a fun project, sorta [17:29:34] notpeter: ok, cp1002 can be reimaged [17:29:39] and puppet can be run on all [17:29:40] mark: ok [17:29:46] yay! [17:29:48] we can put it in production tomorrow or next week [17:29:52] use role::cache::text [17:29:56] awesome [17:30:16] mark is on steriod today ;-) [17:30:32] looking at the amt of check-ins [17:30:40] someone woke me up too early this morning [17:30:56] but I guess that means... i'm already done for the day [17:30:57] RECOVERY - Disk space on srv219 is OK: DISK OK [17:31:04] maplebed: shall we configure the final country filter? [17:31:07] ya inoticed [17:31:25] nah it wasn't the outage [17:31:26] diederik: yeah, sorry, got distracted by my need to brew tea. I'll throw it in in just a sec. [17:31:35] awesome! [17:31:36] initially i thought it was a false alarm cos i could get to the wikipedias [17:31:44] speaking of tea, i need my coffee :D [17:31:48] I was already well awake and working way before that [17:31:51] so that wasn't so bad ;) [17:32:22] <^demon|away> mutante: The blog post suggested using "guest/guest" as the login creds :p [17:33:27] mark - next week, maplebed is putting swift thumbnails into production (1/256 of traffic) [17:33:27] ^demon|away: heh,ok:) [17:33:37] !log reimaging cp1002 and imaging cp1001 and cp1003-1020 [17:33:39] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:33:50] so maybe squids goes in tomorrow? [17:33:57] yeah I know [17:34:04] squid is unrelated to thumbnails [17:34:06] or middle next wek [17:34:09] we're just doing that for text [17:34:13] we wanna use varnish in front of swift [17:34:19] but swift is in pmtpa [17:34:20] (in eqiad) [17:34:22] and squid is there [17:34:27] and can be used for it [17:34:31] ok [17:34:33] so, unrelated [17:34:47] but yeah, I might deploy squid in eqiad tomorrow [17:34:59] or not, depending on how much I wanna change to it still and considering it's friday ;) [17:35:00] okay dokay [17:35:05] not that it's very risky [17:35:22] if not, early next week [17:37:25] sfo folks will be on standby if tomorrow then [17:37:35] notpeter: obviously, if you see any weird puppet issues on any of the new squids, don't run puppet on the rest [17:37:37] New patchset: Bhartshorne; "adding in the final country filter for mobile for diederik" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2228 [17:37:44] as they might need reimaging otherwise ;) [17:37:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2228 [17:38:03] mark: yea, I'm going to redo just cp1002 and see what happens [17:38:10] oki [17:38:12] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2228 [17:38:13] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2228 [17:38:14] see you tomorrow then [17:38:36] mark: cool. have a good evening! [17:40:52] diederik: going live now. [17:41:05] cool [17:43:57] PROBLEM - Host cp1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:45:25] diederik: what's your plan for sending udplog data to additional hosts? i want to make sure one of my hosts gets in on the action [17:46:35] diederik: 2TB total storage is enough? [17:46:51] we are going to replace locke soon and that replacement will become a proxy that will do multicast [17:46:57] This will be the logging host for the next couple years I imagine, I am fine with trying for that or higher [17:47:06] so will order 4 1TB disks at minimum [17:47:06] more tb is better :D [17:47:15] and if I can get 1.5 or 2 I will [17:47:22] its a logging host, so i imagine raid10 is overkill [17:47:27] we can do a raid5 for a lot more space [17:47:37] i think, not sure what those controllers can really do [17:47:53] raid5 is often (though not always) much slower than raid10. [17:48:35] slower on writes for sure, but should be comparable on reads [17:48:47] maplebed: yea, hugely slower on writes [17:48:57] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:49:00] but logs are not really huge, just constant, so i dunno if its an issue [17:49:03] can we do RAID6 though? it's nice to have the extra parity [17:49:05] sweet, 2tb disks are cdheap now [17:49:09] 141 for non green version [17:49:14] (I say not always because Dell used to have some raid cards that, for whatever reason, were actually faster with raid5 than 10. don't ask me why.) [17:49:20] the green version disks are crappy in my experience in raids [17:49:44] maplebed: it can be faster for reads because you get to involve more spindles for reads [17:49:45] yea there are always outliers but i concur that raid5 is usually a lot slower on writes [17:50:05] i mean, implementation aside [17:50:19] diederik: so 2tb disks are cheap, I am going to order 4 of them, so you will have at minimum 4TB [17:50:24] if we raid10 [17:50:36] sounds sweet and awesome! [17:51:03] RobH: do our cards support raid6? [17:51:24] not sure, goign to check in a moment, but i doubt it on the misc servers [17:51:29] the r510 h700 cards do [17:51:34] ic [17:51:40] but this is a misc server, since logging is low power [17:51:44] just needs lots of storage [17:51:56] right, makes sense [17:52:11] fwiw, emery and locke are currently using about 500G each. [17:52:39] can I throw some data there? i've got fundraising logs galore :-P [17:53:11] emery and locke are currently bound by CPU, not disk. [17:54:48] Jeff_Green: you need a new logging host? [17:54:57] RobH: yeah eventually [17:55:06] whats the current one, emery? [17:55:19] it's sorta complicated, sec [17:55:22] i imagine we want the fundraising data to go to a different server than other logs right? [17:55:38] or does that not matter? [17:56:44] ok the situation is this . . . we pull a bunch of banner impression via udp2log [17:57:29] currently they're stored long-term on storage3 because that was the only fr host with adequate space [17:57:29] New patchset: ArielGlenn; "up bw limit for rsyncers on downloads.wm.o" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2229 [17:57:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2229 [17:58:02] there they're parsed and processed into faulkner's fundraising-analytics database [17:58:16] but we still want to keep the raw logs long term [17:58:27] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2229 [17:58:28] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2229 [17:58:37] well, if they are huge, we may want to allocate something larger than just a misc server like we are for diederik's logs [17:58:39] RT1917 has details [17:58:41] which are not huge [17:58:43] write throughput goes down the drain on raid5 and 6 during a rebuild, which can take a long time on 2TB drives [17:59:15] write throughput might be important if a much greater % of log lines are written to disk, which i think is going to happen [17:59:22] RobH: yeah, agreed--we're collecting ~2TB/year [17:59:40] binasher: so i guess we should go for raid10 on this particular locke replacement [17:59:48] since locke itself is raid10 as well iirc [18:00:01] i am checking the controller to see if can do anything but that anyhow [18:00:33] there's been some analysis discussed that wouldn't work with random sampling and needs more log context than could likely be computed in memory by a filter [18:00:41] hrmm, crappy controller does r1 or r0, but not nested automatically, i assume it can do the raid1 then 0 the two together [18:00:51] so raid5/6 are off table anyhow [18:01:02] for this server, unless we get a new controller which is overkill for needs [18:01:12] so 4 2TB disks = 4TB overall [18:03:41] ottomata, [18:03:54] hiya [18:03:57] sorry [18:04:04] my little sister was pressing keys [18:04:06] haha [18:04:32] so o, tab, enter... [18:05:19] i could see us wanting to go to ES class servers for log storage at some point [18:05:49] mark: I know you're gone, but cp1002 came up beautifully, so I'm going to run puppet on the rest [18:06:08] !log doing initial run of puppet on cp1001-1020 [18:06:10] Logged the message, and now dispaching a T1000 to your position to terminate you. [18:10:58] mark: can you tell me something about what bits should go in role/swift.pp vs. what goes in manifests/swift.pp? [18:11:12] I'm adding ganglia metrics (for object count, get rate, put rate, 404 count, etc.) [18:12:13] diederik: Ok, i put in the ticket info and escalaed it for approvals on purchase. once thats done (today or tomorrow) then it will ship [18:12:24] i imagine we will have your repalcement locke online by end of next week [18:12:44] thanks so much! [18:13:00] maplebed: basically, anything that either ties multiple services into one "host role", or anything that's very wikimedia specific should go into the role classes [18:13:02] as a guideline [18:13:13] or role specific... [18:13:24] so service manifests should be fairly generic and/or configureable [18:13:36] (like puppet modules, if we're ever gonna use those, this is in line with that) [18:17:18] food.. [18:26:31] New patchset: Dzahn; "simple SMS pager script as a starter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2230 [18:27:21] !change 2230 | maplebed [18:27:21] maplebed: https://gerrit.wikimedia.org/r/2230 [18:28:26] cute! [18:28:30] I like that bot response. [18:29:07] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2230 [18:29:42] New patchset: Dzahn; "simple SMS pager script as a starter, missed a /" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2230 [18:29:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2230 [18:36:45] has anyone looked at the php remote code execution vulnerability? [18:41:43] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:33] RECOVERY - Host cp1002 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [19:16:48] New patchset: Asher; "fix string formatting of mysql version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2231 [19:17:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2231 [19:37:41] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.761 seconds [19:43:03] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2231 [19:43:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2231 [19:52:31] PROBLEM - Backend Squid HTTP on cp1003 is CRITICAL: Connection refused [19:52:31] PROBLEM - Frontend Squid HTTP on cp1006 is CRITICAL: Connection refused [19:52:31] PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: Connection refused [19:52:31] PROBLEM - Backend Squid HTTP on cp1015 is CRITICAL: Connection refused [19:52:31] PROBLEM - Frontend Squid HTTP on cp1012 is CRITICAL: Connection refused [19:52:32] PROBLEM - Frontend Squid HTTP on cp1018 is CRITICAL: Connection refused [19:56:21] PROBLEM - Backend Squid HTTP on cp1010 is CRITICAL: Connection refused [19:56:31] PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: Connection refused [19:56:41] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [19:56:41] PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: Connection refused [19:57:11] PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: Connection refused [19:57:21] PROBLEM - Frontend Squid HTTP on cp1013 is CRITICAL: Connection refused [19:58:31] PROBLEM - Backend Squid HTTP on cp1016 is CRITICAL: Connection refused [19:58:31] PROBLEM - Frontend Squid HTTP on cp1009 is CRITICAL: Connection refused [19:58:31] PROBLEM - Frontend Squid HTTP on cp1020 is CRITICAL: Connection refused [19:58:31] PROBLEM - Backend Squid HTTP on cp1011 is CRITICAL: Connection refused [19:58:41] PROBLEM - Frontend Squid HTTP on cp1003 is CRITICAL: Connection refused [19:58:41] PROBLEM - Backend Squid HTTP on cp1012 is CRITICAL: Connection refused [19:58:51] PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: Connection refused [19:58:51] PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: Connection refused [19:59:01] PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: Connection refused [19:59:12] PROBLEM - Frontend Squid HTTP on cp1014 is CRITICAL: Connection refused [19:59:31] PROBLEM - Frontend Squid HTTP on cp1008 is CRITICAL: Connection refused [20:00:11] PROBLEM - Backend Squid HTTP on cp1005 is CRITICAL: Connection refused [20:00:11] PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: Connection refused [20:00:11] PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused [20:00:11] PROBLEM - Frontend Squid HTTP on cp1010 is CRITICAL: Connection refused [20:00:11] PROBLEM - Backend Squid HTTP on cp1013 is CRITICAL: Connection refused [20:00:21] PROBLEM - Backend Squid HTTP on cp1007 is CRITICAL: Connection refused [20:00:31] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [20:00:41] PROBLEM - Frontend Squid HTTP on cp1016 is CRITICAL: Connection refused [20:02:11] PROBLEM - Frontend Squid HTTP on cp1015 is CRITICAL: Connection refused [20:02:31] PROBLEM - Backend Squid HTTP on cp1018 is CRITICAL: Connection refused [20:03:11] PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused [20:03:21] PROBLEM - Backend Squid HTTP on cp1014 is CRITICAL: Connection refused [20:03:42] PROBLEM - Backend Squid HTTP on cp1020 is CRITICAL: Connection refused [20:03:51] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: Connection refused [20:04:01] PROBLEM - Frontend Squid HTTP on cp1005 is CRITICAL: Connection refused [20:04:01] PROBLEM - Frontend Squid HTTP on cp1011 is CRITICAL: Connection refused [20:05:02] PROBLEM - Backend Squid HTTP on cp1008 is CRITICAL: Connection refused [20:06:07] notpeter: the sky is falling! [20:06:11] not really. [20:06:42] :-D [20:09:20] {98c0-4d-11-9a66: you might want to change that password ;) [20:09:40] RobH: the new caching config doesn't get the squid class... [20:19:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.128 seconds [20:22:04] New patchset: Diederik; "Added support for not having to define a filter, (the -f option)." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2232 [20:23:34] New patchset: Diederik; "Updated the documentation with the new -f or --force option." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2233 [20:23:36] New patchset: Diederik; "Updated control file to work on emery server." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2234 [20:23:38] New patchset: Diederik; "Fixed link." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2235 [20:57:00] New patchset: Pyoungmeister; "squid class not getting included for some reason. maybe this is a workaround?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2236 [20:57:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2236 [20:58:34] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2236 [20:58:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2236 [21:06:45] !log dataset1001 is alive, mostly [21:06:47] Logged the message, RobH [21:16:39] New patchset: ArielGlenn; "add dataset1001 to the dataset2 stanza in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2237 [21:16:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2237 [21:17:37] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2237 [21:17:37] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2237 [21:25:46] New patchset: ArielGlenn; "and fix the expression for dataset2/1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2238 [21:26:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2238 [21:26:26] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2238 [21:26:26] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2238 [21:36:07] should I log large file uploads via importImages.php to SAL? [21:36:56] ask Reedy ;) [21:37:03] We don't usually bother [21:37:06] k [21:37:41] Eloquence: i think last i saw your mail sig has a cleartext donate link. may want to SSL it now that we have the nginx boxen [21:37:51] jeremyb, heh fair enough [21:58:49] Eloquence: if it's just the 5gb no [21:59:27] yeah, just that. thanks. [21:59:37] as long as it has been chatted about in the channel, either here or wikitech-l, that's about it [21:59:41] er [21:59:44] wikimedia-tech [22:00:07] ok. uploaded the first one, scping the rest of them now and will import shortly. [22:13:35] TimStarling: hello [22:13:42] hi [22:14:08] have you given the worms further consideration? [22:14:27] not really [22:17:18] TimStarling, have you seen this issue before? http://commons.wikimedia.org/wiki/File:The_MediaWiki_Web_API_and_How_to_use_it_-_San_Francisco_Wikipedia_Hackathon_2012.ogv "Invalid ogg file: Cannot decode Ogg file: Invalid page at offset 652021853" [22:17:27] it's a 2.6G ogg theora video [22:17:32] seems to play fine locally [22:18:41] actually, hmgrl, looks like importImages is lying and it's not fully imported for some reason [22:20:01] we haven't done much testing on ogg files of that size [22:20:24] http://en.wikipedia.org/wiki/File:Floater_-_Burning_Sosobra_-_Exiled_-_sample.ogg [22:20:33] 348 KB with the same error [22:20:44] but there's a good chance that it's telling the truth and there really is an invalid page at that offset [22:20:55] yeah, the file is cut off [22:20:57] and maybe your local video player resyncs [22:21:20] should I just delete it and run importImages.php again, or is there some hard size limit I'm hitting here? [22:22:11] I don't think there's a hard limit [22:22:17] debug/verbose flag next time? [22:22:23] just keep an eye on the memory usage of the maintenance script [22:22:44] a previous version of OggHandler used quite a lot of memory for large ogg files, but that should be fixed now [22:26:24] !log pulled db24 from s2, preparing to upgrade to lucid [22:26:26] Logged the message, Master [22:29:43] mem usage is going pretty high for that process .. 24766 erik 20 0 2721m 1.4g 1.3g D 5 35.4 0:03.24 php and growing [22:33:33] got it. second time's the charm, apparently. [22:35:49] there's some cases being able to fully delete files would be nice [22:36:25] so, it's in the archive now? [22:36:32] Reedy: or just dedupe in this case [22:36:39] that's what I mean [22:36:53] especially for "larger" uploads that aren't correct [22:42:03] New patchset: Pyoungmeister; "attempting to force absolute lookup of class per mr. feldman's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2239 [22:44:36] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2239 [22:44:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2239 [22:47:13] Eloquence: what is TMH? [22:47:36] jeremyb, http://www.mediawiki.org/wiki/Extension:TimedMediaHandler [22:48:12] Eloquence: low cost screen capture is what kind of price range? [22:49:17] jeremyb, it'd be nice to have a portable solution for less than $1K that'll allow us to easily capture video output and then later manually composite it with the speaker video [22:49:26] there are some black box solutions but the ones I saw are pretty pricey [22:49:42] Eloquence: that seems to be within range of http://www.amazon.com/Canopus-TWINPACT-Digital-Video-Converter/dp/B000K3HT1K [22:50:05] Eloquence: you can then do pic-in-pic with video of speaker in a live stream [22:50:14] (or recording or both) [22:51:09] jeremyb, it's got a 1 whole star rrating! [22:51:15] * jeremyb is double checking that i got the right model [22:51:40] Reedy: well maybe i got the wrong model. but the one i'm thinking of is quite well proven [22:51:58] heh [22:52:34] !log rebooted db24 [22:52:36] Logged the message, Master [22:53:11] this is the "budget schmudget" option: http://www.epiphan.com/products/recording/vga-recorder-pro/ [22:53:31] New patchset: Asher; "upgrading db24" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2240 [22:53:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2240 [22:55:03] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2240 [22:55:03] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2240 [22:59:55] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [22:59:55] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [22:59:55] PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours [22:59:59] !log db24 upgraded to lucid and current mysql build [23:00:00] Logged the message, Master [23:04:41] sorry, internet died [23:05:56] Reedy: so, i had the right model. it's cheaper at newegg though: http://www.newegg.com/Product/Product.aspx?Item=N82E16815299005 [23:06:36] With a 1 egg rating [23:07:02] ignore the ratings, just watch it in action yourself. it will surely be used at FOSDEM this weekend ;) [23:07:16] and is used every year at DebConf [23:07:41] 2 seperate sites showing crappy ratings doesn't give much hope [23:08:12] (well, only ~75% sure about FOSDEM. but DebConf i'm certain. i can get reviews from hackers if you want them) [23:09:43] Reedy: also both sites have only 1 review total and they're from the same handle [23:09:59] heh [23:10:03] competitors [23:11:18] New patchset: Ottomata; "user_agent1.py - need to import Observation" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2241 [23:11:20] New patchset: Ottomata; "Reworking Observation so that it observes every combination of properties." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2242 [23:11:40] agh whyyy master [23:11:49] we need to figure out git branches + gerrit workflow [23:13:25] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2242 [23:13:59] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2241 [23:13:59] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2242 [23:14:00] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2241 [23:14:25] otto: indeed! [23:22:55] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 1219 seconds [23:28:53] New patchset: Asher; "upgrading db12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2243 [23:29:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2243 [23:30:52] New patchset: Bhartshorne; "adding ganglia logtailer and a log tailing module to swift proxy servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2244 [23:31:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2244 [23:31:25] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2244 [23:31:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2244 [23:32:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2243 [23:32:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2243 [23:32:13] !log rebooting db12 [23:32:15] Logged the message, Master [23:33:58] ottomata: when you do figure out git branches, http://opinionatedprogrammer.com/2011/01/colorful-bash-prompt-reflecting-git-status/ might help maintain sanity (if you don't have something like it already) [23:34:35] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [23:41:42] New patchset: Bhartshorne; "whoops, copy/paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2245 [23:42:01] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2245 [23:42:01] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2245 [23:43:05] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 13 MB (0% inode=64%): /var/lib/ureadahead/debugfs 13 MB (0% inode=64%): [23:44:45] !log db12 back up with lucid + current mysql [23:44:47] Logged the message, Master [23:45:55] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [23:46:26] maplebed: SwiftProxyLogtailer stuff is you, yes? [23:46:32] yes. [23:46:42] it should only be affecting the swift hosts. [23:46:48] is it getting in your hair elsewhere? [23:47:06] I'm getting cron emails from owa3 saying /bin/sh: ganglia-logtailer: not found [23:47:20] bah. [23:47:24] thanks; I'll fix it. [23:47:35] np [23:48:59] New patchset: Bhartshorne; "suppressing ganglia-logtailer messages until they're less spammy and specfiying full path because /usr/sbin is not in cron's search path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2246 [23:49:17] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2246 [23:49:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2246 [23:49:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2246 [23:50:35] nimish_g: the cronspam should stop now. [23:50:56] (or there might be one more then they'll stop, since I'm riding the 5m border and that's when cron runs) [23:51:42] Reedy: 41:25 @ http://meetings-archive.debian.net/pub/debian-meetings/2010/debconf10/high/1345_1345_Conference_Video.ogv [23:53:25] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 92 MB (1% inode=64%): /var/lib/ureadahead/debugfs 92 MB (1% inode=64%): [23:57:51] excellent, thanks maplebed