[00:24:12] <gerrit-wm>	 New patchset: Bhartshorne; "adding in country filters for mobile using new udp_filter framework" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2185
[00:24:28] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2185
[00:24:46] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2185
[00:24:47] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2185
[00:27:28] <gerrit-wm>	 New patchset: Bhartshorne; "typo correction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2186
[00:27:54] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2186
[00:27:55] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2186
[00:30:18] <maplebed>	 robla: re RT-2370, you can already see source through gerrit's gitweb stuff,
[00:30:26] <maplebed>	 it's just that the links are small.
[00:30:50] <maplebed>	 oh.
[00:30:50] <maplebed>	 nm.
[00:30:57] <maplebed>	 that's what you say in bugzilla.  I misread the RT.
[00:33:44] <AaronSchulz>	 maplebed: let me know when you are ok with the thumbnail purging or if I find anything
[00:48:40] <gerrit-wm>	 New patchset: Asher; "fix innodb statistic collection (at long last), update more frequently, add query rate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2187
[00:57:41] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:05:41] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours
[01:05:41] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours
[01:06:45] <maplebed>	 AaronSchulz: I think you should start the deletion job; it might take a while.
[01:09:11] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.440 seconds
[01:18:56] <AaronSchulz>	 maplebed: ok
[01:20:22] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2187
[01:20:22] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2187
[01:43:21] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:53:11] <nagios-wm>	 PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 98 MB (1% inode=60%): /var/lib/ureadahead/debugfs 98 MB (1% inode=60%):
[01:58:08] <gerrit-wm>	 New patchset: Bhartshorne; "disabling 'check-apache' command for swift since it actually checks ssh, not apache." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2188
[01:58:26] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2188
[01:58:26] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2188
[01:58:26] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2188
[02:07:11] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.412 seconds
[02:15:51] <nagios-wm>	 RECOVERY - Disk space on srv222 is OK: DISK OK
[02:18:31] <nagios-wm>	 PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL -  Seconds_Behind_Master : 1452s
[02:26:31] <nagios-wm>	 PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL -  Seconds_Behind_Master : 1932s
[02:38:01] <nagios-wm>	 RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 9s
[02:41:11] <nagios-wm>	 RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK -  Seconds_Behind_Master : 19s
[02:52:21] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:03:51] <nagios-wm>	 RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Feb 2 03:03:27 UTC 2012
[03:04:41] <nagios-wm>	 PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours
[03:22:43] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.630 seconds
[03:32:53] <nagios-wm>	 PROBLEM - mysqld processes on db56 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[03:37:33] <nagios-wm>	 RECOVERY - Puppet freshness on mw8 is OK: puppet ran at Thu Feb 2 03:37:21 UTC 2012
[03:52:25] <diederik>	 hi maplebed
[04:16:10] <nagios-wm>	 RECOVERY - Disk space on es1004 is OK: DISK OK
[04:18:40] <nagios-wm>	 RECOVERY - MySQL disk space on es1004 is OK: DISK OK
[04:44:20] <nagios-wm>	 PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No
[08:51:51] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 682 seconds
[09:12:00] <apergos>	 the s2 active hosts are all very unhappy (dbs 13, 24, 54)
[09:12:01] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db13 is CRITICAL: CRIT replication delay 1339 seconds
[09:12:11] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db54 is CRITICAL: CRIT replication delay 317 seconds
[09:12:20] <apergos>	 thanks nag.
[09:14:15] <hashar>	 I am getting some squids error
[09:14:16] <hashar>	 Request: POST http://www.mediawiki.org/wiki/Special:Code/MediaWiki/110486, from 208.80.152.76 via sq61.wikimedia.org (squid/2.7.STABLE9) to ()
[09:14:16] <hashar>	 Error: ERR_CANNOT_FORWARD, errno [No Error] at Thu, 02 Feb 2012 09:14:00 GMT
[09:14:19] <hashar>	 on www.mediawiki.org
[09:14:23] <hashar>	 that is last the third time
[09:14:53] <apergos>	 yes, they are backed up because of the dbs
[09:15:16] <guillom>	 hashar, see #wikimedia-tech too
[09:15:20] <hashar>	 k thx
[09:16:57] <nagios-wm>	 PROBLEM - LVS HTTP on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:22:47] <nagios-wm>	 RECOVERY - ps1-d2-pmtpa-infeed-load-tower-A-phase-Z on ps1-d2-pmtpa is OK: ps1-d2-pmtpa-infeed-load-tower-A-phase-Z OK - 1200
[09:22:47] <nagios-wm>	 PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:23:47] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:25:17] <nagios-wm>	 PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:26:27] <nagios-wm>	 PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:27:27] <nagios-wm>	 PROBLEM - MySQL Replication Heartbeat on db54 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:27:57] <nagios-wm>	 RECOVERY - LVS HTTP on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 57953 bytes in 0.155 seconds
[09:29:17] <nagios-wm>	 PROBLEM - Apache HTTP on srv247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:30:17] <nagios-wm>	 PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:30:17] <nagios-wm>	 PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:30:27] <nagios-wm>	 PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:33:57] <nagios-wm>	 RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time
[09:36:27] <nagios-wm>	 RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.021 second response time
[09:37:37] <nagios-wm>	 RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time
[09:38:37] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db54 is OK: OK replication delay 0 seconds
[09:40:27] <nagios-wm>	 RECOVERY - Apache HTTP on srv247 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time
[09:41:37] <nagios-wm>	 RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time
[09:41:47] <nagios-wm>	 RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time
[09:42:07] <nagios-wm>	 RECOVERY - Apache HTTP on mw13 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time
[09:43:37] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db54 is OK: OK replication delay 0 seconds
[09:45:27] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db13 is OK: OK replication delay 0 seconds
[09:46:17] <nagios-wm>	 RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds
[09:48:07] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds
[09:58:37] <nagios-wm>	 PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 158 MB (2% inode=60%): /var/lib/ureadahead/debugfs 158 MB (2% inode=60%):
[10:11:27] <nagios-wm>	 PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 446525 MB (3% inode=99%):
[10:14:15] <nagios-wm>	 PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 432545 MB (3% inode=99%):
[10:22:55] <nagios-wm>	 RECOVERY - Disk space on srv223 is OK: DISK OK
[10:27:41] <gerrit-wm>	 New patchset: Mark Bergsma; "Discard mailman bounces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2189
[10:28:09] <gerrit-wm>	 New patchset: Mark Bergsma; "Add removal support for system roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2175
[10:28:25] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2175
[10:28:26] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2175
[10:28:43] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2189
[10:28:44] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2189
[10:31:31] <gerrit-wm>	 New patchset: Mark Bergsma; "It's :blackhole: instead of :discard:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2190
[10:31:59] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2190
[10:31:59] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2190
[11:17:18] <gerrit-wm>	 New patchset: ArielGlenn; "rsyncd stanza for folks mirroring all public content from dumps.wm.o" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2191
[11:18:38] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2191
[11:18:39] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2191
[11:23:02] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours
[11:23:02] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours
[11:27:20] <gerrit-wm>	 New patchset: Mark Bergsma; "Add new role::cache::squid::upload class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2192
[11:27:37] <gerrit-wm>	 New patchset: Mark Bergsma; "Rename service IPs to be consistent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2193
[11:27:53] <gerrit-wm>	 New patchset: Mark Bergsma; "Add TODO" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2194
[11:28:08] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2193
[11:28:27] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2192
[11:28:28] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2192
[11:28:54] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2193
[11:28:54] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2193
[11:29:16] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2194
[11:29:16] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2194
[11:40:42] <nagios-wm>	 RECOVERY - MySQL slave status on es1004 is OK: OK:
[11:48:16] <gerrit-wm>	 New patchset: Hashar; "GraphViz needed for Jenkins installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2196
[12:23:09] <nagios-wm>	 PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours
[12:59:49] <nagios-wm>	 PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%):
[13:07:34] <gerrit-wm>	 New patchset: Mark Bergsma; "Make lvs::realserver support hashes as well as arrays for service IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204
[13:07:51] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2204
[13:08:40] <gerrit-wm>	 New patchset: Mark Bergsma; "Merge LVS changes from test into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204
[13:08:58] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2204
[13:09:22] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2204
[13:09:23] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2204
[13:11:09] <nagios-wm>	 RECOVERY - Disk space on srv221 is OK: DISK OK
[13:17:58] <gerrit-wm>	 New patchset: Mark Bergsma; "Use a dynamic lookup for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2205
[13:18:15] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2205
[13:18:22] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2205
[13:18:22] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2205
[13:24:03] <nagios-wm>	 PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours
[13:28:02] <gerrit-wm>	 New patchset: Mark Bergsma; "Refactor, remove duplication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2206
[13:28:20] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2206
[13:29:22] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2206
[13:29:22] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2206
[13:33:37] <gerrit-wm>	 New patchset: Mark Bergsma; "A hash is unordered, so sort values to avoid constant Puppet changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2207
[13:33:55] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2207
[13:34:32] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2207
[13:34:32] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2207
[13:43:20] <gerrit-wm>	 New patchset: Mark Bergsma; "Remove old system roles, preparing for migration of role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2208
[13:43:23] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:43:37] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2208
[13:43:40] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2208
[13:43:41] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2208
[13:44:13] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:54:46] <gerrit-wm>	 New patchset: Mark Bergsma; "Cleanup Squid monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2209
[13:55:23] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2209
[13:55:23] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2209
[13:59:02] <hashar>	 did anyone pushed something recently?
[13:59:13] <hashar>	 I got [notice] child pid 7721 exit signal Segmentation fault (11)
[13:59:14] <hashar>	 messages
[13:59:38] <^demon>	 RoanKattouw says that's known.
[13:59:45] <RoanKattouw>	 Tim knows about it
[14:00:10] <RoanKattouw>	 IIRC he said that PHP is segfaulting during shutdown, and I think it was related to wmerrors
[14:00:35] <RoanKattouw>	 Although I'm seeing a lot of new segfaults now, I thought they were supposed to have gone away
[14:00:39] <^demon>	 wmerrors was disabled iirc.
[14:00:54] <hashar>	 I am just warning because it suddenly started to happen
[14:00:56] <RoanKattouw>	 (full disclosure, I'm running scap right nwo)
[14:01:01] <hashar>	 a segfault every second or though
[14:01:04] <hashar>	 ahh
[14:01:08] <hashar>	 so that must be scap :D
[14:01:22] <hashar>	 would explain why it happens form different apaches
[14:06:19] <RoanKattouw>	 OK, scap is done, no new segfaults
[14:07:38] <gerrit-wm>	 New patchset: Mark Bergsma; "Rename squid role caches and nagios groups for consistency with varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2210
[14:07:54] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2210
[14:08:08] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2210
[14:08:08] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2210
[14:18:25] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate varnish cache::bits and cache::mobile classes to role::cache in role/cache.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2211
[14:18:43] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2211
[14:20:09] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2211
[14:20:10] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2211
[14:28:26] <gerrit-wm>	 New review: Dzahn; "moved puppetmaster monitoring out of site.pp" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2166
[14:28:26] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2166
[14:30:27] <gerrit-wm>	 New review: Dzahn; "yep, graphviz needed" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2196
[14:30:28] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2196
[14:36:32] <hexmode>	 this looks like an apache problem: https://bugzilla.wikimedia.org/34160
[14:37:15] <hexmode>	 for simple example, try: http://mingle.corp.wikimedia.org/projects/
[14:39:26] <^demon>	 Yeah, looks like a busted redirect. I think office IT manages that box though.
[14:42:54] <Guest99090>	 looks like somebody just copy/pasted an internal link into the post
[14:43:54] <^demon>	 No, it works, there's just a broken redirect.
[14:44:04] <^demon>	 Figured this out yesterday.
[14:47:31] <gerrit-wm>	 New review: Dzahn; "you are deleting nightly.css but still using it?" [operations/puppet] (production); V: 0 C: 0;  - https://gerrit.wikimedia.org/r/2174
[14:49:02] <gerrit-wm>	 New review: Demon; "(no comment)" [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/2174
[15:01:05] <nagios-wm>	 PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=64%): /var/lib/ureadahead/debugfs 1 MB (0% inode=64%):
[15:11:55] <nagios-wm>	 RECOVERY - Disk space on srv220 is OK: DISK OK
[15:14:31] <gerrit-wm>	 New patchset: Mark Bergsma; "Split ridiculously long lines in lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2212
[15:14:48] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2212
[15:15:05] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2212
[15:15:05] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2212
[15:16:37] <diederik>	 hi mark
[15:16:45] <mark>	 hi
[15:17:09] <diederik>	 is there any news on the 'locke-replacement-department'?
[15:17:16] <mark>	 no idea
[15:17:19] <mark>	 what does the ticket say? :)
[15:17:24] <diederik>	 no idea
[15:17:25] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.383 seconds
[15:17:33] <diederik>	 :D
[15:18:14] <diederik>	 no activity since last week when we opened it
[15:18:20] <mark>	 time to prod ;)
[15:18:25] <diederik>	 prod who?
[15:18:29] <mark>	 the ticket
[15:18:36] <mark>	 rob was in eqiad yesterday
[15:18:40] <mark>	 if nothing happened there, he likely forgot
[15:18:58] <diederik>	 can you remind him?
[15:19:12] <mark>	 i'll ask, but best reply to the ticket too
[15:20:25] <mark>	 in fact...
[15:20:29] <mark>	 RobH: ^^
[15:20:39] <diederik>	 there is no owner of the ticket yet
[15:22:42] <mark>	 that doesn't really mean anything
[15:23:05] <nagios-wm>	 PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=64%): /var/lib/ureadahead/debugfs 1 MB (0% inode=64%):
[15:25:23] <gerrit-wm>	 New review: Hashar; "Please note nightly.css was *renamed* not deleted :" [operations/puppet] (production) C: 0;  - https://gerrit.wikimedia.org/r/2174
[15:27:13] <gerrit-wm>	 New patchset: Mark Bergsma; "Remove old LVS services and old service IPs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2213
[15:28:36] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2213
[15:28:36] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2213
[15:31:09] <gerrit-wm>	 New patchset: Mark Bergsma; "Remove old payments LVS service as well, rename the new one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2214
[15:31:26] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2214
[15:31:47] <gerrit-wm>	 New patchset: Mark Bergsma; "Qualify all $site references in lvs.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2215
[15:33:45] <nagios-wm>	 RECOVERY - Disk space on srv223 is OK: DISK OK
[15:35:58] <gerrit-wm>	 New patchset: Dzahn; "redirect http://irc.wikimedia.org to IRC meta page (currently "it works" page)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2216
[15:36:15] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2216
[15:37:03] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2214
[15:37:03] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2214
[15:37:33] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2215
[15:37:33] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2215
[15:38:43] <cmjohnson1>	 robh: hi
[15:39:00] <RobH>	 heya
[15:39:09] <RobH>	 diederik: im lookin at your locke replacement ticket now
[15:39:18] <diederik>	 ok
[15:39:19] <RobH>	 trying to figure out if I have hardware already we can use
[15:39:47] <hexmode>	 anyone know anything about OTRS upgrades?
[15:40:00] <cmjohnson1>	 i received the drives for db13..thx do u have time to help me w/it today?
[15:40:18] <mark>	 !log Running apt-get update && apt-get dist-upgrade && reboot on lvs2
[15:40:20] <morebots>	 Logged the message, Master
[15:40:27] <diederik>	 robh: i guess it's a middle of the road server
[15:40:37] <diederik>	 maybe a bit more harddrive space is useful
[15:40:43] <diederik>	 but memory is not a super big thing
[15:41:05] <diederik>	 but mark might have clearer ideas on this
[15:42:52] <RobH>	 Well, I think our stock misc server would work, except for the lack of hard disk
[15:43:20] <RobH>	 the stock misc servers are 500 GB, which are not large enough
[15:43:31] <RobH>	 cmjohnson1: lemme take a look at ticket
[15:46:09] <Reedy>	 hexmode, in what sense?
[15:46:34] <hexmode>	 Reedy: in the sense of being able to upgrade it
[15:47:57] <hexmode>	 Reedy: There are a couple of RT tickets open for it
[15:48:00] <Reedy>	 Jeff_Green was doing some work on it for something... I don't know if it's to the extent of trying to upgrade it...
[15:48:07] <mark>	 RobH: what are the ocg boxes doing in the squid vlan?
[15:48:27] * hexmode  goes all hound-dog on Jeff_Green 
[15:48:31] <Jeff_Green>	 whut
[15:48:38] <Reedy>	 otrs
[15:48:41] <hexmode>	 Jeff_Green: OTRS!
[15:48:46] <Jeff_Green>	 double whut.
[15:48:49] <hexmode>	 can you upgrade it?
[15:49:20] <Jeff_Green>	 i can try, but I warn you their official estimate according to philippe is that an upgrade for us will take a year
[15:49:38] <Jeff_Green>	 i lol about that, but at the same time I wonder why they think it will not be straightforward
[15:51:00] <hexmode>	 Jeff_Green: maybe b/c we don't have anyone in OPs working on it and the RT ticket has been open a year?
[15:51:12] * hexmode  guesses at the time from the ticket #
[15:51:30] <Jeff_Green>	 very funny, but no.
[15:51:38] <Reedy>	 heh
[15:51:42] <Reedy>	 It's perl and otrs
[15:51:51] <Reedy>	 We've no sadists on staff
[15:52:02] <hexmode>	 perl! I love perl!
[15:52:04] <Jeff_Green>	 philippe consulted with the OTRS folks about upgrading--i.e. contracting them, and they said to anticipate a year-long process
[15:52:11] <RobH>	 cmjohnson1: checking db13 now
[15:52:12] <Jeff_Green>	 I have no problem with perl, you can't blame perl
[15:52:18] <Reedy>	 heh
[15:52:19] <hexmode>	 it is my favorite practical/functional language
[15:52:37] <hexmode>	 ??
[15:52:54] <hexmode>	 OTRS is looking for weird ways to make money
[15:52:55] <Jeff_Green>	 I'm going to donate to the perl foundation just to annoy all you perl haters
[15:53:08] <Jeff_Green>	 hexmode: maybe.
[15:53:11] <RobH>	 cmjohnson1: So I show:     Group 6, Segment 0                       : Inconsistent (0,23)
[15:53:31] <cmjohnson1>	 okay
[15:53:31] <RobH>	 that is drive 23 in the controller, it should have an orange LED lit
[15:53:40] <hexmode>	 Jeff_Green: I'd chip in but the Perl 6 saga is frustrating after 10 years
[15:53:45] <Jeff_Green>	 anyway, I'm finally at a place where the idea of being involved with an otrs upgrade doesn't make me want to kill myself, but yeah it's not trivial
[15:54:02] <RobH>	 Jeff_Green: if you dont wanna kill yourself and others, you missed an important step ;]
[15:54:04] <Jeff_Green>	 hexmode: yeah, fair enough. but all the other languages will get there too I figure
[15:54:12] <cmjohnson1>	 robh: no led on a specific drive...i will have to look at diagram
[15:54:40] <hexmode>	 python nextgen and PHP 6 have similar stories
[15:54:46] <Jeff_Green>	 languages mature to the point where you're better off starting a new one rather than improving the old one
[15:54:58] <hexmode>	 except lisp!
[15:55:05] <hexmode>	 lisp is always awesome
[15:55:30] <Jeff_Green>	 that's actually one of the things I like about perl pre 6. it is what it is, it works, it doesn't have a security crisis every other day
[15:55:39] <mark>	 !log Running apt-get update && apt-get dist-upgrade && reboot on lvs1
[15:55:41] <morebots>	 Logged the message, Master
[15:55:52] <RobH>	 cmjohnson1: hrmm, odd that no drive is lit
[15:56:41] <RobH>	 cmjohnson1: argh, the 23 means nothing, its in slot 13 i think, still confirming
[15:57:31] <RobH>	 cmjohnson1: Ok, reading this, it appears that drive 15 is the dead drive
[15:57:39] <RobH>	 which should be in the bottom corner, but please confirm
[15:58:59] <mutante>	 ^demon|away: i see it breaks the redirect, but even if i fix it, it's still an internal link with a login on a public blog
[15:59:28] <cmjohnson1>	 robh: the drive location is top right corner
[15:59:33] <mutante>	 by "fix it" i just meant changing the URL in my browser :p
[16:00:47] <mutante>	 hexmode: bz 34160 should be an Office IT thing, afaik yep
[16:00:53] <RobH>	 cmjohnson1: is it blinking now?
[16:00:59] <RobH>	 cmjohnson1: i told it to blink the dead drive
[16:01:44] <cmjohnson1>	 robh: strangely enough all the drives but drive 15 are blinking ....as though they are in use except 15
[16:02:35] <RobH>	 ok, well, command failed, oh well
[16:02:37] <RobH>	 heh
[16:02:45] <hexmode>	 mutante: k
[16:02:45] <RobH>	 since the rest are blinking, i think we know the dead one then
[16:03:03] <cmjohnson1>	 okay....fyi http://docs.oracle.com/cd/E19121-01/sf.x4140/820-2394-13/overview.html#0_65739
[16:03:06] <RobH>	 cmjohnson1: so log before you pull it, and then go ahead and swap it out and lets see if it fixes it.  real fast though, lemme check to ensure its not a master
[16:03:15] <cmjohnson1>	 okay....let me know
[16:03:36] <RobH>	 http://noc.wikimedia.org/dbtree/ is new tool asher made
[16:03:52] <RobH>	 so we can see db13 is a slave for s2
[16:04:00] <RobH>	 which is fine, i just get nervous when we work on the masters
[16:04:10] <RobH>	 so go ahead and log and swap with the system staying online
[16:04:18] <RobH>	 (all good on this end)
[16:04:32] <cmjohnson1>	 k
[16:04:56] * hexmode  lols @ [[User:Mutante]]
[16:05:03] <hexmode>	 "I love User Boxes"
[16:05:12] <hexmode>	 yes, we can all see that
[16:05:12] <cmjohnson1>	 !log replacing disk 15 on db11
[16:05:15] <morebots>	 Logged the message, Master
[16:05:50] <gerrit-wm>	 New review: Dzahn; "re: asterisk in <Directory>: ah,ok so Apache will expand that? re: renamed file: yes, i see. nevermi..." [operations/puppet] (production); V: 1 C: 1;  - https://gerrit.wikimedia.org/r/2174
[16:05:52] <apergos>	 yeah I used that tool today to see which cluster was affected by the three dbs being whacked
[16:06:09] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate esams text squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2217
[16:06:18] <cmjohnson1>	 !log disk 15 swap complete on  db11
[16:06:19] <morebots>	 Logged the message, Master
[16:06:31] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2217
[16:07:55] <mutante>	 hexmode: "This user is ambivalent about self-referential userboxes." :)
[16:07:56] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate esams squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2218
[16:08:05] <cmjohnson1>	 robh: please verify the new disk..thx
[16:08:18] <RobH>	 cmjohnson1:
[16:08:19] <RobH>	       Device #15
[16:08:19] <RobH>	          Device is a Hard drive
[16:08:20] <RobH>	          State                              : Rebuilding
[16:08:24] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2217
[16:08:25] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2217
[16:08:31] <RobH>	    Group 6, Segment 1                       : Rebuilding (0,23) 30V1T30W            3PD1T30W
[16:08:32] <cmjohnson1>	 thx
[16:08:36] <RobH>	 looks good =]
[16:13:51] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2218
[16:13:52] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2218
[16:21:00] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate remaining squids to new role classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2219
[16:21:57] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2219
[16:21:57] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2219
[16:24:21] <gerrit-wm>	 New patchset: Mark Bergsma; "Revert "Migrate remaining squids to new role classes"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2220
[16:24:39] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2220
[16:24:39] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2220
[16:24:44] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2220
[16:24:44] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2220
[16:26:35] <gerrit-wm>	 New patchset: Mark Bergsma; "Revert "Revert "Migrate remaining squids to new role classes""" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2221
[16:26:51] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2221
[16:26:52] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2221
[16:26:52] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2221
[16:28:11] <gerrit-wm>	 New patchset: Mark Bergsma; "Let's not convert live text squids into upload squids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2222
[16:28:28] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2222
[16:28:33] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2222
[16:28:34] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2222
[16:29:29] <gerrit-wm>	 New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2174
[16:29:30] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2174
[16:32:47] <RoanKattouw>	 mark, apergos: Would it perhaps be a good idea to give the scalers a larger root partition? It's 7GB currently and they're out of space all the time because GhostScript creates 100MB files in /tmp
[16:33:00] <RoanKattouw>	 (I checked, they're not even old. The oldest ones I found were ~20 mins old)
[16:33:06] <mark>	 that means they shouldn't be using the root partition at all
[16:33:21] <apergos>	 I know they aren't old
[16:33:27] <apergos>	 I updated that cron script recently
[16:33:32] <mark>	 that should be a dedicated partition
[16:38:06] <gerrit-wm>	 New patchset: Hashar; "fix nightly.css relative path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2223
[16:42:35] <gerrit-wm>	 New review: Dzahn; "done:)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2223
[16:42:35] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2223
[16:47:43] <nagios-wm>	 RECOVERY - RAID on db13 is OK: OK: 1 logical device(s) checked
[16:48:32] <diederik>	 quick question for ops folks: do we detect mobile devices in varnish? (and redirect them to the mobile site)
[16:48:48] <mark>	 in squid, yes
[16:49:15] <gerrit-wm>	 New patchset: Hashar; "fix up nightly.css" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2224
[16:49:33] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2224
[16:51:32] <gerrit-wm>	 New review: Dzahn; "looks good, not using IndexStyleSheet" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2224
[16:51:33] <gerrit-wm>	 Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2224
[16:56:03] <nagios-wm>	 PROBLEM - Host text.pmtpa.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:56:11] <gerrit-wm>	 New patchset: Mark Bergsma; "Cleanup: remove obsolete old role classes text-squid and upload-squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2225
[16:56:29] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2225
[16:57:44] <RobH>	 ok, i jsut got paged and emailed about text
[16:57:54] <mark>	 old service ip, not a problem
[16:57:58] <maplebed>	 me too!
[16:58:04] <RobH>	 heh, was gonna say, it looks ok...
[16:58:07] <RobH>	 its working ;]
[16:58:09] <notpeter>	 hey look the paging system works!
[16:58:13] <apergos>	 :-P
[16:58:13] <mark>	 no it's not working
[16:58:19] <RobH>	 ?
[16:58:21] <mark>	 but that doesn't matter since it's been unused for 6 months ;)
[16:58:29] <apergos>	 the paging system?
[16:58:29] <notpeter>	 heh
[16:58:32] <mark>	 I disabled 208.80.152.2 today
[16:58:33] <apergos>	 it just paged us
[16:58:34] <notpeter>	 WHAT'S GOING ON
[16:58:39] <maplebed>	 we don't have an email alias that will page all the same people regular pages go to, do we?
[16:58:42] <apergos>	 so I guess it is working :-P
[16:58:43] <mark>	 and removed it from nagios
[16:58:55] <diederik>	 mark: how is squid doing mobile detection? using wurfl?
[16:58:55] <maplebed>	 eg to send a note to everybody that got paged but isn't here to say "don't worry, it's cool."
[16:59:03] <diederik>	 hi maplebed!
[16:59:06] <mark>	 maplebed: ops, I guess
[16:59:09] <mark>	 i'll do that now
[16:59:15] <maplebed>	 mark: but taht doesn't page.
[16:59:26] <diederik>	 can we enable the final country filter ?
[17:00:33] <maplebed>	 diederik: emery looks ok, so yes.
[17:00:49] <diederik>	 awesome!
[17:00:58] <diederik>	 i also checked it and it's still running
[17:01:17] * binasher  deletes site_stats from enwiki to further along the fire drill
[17:01:51] <notpeter>	 mark: but hey! people really are getting much better about responding to pages :)
[17:01:56] <mark>	 maplebed: now I think of it
[17:01:59] <mark>	 I think you can do that using nagios
[17:02:55] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2225
[17:02:55] <gerrit-wm>	 Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2225
[17:03:24] <binasher>	 diederik: squid mobile detection is based on useragent regex's that were generated by inspected wurfl xml.  i think they provided around 90% coverage of wurfl at the time of implementation
[17:03:52] <mutante>	 maplebed: yea, could use Nagios acknowledgement and configure it to page
[17:04:11] <diederik>	 binasher: cool, i just stumbled upon this: http://www.enrise.com/2011/02/mobile-device-detection-with-wurfl-and-varnish/ and it seemed relevant
[17:06:02] <binasher>	 diederik: yep, definitely. preilly has done some work on a c api to wurfl as well, that we may integrate with varnish one day
[17:06:13] <diederik>	 sweet!
[17:06:51] <binasher>	 the implementation in that blog post doesn't look great
[17:07:39] <diederik>	 no, i don't think that is a very scalable implementaton either but i didn't know it was possible in the first place (which says a lot about my knowledge)
[17:07:56] <mutante>	 maplebed: or actually, it might be to put a host into scheduled downtime as a reaction, and configuring "scheduled downtime started" as notification_options:  Valid options are a combination of one or more of the following: d = send notifications on a DOWN state, u = send notifications on an UNREACHABLE state, r = send notifications on recoveries (OK state), f = send notifications when the host starts and stops flapping, and s = send notifi
[17:08:14] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate swift cluster role classes into role/swift.pp with role::prefix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2226
[17:08:32] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2226
[17:08:43] <nagios-wm>	 PROBLEM - Backend Squid HTTP on knsq25 is CRITICAL: Connection refused
[17:09:04] <gerrit-wm>	 New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1;  - https://gerrit.wikimedia.org/r/2226
[17:09:18] <maplebed>	 mutante, mark: what I mean is that it's useful to be able to send information (as a human) in the situation when pages are going out and people are on their way to getting net access.
[17:09:29] <mark>	 maplebed: yeah
[17:09:34] <maplebed>	 poking nagios with downtimes or sending ack pages and all is still useful, but not the same.
[17:09:39] <mark>	 I believe you can send out arbitrary comments in the nagios web interface
[17:09:48] <mutante>	 maplebed: putting it into scheduled downtime would be a human action via web ui
[17:09:53] <nagios-wm>	 PROBLEM - Puppetmaster HTTPS on virt0 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 403 Forbidden
[17:10:38] <mark>	 maplebed: whenever you're ready, change 2226 is awaiting your review
[17:11:24] <maplebed>	 mutante: it still conflates two equally possible messages.  (1) shit's broken and I need nagios to shut up or (2) it's a false alarm and the rest of you that got a page should go back to sleep.  Though if we can send arbitrary pages with the nagios UI as mark suggests, we can add a note to clarify...
[17:11:28] <mutante>	 comments should also work, but not sure if you can page on them easily, because its not in notification_options
[17:12:01] <maplebed>	 I know I would prefer to just have an email address "page-ops@" that I can say what is necessary and not have to spend precious could-be-fixing time fighting with nagios and hoping it does what I want.
[17:12:44] <mark>	 well that would have issues too
[17:12:47] <mark>	 we would need to guard it against spam
[17:12:55] <mark>	 and of course only the first 140 chars in your mail would get sent, or something
[17:13:02] <mark>	 and imagine that being a mime header or whatever ;)
[17:13:08] <mark>	 it's possible, but not trivial
[17:13:22] <maplebed>	 mark: re: change 2226, you're basically just moving that stuff to a new file, right?
[17:13:29] <mark>	 yeah and a slight rename of classes
[17:13:31] <mutante>	 i would say both cases are kind of "Acknowledgement", which has a comment field
[17:13:36] <Jeff_Green>	 dunno if this is helpful but at CL we ended up hacking a web-based [MUTE] button into our paging system, which would email a reminder ever so often while it was active
[17:13:40] <maplebed>	 mark: in my experience those email addresses don't get a lot of spam (actually not any)
[17:13:42] <mark>	 for consistency
[17:14:02] <mark>	 actually
[17:14:06] <mark>	 I see a problem with it already
[17:14:11] <mark>	 inherits swift-cluster::base
[17:14:16] <mark>	 I renamed swift-cluster::base
[17:14:17] <mark>	 let me amend
[17:14:21] <maplebed>	 k.
[17:14:47] <mark>	 heh
[17:14:50] <mark>	 this was broken before already
[17:14:54] <mark>	 it's calling swift::base
[17:14:59] <mark>	 that was always called swift-cluster::base
[17:15:05] <mark>	 perhaps that's why inheritance wasn't working?
[17:15:22] <mark>	 ah, no
[17:15:26] <mark>	 that's calling into swift.pp...
[17:15:30] <maplebed>	 at Linden each opsen had an email address "page-<name>" and there was a "page-ops".  They worked exceedingly well and were very often useful and I honestly can't remember ever getting spam to the addresses.
[17:15:48] <maplebed>	 obviously we have more exposure here, but with minor obfuscation of the addresses, I'd bet it would be fine.
[17:16:25] <Jeff_Green>	 yeah we did that too, totally useful
[17:16:47] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.909 seconds
[17:16:48] <mutante>	 you could write a wrapper for the pager script nagios uses as well
[17:17:10] <Jeff_Green>	 we did that too, when we used big brother :-P
[17:17:17] <gerrit-wm>	 New patchset: Mark Bergsma; "Migrate swift cluster role classes into role/swift.pp with role::prefix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2226
[17:17:29] <Jeff_Green>	 I probably have that script somewhere, it did time- and perl regex filtering
[17:17:35] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2226
[17:17:39] <maplebed>	 tea time.  brb.
[17:18:41] <Jeff_Green>	 oh tea is a very good idea. i'm going to copy you.
[17:18:44] <binasher>	 pager script wrapper ++
[17:19:43] <binasher>	 we could also do some reply processing
[17:19:47] <nagios-wm>	 PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=64%): /var/lib/ureadahead/debugfs 0 MB (0% inode=64%):
[17:20:23] <binasher>	 reply ack to acknowledge in nagios, or "msg …" to send a message everyone else on the page
[17:20:44] <maplebed>	 (we also had an IRC bot that could both page people and accept return pages and echo them back to the channel in which the person was paged...  it was awesome.)
[17:20:47] <mutante>	 echo "go back to sleep"  | /usr/local/bin/gammu-smsd-inject TEXT $CONTACTPAGER$
[17:26:00] <Jeff_Green>	 does nagios have a reasonable API that would allow poking it from email/sms/ircbot or would we need to hack that too?
[17:26:20] <mark>	 it's written in perl, isn't it
[17:26:24] <mark>	 >;-)
[17:26:27] <Jeff_Green>	 no clue!
[17:26:37] <mark>	 or was it C
[17:26:42] <mutante>	 its hacking that status.dat file afaik
[17:26:45] <Jeff_Green>	 i was guessing something evil like ruby
[17:26:52] <mark>	 definitely not ruby
[17:26:57] <mark>	 or it would have died a long time ago
[17:27:00] <Jeff_Green>	 ha
[17:27:14] <mutante>	 http://mathias-kettner.de/checkmk_livestatus.html
[17:27:53] <binasher>	 Jeff_Green: email-to-nagios scripts i've seen wrote directly to the same nagios socket that the web page uses
[17:28:37] <Jeff_Green>	 ah, that's not bad actually
[17:29:13] <Jeff_Green>	 this actually sounds like a fun project, sorta
[17:29:34] <mark>	 notpeter: ok, cp1002 can be reimaged
[17:29:39] <mark>	 and puppet can be run on all
[17:29:40] <notpeter>	 mark: ok
[17:29:46] <notpeter>	 yay!
[17:29:48] <mark>	 we can put it in production tomorrow or next week
[17:29:52] <mark>	 use role::cache::text
[17:29:56] <notpeter>	 awesome
[17:30:16] <woosters>	 mark is on steriod today ;-)
[17:30:32] <woosters>	 looking at the amt of check-ins
[17:30:40] <mark>	 someone woke me up too early this morning
[17:30:56] <mark>	 but I guess that means... i'm already done for the day
[17:30:57] <nagios-wm>	 RECOVERY - Disk space on srv219 is OK: DISK OK
[17:31:04] <diederik>	 maplebed: shall we configure the final country filter?
[17:31:07] <woosters>	 ya inoticed
[17:31:25] <mark>	 nah it wasn't the outage
[17:31:26] <maplebed>	 diederik: yeah, sorry, got distracted by my need to brew tea.  I'll throw it in in just a sec.
[17:31:35] <diederik>	 awesome!
[17:31:36] <woosters>	 initially i thought it was a false alarm cos i could get to the wikipedias
[17:31:44] <diederik>	 speaking of tea, i need my coffee :D
[17:31:48] <mark>	 I was already well awake and working way before that
[17:31:51] <mark>	 so that wasn't so bad ;)
[17:32:22] <^demon|away>	 mutante: The blog post suggested using "guest/guest" as the login creds :p
[17:33:27] <woosters>	 mark - next week, maplebed is putting swift thumbnails into production (1/256 of traffic)
[17:33:27] <mutante>	 ^demon|away: heh,ok:)
[17:33:37] <notpeter>	 !log reimaging cp1002 and imaging cp1001 and cp1003-1020
[17:33:39] <morebots>	 Logged the message, and now dispaching a T1000 to your position to terminate you.
[17:33:50] <woosters>	 so maybe squids goes in tomorrow?
[17:33:57] <mark>	 yeah I know
[17:34:04] <mark>	 squid is unrelated to thumbnails
[17:34:06] <woosters>	 or middle next wek
[17:34:09] <mark>	 we're just doing that for text
[17:34:13] <mark>	 we wanna use varnish in front of swift
[17:34:19] <mark>	 but swift is in pmtpa
[17:34:20] <maplebed>	 (in eqiad)
[17:34:22] <mark>	 and squid is there
[17:34:27] <mark>	 and can be used for it
[17:34:31] <woosters>	 ok
[17:34:33] <mark>	 so, unrelated
[17:34:47] <mark>	 but yeah, I might deploy squid in eqiad tomorrow
[17:34:59] <mark>	 or not, depending on how much I wanna change to it still and considering it's friday ;)
[17:35:00] <woosters>	 okay dokay
[17:35:05] <mark>	 not that it's very risky
[17:35:22] <mark>	 if not, early next week
[17:37:25] <woosters>	 sfo folks will be on standby if tomorrow then
[17:37:35] <mark>	 notpeter: obviously, if you see any weird puppet issues on any of the new squids, don't run puppet on the rest
[17:37:37] <gerrit-wm>	 New patchset: Bhartshorne; "adding in the final country filter for mobile for diederik" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2228
[17:37:44] <mark>	 as they might need reimaging otherwise ;)
[17:37:55] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2228
[17:38:03] <notpeter>	 mark: yea, I'm going to redo just cp1002 and see what happens
[17:38:10] <mark>	 oki
[17:38:12] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2228
[17:38:13] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2228
[17:38:14] <mark>	 see you tomorrow then
[17:38:36] <notpeter>	 mark: cool. have a good evening!
[17:40:52] <maplebed>	 diederik: going live now.
[17:41:05] <diederik>	 cool
[17:43:57] <nagios-wm>	 PROBLEM - Host cp1002 is DOWN: PING CRITICAL - Packet loss = 100%
[17:45:25] <binasher>	 diederik: what's your plan for sending udplog data to additional hosts?  i want to make sure one of my hosts gets in on the action
[17:46:35] <RobH>	 diederik: 2TB total storage is enough?
[17:46:51] <diederik>	 we are going to replace locke soon and that replacement will become a proxy that will do multicast
[17:46:57] <RobH>	 This will be the logging host for the next couple years I imagine, I am fine with trying for that or higher
[17:47:06] <RobH>	 so will order 4 1TB disks at minimum
[17:47:06] <diederik>	 more tb is better :D
[17:47:15] <RobH>	 and if I can get 1.5 or 2 I will
[17:47:22] <RobH>	 its a logging host, so i imagine raid10 is overkill
[17:47:27] <RobH>	 we can do a raid5 for a lot more space
[17:47:37] <RobH>	 i think, not sure what those controllers can really do
[17:47:53] <maplebed>	 raid5 is often (though not always) much slower than raid10.
[17:48:35] <Jeff_Green>	 slower on writes for sure, but should be comparable on reads
[17:48:47] <RobH>	 maplebed: yea, hugely slower on writes
[17:48:57] <nagios-wm>	 PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:49:00] <RobH>	 but logs are not really huge, just constant, so i dunno if its an issue
[17:49:03] <Jeff_Green>	 can we do RAID6 though? it's nice to have the extra parity
[17:49:05] <RobH>	 sweet, 2tb disks are cdheap now
[17:49:09] <RobH>	 141 for non green version
[17:49:14] <maplebed>	 (I say not always because Dell used to have some raid cards that, for whatever reason, were actually faster with raid5 than 10.  don't ask me why.)
[17:49:20] <RobH>	 the green version disks are crappy in my experience in raids
[17:49:44] <Jeff_Green>	 maplebed: it can be faster for reads because you get to involve more spindles for reads
[17:49:45] <RobH>	 yea there are always outliers but i concur that raid5 is usually a lot slower on writes
[17:50:05] <Jeff_Green>	 i mean, implementation aside
[17:50:19] <RobH>	 diederik: so 2tb disks are cheap, I am going to order 4 of them, so you will have at minimum 4TB
[17:50:24] <RobH>	 if we raid10
[17:50:36] <diederik>	 sounds sweet and awesome!
[17:51:03] <Jeff_Green>	 RobH: do our cards support raid6?
[17:51:24] <RobH>	 not sure, goign to check in a moment, but i doubt it on the misc servers
[17:51:29] <RobH>	 the r510 h700 cards do
[17:51:34] <Jeff_Green>	 ic
[17:51:40] <RobH>	 but this is a misc server, since logging is low power
[17:51:44] <RobH>	 just needs lots of storage
[17:51:56] <Jeff_Green>	 right, makes sense
[17:52:11] <maplebed>	 fwiw, emery and locke are currently using about 500G each.
[17:52:39] <Jeff_Green>	 can I throw some data there? i've got fundraising logs galore :-P
[17:53:11] <maplebed>	 emery and locke are currently bound by CPU, not disk.
[17:54:48] <RobH>	 Jeff_Green: you need a new logging host?
[17:54:57] <Jeff_Green>	 RobH: yeah eventually
[17:55:06] <RobH>	 whats the current one, emery?
[17:55:19] <Jeff_Green>	 it's sorta complicated, sec
[17:55:22] <RobH>	 i imagine we want the fundraising data to go to a different server than other logs right?
[17:55:38] <RobH>	 or does that not matter?
[17:56:44] <Jeff_Green>	 ok the situation is this . . . we pull a bunch of banner impression via udp2log
[17:57:29] <Jeff_Green>	 currently they're stored long-term on storage3 because that was the only fr host with adequate space
[17:57:29] <gerrit-wm>	 New patchset: ArielGlenn; "up bw limit for rsyncers on downloads.wm.o" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2229
[17:57:45] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2229
[17:58:02] <Jeff_Green>	 there they're parsed and processed into faulkner's fundraising-analytics database
[17:58:16] <Jeff_Green>	 but we still want to keep the raw logs long term
[17:58:27] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2229
[17:58:28] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2229
[17:58:37] <RobH>	 well, if they are huge, we may want to allocate something larger than just a misc server like we are for diederik's logs
[17:58:39] <Jeff_Green>	 RT1917 has details
[17:58:41] <RobH>	 which are not huge
[17:58:43] <binasher>	 write throughput goes down the drain on raid5 and 6 during a rebuild, which can take a long time on 2TB drives
[17:59:15] <binasher>	 write throughput might be important if a much greater % of log lines are written to disk, which i think is going to happen
[17:59:22] <Jeff_Green>	 RobH: yeah, agreed--we're collecting ~2TB/year
[17:59:40] <RobH>	 binasher: so i guess we should go for raid10 on this particular locke replacement
[17:59:48] <RobH>	 since locke itself is raid10 as well iirc
[18:00:01] <RobH>	 i am checking the controller to see if can do anything but that anyhow
[18:00:33] <binasher>	 there's been some analysis discussed that wouldn't work with random sampling and needs more log context than could likely be computed in memory by a filter
[18:00:41] <RobH>	 hrmm, crappy controller does r1 or r0, but not nested automatically, i assume it can do the raid1 then 0 the two together
[18:00:51] <RobH>	 so raid5/6 are off table anyhow
[18:01:02] <RobH>	 for this server, unless we get a new controller which is overkill for needs
[18:01:12] <RobH>	 so 4 2TB disks = 4TB overall
[18:03:41] <Reedy>	 ottomata,
[18:03:54] <ottomata>	 hiya
[18:03:57] <Reedy>	 sorry
[18:04:04] <Reedy>	 my little sister was pressing keys
[18:04:06] <ottomata>	 haha
[18:04:32] <Reedy>	 so o, tab, enter...
[18:05:19] <binasher>	 i could see us wanting to go to ES class servers for log storage at some point
[18:05:49] <notpeter>	 mark: I know you're gone, but cp1002 came up beautifully, so I'm going to run puppet on the rest
[18:06:08] <notpeter>	 !log doing initial run of puppet on cp1001-1020
[18:06:10] <morebots>	 Logged the message, and now dispaching a T1000 to your position to terminate you.
[18:10:58] <maplebed>	 mark: can you tell me something about what bits should go in role/swift.pp vs. what goes in manifests/swift.pp?
[18:11:12] <maplebed>	 I'm adding ganglia metrics (for object count, get rate, put rate, 404 count, etc.)
[18:12:13] <RobH>	 diederik: Ok, i put in the ticket info and escalaed it for approvals on purchase.  once thats done (today or tomorrow) then it will ship
[18:12:24] <RobH>	 i imagine we will have your repalcement locke online by end of next week
[18:12:44] <diederik>	 thanks so much!
[18:13:00] <mark>	 maplebed: basically, anything that either ties multiple services into one "host role", or anything that's very wikimedia specific should go into the role classes
[18:13:02] <mark>	 as a guideline
[18:13:13] <mark>	 or role specific...
[18:13:24] <mark>	 so service manifests should be fairly generic and/or configureable
[18:13:36] <mark>	 (like puppet modules, if we're ever gonna use those, this is in line with that)
[18:17:18] <mark>	 food..
[18:26:31] <gerrit-wm>	 New patchset: Dzahn; "simple SMS pager script as a starter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2230
[18:27:21] <mutante>	 !change 2230 | maplebed
[18:27:21] <wm-bot>	 maplebed: https://gerrit.wikimedia.org/r/2230
[18:28:26] <maplebed>	 cute!
[18:28:30] <maplebed>	 I like that bot response.
[18:29:07] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1;  - https://gerrit.wikimedia.org/r/2230
[18:29:42] <gerrit-wm>	 New patchset: Dzahn; "simple SMS pager script as a starter, missed a /" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2230
[18:29:59] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2230
[18:36:45] <Platonides>	 has anyone looked at the php remote code execution vulnerability?
[18:41:43] <nagios-wm>	 PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:53:33] <nagios-wm>	 RECOVERY - Host cp1002 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms
[19:16:48] <gerrit-wm>	 New patchset: Asher; "fix string formatting of mysql version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2231
[19:17:05] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2231
[19:37:41] <nagios-wm>	 RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.761 seconds
[19:43:03] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2231
[19:43:04] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2231
[19:52:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1003 is CRITICAL: Connection refused
[19:52:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1006 is CRITICAL: Connection refused
[19:52:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1009 is CRITICAL: Connection refused
[19:52:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1015 is CRITICAL: Connection refused
[19:52:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1012 is CRITICAL: Connection refused
[19:52:32] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1018 is CRITICAL: Connection refused
[19:56:21] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1010 is CRITICAL: Connection refused
[19:56:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1007 is CRITICAL: Connection refused
[19:56:41] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused
[19:56:41] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1004 is CRITICAL: Connection refused
[19:57:11] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1001 is CRITICAL: Connection refused
[19:57:21] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1013 is CRITICAL: Connection refused
[19:58:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1016 is CRITICAL: Connection refused
[19:58:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1009 is CRITICAL: Connection refused
[19:58:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1020 is CRITICAL: Connection refused
[19:58:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1011 is CRITICAL: Connection refused
[19:58:41] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1003 is CRITICAL: Connection refused
[19:58:41] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1012 is CRITICAL: Connection refused
[19:58:51] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1006 is CRITICAL: Connection refused
[19:58:51] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1002 is CRITICAL: Connection refused
[19:59:01] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: Connection refused
[19:59:12] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1014 is CRITICAL: Connection refused
[19:59:31] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1008 is CRITICAL: Connection refused
[20:00:11] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1005 is CRITICAL: Connection refused
[20:00:11] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1004 is CRITICAL: Connection refused
[20:00:11] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1001 is CRITICAL: Connection refused
[20:00:11] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1010 is CRITICAL: Connection refused
[20:00:11] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1013 is CRITICAL: Connection refused
[20:00:21] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1007 is CRITICAL: Connection refused
[20:00:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused
[20:00:41] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1016 is CRITICAL: Connection refused
[20:02:11] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1015 is CRITICAL: Connection refused
[20:02:31] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1018 is CRITICAL: Connection refused
[20:03:11] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1002 is CRITICAL: Connection refused
[20:03:21] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1014 is CRITICAL: Connection refused
[20:03:42] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1020 is CRITICAL: Connection refused
[20:03:51] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: Connection refused
[20:04:01] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1005 is CRITICAL: Connection refused
[20:04:01] <nagios-wm>	 PROBLEM - Frontend Squid HTTP on cp1011 is CRITICAL: Connection refused
[20:05:02] <nagios-wm>	 PROBLEM - Backend Squid HTTP on cp1008 is CRITICAL: Connection refused
[20:06:07] <RobH>	 notpeter: the sky is falling!
[20:06:11] <RobH>	 not really.
[20:06:42] <apergos>	 :-D
[20:09:20] <notpeter>	 {98c0-4d-11-9a66: you might want to change that password ;)
[20:09:40] <notpeter>	 RobH: the new caching config doesn't get the squid class...
[20:19:12] <nagios-wm>	 RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.128 seconds
[20:22:04] <gerrit-wm>	 New patchset: Diederik; "Added support for not having to define a filter, (the -f option)." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2232
[20:23:34] <gerrit-wm>	 New patchset: Diederik; "Updated the documentation with the new -f or --force option." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2233
[20:23:36] <gerrit-wm>	 New patchset: Diederik; "Updated control file to work on emery server." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2234
[20:23:38] <gerrit-wm>	 New patchset: Diederik; "Fixed link." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2235
[20:57:00] <gerrit-wm>	 New patchset: Pyoungmeister; "squid class not getting included for some reason. maybe this is a workaround?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2236
[20:57:18] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2236
[20:58:34] <gerrit-wm>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2236
[20:58:35] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2236
[21:06:45] <RobH>	 !log dataset1001 is alive, mostly
[21:06:47] <morebots>	 Logged the message, RobH
[21:16:39] <gerrit-wm>	 New patchset: ArielGlenn; "add dataset1001 to the dataset2 stanza in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2237
[21:16:57] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2237
[21:17:37] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2237
[21:17:37] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2237
[21:25:46] <gerrit-wm>	 New patchset: ArielGlenn; "and fix the expression for dataset2/1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2238
[21:26:03] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2238
[21:26:26] <gerrit-wm>	 New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2238
[21:26:26] <gerrit-wm>	 Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2238
[21:36:07] <Eloquence>	 should I log large file uploads via importImages.php to SAL?
[21:36:56] <jeremyb>	 ask Reedy ;)
[21:37:03] <Reedy>	 We don't usually bother
[21:37:06] <Eloquence>	 k
[21:37:41] <jeremyb>	 Eloquence: i think last i saw your mail sig has a cleartext donate link. may want to SSL it now that we have the nginx boxen
[21:37:51] <Eloquence>	 jeremyb, heh fair enough
[21:58:49] <apergos>	 Eloquence: if it's just the 5gb no
[21:59:27] <Eloquence>	 yeah, just that. thanks.
[21:59:37] <apergos>	 as long as it has been chatted about in the channel, either here or wikitech-l, that's about it
[21:59:41] <apergos>	 er
[21:59:44] <apergos>	 wikimedia-tech
[22:00:07] <Eloquence>	 ok. uploaded the first one, scping the rest of them now and will import shortly.
[22:13:35] <AaronSchulz>	 TimStarling: hello
[22:13:42] <TimStarling>	 hi
[22:14:08] <AaronSchulz>	 have you given the worms further consideration?
[22:14:27] <TimStarling>	 not really
[22:17:18] <Eloquence>	 TimStarling, have you seen this issue before? http://commons.wikimedia.org/wiki/File:The_MediaWiki_Web_API_and_How_to_use_it_-_San_Francisco_Wikipedia_Hackathon_2012.ogv "Invalid ogg file: Cannot decode Ogg file: Invalid page at offset 652021853"
[22:17:27] <Eloquence>	 it's a 2.6G ogg theora video
[22:17:32] <Eloquence>	 seems to play fine locally
[22:18:41] <Eloquence>	 actually, hmgrl, looks like importImages is lying and it's not fully imported for some reason
[22:20:01] <TimStarling>	 we haven't done much testing on ogg files of that size
[22:20:24] <Reedy>	 http://en.wikipedia.org/wiki/File:Floater_-_Burning_Sosobra_-_Exiled_-_sample.ogg
[22:20:33] <Reedy>	 348 KB with the same error
[22:20:44] <TimStarling>	 but there's a good chance that it's telling the truth and there really is an invalid page at that offset
[22:20:55] <Eloquence>	 yeah, the file is cut off
[22:20:57] <TimStarling>	 and maybe your local video player resyncs
[22:21:20] <Eloquence>	 should I just delete it and run importImages.php again, or is there some hard size limit I'm hitting here?
[22:22:11] <TimStarling>	 I don't think there's a hard limit
[22:22:17] <jeremyb>	 debug/verbose flag next time?
[22:22:23] <TimStarling>	 just keep an eye on the memory usage of the maintenance script
[22:22:44] <TimStarling>	 a previous version of OggHandler used quite a lot of memory for large ogg files, but that should be fixed now
[22:26:24] <binasher>	 !log pulled db24 from s2, preparing to upgrade to lucid
[22:26:26] <morebots>	 Logged the message, Master
[22:29:43] <Eloquence>	 mem usage is going pretty high for that process .. 24766 erik      20   0 2721m 1.4g 1.3g D    5 35.4   0:03.24 php    and growing
[22:33:33] <Eloquence>	 got it. second time's the charm, apparently.
[22:35:49] <Reedy>	 there's some cases being able to fully delete files would be nice
[22:36:25] <jeremyb>	 so, it's in the archive now?
[22:36:32] <jeremyb>	 Reedy: or just dedupe in this case
[22:36:39] <Reedy>	 that's what I mean
[22:36:53] <Reedy>	 especially for "larger" uploads that aren't correct
[22:42:03] <gerrit-wm>	 New patchset: Pyoungmeister; "attempting to force absolute lookup of class per mr. feldman's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2239
[22:44:36] <gerrit-wm>	 New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2239
[22:44:37] <gerrit-wm>	 Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2239
[22:47:13] <jeremyb>	 Eloquence: what is TMH?
[22:47:36] <Eloquence>	 jeremyb, http://www.mediawiki.org/wiki/Extension:TimedMediaHandler
[22:48:12] <jeremyb>	 Eloquence: low cost screen capture is what kind of price range?
[22:49:17] <Eloquence>	 jeremyb, it'd be nice to have a portable solution for less than $1K that'll allow us to easily capture video output and then later manually composite it with the speaker video
[22:49:26] <Eloquence>	 there are some black box solutions but the ones I saw are pretty pricey
[22:49:42] <jeremyb>	 Eloquence: that seems to be within range of http://www.amazon.com/Canopus-TWINPACT-Digital-Video-Converter/dp/B000K3HT1K
[22:50:05] <jeremyb>	 Eloquence: you can then do pic-in-pic with video of speaker in a live stream
[22:50:14] <jeremyb>	 (or recording or both)
[22:51:09] <Reedy>	 jeremyb, it's got a 1 whole star rrating!
[22:51:15] * jeremyb  is double checking that i got the right model
[22:51:40] <jeremyb>	 Reedy: well maybe i got the wrong model. but the one i'm thinking of is quite well proven
[22:51:58] <Reedy>	 heh
[22:52:34] <binasher>	 !log rebooted db24
[22:52:36] <morebots>	 Logged the message, Master
[22:53:11] <Eloquence>	 this is the "budget schmudget" option: http://www.epiphan.com/products/recording/vga-recorder-pro/
[22:53:31] <gerrit-wm>	 New patchset: Asher; "upgrading db24" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2240
[22:53:48] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2240
[22:55:03] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2240
[22:55:03] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2240
[22:59:55] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours
[22:59:55] <nagios-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours
[22:59:55] <nagios-wm>	 PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours
[22:59:59] <binasher>	 !log db24 upgraded to lucid and current mysql build
[23:00:00] <morebots>	 Logged the message, Master
[23:04:41] <jeremyb>	 sorry, internet died
[23:05:56] <jeremyb>	 Reedy: so, i had the right model. it's cheaper at newegg though: http://www.newegg.com/Product/Product.aspx?Item=N82E16815299005
[23:06:36] <Reedy>	 With a 1 egg rating
[23:07:02] <jeremyb>	 ignore the ratings, just watch it in action yourself. it will surely be used at FOSDEM this weekend ;)
[23:07:16] <jeremyb>	 and is used every year at DebConf
[23:07:41] <Reedy>	 2 seperate sites showing crappy ratings doesn't give much hope
[23:08:12] <jeremyb>	 (well, only ~75% sure about FOSDEM. but DebConf i'm certain. i can get reviews from hackers if you want them)
[23:09:43] <jeremyb>	 Reedy: also both sites have only 1 review total and they're from the same handle
[23:09:59] <Reedy>	 heh
[23:10:03] <Reedy>	 competitors
[23:11:18] <gerrit-wm>	 New patchset: Ottomata; "user_agent1.py - need to import Observation" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2241
[23:11:20] <gerrit-wm>	 New patchset: Ottomata; "Reworking Observation so that it observes every combination of properties." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2242
[23:11:40] <ottomata>	 agh whyyy master
[23:11:49] <ottomata>	 we need to figure out git branches + gerrit workflow
[23:13:25] <gerrit-wm>	 New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2242
[23:13:59] <gerrit-wm>	 New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2241
[23:13:59] <gerrit-wm>	 Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2242
[23:14:00] <gerrit-wm>	 Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2241
[23:14:25] <diederik>	 otto: indeed!
[23:22:55] <nagios-wm>	 PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 1219 seconds
[23:28:53] <gerrit-wm>	 New patchset: Asher; "upgrading db12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2243
[23:29:11] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2243
[23:30:52] <gerrit-wm>	 New patchset: Bhartshorne; "adding ganglia logtailer and a log tailing module to swift proxy servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2244
[23:31:10] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2244
[23:31:25] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2244
[23:31:26] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2244
[23:32:00] <gerrit-wm>	 New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2;  - https://gerrit.wikimedia.org/r/2243
[23:32:01] <gerrit-wm>	 Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2243
[23:32:13] <binasher>	 !log rebooting db12
[23:32:15] <morebots>	 Logged the message, Master
[23:33:58] <maplebed>	 ottomata: when you do figure out git branches, http://opinionatedprogrammer.com/2011/01/colorful-bash-prompt-reflecting-git-status/ might help maintain sanity (if you don't have something like it already)
[23:34:35] <nagios-wm>	 PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours
[23:41:42] <gerrit-wm>	 New patchset: Bhartshorne; "whoops, copy/paste error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2245
[23:42:01] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2245
[23:42:01] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2245
[23:43:05] <nagios-wm>	 PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 13 MB (0% inode=64%): /var/lib/ureadahead/debugfs 13 MB (0% inode=64%):
[23:44:45] <binasher>	 !log db12 back up with lucid + current mysql
[23:44:47] <morebots>	 Logged the message, Master
[23:45:55] <nagios-wm>	 RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds
[23:46:26] <nimish_g>	 maplebed: SwiftProxyLogtailer stuff is you, yes?
[23:46:32] <maplebed>	 yes.
[23:46:42] <maplebed>	 it should only be affecting the swift hosts.
[23:46:48] <maplebed>	 is it getting in your hair elsewhere?
[23:47:06] <nimish_g>	 I'm getting cron emails from owa3 saying /bin/sh: ganglia-logtailer: not found
[23:47:20] <maplebed>	 bah.
[23:47:24] <maplebed>	 thanks; I'll fix it.
[23:47:35] <nimish_g>	 np
[23:48:59] <gerrit-wm>	 New patchset: Bhartshorne; "suppressing ganglia-logtailer messages until they're less spammy and specfiying full path because /usr/sbin is not in cron's search path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2246
[23:49:17] <gerrit-wm>	 New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2;  - https://gerrit.wikimedia.org/r/2246
[23:49:17] <gerrit-wm>	 Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2246
[23:49:17] <gerrit-wm>	 New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2246
[23:50:35] <maplebed>	 nimish_g: the cronspam should stop now.
[23:50:56] <maplebed>	 (or there might be one more then they'll stop, since I'm riding the 5m border and that's when cron runs)
[23:51:42] <jeremyb>	 Reedy: 41:25 @ http://meetings-archive.debian.net/pub/debian-meetings/2010/debconf10/high/1345_1345_Conference_Video.ogv
[23:53:25] <nagios-wm>	 PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 92 MB (1% inode=64%): /var/lib/ureadahead/debugfs 92 MB (1% inode=64%):
[23:57:51] <nimish_g>	 excellent, thanks maplebed