[00:07:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.670 seconds [00:20:40] !log deployed squid config to upload squids rolling thumbnails back to 75% handled by swift to test the & bug [00:20:42] Logged the message, Master [00:21:13] confirmed. [00:21:14] :( [00:22:41] !log increased nagios max concurrent checks on spence and lowered the interval between processing them [00:22:43] Logged the message, Mistress of the network gear. [00:27:51] hrm, i did the math and we need to allow 561 checks concurrently... [00:30:21] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:49] at the rate they run? [00:30:57] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:51] yeah [00:32:08] i'm going to try 512 and see how that goes :) [00:32:18] honestly the biggest issue has been when pushing a puppet change, spence freaks out [00:33:43] New patchset: Lcarr; "upping nagios checks to 512 concurrent checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2484 [00:34:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2484 [00:34:39] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2484 [00:45:26] New patchset: Asher; "http setup for gdash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2485 [00:46:25] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2485 [00:46:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2485 [00:49:15] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.889 seconds [00:52:06] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.459 seconds [01:05:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.968 seconds [01:11:34] New patchset: Asher; "deleted a character too many" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2486 [01:11:55] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2486 [01:12:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2486 [01:12:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2486 [01:13:47] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.533 seconds [01:26:23] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [01:27:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:17] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [01:29:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.961 seconds [01:31:02] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:34:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:12] did anyone reboot srv278 [01:41:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.841 seconds [01:44:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.761 seconds [01:49:11] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 193 seconds [01:49:11] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:20] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 202 seconds [01:49:29] PROBLEM - MySQL Slave Delay on db1038 is CRITICAL: CRIT replication delay 200 seconds [01:50:05] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 235 seconds [01:50:23] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 253 seconds [01:50:41] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [01:52:02] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:20] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:14] RECOVERY - DPKG on db42 is OK: All packages OK [01:53:32] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:53:41] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:41] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:35] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [01:54:53] RECOVERY - Disk space on db42 is OK: DISK OK [01:55:02] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [01:55:56] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.535 seconds [01:56:23] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [01:57:26] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 666s [01:58:47] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 30 seconds [01:59:05] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [01:59:59] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.696 seconds [02:17:14] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.164 seconds [02:28:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.168 seconds [02:37:29] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:23] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 59s [02:38:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.560 seconds [02:38:50] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:40:20] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.684 seconds [02:42:53] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:10] New patchset: Bhartshorne; "fixing bug where images with an ampersand \& in the name fail to load through swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2487 [02:49:47] !log deploying fix for & bug with swift (files with an & in the name wouldn't load properly) [02:49:49] Logged the message, Master [02:50:27] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2487 [02:50:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2487 [02:52:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.014 seconds [02:52:21] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:52:47] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:53] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [02:55:20] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [02:56:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:20] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.291 seconds [03:07:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.565 seconds [03:11:57] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:03] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [03:14:30] RECOVERY - MySQL Slave Delay on db1002 is OK: OK replication delay 0 seconds [03:16:09] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] RECOVERY - MySQL Slave Delay on db1038 is OK: OK replication delay 0 seconds [03:17:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.167 seconds [03:18:15] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [03:19:18] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 211 seconds [03:19:54] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: CRIT replication delay 247 seconds [03:21:15] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [03:21:15] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [03:21:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:48] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 212 seconds [03:24:15] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 239 seconds [03:24:33] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 257 seconds [03:24:33] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CRIT replication delay 257 seconds [03:25:54] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 0 seconds [03:25:54] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 0 seconds [03:26:39] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [03:27:06] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 0 seconds [03:28:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.602 seconds [03:33:33] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [03:40:27] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 223 seconds [03:43:00] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 217 seconds [03:43:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 204 seconds [03:44:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.341 seconds [03:46:54] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:48:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:48:33] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.855 seconds [03:49:54] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds [03:49:54] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 210 seconds [03:51:51] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:51:51] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:18] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:18] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 210 seconds [03:52:27] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.904 seconds [03:52:27] PROBLEM - RAID on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:27] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 211 seconds [03:52:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:03] RECOVERY - Disk space on db42 is OK: DISK OK [03:53:21] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [03:53:21] RECOVERY - MySQL disk space on db42 is OK: DISK OK [03:53:21] RECOVERY - DPKG on db42 is OK: All packages OK [03:53:30] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:53:30] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:53:39] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [03:54:15] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [03:56:30] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:56:39] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.879 seconds [04:00:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:01:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.892 seconds [04:04:44] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.506 seconds [04:05:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:12:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.941 seconds [04:16:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.832 seconds [04:20:47] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.761 seconds [04:27:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.293 seconds [04:31:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:20] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.519 seconds [04:35:47] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.118 seconds [04:42:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.459 seconds [04:46:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:47:39] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [04:47:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.106 seconds [04:47:47] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 0 seconds [04:47:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.039 seconds [04:51:50] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:08] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:26] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:26] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:53] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:53:50] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:54:17] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:54:26] RECOVERY - MySQL Recent Restart on db42 is OK: OK 4250487 seconds since restart [04:54:44] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [04:55:02] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [04:58:38] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [04:58:47] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [05:08:32] !log deployed squid config to uploads to send 100% of thumbnail traffic to swift [05:08:34] Logged the message, Master [05:43:11] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:47:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.027 seconds [05:47:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [05:51:40] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:40] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:58] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:34] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:01] RECOVERY - Disk space on db42 is OK: DISK OK [05:53:01] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:53:01] RECOVERY - MySQL disk space on db42 is OK: DISK OK [05:53:46] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 1 seconds [05:54:04] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [05:54:04] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [05:54:04] RECOVERY - DPKG on db42 is OK: All packages OK [06:06:49] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 198 seconds [06:07:52] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 262 seconds [06:08:10] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [06:09:13] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 11 seconds [06:19:52] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 236 seconds [06:20:01] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 243 seconds [06:22:34] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [06:22:34] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 1 seconds [06:51:50] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:50] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:50] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:52] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:52] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:01] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:52] !log Making cp1001-1005 API squids [11:50:54] Logged the message, Master [12:02:54] !log Decommissioning sq38, sq46 and sq47 in squid configurator [12:02:56] Logged the message, Master [12:06:00] New patchset: Mark Bergsma; "Decommission sq38, sq46, sq47" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:06:20] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2488 [12:07:09] New patchset: Mark Bergsma; "Decommission sq38, sq46, sq47" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:07:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2488 [12:07:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2488 [12:07:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:17:54] New patchset: Mark Bergsma; "Decommission sq31" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2489 [12:18:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2489 [12:18:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2489 [12:18:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2489 [12:18:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2489 [12:25:35] New patchset: Mark Bergsma; "Decommission sq35" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2490 [12:25:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2490 [12:26:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2490 [12:26:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2490 [13:32:09] New patchset: Mark Bergsma; "Remove old nagios host/service groups for squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2491 [13:32:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2491 [13:32:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2491 [13:32:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2491 [13:42:54] !log Configured cp1001 and cp1020 to contact backend servers directly instead of via pmtpa squids [13:42:56] Logged the message, Master [14:13:44] New patchset: Mark Bergsma; "Remove decommissioned servers from api list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2492 [14:14:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2492 [14:23:21] New patchset: Mark Bergsma; "Update more graphs for eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2493 [14:23:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2493 [14:24:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2492 [14:24:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2492 [14:25:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2493 [14:25:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2493 [14:36:20] New patchset: Mark Bergsma; "Corrections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2494 [14:36:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2494 [14:36:49] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2494 [14:36:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2494 [14:40:04] New patchset: Hashar; "(bug 34141) notify jenkins on ANY merge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2495 [14:40:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2495 [14:47:45] New patchset: Mark Bergsma; "Add mobile graphs, corrections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2496 [14:48:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2496 [14:48:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2496 [14:48:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2496 [14:50:53] New patchset: Mark Bergsma; "Fix color spec" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2497 [14:51:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2497 [14:51:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2497 [14:51:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2497 [14:52:53] New review: Demon; "Why are we doing this on merge? Shouldn't this go on push (before the merge happens)?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [14:55:15] New patchset: Mark Bergsma; "Automatically clear the cache on Torrus config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:55:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2498 [14:55:45] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2498 [14:56:37] New patchset: Mark Bergsma; "Automatically clear the cache on Torrus config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:57:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2498 [14:57:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2498 [14:57:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:59:27] New patchset: Mark Bergsma; "Fix api colors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2499 [14:59:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2499 [14:59:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2499 [15:01:15] New patchset: Mark Bergsma; "Use LINE1 for service times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2500 [15:01:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2500 [15:01:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2500 [15:01:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2500 [15:01:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2500 [15:03:15] New patchset: Mark Bergsma; "Really fix API colors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2501 [15:03:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2501 [15:03:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2501 [15:03:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2501 [15:07:16] New patchset: Mark Bergsma; "More color fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2502 [15:07:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2502 [15:07:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2502 [15:07:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2502 [15:14:51] New patchset: Mark Bergsma; "Turn on KeepAlive on application servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2504 [15:15:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2504 [15:16:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2504 [15:16:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2504 [15:17:54] !log Turned on KeepAlive on apaches for better miss service times from eqiad [15:17:56] Logged the message, Master [15:40:27] New patchset: Mark Bergsma; "/etc/init.d/torrus-common restart has proven unreliable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2505 [15:40:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2505 [15:41:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2505 [15:41:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2505 [16:07:26] !log Rebalanced appserver load balancing by giving the new mw* pmtpa app servers weight 150 in the pybal server list [16:07:27] Logged the message, Master [16:37:36] hi notpeter, are you around? [16:45:22] hi robh [16:45:31] heyas [16:45:39] can i ask you for a quick favor? [16:45:51] you can ask =] [16:45:54] :) [16:45:58] you need 5 mins of C code review? ;-) [16:46:04] :) :) [16:46:29] i need some files transferred from emery to the analytics virtual labs instance [16:46:34] i filed an rt ticket [16:46:44] but without those files, i cannot make progress [16:46:58] you dont have emery access? [16:46:58] http://rt.wikimedia.org/Ticket/Display.html?id=2421 [16:47:15] i can add you to the virtual instance [16:47:17] New review: Hashar; "That basically copy the way subversion works for now." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [16:50:08] !log running sync-apache, trying to redirect office.wm to https [16:50:10] updated ticket, afk a moment, receiving in items in eqiad [16:50:10] Logged the message, Master [16:51:09] robh: files are in /a/squid/archive/sampled [16:51:16] and they contain ip addresses [16:52:31] is that ok to move into labs? [16:52:48] i hope so, else there is a real catch 22 going on here [16:53:30] i cannot install mysql-server on stat1 (for the moment) and i cannot move files to labs (that are on stat1) which has a mysql-server [16:53:38] i need a machine with both the files and mysql server [16:54:57] heh [16:55:11] mark: is there any issue with putting sampled log files for squid into a labs instance? [16:55:29] drdee: go ahead and give me access to the instance if you would, we can assume its ok and prepare for moving the files now [16:55:30] the ideal situation is fixing ticket: [16:55:44] and if its ok once mark replies great [16:55:44] if [16:55:45] not [16:55:48] http://rt.wikimedia.org/Ticket/Display.html?id=2411 [16:55:50] damned colloquy bug [16:55:52] =P [16:56:04] stat1 is public facing [16:56:10] i know :) :) [16:56:10] running mysql on it is bad [16:56:21] I recall there is a conversation about this going on via email right? [16:56:21] running mysql anywhere is bad [16:56:22] that's why it is going to become private [16:56:24] yes [16:56:42] mark: so can sampled squid logs be pushed to a labs instance safely? [16:56:47] i'm not sure [16:56:57] in theory, yes [16:57:00] but i'd like ryan to confirm [16:57:04] since I'm not fully up to date on labs [16:57:10] http://rt.wikimedia.org/Ticket/Display.html?id=2412 [16:57:20] is the ticket to make stat1 private [16:57:41] but it depends on: [16:57:41] http://rt.wikimedia.org/Ticket/Display.html?id=2165 [16:57:46] there is discussion whether you should be using a managed mysql cluster instead of a mysqld on stat1 [16:57:54] (technically it does not) [16:58:19] for the moment, i just need a single mysql server [16:58:24] drdee: So I am going to update the ticket and assign to it Ryan to confirm that it is ok to move the data there. In anticipation, please go ahead and grant me access to your instance. [16:58:31] ok [16:58:33] thanks! [16:58:39] once he confirms its secure (labs) then I will go ahead and snag it [16:58:48] you want just the sampled logs for a week or so? [16:58:53] (how much data you want?) [16:58:55] for one 1 month [16:58:57] 30 days [16:59:13] if that fits, else 1 or 2 weeks is fine as well [16:59:17] just anything :) [16:59:46] can we remove the dependency 2165 from 2412? [17:00:16] its not linked right now [17:00:33] 2412 isnt tied to any other ticket [17:01:15] done, i added you to labs instance [17:01:31] but 2412 says it depends on 2165, or am I reading it wrong? [17:01:43] "Dependency on ticket #2165 added" [17:02:42] the intention was just to link all the stat1 tickets in the same place, and the others were dependencies [17:02:56] oh i am transposing numbers =P [17:03:26] i think moving stat1 is not the solution, the solution is adding this data to a misc db cluster [17:03:30] maybe "children" is more approriate [17:03:38] running stat1 as a mysql server isnt good. [17:04:02] sorry for asking, but why not? [17:04:03] plus its old as sin, and has no warranty [17:04:04] u [17:04:04] sing [17:04:05] [17:04:05] [17:04:06] [17:04:08] wow... brb [17:04:08] that's bayes [17:04:10] not stat1 [17:04:14] oh, thats right. [17:04:55] drdee: the folks in email are covering it better than me ;] [17:14:07] !log oxygen offline for hard disk upgrade to replace locke [17:14:08] Logged the message, RobH [17:14:34] <^demon> Is http://svn.wikimedia.org/viewvc/mediawiki/wikimedia-web/ still used for anything? [17:15:08] mark or apergos: any ideas on why the day view is so spikey? [17:15:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe2.pmtpa.wmnet&c=Swift%20pmtpa&m=swift_GET_200_hits&r=day [17:16:15] as in, why the request pattern is so spikey? [17:16:18] not really no [17:16:30] that'sreally quite odd [17:17:33] hmm but [17:17:40] the spikes aren't such wide variants right? [17:17:57] I mean between 8 and 14... eh [17:20:31] I was thinking individual squids blowing their cache might account for it [17:20:44] but have any of them have done so? [17:21:41] the 404 graph is totally smooth, fwiw. [17:25:16] yeah, dunno [17:33:29] about to apache-graceful-all for a small addition to redirects.conf [17:36:13] maplebed: no upload squids, no [17:36:34] yeah, didn't tihnk so. [17:36:35] the only thing I can think of is my decommissioning of a few upload squids today [17:36:38] which were down already [17:36:39] oh well... [17:36:40]