[00:07:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.670 seconds [00:20:40] !log deployed squid config to upload squids rolling thumbnails back to 75% handled by swift to test the & bug [00:20:42] Logged the message, Master [00:21:13] confirmed. [00:21:14] :( [00:22:41] !log increased nagios max concurrent checks on spence and lowered the interval between processing them [00:22:43] Logged the message, Mistress of the network gear. [00:27:51] hrm, i did the math and we need to allow 561 checks concurrently... [00:30:21] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:49] at the rate they run? [00:30:57] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:51] yeah [00:32:08] i'm going to try 512 and see how that goes :) [00:32:18] honestly the biggest issue has been when pushing a puppet change, spence freaks out [00:33:43] New patchset: Lcarr; "upping nagios checks to 512 concurrent checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2484 [00:34:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2484 [00:34:39] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2484 [00:45:26] New patchset: Asher; "http setup for gdash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2485 [00:46:25] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2485 [00:46:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2485 [00:49:15] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.889 seconds [00:52:06] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.459 seconds [01:05:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.968 seconds [01:11:34] New patchset: Asher; "deleted a character too many" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2486 [01:11:55] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2486 [01:12:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2486 [01:12:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2486 [01:13:47] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.533 seconds [01:26:23] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [01:27:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:27:17] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [01:29:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.961 seconds [01:31:02] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:34:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:12] did anyone reboot srv278 [01:41:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.841 seconds [01:44:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.761 seconds [01:49:11] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 193 seconds [01:49:11] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:20] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 202 seconds [01:49:29] PROBLEM - MySQL Slave Delay on db1038 is CRITICAL: CRIT replication delay 200 seconds [01:50:05] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 235 seconds [01:50:23] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 253 seconds [01:50:41] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [01:52:02] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:20] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:14] RECOVERY - DPKG on db42 is OK: All packages OK [01:53:32] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [01:53:41] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:41] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:35] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.069 second response time [01:54:53] RECOVERY - Disk space on db42 is OK: DISK OK [01:55:02] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [01:55:56] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.535 seconds [01:56:23] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 602s [01:57:26] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 666s [01:58:47] RECOVERY - MySQL Replication Heartbeat on db1034 is OK: OK replication delay 30 seconds [01:59:05] RECOVERY - MySQL Slave Delay on db1034 is OK: OK replication delay 0 seconds [01:59:59] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:13:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.696 seconds [02:17:14] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.164 seconds [02:28:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.168 seconds [02:37:29] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:23] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 59s [02:38:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.560 seconds [02:38:50] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:40:20] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.684 seconds [02:42:53] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:10] New patchset: Bhartshorne; "fixing bug where images with an ampersand \& in the name fail to load through swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2487 [02:49:47] !log deploying fix for & bug with swift (files with an & in the name wouldn't load properly) [02:49:49] Logged the message, Master [02:50:27] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2487 [02:50:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2487 [02:52:21] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.014 seconds [02:52:21] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:52:47] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:54:53] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [02:55:20] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [02:56:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:20] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.291 seconds [03:07:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.565 seconds [03:11:57] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:03] RECOVERY - MySQL Replication Heartbeat on db1002 is OK: OK replication delay 0 seconds [03:14:30] RECOVERY - MySQL Slave Delay on db1002 is OK: OK replication delay 0 seconds [03:16:09] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:17:03] RECOVERY - MySQL Slave Delay on db1038 is OK: OK replication delay 0 seconds [03:17:30] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.167 seconds [03:18:15] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay 0 seconds [03:19:18] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 211 seconds [03:19:54] PROBLEM - MySQL Slave Delay on db1004 is CRITICAL: CRIT replication delay 247 seconds [03:21:15] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [03:21:15] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [03:21:33] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:48] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 212 seconds [03:24:15] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 239 seconds [03:24:33] PROBLEM - MySQL Replication Heartbeat on db1021 is CRITICAL: CRIT replication delay 257 seconds [03:24:33] PROBLEM - MySQL Slave Delay on db1021 is CRITICAL: CRIT replication delay 257 seconds [03:25:54] RECOVERY - MySQL Replication Heartbeat on db1021 is OK: OK replication delay 0 seconds [03:25:54] RECOVERY - MySQL Slave Delay on db1021 is OK: OK replication delay 0 seconds [03:26:39] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [03:27:06] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 0 seconds [03:28:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.602 seconds [03:33:33] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:40:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 182 seconds [03:40:27] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 223 seconds [03:43:00] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 217 seconds [03:43:09] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 204 seconds [03:44:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.341 seconds [03:46:54] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:48:24] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:48:33] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.855 seconds [03:49:54] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 210 seconds [03:49:54] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 210 seconds [03:51:51] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:51:51] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:09] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:18] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:18] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 210 seconds [03:52:27] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.904 seconds [03:52:27] PROBLEM - RAID on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:27] PROBLEM - MySQL Slave Delay on db1003 is CRITICAL: CRIT replication delay 211 seconds [03:52:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:03] RECOVERY - Disk space on db42 is OK: DISK OK [03:53:21] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [03:53:21] RECOVERY - MySQL disk space on db42 is OK: DISK OK [03:53:21] RECOVERY - DPKG on db42 is OK: All packages OK [03:53:30] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:53:30] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:53:39] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [03:54:15] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [03:56:30] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:56:39] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.879 seconds [04:00:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:01:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.892 seconds [04:04:44] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.506 seconds [04:05:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:12:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:16:35] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.941 seconds [04:16:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.832 seconds [04:20:47] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:20:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.761 seconds [04:27:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.293 seconds [04:31:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:20] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.519 seconds [04:35:47] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:38:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.118 seconds [04:42:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:43:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.459 seconds [04:46:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:47:39] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay 0 seconds [04:47:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.106 seconds [04:47:47] RECOVERY - MySQL Slave Delay on db1004 is OK: OK replication delay 0 seconds [04:47:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:29] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.039 seconds [04:51:50] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:08] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:26] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:26] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:52:53] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:53:50] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [04:54:17] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:54:26] RECOVERY - MySQL Recent Restart on db42 is OK: OK 4250487 seconds since restart [04:54:44] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [04:55:02] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [04:58:38] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [04:58:47] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [05:08:32] !log deployed squid config to uploads to send 100% of thumbnail traffic to swift [05:08:34] Logged the message, Master [05:43:11] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:44:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:47:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 0.027 seconds [05:47:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [05:51:40] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:40] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:58] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:34] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:52:52] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:01] RECOVERY - Disk space on db42 is OK: DISK OK [05:53:01] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:53:01] RECOVERY - MySQL disk space on db42 is OK: DISK OK [05:53:46] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 1 seconds [05:54:04] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [05:54:04] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [05:54:04] RECOVERY - DPKG on db42 is OK: All packages OK [06:06:49] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 198 seconds [06:07:52] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 262 seconds [06:08:10] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 0 seconds [06:09:13] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 11 seconds [06:19:52] PROBLEM - MySQL Slave Delay on db1019 is CRITICAL: CRIT replication delay 236 seconds [06:20:01] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: CRIT replication delay 243 seconds [06:22:34] RECOVERY - MySQL Slave Delay on db1019 is OK: OK replication delay 0 seconds [06:22:34] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay 1 seconds [06:51:50] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:50] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:50] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:52] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:52] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:01] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:52] !log Making cp1001-1005 API squids [11:50:54] Logged the message, Master [12:02:54] !log Decommissioning sq38, sq46 and sq47 in squid configurator [12:02:56] Logged the message, Master [12:06:00] New patchset: Mark Bergsma; "Decommission sq38, sq46, sq47" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:06:20] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2488 [12:07:09] New patchset: Mark Bergsma; "Decommission sq38, sq46, sq47" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:07:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2488 [12:07:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2488 [12:07:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2488 [12:17:54] New patchset: Mark Bergsma; "Decommission sq31" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2489 [12:18:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2489 [12:18:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2489 [12:18:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2489 [12:18:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2489 [12:25:35] New patchset: Mark Bergsma; "Decommission sq35" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2490 [12:25:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2490 [12:26:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2490 [12:26:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2490 [13:32:09] New patchset: Mark Bergsma; "Remove old nagios host/service groups for squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2491 [13:32:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2491 [13:32:41] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2491 [13:32:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2491 [13:42:54] !log Configured cp1001 and cp1020 to contact backend servers directly instead of via pmtpa squids [13:42:56] Logged the message, Master [14:13:44] New patchset: Mark Bergsma; "Remove decommissioned servers from api list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2492 [14:14:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2492 [14:23:21] New patchset: Mark Bergsma; "Update more graphs for eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2493 [14:23:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2493 [14:24:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2492 [14:24:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2492 [14:25:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2493 [14:25:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2493 [14:36:20] New patchset: Mark Bergsma; "Corrections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2494 [14:36:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2494 [14:36:49] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2494 [14:36:49] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2494 [14:40:04] New patchset: Hashar; "(bug 34141) notify jenkins on ANY merge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2495 [14:40:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2495 [14:47:45] New patchset: Mark Bergsma; "Add mobile graphs, corrections" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2496 [14:48:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2496 [14:48:24] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2496 [14:48:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2496 [14:50:53] New patchset: Mark Bergsma; "Fix color spec" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2497 [14:51:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2497 [14:51:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2497 [14:51:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2497 [14:52:53] New review: Demon; "Why are we doing this on merge? Shouldn't this go on push (before the merge happens)?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [14:55:15] New patchset: Mark Bergsma; "Automatically clear the cache on Torrus config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:55:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2498 [14:55:45] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2498 [14:56:37] New patchset: Mark Bergsma; "Automatically clear the cache on Torrus config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:57:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2498 [14:57:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2498 [14:57:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2498 [14:59:27] New patchset: Mark Bergsma; "Fix api colors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2499 [14:59:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2499 [14:59:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2499 [15:01:15] New patchset: Mark Bergsma; "Use LINE1 for service times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2500 [15:01:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2500 [15:01:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2500 [15:01:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2500 [15:01:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2500 [15:03:15] New patchset: Mark Bergsma; "Really fix API colors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2501 [15:03:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2501 [15:03:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2501 [15:03:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2501 [15:07:16] New patchset: Mark Bergsma; "More color fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2502 [15:07:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2502 [15:07:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2502 [15:07:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2502 [15:14:51] New patchset: Mark Bergsma; "Turn on KeepAlive on application servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2504 [15:15:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2504 [15:16:07] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2504 [15:16:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2504 [15:17:54] !log Turned on KeepAlive on apaches for better miss service times from eqiad [15:17:56] Logged the message, Master [15:40:27] New patchset: Mark Bergsma; "/etc/init.d/torrus-common restart has proven unreliable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2505 [15:40:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2505 [15:41:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2505 [15:41:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2505 [16:07:26] !log Rebalanced appserver load balancing by giving the new mw* pmtpa app servers weight 150 in the pybal server list [16:07:27] Logged the message, Master [16:37:36] hi notpeter, are you around? [16:45:22] hi robh [16:45:31] heyas [16:45:39] can i ask you for a quick favor? [16:45:51] you can ask =] [16:45:54] :) [16:45:58] you need 5 mins of C code review? ;-) [16:46:04] :) :) [16:46:29] i need some files transferred from emery to the analytics virtual labs instance [16:46:34] i filed an rt ticket [16:46:44] but without those files, i cannot make progress [16:46:58] you dont have emery access? [16:46:58] http://rt.wikimedia.org/Ticket/Display.html?id=2421 [16:47:15] i can add you to the virtual instance [16:47:17] New review: Hashar; "That basically copy the way subversion works for now." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [16:50:08] !log running sync-apache, trying to redirect office.wm to https [16:50:10] updated ticket, afk a moment, receiving in items in eqiad [16:50:10] Logged the message, Master [16:51:09] robh: files are in /a/squid/archive/sampled [16:51:16] and they contain ip addresses [16:52:31] is that ok to move into labs? [16:52:48] i hope so, else there is a real catch 22 going on here [16:53:30] i cannot install mysql-server on stat1 (for the moment) and i cannot move files to labs (that are on stat1) which has a mysql-server [16:53:38] i need a machine with both the files and mysql server [16:54:57] heh [16:55:11] mark: is there any issue with putting sampled log files for squid into a labs instance? [16:55:29] drdee: go ahead and give me access to the instance if you would, we can assume its ok and prepare for moving the files now [16:55:30] the ideal situation is fixing ticket: [16:55:44] and if its ok once mark replies great [16:55:44] if [16:55:45] not [16:55:48] http://rt.wikimedia.org/Ticket/Display.html?id=2411 [16:55:50] damned colloquy bug [16:55:52] =P [16:56:04] stat1 is public facing [16:56:10] i know :) :) [16:56:10] running mysql on it is bad [16:56:21] I recall there is a conversation about this going on via email right? [16:56:21] running mysql anywhere is bad [16:56:22] that's why it is going to become private [16:56:24] yes [16:56:42] mark: so can sampled squid logs be pushed to a labs instance safely? [16:56:47] i'm not sure [16:56:57] in theory, yes [16:57:00] but i'd like ryan to confirm [16:57:04] since I'm not fully up to date on labs [16:57:10] http://rt.wikimedia.org/Ticket/Display.html?id=2412 [16:57:20] is the ticket to make stat1 private [16:57:41] but it depends on: [16:57:41] http://rt.wikimedia.org/Ticket/Display.html?id=2165 [16:57:46] there is discussion whether you should be using a managed mysql cluster instead of a mysqld on stat1 [16:57:54] (technically it does not) [16:58:19] for the moment, i just need a single mysql server [16:58:24] drdee: So I am going to update the ticket and assign to it Ryan to confirm that it is ok to move the data there. In anticipation, please go ahead and grant me access to your instance. [16:58:31] ok [16:58:33] thanks! [16:58:39] once he confirms its secure (labs) then I will go ahead and snag it [16:58:48] you want just the sampled logs for a week or so? [16:58:53] (how much data you want?) [16:58:55] for one 1 month [16:58:57] 30 days [16:59:13] if that fits, else 1 or 2 weeks is fine as well [16:59:17] just anything :) [16:59:46] can we remove the dependency 2165 from 2412? [17:00:16] its not linked right now [17:00:33] 2412 isnt tied to any other ticket [17:01:15] done, i added you to labs instance [17:01:31] but 2412 says it depends on 2165, or am I reading it wrong? [17:01:43] "Dependency on ticket #2165 added" [17:02:42] the intention was just to link all the stat1 tickets in the same place, and the others were dependencies [17:02:56] oh i am transposing numbers =P [17:03:26] i think moving stat1 is not the solution, the solution is adding this data to a misc db cluster [17:03:30] maybe "children" is more approriate [17:03:38] running stat1 as a mysql server isnt good. [17:04:02] sorry for asking, but why not? [17:04:03] plus its old as sin, and has no warranty [17:04:04] u [17:04:04] sing [17:04:05]  [17:04:05]  [17:04:06]  [17:04:08] wow... brb [17:04:08] that's bayes [17:04:10] not stat1 [17:04:14] oh, thats right. [17:04:55] drdee: the folks in email are covering it better than me ;] [17:14:07] !log oxygen offline for hard disk upgrade to replace locke [17:14:08] Logged the message, RobH [17:14:34] <^demon> Is http://svn.wikimedia.org/viewvc/mediawiki/wikimedia-web/ still used for anything? [17:15:08] mark or apergos: any ideas on why the day view is so spikey? [17:15:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe2.pmtpa.wmnet&c=Swift%20pmtpa&m=swift_GET_200_hits&r=day [17:16:15] as in, why the request pattern is so spikey? [17:16:18] not really no [17:16:30] that'sreally quite odd [17:17:33] hmm but [17:17:40] the spikes aren't such wide variants right? [17:17:57] I mean between 8 and 14... eh [17:20:31] I was thinking individual squids blowing their cache might account for it [17:20:44] but have any of them have done so? [17:21:41] the 404 graph is totally smooth, fwiw. [17:25:16] yeah, dunno [17:33:29] about to apache-graceful-all for a small addition to redirects.conf [17:36:13] maplebed: no upload squids, no [17:36:34] yeah, didn't tihnk so. [17:36:35] the only thing I can think of is my decommissioning of a few upload squids today [17:36:38] which were down already [17:36:39] oh well... [17:36:40] that /should/ not matter [17:36:50] do you have timestamps? [17:36:52] but perhaps due to CARP hashing, it did matter a little bit [17:36:56] yeah, in the SAL [17:37:00] * maplebed goes to look [17:37:18] other than that, I only worked on text squids today [17:38:14] doesn't look like the timestamps match up [17:38:30] the closest is cp1001 and cp1020 [17:38:40] but they're not part up upload, right? [17:38:43] no [17:38:54] oh well. [17:41:10] I'm starting a read load test against the swift cluster to look for qps ratings while under real world load. [17:41:14] it won't last more than a few minutes. [17:42:07] ok [17:42:45] New review: Demon; "As long as the create patchset hook has the changeset, you could build a cherry-pick url like the on..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2495 [17:47:53] mediawiki question--where do I find documentation on a top block like this: __NOEDITSECTION__\n{{ServiceOperations ? [17:48:27] http://www.mediawiki.org/wiki/Help:Magic_words [17:48:50] thanks! [17:48:56] yw [17:55:20] there are two issues with the current SSL squid logging [17:55:20] 1) The mimetype should be urlencoded in the logfile, there are some instances where it returns "application/x-www-form-urlencoded; charset=UTF-8" the space is throwing us off [17:55:20] 2) The path part for API calls is missing: GET https://fr.wikipedia.org/w/api.php [17:55:21] I'll file an RT ticket. [17:57:24] mutante: are you familiar with mw templates? [17:58:10] Jeff_Green: as long as they are basic..i guess [17:58:37] where are you editing? [17:58:39] oh actually, I just found what I was looking for! [17:58:47] !log updating dns for oxygen internal ip [17:58:48] Logged the message, RobH [17:59:03] i'm down a rathole working on office-wiki payments infrastructure documentation [17:59:34] i was trying to create a table, and I've finally figured out that it fails to render because there's a template applied to the page I'm working on [18:00:06] http://office.wikimedia.org/wiki/Payments_cluster [18:00:59] template looks not terribly useful, perhaps I'll just un-use it. [18:02:01] RobH: can we prioritize http://rt.wikimedia.org/Ticket/Display.html?id=2412 and http://rt.wikimedia.org/Ticket/Display.html?id=2411 [18:02:56] Jeff_Green: i see, that would be table inside a table , created by the template. yea, probably un-use [18:03:11] diederik: only if folks are in agreement, i dont think folks will agree to run mysql on it. [18:03:19] we tend to NOT run mysql on hosts. [18:03:22] misc hosts i mean [18:03:38] so what is the solution? [18:03:39] Jeff_Green: yes, it's like suggesting a page structure for you, but you can follow the same standard without the template.. [18:03:40] we have them all talk to their own database on the misc db cluster (db9/10 which are being replaced) [18:03:54] mutante: ya exactly [18:03:56] so i would say the solution is do the sql data on the db9 or its replacement [18:04:23] okay, can you give me and Andrew Otto access to db9? [18:04:24] i previewed w/o the template definition and it didn't change much [18:04:30] Jeff_Green: unless that [18:04:39] Jeff_Green: unless that would actually be semantic mediawiki [18:04:55] diederik: The best thing is drop a ticket with what you would like the db to be called and we can create a user and password with it restricted to stat1 as the user host [18:05:04] also how much space you expect the db will be using [18:05:05] db9 [18:05:06] is [18:05:11] old and going away with a replacement [18:05:25] or append to the one ticket actually [18:05:32] about mysql. [18:05:36] ok, another ticket :D [18:05:48] mutante: as in it's tying this into other pages? [18:05:48] well not new [18:05:49] http://rt.wikimedia.org/Ticket/Display.html?id=2411 [18:05:59] i[ll append [18:06:30] i commented [18:07:21] heh, db9 has plenty of space iwth otrs gone off it. [18:07:36] New patchset: Hashar; "note about going to 1.20" [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/2506 [18:07:44] 236gb [18:08:25] Jeff_Green: as in "lets you query data within the wiki pages" http://semantic-mediawiki.org/ , used on labs wiki, but not on office [18:09:09] New patchset: Hashar; "adding gitreview config file" [test/mediawiki/core] (master) - https://gerrit.wikimedia.org/r/2507 [18:09:34] mutante: how are you able to tell whether or not it's used on office? [18:09:49] Robh: ticket updated [18:10:38] diederik: you expect to take a TB for a database? [18:10:46] Jeff_Green: it's listed on the page Special:Version and adds a little icon [18:10:50] over time, yes [18:11:02] ok....then db9 wont work [18:11:06] mutante: spiffy. thank you! killing template . . . [18:11:16] okay, then 250GB? [18:11:17] https://labsconsole.wikimedia.org/wiki/Special:Version [18:11:32] yup i see now [18:11:44] diederik: the issue is db9 is a shared db host. so i guess you guys have enough data that you need a dedicated host. [18:11:55] ideally we do not have hosts running databases doing other things. [18:12:07] yes, that will happen [18:13:08] hrmm. [18:13:28] what are the services that stat1 presently provides? [18:13:51] nothing so far, nobody is using it yet except for me [18:14:04] stat1 is replacing bayes [18:14:04] bayes is still the production one [18:14:31] so stat1 if its replacing bayes will need to stay on public IP. [18:14:37] correct? [18:15:07] i honestly don't know, ask erik z, he uses bayes and i don't he needs a public ip [18:15:10] but ask him [18:15:13] bayes is going away [18:15:17] its being replaced by stat1 [18:15:18] it has a public Ip . erik z ssh to it [18:15:30] right [18:15:38] so stat1 cannot be private ip [18:15:42] ok [18:15:45] not if we want the public to hit it directly [18:15:59] It also has a ton more storage, due to Erik Z saying he could use it. [18:16:14] So, we cannot make db9 your database on the project if you need that much space. [18:16:21] The alternative is ordering a new DB host for this. [18:16:27] how much can db9 give [18:16:28] stat1 needs to have some NFS mounts, described in puppet/manifests/misc/statistics.pp meanwhile , it needs to talk to 10.0.5.8 and 208.80.152.185 [18:16:30] ? [18:16:44] (and currently there are problems with that) [18:16:44] db9 has 235 GB free. [18:17:08] can you give me 150? [18:17:10] on db9 [18:17:39] its not dedicated space, just shared. we can set ya up there and see how it goes [18:17:52] i just dont wanna misrepresent what db9 is (also we are moving it on saturday so it will have downtime then) [18:17:58] ok, thanks [18:18:03] i understand [18:18:07] its also going to be replaced in the next month or so with a new server that has more space. [18:18:14] so if that works, then we can do that. [18:18:49] i am fine with that [18:19:48] so let's kill ticket 2412 before someone takes that ticket [18:20:55] cool, done [18:21:16] note how maplebed asked if this can be in labs [18:21:31] ? [18:22:49] via email, he asked erik z. if the stat1 config could be a labs project [18:22:51] I'm pretty sure stat1 does want to move to an internal IP addrsess [18:22:58] despite the fact that it is replacing bayes. [18:23:01] ..... [18:23:13] bayes serves data to public http [18:23:18] it's not a straight up service-for-service replacement, it's a replacement in duty. [18:23:21] how is that going to occur on stat1 [18:23:30] the data will be served to the public via an API running in a mediawiki extension [18:23:38] so will go through the cluster rather than being hit directly. [18:23:45] at least so says ottomatic, who's building the thing. [18:23:47] does this change that db9 should run the myusql? [18:24:07] I'm pretty sure db9 shouldn't be running the mysql instance supporting this, [18:24:12] or at least that stat1 should also. [18:24:51] it sounds like a conversation is in order. [18:24:55] so you think stat1 should run mysql? [18:25:05] yea, diederik this isnt gonna get settled now. [18:25:05] for [18:25:06] o [18:25:06]  [18:25:07]  [18:25:07]  [18:25:08] between diederik erikz, otto, one of us, etc. [18:25:11] ok, restarting [18:27:04] RobH: but this needs to be settled soon :D me, andrew, fabian and andre cannot make progress with mobile analytics and the reportcard projects [18:27:18] it doesnt sound like im the one to decide ;] [18:27:24] maplebed seems to know more whats goin on. [18:27:28] and mutante [18:27:44] basically someone need to let diederik know where we can run a database instance of mysql for his project [18:28:11] so if db9 isnt suitable, maplebed, where shoudl we run this, on stat1?? [18:28:23] traditionally we do not serve mysql on shared services hosts. [18:28:31] diederik: do you have written down anywhere how this thing is going to work? I've heard now both that it needs to be publicly accessible and that it doesn't, that it needs lots of space and that db9 will be accessible, that it will also be crunching data on hadoop and that it's only for the report card, etc. [18:28:44] it's hard to help you set up a service with so many seemingly-confilcting requirements. [18:29:28] maplebed: it is not that complicated: i need a mysql server with a decent amount of storage [18:29:49] that's all [18:30:02] so why are people talking about it needing a public IP? [18:30:23] because stat1 has been described as a 'bayes replacement'. [18:30:27] is that the same thing as what you're doing? [18:30:29] well, me and erik z need to be able to ssh into it [18:30:40] (and bayes currently has a public IP) [18:31:07] and so erik z will start running his scripts on stat1 that are now running on bayes [18:31:09] do you see my trouble here? [18:32:16] so why can't we have stat1 access on db9 the mysql server? [18:32:26] and then stat1 can stay as it is [18:32:33] Does bayes not serve any data publically anymore? [18:32:43] you could start a labs instance, select an instance type with larger storage, apply mariadb classes and have a local "mysql" ..to be able to develop until it goes live.. [18:32:52] mutante: [18:32:53] no [18:32:55] no [18:32:56] i can't [18:32:58] no secret data in labs yet [18:33:06] this is really a catch 22 [18:33:07] oh, secret data.ok [18:33:37] robh: ask erik z if it serves public data [18:33:40] i don't think so [18:33:49] i only think he ssh into that box and runs his scripts [18:33:52] but i am not 100% sure [18:33:54] then you can do that via fenari [18:34:03] +1 mark [18:34:06] if it doesn't need to be publicly accessible, we should move it to internal anyway [18:34:19] again: confirm with erik z [18:34:23] but I'd still not like to see a mysql instance on there, if it needs to be anywhere near production grade [18:35:37] my only hesitation about using db9 is that I'm used to analytics jobs abusing their database, and db9 also serves production services such as the blog, bugzilla, etc. [18:35:54] (mark - is my understanding about what db9 does correct?) [18:36:01] yeah I'm also not sure db9 is the right box/cluster [18:36:05] yes [18:36:20] if necessary we need to setup something new [18:36:39] so I hear the hesitation about stat1 serving mysql (huh? no replication? what'll happen when a disk goes boom and all your data goes byebye?) [18:36:46] yeah [18:37:04] also stat1 is a bit of a black box for ops [18:37:10] if its not dedicated hardware for db its also a bit harder to get asher to do review, it has a bunch of shit we dunno on it [18:37:13] we provide access and don't do much more (although that could change) [18:37:15] my best guess for how to make this work: [18:37:32] put the db on db9 and use stat1 as a dev box. [18:37:40] before it goes production, give us better estimates on load [18:37:51] and if necessary we'll create a new db cluster for the project [18:37:54] does that sound reasonable? [18:37:55] yeah [18:37:58] and as long as it's on db9 [18:37:58] definitely! [18:38:04] thanks guys! [18:38:05] we will be able to shut it off if it causes problems [18:38:13] db9 also doesn't have a lot of space [18:38:15] need to check that too [18:38:20] db9 is being migrated to a new box [18:38:23] but that will be another few weeks [18:38:24] db9 has 235G free [18:38:32] ah because otrs went away [18:38:32] yeah [18:38:48] maplebed: can you confirm with asher [18:39:03] it still sounds to me like stat1 can move to an internal IP, but I'm less concerned since it won't have a DB on it. [18:39:14] diederik: how about you write all this up? [18:39:22] (then asher and others can review) [18:39:24] +1 [18:39:35] on the RT ticket please [18:39:42] sure [18:39:48] (other == erikz too, since stat1 is still the "bayes repalcement"...) [18:39:54] which one, the 2411? [18:40:03] I suggest creating a new one. [18:40:15] and actually, I'd prefer a wiki page that the RT links to [18:40:27] so that we can actually see the plan as a whole instead of the sum of a bunch of comments [18:40:31] but that's just me. [18:40:31] ok [18:41:04] yeah, as long as it's somehow reachable from the rt ticket(s) ;) [18:41:30] ok, i'll mock up a wiki page, an rt ticket and an email alerting everybody [18:43:44] there's also stat1001 btw [18:43:46] in eqiad [18:43:55] ideally you make sure that's your backup for everything you do on stat1 [18:58:02] ok [19:09:22] okay, done: http://rt.wikimedia.org/Ticket/Display.html?id=2431 [19:09:25] and [19:09:25] http://www.mediawiki.org/wiki/Analytics/Infrastructure/Stat1#Relevant_tickets [19:09:30] http://www.mediawiki.org/wiki/Analytics/Infrastructure/Stat1 [19:10:39] drdee: it's not going to run a web server? [19:10:55] (doesn't the reportcard have a frontend as part of it? [19:10:57] ) [19:12:31] !log oxygen setup and installed per rt2343, still needs puppet runs and full deployment per rt 2430 [19:12:33] Logged the message, RobH [19:13:06] diederik: So I just finished up the OS install on oxygen, the locke replacement. [19:13:30] so as far as you know, it does everything locke does, and I should be able to just give it the same puppet info [19:13:32] ? [19:13:59] whoever is trying to nfs mount /data from stat1 [19:14:13] you're going to have to change the exports file on dataset2 and re-export over there [19:14:33] (exports is puppetized but the remoount does not work via puppet) [19:14:33] RobH: cool! [19:15:06] RobH: and in addition, Locke should be configured as a proxy server to enable multicast logging [19:15:09] * maplebed added NFS to the list of requested services [19:15:32] s/remount/re-export/ [19:18:18] maplebed: yes, you are right, sorry i missed that [19:19:24] i added mediawiki as a requested incoming service [19:20:15] more detail around that'd be great - does it have to be running or do you just need the libraries? [19:20:22] i.e. will it be listening on port 80? [19:20:31] does it need apache or just the mediawiki source? [19:20:50] it needs a running mediawiki installation [19:21:51] for people to hit or scripts? (if people, only those that also have ssh access or a wider audience?) [19:22:04] (this is influencing the public / private IP discussion [19:23:11] for people to access the reporcard using their browsesr [19:23:33] that sounds to me like a public IP. :( [19:26:25] the NFS bit is interesting for stat1001 because NFS cross colo sucks ass. [19:26:39] but it should be fine for stat1 [20:02:36] AaronSchulz: does your comment to https://bugzilla.wikimedia.org/show_bug.cgi?id=34231 imply that you still have work to do on the bug and I should assign it back to you? [20:03:00] I'm finishing up that work now [20:04:31] ok, youl'l probably finish long before I do anything so I'll keep ownership [20:04:51] I'll get tim to review first as well [20:05:30] I think there are actually a lot of files that deserve purging that got into swift as a biproduct of my thumbnail filler [20:05:48] when I did a container listing grepping for & I found many paths that were a filename + query parameters. [20:06:24] eg 5/5d/Weather_Report_19810611_shinjuku_fn23.jpg/720px-Weather_Report_19810611_shinjuku_fn23.jpg&crop&fallback=hub_music&prefix=q [20:07:08] heh [20:10:15] '(mid|seek(?:=|%3D|%3d)[0-9.]+)-([^/]*)$!' [20:10:17] * AaronSchulz giggles [20:17:35] New patchset: Lcarr; "Adding in default gateway fact for puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2512 [20:17:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2512 [20:25:12] drdee: stat1 mediawiki instance == just for dev, right? [20:26:44] what are the other options? :) [20:27:02] eventually, the reportcard will go into production [20:27:16] but for the time being, it's only dev [20:38:37] New patchset: Hashar; "jenkins: git preparaton script for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2513 [20:38:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/2513 [20:40:54] oh no [20:45:26] bleh, i had to rewire b1/b2 eqiad, they werent right. [20:45:34] binasher: racking your two new en dbs in eqiad today. [20:46:11] thanks, looked like the pmtpa ones came in too? [20:46:28] yep, not sure on status on those yet [20:46:38] they should be onsite and such [20:46:52] New patchset: Hashar; "jenkins: git preparation script for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2513 [20:47:04] binasher: looks like they are racked, https://rt.wikimedia.org/Ticket/Display.html?id=2392 [20:47:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2513 [20:53:58] !log updating dns for new db hosts [20:54:00] Logged the message, RobH [20:56:57] New patchset: Hashar; "pyc files are now ignored" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2514 [20:57:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2514 [21:07:00] binasher: ok, the eqiad ones are ready for install, just need the networking setup, i dropped a ticket for that [21:09:19] New patchset: Diederik; "Ignore more build-specific files." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2515 [21:09:22] New patchset: Diederik; "Simple script to test code quality." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2516 [21:09:23] New patchset: Diederik; "Improving support for regular expressions." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2517 [21:20:07] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2517 [21:20:26] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2516 [21:20:48] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2515 [21:20:48] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2517 [21:20:49] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2516 [21:20:49] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2515 [21:21:22] New patchset: Diederik; "Rename udp.c to udp-filter.c so now the binary file name and the source filename are consistent." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2518 [21:21:25] New patchset: Diederik; "Originally, this was udp.c, renamed to increase consistency." [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2519 [21:21:54] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2519 [21:22:08] New review: Diederik; "Ok." [analytics/udp-filters] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2518 [21:22:09] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2519 [21:22:09] Change merged: Diederik; [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2518 [21:29:05] !log db1001 rebooting, locked up [21:29:07] Logged the message, RobH [21:34:03] !log powercycling msw-a1-eqiad. [21:34:05] Logged the message, RobH [21:35:26] huh [21:36:00] !log powercycling msw-a2-eqiad resolves all mgmt issues in rack [21:36:02] Logged the message, RobH [21:37:09] binasher: ok, db1001 is back up, it was locked up and had mgmt switch issue [21:37:14] but its rebooted now and mgmt works [21:38:50] well, its in reboot now, my bad [21:41:21] !log cp1017 being tested for bad memory [21:41:23] Logged the message, RobH [21:45:16] RobH: what do you think about the lockup, should i reslave it or want to do a ram test? [21:45:33] if it happens once, it didnt really happen ;] [21:45:48] if it locks up a second time its worth pulling offline and doing hardware diagnostics [21:45:54] so i would push back to service, hyea [21:47:38] when i go to work on bad hardware, i do an RT search for all old tickets on a host [21:47:49] so if db1001 messes up in a month, i will see my old ticket on its lockup [21:48:49] anyone want to check out https://gerrit.wikimedia.org/r/#change,2512 ? it works in labs - want to make sure everything is in the right place [21:48:56] ok, sounds good [21:48:58] !log memory in cp1017 wasnt properly seated as far as i can tell, if it doesnt mess up again it should be ok. [21:49:00] Logged the message, RobH [22:02:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2512 [22:02:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2512 [22:17:04] !log fixing the labs apache2 puppet groups [22:17:05] Logged the message, Mistress of the network gear. [22:25:45] anyone know of a site that has webpages in various sizes for testing purposes ? Or a site that has a single html file that is about 1.5kbyte [22:26:28] trouble with MTU? :-( [22:27:22] testing new mtu stuff [22:27:26] :) [22:27:27] commons? [22:27:42] oh good idea, i can just scale the image [22:27:59] or you write some PHP script that generates whatever number of bytes you want [22:28:16] (still need to take care of HTTP headers though) [22:35:40] i decided to rescale a kitteh pic in commons [22:35:43] http://208.80.153.196/40px-Anteh-vandalism_kitteh_lolcat.jpg [22:35:50] world's tiniest lolcat [22:39:06] OH NO !! Leslie is pixellizing tiny cats! [22:41:44] LeslieCarr: here for you http://f.images.memegenerator.net/instances/500x/14411613.jpg [22:42:09] eeeek [22:42:20] that image is too scary [22:42:28] yeah sorry :-( [22:44:18] oh god that's horrible. [22:45:31] gosh .. looks like hasher ;-p [22:51:00] no cats were hurted during the photoshop session! [22:52:50] meow. [23:14:17] Ok, cleaning up and I am out of eqiad, online later [23:32:36] wow so iran is blocking 443 now [23:32:40] and 22 [23:32:51] and 993 [23:34:11] is something special going on? [23:34:37] 33rd anniversary coming up ? [23:34:51] or perhaps they looked at arab spring and decided to crack down harder