[00:06:02] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [00:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:28:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.345 second response time [00:31:22] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 302 seconds [00:32:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:33:23] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 305 seconds [00:37:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:54:44] anyone in ops about? db1027 not responding to ssh, and i can't get onto the mgmt w/o password [01:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [01:02:50] PROBLEM - RAID on analytics1010 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:50] RECOVERY - RAID on analytics1010 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:26:08] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [01:36:37] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:39:36] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [01:48:37] PROBLEM - DPKG on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [02:01:46] !log LocalisationUpdate completed (1.22wmf14) at Sun Sep 1 02:01:46 UTC 2013 [02:01:53] Logged the message, Master [02:02:30] !log LocalisationUpdate completed (1.22wmf15) at Sun Sep 1 02:02:30 UTC 2013 [02:02:36] Logged the message, Master [02:07:18] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Sep 1 02:07:18 UTC 2013 [02:07:24] Logged the message, Master [02:13:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:27:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:38:51] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [02:38:51] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [02:38:51] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [02:51:31] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [02:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:13:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [03:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [03:28:43] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [03:44:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:46:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [03:48:07] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [03:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:54:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [03:57:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.820 second response time [04:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:13] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [04:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [04:29:13] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:33:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:36:13] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [04:39:27] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:48:27] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [04:49:27] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [04:52:27] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [04:52:27] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:52:27] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [04:55:27] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:55:27] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:27] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:27] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [05:01:27] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:03:27] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [05:04:27] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [05:05:28] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [05:05:28] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [05:10:34] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [05:31:54] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:45] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:51:51] PROBLEM - DPKG on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:10:14] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [06:10:44] PROBLEM - RAID on snapshot3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:11:34] RECOVERY - RAID on snapshot3 is OK: OK: no RAID installed [06:28:14] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:10] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:46:20] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:20] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [07:59:40] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:02:30] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:08:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:09:33] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:34:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:33] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:38:35] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:41:35] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:56:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:04] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:02:34] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [09:08:36] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:25] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [09:17:05] PROBLEM - Puppet freshness on db1027 is CRITICAL: No successful Puppet run in the last 10 hours [10:06:14] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [10:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [11:09:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [11:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [11:27:02] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [11:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:35:32] PROBLEM - Disk space on wtp1017 is CRITICAL: DISK CRITICAL - free space: / 337 MB (3% inode=77%): [11:40:38] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:51:38] PROBLEM - Disk space on wtp1019 is CRITICAL: DISK CRITICAL - free space: / 277 MB (3% inode=77%): [11:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:54:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:56:48] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [12:01:28] RECOVERY - Disk space on wtp1017 is OK: DISK OK [12:01:38] RECOVERY - Disk space on wtp1019 is OK: DISK OK [12:01:48] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [12:21:39] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:29] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [12:31:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:39:04] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [12:39:04] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [12:39:04] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [12:48:14] PROBLEM - Disk space on wtp1009 is CRITICAL: DISK CRITICAL - free space: / 276 MB (3% inode=77%): [12:48:25] PROBLEM - Disk space on wtp1020 is CRITICAL: DISK CRITICAL - free space: / 253 MB (2% inode=77%): [12:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:35] PROBLEM - Parsoid on wtp1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [12:53:25] RECOVERY - Disk space on wtp1020 is OK: DISK OK [12:54:04] PROBLEM - Disk space on wtp1010 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): [12:56:04] RECOVERY - Disk space on wtp1010 is OK: DISK OK [12:56:34] PROBLEM - Parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:14] RECOVERY - Disk space on wtp1009 is OK: DISK OK [13:05:34] PROBLEM - Disk space on wtp1024 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): [13:07:36] PROBLEM - Parsoid on wtp1024 is CRITICAL: Connection refused [13:08:35] RECOVERY - Parsoid on wtp1024 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [13:08:35] RECOVERY - Disk space on wtp1024 is OK: DISK OK [13:11:23] (03PS1) 10Jalexander: Change Wikivoyage Logo and favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82119 [13:13:26] (03PS2) 10Jalexander: Change Wikivoyage Logo and favicon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/82119 [13:22:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:29:35] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [13:33:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.356 second response time [13:39:36] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 263 MB (2% inode=77%): [13:44:33] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [13:46:33] PROBLEM - Disk space on wtp1019 is CRITICAL: DISK CRITICAL - free space: / 243 MB (2% inode=77%): [13:49:03] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [13:50:43] PROBLEM - Parsoid on wtp1019 is CRITICAL: Connection refused [13:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:03:43] RECOVERY - Parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [14:04:34] RECOVERY - Disk space on wtp1019 is OK: DISK OK [14:09:31] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [14:09:31] RECOVERY - Disk space on wtp1018 is OK: DISK OK [14:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [14:24:01] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [14:30:01] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [14:30:11] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [14:33:21] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:37:12] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [14:40:12] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [14:49:12] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [14:50:12] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [14:52:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:12] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:53:12] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [14:53:12] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:56:12] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [14:56:12] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:12] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:12] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:02:12] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:04:12] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [15:05:12] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [15:06:12] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [15:06:12] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [15:07:02] PROBLEM - Disk space on wtp1012 is CRITICAL: DISK CRITICAL - free space: / 76 MB (0% inode=77%): [15:10:27] PROBLEM - Parsoid on wtp1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:57] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [15:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [15:27:07] RECOVERY - Disk space on wtp1012 is OK: DISK OK [15:28:38] Ata_Zh: hello here. I'm trying to add a page into a group on https://meta.wikimedia.org/wiki/Special:AggregateGroups and see an error message " Database query error" [15:28:38] can you please tell me what's the reason? [15:31:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [16:02:33] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:03:33] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.184 second response time [16:05:13] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:23] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:02] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.924 second response time [16:06:02] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.001 second response time [16:06:12] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:22] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:22] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:22] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.511 second response time [16:06:22] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:12] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:12] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:12] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:12] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:12] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:13] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:13] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.462 second response time [16:07:14] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [16:07:42] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:02] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.483 second response time [16:08:02] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.470 second response time [16:08:02] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.743 second response time [16:08:12] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.076 second response time [16:08:12] PROBLEM - Apache HTTP on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:42] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61829 bytes in 9.290 second response time [16:08:52] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:08:58] mark___, paravoid ^^^ [16:09:02] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.531 second response time [16:09:02] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.358 second response time [16:09:02] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [16:09:02] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.005 second response time [16:09:03] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.238 second response time [16:09:22] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:32] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:35] apaches are overloaded [16:09:42] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.091 second response time [16:09:59] Reedy, what's the procedure? [16:10:00] yo, ops! [16:10:10] An error has occurred while searching: HTTP request timed out. [16:10:14] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.750 second response time [16:10:22] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [16:10:31] Most of them look to have returned... [16:10:52] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [16:11:12] an no one cares about pdf1 [16:11:12] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:12] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:11:45] screw pdf, we've got melting apaches [16:12:00] Reedy: i'm still getting those errors [16:12:02] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.057 second response time [16:12:12] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:42] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:02] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [16:13:32] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.074 second response time [16:14:02] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.065 second response time [16:14:03] MaxSem: just text our Europeans [16:14:22] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:32] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:42] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:48] Oh noes [16:14:53] arrrg [16:15:04] can i help in any way? [16:15:13] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:15:13] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [16:15:22] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.090 second response time [16:16:22] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:31] And that's a handful of Americans done too [16:16:32] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61829 bytes in 0.290 second response time [16:17:12] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.052 second response time [16:18:02] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [16:18:21] texted Mark and Faidon [16:18:22] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:22] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:42] MaxSem: So had I when I said I'd texted our Europeans ;) [16:19:12] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.830 second response time [16:19:13] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.252 second response time [16:19:22] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:40] Reedy, spam them to death until they pop up;) [16:19:46] OR DIE [16:20:13] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:13] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:41] I need to grab mobile numbers for a few more opsen from officewiki at somepoint [16:21:12] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:13] and it is always on sunday :) [16:21:22] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:22] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:22] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:22] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.125 second response time [16:21:32] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:36] It's also a US Holiday weekend [16:21:39] Which really isn't helpful [16:21:42] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:21:49] yeah, that too... [16:21:52] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:02] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.447 second response time [16:22:12] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.412 second response time [16:22:12] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:22] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:22] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:22] PROBLEM - Apache HTTP on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:22] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:24] ok, what's people doing [16:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:41] Hey Leslie [16:22:42] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:50] hey [16:22:52] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:22:52] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:03] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [16:23:03] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.049 second response time [16:23:12] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [16:23:13] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.457 second response time [16:23:13] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.051 second response time [16:23:13] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.069 second response time [16:23:13] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [16:23:13] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.052 second response time [16:23:13] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.774 second response time [16:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [16:23:42] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.109 second response time [16:23:42] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.818 second response time [16:23:52] ... [16:24:02] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.949 second response time [16:24:07] tech syndrom [16:24:21] so, any other info other than the fact that the (i'm guessing api) apache's are flapping ? [16:24:22] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.043 second response time [16:24:23] when the doctor arrives every thing is fine [16:24:32] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [16:24:32] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.096 second response time [16:24:42] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [16:25:14] They're not actually API apaches [16:25:22] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:22] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:34] my question's burried... [16:25:47] LVS HTTP IPv4 on appservers.svc.eqiad.wmnet went down for a couple of minutes according to icinga [16:26:01] matanya: Still broken? [16:26:11] yes Reedy [16:26:12] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.040 second response time [16:26:16] and see this: https://ganglia.wikimedia.org/latest/graph_all_periods.php?me=Wikimedia&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&g=cpu_report&z=large [16:26:22] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:12] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [16:27:22] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:22] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:22] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:22] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:52] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:13] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.236 second response time [16:28:23] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.827 second response time [16:28:23] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.299 second response time [16:28:23] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.029 second response time [16:28:23] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:23] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.582 second response time [16:28:24] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:52] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [16:29:22] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.707 second response time [16:29:22] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.531 second response time [16:29:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [16:29:22] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:42] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.711 second response time [16:30:13] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.298 second response time [16:30:21] does anyone remember where the udp2log files get aggregated ? [16:30:22] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:22] RECOVERY - Puppet freshness on cp1063 is OK: puppet ran at Sun Sep 1 16:30:20 UTC 2013 [16:30:31] LeslieCarr, fluorine [16:30:32] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:38] thanks MaxSem [16:31:12] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [16:31:12] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:32] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.474 second response time [16:32:02] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.932 second response time [16:32:12] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:52] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:02] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [16:33:12] ok, so i have no idea what's going on [16:33:42] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.189 second response time [16:33:52] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:52] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:34:38] LeslieCarr: looking at ganglia i see a jump in many network related stuff at around 16:00 [16:34:42] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [16:34:42] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:01] yeah, though that makes sense if there's anything weird going on [16:35:23] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:32] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:32] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:42] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.072 second response time [16:35:43] do the request get to the apaches normally? [16:36:05] i.e. is the problem is the frontend or the backend? [16:36:12] *in the [16:36:25] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.867 second response time [16:36:25] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:25] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.264 second response time [16:36:25] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.734 second response time [16:36:25] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:45] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61829 bytes in 7.333 second response time [16:37:00] LeslieCarr: The LVS seems down to me [16:37:20] requests go via lvs to the varnish layer, which then will query the apaches [16:37:25] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:37:25] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.468 second response time [16:37:25] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.958 second response time [16:37:45] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:10] hmm, I see a few Allowed memory size of 183500800 bytes exhausted (tried to allocate 11 bytes) in /usr/local/apache/common-local/php-1.22wmf14/includes/api/ApiQueryImageInfo.php on line 463 [16:38:15] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:16] not very many though [16:38:34] hey [16:38:35] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.166 second response time [16:38:37] just came on [16:38:40] hey paravoid , thanks [16:39:05] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.117 second response time [16:39:06] to me, this looks like a "classic" case of the eqiad apache's being overloaded [16:39:10] but i am not sure from what [16:39:16] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:30] It looks to me like the LVS isn't working correctly [16:40:06] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [16:40:15] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:16] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.393 second response time [16:40:25] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:25] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:05] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.605 second response time [16:41:15] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [16:41:16] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [16:41:40] matanya: lvs looks like it is working fine, it's depooling overloaded apache's and repooling them when they are no longer overloaded [16:42:15] LeslieCarr: i can't see that from https://ganglia.wikimedia.org/latest/?c=LVS%20loadbalancers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:42:25] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:33] matanya: look at eqiad load balancers, we're not on pmtpa for much any more [16:43:10] thanks [16:43:15] LeslieCarr: any luck? Or do you want me to call someone? [16:43:25] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.219 second response time [16:43:25] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:27] ksnider: faidon just jumped on [16:43:39] Ok, awesome. Thanks, paravoid! [16:44:16] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:44:25] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.350 second response time [16:45:15] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.681 second response time [16:46:05] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:15] Reedy, did we have a lot of OOMs with SVGs before? [16:46:25] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:46:32] that's irrelevant, ignore that [16:46:58] MaxSem: I don't believe we did [16:47:16] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:26] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.521 second response time [16:47:29] because ATM I see only SVG OOMs, not even parser ones [16:47:41] that's imagescalers, not the main apache pool [16:47:45] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:47:55] OOM is also normal, as we use cgroups [16:47:55] paravoid, that's API servers [16:47:55] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:01] which one? [16:48:05] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:07] Hmm. A lot in ApiQueryImageInfo.php... [16:48:15] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.323 second response time [16:48:18] all of them from the looks of it [16:48:45] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [16:48:55] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.208 second response time [16:49:15] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:15] I'll try to disable it [16:49:35] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61829 bytes in 0.244 second response time [16:49:39] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:05] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.114 second response time [16:50:25] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:25] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:26] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:05] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.014 second response time [16:51:15] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.069 second response time [16:51:15] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.065 second response time [16:51:25] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.706 second response time [16:52:00] $wgAPIPropModules['imageinfo'] = 'ApiQueryDisabled'; [16:52:31] too late [16:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:55] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:55] db1040 isn't very happy [16:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [16:53:34] !log maxsem synchronized /php-1.22wmf14/includes/api/ApiQueryImageInfo.php 'Trying to disable this...' [16:53:35] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.133 second response time [16:53:40] Logged the message, Master [16:53:46] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.313 second response time [16:56:55] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:05] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:16] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:25] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:35] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:45] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.718 second response time [16:57:55] doesn't seem to have helped [16:58:05] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.797 second response time [16:58:15] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.005 second response time [16:58:16] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.148 second response time [16:58:25] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:58:25] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.097 second response time [16:59:25] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:25] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:25] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:25] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:59:35] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:15] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.576 second response time [17:00:16] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.598 second response time [17:00:25] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.471 second response time [17:00:25] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.701 second response time [17:00:25] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [17:01:15] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.132 second response time [17:02:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:55] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:03:55] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:25] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:26] !log maxsem synchronized /php-1.22wmf14/includes/api/ApiQueryImageInfo.php 'Not the cause, restoring back' [17:04:32] Logged the message, Master [17:04:35] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:35] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:05:25] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.026 second response time [17:05:25] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.431 second response time [17:05:45] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [17:05:45] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.133 second response time [17:06:25] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.066 second response time [17:07:25] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:08:15] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.088 second response time [17:08:25] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:09:15] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.035 second response time [17:10:25] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:10:35] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:15] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:25] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:25] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:06] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.078 second response time [17:12:15] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.093 second response time [17:12:15] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.154 second response time [17:12:25] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.358 second response time [17:12:25] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:33] a lot of WikiExporter::dumpFrom slow queries [17:12:35] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.003 second response time [17:12:49] but I've killed those I think [17:12:55] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.047 second response time [17:13:25] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.872 second response time [17:13:53] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:13:53] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:23] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:43] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.062 second response time [17:14:43] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.273 second response time [17:15:03] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:53] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.166 second response time [17:16:23] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.695 second response time [17:17:14] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:17:14] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:03] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.060 second response time [17:18:03] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.713 second response time [17:18:13] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:23] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:23] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:24] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:30] why is there so many expandtemplates requests? [17:19:14] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.309 second response time [17:19:23] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:13] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:13] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.526 second response time [17:20:23] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:20:33] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:03] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.660 second response time [17:21:08] MaxSem: ? [17:21:13] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:21:13] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.094 second response time [17:21:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:23] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:23] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.224 second response time [17:21:23] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.578 second response time [17:22:13] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:48] paravoid, nevermind, trying to find something suspicious in API logs [17:23:03] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.427 second response time [17:23:03] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.948 second response time [17:23:13] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.844 second response time [17:23:13] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:23:33] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:03] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.297 second response time [17:24:23] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:03] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.501 second response time [17:26:23] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.852 second response time [17:26:23] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:24] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.871 second response time [17:26:33] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:26:43] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:13] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:14] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:23] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:23] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.002 second response time [17:27:24] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:03] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.064 second response time [17:28:03] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:03] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.101 second response time [17:28:03] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.804 second response time [17:28:13] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:14] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.053 second response time [17:28:23] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:23] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.707 second response time [17:28:33] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.263 second response time [17:28:53] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 4.189 second response time [17:29:13] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.360 second response time [17:30:03] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.462 second response time [17:30:13] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:13] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:13] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.921 second response time [17:30:23] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:53] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:23] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.232 second response time [17:31:33] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:53] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.887 second response time [17:32:03] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.723 second response time [17:32:33] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.177 second response time [17:33:13] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:03] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:34:03] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.527 second response time [17:34:13] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:53] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:53] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:34:53] PROBLEM - Apache HTTP on mw1210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:04] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.917 second response time [17:35:23] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:43] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:35:43] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:35:53] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.936 second response time [17:37:03] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:23] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:14] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.753 second response time [17:38:53] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.063 second response time [17:40:14] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [17:40:23] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:24] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:21] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.220 second response time [17:41:21] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.534 second response time [17:41:41] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:42:21] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:28] ... [17:42:41] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.539 second response time [17:44:21] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.750 second response time [17:45:31] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:21] PROBLEM - Apache HTTP on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:31] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 9.974 second response time [17:46:31] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:01] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [17:47:01] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [17:47:11] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:11] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:12] PROBLEM - Apache HTTP on mw1217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:21] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.788 second response time [17:47:21] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.061 second response time [17:48:01] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 2.315 second response time [17:48:02] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.870 second response time [17:48:02] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.431 second response time [17:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:33] !log faidon cleared profiling data [17:52:39] Logged the message, Master [17:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [17:56:46] !log killed snapshot1004 workers, flooding dberror with wikiadmin auth failures + WikiExporter::dumpFrom slow queries (10' ago) [17:56:52] Logged the message, Master [17:57:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:54] that may be the thing that fixed this [17:58:14] https://graphite.wikimedia.org/render/?title=99%25%20latency%20en.wikipedia.org%20edits%20%28ms%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.edits.en_wikipedia_org.tp99,%2299%25%20edit%20latency%22%29%29,%22blue%22%29 [17:58:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [18:00:51] db1024's 56.99 % of time [18:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [18:40:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:48:22] did you freeze "melting apaches" back? ) [18:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:44] Looks like it has been [18:53:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:08:56] looks quieter. what changed?.. [19:09:48] Ata_Zh: looks like "killed snapshot1004 workers, flooding dberror with wikiadmin auth failures + WikiExporter::dumpFrom slow queries (10' ago)" [19:10:22] ok [19:18:02] PROBLEM - Puppet freshness on db1027 is CRITICAL: No successful Puppet run in the last 10 hours [19:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:39:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [19:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:58:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [20:07:09] PROBLEM - Puppet freshness on sq36 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:30] Anyone around? Gerrit web interface is down [20:22:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [20:29:14] !log restarting gerrit [20:29:29] Logged the message, Master [20:30:01] Reedy: it's back [20:38:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:40:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.521 second response time [20:44:50] <^d> qchris: You around? [20:45:54] * ^d whacks gerrit with a 2x4. [20:46:26] RECOVERY - Full LVS Snapshot on db1027 is OK: OK no full LVM snapshot volumes [20:46:26] RECOVERY - MySQL Recent Restart on db1027 is OK: OK seconds since restart [20:46:26] RECOVERY - MySQL Replication Heartbeat on db1027 is OK: OK replication delay 0 seconds [20:46:26] RECOVERY - RAID on db1027 is OK: OK: State is Optimal, checked 2 logical device(s) [20:46:28] <^d> publickey denied my ass. [20:46:35] RECOVERY - MySQL Slave Delay on db1027 is OK: OK replication delay 0 seconds [20:46:35] RECOVERY - SSH on db1027 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:46:45] RECOVERY - Disk space on db1027 is OK: DISK OK [20:47:15] RECOVERY - Puppet freshness on db1027 is OK: puppet ran at Sun Sep 1 20:47:05 UTC 2013 [20:47:15] RECOVERY - MySQL disk space on db1027 is OK: DISK OK [20:47:15] RECOVERY - MySQL Slave Running on db1027 is OK: OK replication [20:47:16] RECOVERY - MySQL Idle Transactions on db1027 is OK: OK longest blocking idle transaction sleeps for seconds [20:50:15] RECOVERY - NTP on db1027 is OK: NTP OK: Offset -0.001009583473 secs [20:51:27] !log powercycled db1027, BUG: soft lockup - CPU#14 stuck for 22s! [xfsaild/dm-0:933] [20:51:34] Logged the message, Master [20:58:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.774 second response time [21:08:04] ^d yes. Happily munging away some cake. nom nom. [21:08:24] ughhhhhh 240 Segmentation fault (11) [21:08:26] <^d> Heh, enjoy your cake. [21:08:34] <^d> Turns out it was operator error (mine). [21:08:39] like all our apaches suddenly crashed?:( [21:08:43] 240 Segmentation fault (11) [21:08:51] <^d> The heck? [21:08:55] Ok. [21:09:08] MaxSem: what? [21:09:11] which? when? [21:09:18] in fatalmonitor [21:09:48] that means that 240 of 1000 last lines in error log were about segmentation fault [21:10:20] <^d> I can't find that in fatal.log. [21:10:45] it's in /home/wikipedia/syslog/apache.log on fenari [21:11:36] that's back from 20:47 [21:11:38] the number decreases now, apparently no new errors [21:11:45] which coincides with all the mysql connect errors to db1027 [21:12:09] which wasn't in db-eqiad.php since yesterday [21:12:10] wasn't it taken out of rotation yesterday? [21:12:16] yep [21:12:37] pheww [21:12:47] and still [21:12:47] Sun Sep 1 20:47:27 UTC 2013 mw1025 frwiki Error connecting to 10.64.16.16: Can't connect to MySQL server on '10.64.16.16' (111) [21:12:50] Sun Sep 1 20:47:29 UTC 2013 mw1058 ruwiki Error connecting to 10.64.16.16: Can't connect to MySQL server on '10.64.16.16' (111) [21:12:53] Sun Sep 1 20:47:32 UTC 2013 mw1002 frwiki Error connecting to 10.64.16.16: Can't connect to MySQL server on '10.64.16.16' (111) [21:12:58] (etc.) [21:14:13] btw, why memcached-serious.log is so full of entries from pmtpa? these servers should be idle now, no? [21:14:46] yes and dunno, was wondering the same myself [21:15:18] <^d> pmtpa is serious business [21:17:03] and the key names from pmtpa are rather similar to live site... [21:17:19] load balancer balances too much?:P [21:18:46] hah. enwiki:newtalk:ip:10.0.0.13 [21:18:55] it's from monitoring, apparently [21:19:35] but then it means that memcached is totally broken for pmtpa [21:22:15] RECOVERY - mysqld processes on db1027 is OK: PROCS OK: 1 process with command name mysqld [21:25:25] PROBLEM - MySQL Replication Heartbeat on db1027 is CRITICAL: CRIT replication delay 103448 seconds [21:25:35] PROBLEM - MySQL Slave Delay on db1027 is CRITICAL: CRIT replication delay 103366 seconds [21:27:55] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [21:29:20] !log restarted db1027 mysqld. leaving port 3306 blocked while replication recovers [21:29:25] Logged the message, Master [21:57:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.826 second response time [22:22:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [22:39:48] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [22:39:48] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [22:39:48] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [23:12:31] RECOVERY - MySQL Replication Heartbeat on db1027 is OK: OK replication delay -1 seconds [23:12:32] RECOVERY - MySQL Slave Delay on db1027 is OK: OK replication delay 0 seconds [23:13:56] springle, ^^ [23:16:03] :) [23:30:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [23:31:13] (03PS1) 10Ori.livneh: New module: 'statsd' [operations/puppet] - 10https://gerrit.wikimedia.org/r/82201 [23:36:27] not urgent at all, but if anyone happens to be around, i'd love to get that patch in. [23:37:08] did you get the package in? :) [23:37:27] yep [23:37:47] oh? [23:37:53] there was a good (to my untrained eye) debian/ tree upstream already [23:38:10] i used it to build a package, it worked well, coren added it to reprepo [23:38:13] * YuviPanda +2s [23:38:16] (or not) [23:38:42] operations/debs/StatsD [23:38:50] hi yuvi [23:38:53] uhm, in gerrit? why? [23:39:02] hey ori-l. [23:39:17] paravoid: because i expect we'll modify it [23:39:18] switched to Emacs from Vim earlier today, so a productivity drop for a few days. [23:39:29] (hence no updated redis vagrant patch) [23:39:32] (yet!) [23:39:44] YuviPanda, no problem [23:40:32] the package is quite good indeed [23:40:35] debian/scripts/start /usr/share/statsd/scripts [23:40:39] exec sudo -u _statsd /usr/share/statsd/debian/scripts/start [23:40:40] hah [23:40:46] that's awful though :) [23:41:14] also no logrotate [23:41:23] should have used upstart's setuid/setgid, you mean? [23:41:44] that, plus /usr/share/statsd/debian/* is awful in general [23:41:52] should have inlined the contents of start, then not ship that at all [23:42:21] yeah, makes sense. hm, if i fix that and submit a patch upstream, do you think it could be added to debian? [23:42:38] it only depends on 'nodejs', no npm nosense [23:42:50] it'll need a maintainer [23:42:59] I'm not sure I'm up for it [23:43:03] (nodejs on precise is 0.6, eugh) [23:43:14] but we could ping those folks and offer them sponsorship [23:43:39] there's a couple more minor issues as well, d/copyright needs new format, d/compat needs to be bumped to 9 [23:43:41] yes, I have no experience with this so I just conveniently forgot "and then support that package forever" part of adding something to Debian [23:44:01] it Depends on nodejs >= 0.6, does it work with 0.10 [23:44:02] etc. [23:44:23] debian/ should be in a seperate branch too [23:44:59] logrotate too [23:45:13] does it log outside of /var/log/upstart/? [23:45:22] because that gets rotated by upstart itself iirc [23:45:50] $NODE_BIN /usr/share/statsd/stats.js /etc/statsd/localConfig.js 2>&1 >> /var/log/statsd/statsd.log [23:45:53] there's at least that [23:46:11] oh, hrm. good catch, then. [23:46:43] ordered json? [23:46:44] what for? [23:47:43] This causes problems 7 # whenever a JSON-serialized hash is included in a file template, 8 # because the variations in key order are picked up as file updates by 9 # Puppet, causing Puppet to replace the file and refresh dependent 10 # resources on every run. [23:47:46] the config format can contain nested objects, so you can't really map individual puppet parameters all that well [23:47:50] what we usually do is just sort [23:48:04] won't work for nested hashes [23:48:10] which this has [23:48:57] heh [23:49:33] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [23:49:45] so, wait [23:50:05] why aren't we running statsd on the graphite host itself? [23:50:08] i.e. professor? [23:51:19] initially because i didn't have access to it and didn't want to dump the responsibility of debugging possible issues on $random_ops_person [23:51:39] but then i figured: it needs to get decom'd anyway, since it's in tampa [23:51:46] might as well build up hafnium as a replacement [23:52:41] hrm [23:52:42] though we could run it there, either in addition to or instead of hafnium [23:53:29] okay, I'll merge this now [23:53:31] but if it works [23:53:38] and you're done with your testing [23:53:42] I'd like to move it to the graphite host [23:53:59] i have access to professor now, so wait [23:54:03] oh [23:54:05] let me just update the patch [23:54:26] there's other things that can use statsd [23:54:36] swift is one, some of the CI stuff is two [23:54:39] plus we can create more [23:54:50] yeah, i saw the swift stuff, looks like the integration is pretty nice [23:55:11] * YuviPanda googles statds [23:55:33] ah, nice [23:55:43] oh, i know why [23:55:59] erm, nevermind, updating the patch [23:56:33] the nice thing with some of the alternative implementations was that they had the ability to push to ganglia though.. [23:57:43] !log running pt-table-sync over db1027 wikis [23:58:00] Logged the message, Master