[00:32:45] !log on cp3002: set tcp_tw_recycle back to zero [00:32:53] Logged the message, Master [00:38:04] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [00:53:04] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [01:06:09] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:22:11] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:40:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 206 seconds [01:41:59] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 242 seconds [01:49:02] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 665s [01:53:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [01:54:53] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 31s [01:56:23] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 4 seconds [02:37:34] !log LocalisationUpdate completed (1.20wmf8) at Tue Aug 7 02:37:34 UTC 2012 [02:37:45] Logged the message, Master [02:51:08] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [02:51:09] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [03:08:42] RECOVERY - Puppet freshness on bayes is OK: puppet ran at Tue Aug 7 03:08:15 UTC 2012 [03:09:29] !log LocalisationUpdate completed (1.20wmf9) at Tue Aug 7 03:09:28 UTC 2012 [03:09:37] Logged the message, Master [03:12:00] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Tue Aug 7 03:11:47 UTC 2012 [03:14:15] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:24] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:14:42] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:09] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:17] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:26] RECOVERY - Puppet freshness on srv190 is OK: puppet ran at Tue Aug 7 03:15:09 UTC 2012 [03:15:35] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.279 second response time [03:15:35] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.852 second response time [03:15:54] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.954 second response time [03:16:30] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.314 second response time [03:18:18] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.181 second response time [03:22:03] RECOVERY - Puppet freshness on srv238 is OK: puppet ran at Tue Aug 7 03:21:45 UTC 2012 [03:23:32] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:54] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.241 second response time [03:25:29] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Tue Aug 7 03:25:20 UTC 2012 [03:27:44] RECOVERY - Puppet freshness on srv242 is OK: puppet ran at Tue Aug 7 03:27:28 UTC 2012 [04:03:35] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 3577.24 ms [04:05:32] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 128.81 ms [06:34:50] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 4211.40 ms [06:36:20] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 133.67 ms [07:05:17] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:17] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.22:11000 (Connection timed out) [07:07:41] PROBLEM - Memcached on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:38] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.546 second response time [07:10:32] RECOVERY - Memcached on srv272 is OK: TCP OK - 0.008 second response time on port 11000 [07:14:12] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [07:18:06] PROBLEM - Apache HTTP on srv272 is CRITICAL: Connection refused [07:19:36] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [08:51:50] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [09:22:52] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:35:55] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:42:13] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 28%, RTA = 3638.43 ms [09:43:16] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 117.41 ms [09:58:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:38:56] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [10:53:56] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:08:18] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:23:37] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:10:38] notpeter_: hey wassup [12:35:20] just read the blogpost by CT Woo, are there any news to that? ;) [12:35:35] T3rminat0r: like what? [12:36:23] like news on the cut wire, reasons, or news as to the switch back to using both datacenters again [12:36:34] nvm, on the latter [12:37:16] just saw that bits is pointing back to ashburn already [12:38:14] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:40:21] still waiting for report from the dc, no new news. [12:40:29] when there is some I guess it will get sent around [12:40:51] not from the dc [12:40:57] dc's have nothing to do with it [12:41:07] sorry [12:41:20] from the vendor responsible for thefiber I shoudl say [12:41:22] *should [12:52:11] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [12:52:11] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [13:15:08] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:14] is the jobqueue always so spiky: http://gdash.wikimedia.org/dashboards/jobq/ ? [13:16:29] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.831 second response time [13:16:56] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:18:17] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:49:29] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [15:07:26] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:43] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:02] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:20] PROBLEM - Apache HTTP on mw7 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:28] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:38] PROBLEM - Apache HTTP on mw12 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:46] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:47] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:48] PROBLEM - Apache HTTP on mw56 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:49] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:56] PROBLEM - Apache HTTP on mw9 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:57] PROBLEM - Apache HTTP on mw1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:58] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:59] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:04] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:05] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:14] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:15] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:23] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:40] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:46] that doesn't look good [15:09:50] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:50] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:09:58] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:00] * MatmaRex is waiting for a flood kick [15:10:08] PROBLEM - Apache HTTP on mw15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:10:26] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 6.851 seconds [15:10:27] not good [15:10:35] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49605 bytes in 2.667 seconds [15:10:44] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:02] It still seems up but sluggish as hell [15:11:46] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:46] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] PROBLEM - Apache HTTP on mw13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:56] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 3.699 seconds [15:12:05] PROBLEM - Apache HTTP on mw3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:05] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60841 bytes in 3.068 seconds [15:12:31] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:41] PROBLEM - Apache HTTP on srv288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:12:50] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:25] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:01] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.860 second response time [15:14:28] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:37] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable [15:14:46] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:13] PROBLEM - Apache HTTP on srv229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:30] hi what is with serwers? [15:15:31] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:40] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.221 second response time [15:15:59] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:15:59] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:16] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:17] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:34] RECOVERY - Apache HTTP on srv229 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.187 second response time [15:16:44] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:52] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:38] PROBLEM - Apache HTTP on srv210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:38] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.108 second response time [15:17:46] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.300 seconds [15:18:43] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.662 second response time [15:18:44] PROBLEM - Apache HTTP on srv228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:01] PROBLEM - Apache HTTP on srv209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:37] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:46] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.058 second response time [15:19:49] what is with servers? [15:20:04] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.809 second response time [15:20:04] RECOVERY - Apache HTTP on srv239 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.247 second response time [15:20:13] PROBLEM - Apache HTTP on srv246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] PROBLEM - Apache HTTP on srv259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:13] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 60649 bytes in 5.571 seconds [15:20:23] RECOVERY - Apache HTTP on srv209 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.179 second response time [15:20:31] PROBLEM - Apache HTTP on srv205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:20:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 2.713 seconds [15:20:32] see topic [15:21:34] RECOVERY - Apache HTTP on srv246 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.650 second response time [15:21:44] RECOVERY - Apache HTTP on srv259 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.685 second response time [15:21:44] PROBLEM - Apache HTTP on srv225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:44] PROBLEM - Apache HTTP on srv244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:52] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:01] RECOVERY - Apache HTTP on srv205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.501 second response time [15:23:04] RECOVERY - Apache HTTP on srv228 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.298 second response time [15:23:13] PROBLEM - Apache HTTP on srv269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:13] PROBLEM - Apache HTTP on srv274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:49] RECOVERY - Apache HTTP on srv210 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.879 second response time [15:24:01] are we still experiencing trouble derived from yesterday's issues? [15:24:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39689 bytes in 6.464 seconds [15:24:34] RECOVERY - Apache HTTP on srv244 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.148 second response time [15:24:44] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:44] PROBLEM - Apache HTTP on srv233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:50] cable cut and server overload are probably not related [15:24:52] PROBLEM - check_job_queue on neon is CRITICAL: (Service Check Timed Out) [15:24:53] malafaya: no [15:24:58] ok [15:25:01] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 3.666 seconds [15:25:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 5.197 seconds [15:25:55] servers aren't overloaded [15:25:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.224 seconds [15:26:04] RECOVERY - Apache HTTP on srv276 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.183 second response time [15:26:04] RECOVERY - Apache HTTP on srv233 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.113 second response time [15:26:04] RECOVERY - Apache HTTP on srv269 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.167 second response time [15:26:14] PROBLEM - Apache HTTP on srv260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:32] domas, connectivity then? [15:26:44] internal connectivity, probably [15:27:07] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 8.533 seconds [15:27:14] worst of all is that ganglia is as dead [15:27:19] is whatever's going on also the reason I can't get OTRS to load? [15:27:24] Yes [15:27:33] Ganglia is fine [15:27:34] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.245 second response time [15:27:34] RECOVERY - Apache HTTP on srv274 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.806 second response time [15:27:43] RECOVERY - Apache HTTP on srv260 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.472 second response time [15:28:15] Reedy, ganglia is crawling for me [15:29:04] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.265 second response time [15:29:13] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:13] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:22] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 9.756 seconds [15:29:40] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49412 bytes in 8.579 seconds [15:29:40] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:43] PROBLEM - Apache HTTP on srv231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:44] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49598 bytes in 6.207 seconds [15:32:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.468 second response time [15:32:04] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49405 bytes in 2.684 seconds [15:32:04] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.426 second response time [15:32:32] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49602 bytes in 6.290 seconds [15:32:52] datacenter network toasted [15:32:58] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:58] PROBLEM - Apache HTTP on srv212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:34] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.278 second response time [15:33:34] RECOVERY - Apache HTTP on srv231 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.686 second response time [15:33:43] PROBLEM - Apache HTTP on srv273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv264 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:43] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:57] * Damianz brings out http://images.cheezburger.com/completestore/2011/6/13/5525cbe1-7161-4ac2-b86e-0dcac0a00eb9.jpg again [15:34:19] RECOVERY - Apache HTTP on srv212 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.723 second response time [15:34:20] "Request: GET http://en.wikipedia.org/w/index.php?title=Special:Log/newusers&offset=20120807151013&type=newusers&user=, from 208.80.152.81 via sq66.wikimedia.org (squid/2.7.STABLE9) to () [15:34:21] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Tue, 07 Aug 2012 15:33:39 GMT " [15:34:25] D: [15:34:28] PROBLEM - Apache HTTP on srv204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:55] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 2.308 seconds [15:35:05] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:05] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.162 second response time [15:35:13] RECOVERY - Apache HTTP on srv273 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.699 second response time [15:35:13] PROBLEM - Apache HTTP on srv280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:22] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:49] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.526 second response time [15:35:49] RECOVERY - Apache HTTP on srv204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.433 second response time [15:35:58] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:13] HTTPS services downgraded from Having Issues to DOWN [15:36:19] Hello. [15:36:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39686 bytes in 6.371 seconds [15:36:25] I can't log in to https://en.wikipedia.org. [15:36:30] We know [15:36:32] Is this a known issue? [15:36:34] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.537 second response time [15:36:35] Well! [15:36:41] Marybelle: Yes [15:36:47] http://status.wikimedia.org/ [15:36:56] That is the status of the sites [15:37:10] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:20] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.886 second response time [15:37:40] Yes, Marybelle - listen to Generalcamo :v [15:37:58] I like how the status link isn't in the channel topic. [15:37:59] Hey MZM. [15:38:09] And the status in the topic is "very slow". [15:38:13] RECOVERY - Apache HTTP on srv264 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.266 second response time [15:38:13] PROBLEM - Apache HTTP on srv230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:16] I also like how status.wm.o lists https as "unsupported." [15:38:32] There you go, Mary. [15:38:40] I frequently get errors [15:38:40] Bless you. [15:38:57] but my fellow nlwiki colleagues can get through [15:39:06] WP is slow though [15:39:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:12] Trijnstel: yes, problem is known.. [15:39:17] Trijnstel, aparently, they donated more [15:39:23] Trijnstel: You don't think everyone knows here? [15:39:32] Anyone know the bug about changing the error message to not list #wikipedia? [15:39:33] I don [15:39:36] Yes, they do know it's working funky [15:39:38] I don't know, Bsadowski1 :P [15:39:38] yeah, it's another (unrelated to yesterday) network issue, i'm on it [15:39:46] is bugzilla monitored in status.wm.o? [15:39:47] but yesterday they had a huge crash as well [15:39:52] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 49595 bytes in 8.823 seconds [15:40:03] I thought it wouldn't happen today [15:40:03] liangent: Doesn't appear so. [15:40:07] * Jaqen has a rb to do [15:40:12] Work: lol [15:40:13] Trijnstel: Shit happens. [15:40:17] :( [15:40:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 3.448 seconds [15:41:04] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.479 second response time [15:41:13] PROBLEM - Apache HTTP on srv243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:58] PROBLEM - Apache HTTP on srv211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:58] PROBLEM - Apache HTTP on srv195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:05] ., [15:42:25] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] RECOVERY - Apache HTTP on srv243 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.538 second response time [15:42:40] lol [15:43:19] RECOVERY - Apache HTTP on srv195 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.834 second response time [15:43:24] shit happens - but every day? [15:43:43] LeastCommonAnces: Not every day. But some days. [15:43:46] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60834 bytes in 2.592 seconds [15:43:57] If you're unhappy with the service, ask for your money back. [15:44:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:08] After spending most the day debugging network st00f and figuring out a /25 was setup as a /26 on the routers I can confirm "shit happens" :P [15:44:09] e.g. yesterday (see http://status.wikimedia.org/) [15:44:13] PROBLEM - Apache HTTP on srv232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:13] PROBLEM - Apache HTTP on srv235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:38] LeastCommonAnces: yes, yes it does [15:44:40] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:49] RECOVERY - Apache HTTP on srv211 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.294 second response time [15:45:05] I'm surprised Wikiversity has its own load balancer. [15:45:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39689 bytes in 3.896 seconds [15:45:34] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.801 second response time [15:46:10] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49605 bytes in 9.900 seconds [15:46:17] ciao Jaqen [15:46:28] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 4.852 seconds [15:47:13] PROBLEM - Apache HTTP on srv277 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:13] PROBLEM - Apache HTTP on srv234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:14] PROBLEM - Apache HTTP on srv240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:14] PROBLEM - Apache HTTP on srv272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:22] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:23] come on! where is WMF money for servers... [15:47:31] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 49410 bytes in 4.140 seconds [15:47:48] mmhhhh, obviously vandals are able to edit [15:47:52] :D [15:48:16] PROBLEM - Apache HTTP on srv282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 7.274 seconds [15:48:34] RECOVERY - Apache HTTP on srv272 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.504 second response time [15:48:43] PROBLEM - Apache HTTP on srv275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:43] PROBLEM - Apache HTTP on srv238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:44] PROBLEM - Apache HTTP on srv270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:48:52] fyi, caused by an overloaded link in our legacy datacenter. And yes, I'd love to move datacenters... [15:48:53] the WMF is getting a new datacenter this year [15:48:59] no, we're not :( [15:49:06] we're getting a 2 rack caching center [15:49:17] Oooh, whereabouts? [15:49:18] * Jasper_Deng would call that a datacenter [15:49:28] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:37] RECOVERY - Apache HTTP on srv282 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.339 second response time [15:49:40] Leslie just needs some more 10gig crosslinks ;) [15:49:55] PROBLEM - Apache HTTP on srv289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:55] west coast [15:49:55] PROBLEM - BGP status on csw1-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.247, [15:50:04] RECOVERY - Apache HTTP on srv275 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.187 second response time [15:50:04] RECOVERY - Apache HTTP on srv238 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.192 second response time [15:50:04] RECOVERY - Apache HTTP on srv277 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.186 second response time [15:50:05] RECOVERY - Apache HTTP on srv270 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.188 second response time [15:50:05] RECOVERY - Apache HTTP on srv240 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.197 second response time [15:50:05] RECOVERY - Apache HTTP on srv232 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.203 second response time [15:50:05] RECOVERY - Apache HTTP on srv234 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.564 second response time [15:50:07] Coolio [15:50:08] haha, yeah but we need a 30 rack datacenter for all our infrastructure :) [15:50:24] WC is just going to be varnish [15:50:28] oh, West Coast, USA [15:50:49] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.142 second response time [15:50:56] lesliecarr: you could use virtualization [15:50:58] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39489 bytes in 7.788 seconds [15:51:00] lesliecarr: less racks needed then! [15:51:05] LeslieCarr: All the WMF might need would be like 15 256-core (512-logical-processor) servers from SeaMicro [15:51:07] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:18] 20 of them could handle Amazon's entire website [15:51:22] oh dude, let's totally go to the cloud ! [15:51:37] SeaMicro is not a webhost [15:51:40] it's a server maker [15:51:45] Cloud's never break right, that self healing "unlimited" storage coolness :) [15:52:01] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 15426 bytes in 0.469 seconds [15:52:02] Can we host wikipedia as a facebook app? [15:52:36] Though tbf, if we exported every page to plain html you could probably serve the traffic off 2 servers... damn editors. [15:52:41] why does nagios-wm consider it necessary to spam this channel every time something happens? D: [15:52:44] Brooke: did you find the bug? [15:52:46] RECOVERY - Apache HTTP on srv289 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.148 second response time [15:52:51] MatmaRex: So people know? [15:52:55] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 15487 bytes in 0.532 seconds [15:53:00] couldnt he maybe group these messages? [15:53:04] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 15480 bytes in 0.296 seconds [15:53:04] PROBLEM - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:10] so we get one big message instead of 10 small ones? [15:53:13] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 61850 bytes in 1.298 seconds [15:53:18] matmarex: feel free to improve nagios ;-) [15:53:31] RECOVERY - Apache HTTP on mw7 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.669 second response time [15:53:32] point me to the source code :P [15:53:34] Most the time they don't all "ping out" at once due to the way the check scheduling works [15:53:37] html ? plaintext! [15:53:43] They actually come in as seperate alerts :( [15:53:43] nagios-wm: is such an ass [15:53:46] Damianz: except that now they do [15:53:53] yes, but [15:53:58] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39682 bytes in 1.439 seconds [15:54:03] he could wait maybe two second after getting an alert [15:54:07] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.707 second response time [15:54:07] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.901 second response time [15:54:07] RECOVERY - Apache HTTP on mw15 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.339 second response time [15:54:07] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.110 second response time [15:54:07] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.530 second response time [15:54:08] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.032 second response time [15:54:08] RECOVERY - Apache HTTP on mw3 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.973 second response time [15:54:08] LeslieCarr: I need to convince Ryan to make it more of an ass and spam labs ;) [15:54:09] RECOVERY - Apache HTTP on mw11 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.087 second response time [15:54:09] RECOVERY - Apache HTTP on mw4 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.971 second response time [15:54:09] and see if there appear more [15:54:10] RECOVERY - Apache HTTP on mw16 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.072 second response time [15:54:15] MatmaRex: http://wikitech.wikimedia.org/view/Ircecho [15:54:16] RECOVERY - Apache HTTP on mw12 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.725 second response time [15:54:16] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.795 second response time [15:54:16] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.719 second response time [15:54:16] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.302 second response time [15:54:17] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.615 second response time [15:54:21] source code available there [15:54:25] RECOVERY - LVS HTTPS IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39496 bytes in 6.507 seconds [15:54:34] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.163 second response time [15:54:52] RECOVERY - Apache HTTP on mw1 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.091 second response time [15:54:52] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.098 second response time [15:54:52] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.104 second response time [15:54:52] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.622 second response time [15:54:52] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.986 second response time [15:54:53] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.120 second response time [15:54:53] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.929 second response time [15:54:54] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.440 second response time [15:54:54] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.475 second response time [15:54:57] aaaarg [15:55:01] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:55:09] LeslieCarr: to be more precise, this is all i need to look at? https://svn.wikimedia.org/viewvc/mediawiki/trunk/debs/ircecho/ircecho?view=markup [15:55:13] i wish there was a channel-ignore command [15:55:22] /ignore ;) [15:55:34] MatmaRex: That's the bot that reads the log and spams here [15:55:36] or /part [15:55:37] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:55:37] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:55:37] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [15:55:37] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [15:55:37] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [15:55:57] MatmaRex: yep [15:56:03] okay, my python's rusty, but it cant be that hard to fix [15:56:18] Want to add sending to channel based on regex too? (need it to spam labs :D) [15:56:59] hey, hey, i'm not even a python programmer [15:57:05] dont expect me to add features ;) [15:57:16] oh look [15:57:16] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 60837 bytes in 1.017 seconds [15:57:19] he just got flood kicked [15:58:10] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.990 second response time [16:15:44] okay. i hereby declare that ircecho script an ungodly abomination. [16:17:22] it can message to channel in one of two places in code, depending on whether it was given an infile [16:17:31] and these places are in two different classes [16:17:47] ;_; [17:14:29] Reedy: you around? [17:34:27] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 190 seconds [17:34:44] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 190 seconds [17:45:59] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 17 seconds [17:46:27] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 9 seconds [17:54:19] How would one go about making/getting acces to/constructing a bot on wikimedia commons? [17:56:37] zerodamage: there is a bot flag requests page somewhere [17:56:54] that one: https://commons.wikimedia.org/wiki/Commons:Bots/Requests ? [17:57:05] except that is has a yaer's worth of backlog [17:57:14] yeah, that [17:57:22] so you're gonna have to poke some crat for it, probably [17:58:32] anyone in particular? (perhaps you know someone who has a powerlevel of 9000) [17:59:54] i dont know, i rarely go on commons, sorry [17:59:55] https://commons.wikimedia.org/wiki/Commons:Bureaucrats [18:00:07] i guess anyone from this list who is active [18:01:01] robla: am now [18:01:16] thanls :) [18:01:21] thanks* [18:02:24] also, i need someone who cares and knows at least a little about nagios-wm ot. [18:02:26] bot* [18:02:36] volunteers? :) [18:19:16] zerodamage: hrm, try in #wikimedia-labs perhaps ? [18:19:28] if there's a textfile, irchecho should be able to do it [18:52:35] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [19:23:30] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:36:33] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [19:59:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:15:54] !log reedy synchronized php-1.20wmf9/includes/api/ApiQuery.php [20:16:03] Logged the message, Master [20:39:32] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [20:51:05] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 2, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS64600 not established - BR [20:51:14] RECOVERY - BGP status on csw1-esams is OK: OK: host 91.198.174.247, sessions up: 5, down: 0, shutdown: 0 [20:52:35] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [20:54:32] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [20:56:21] Damianz: maybe they haven't been shined yet this week? [20:58:45] * Damianz hands jeremyb the polish and points at the rack [20:59:13] * Nemo_bis takes it a runs away [20:59:17] *and [20:59:48] Nemo makes unshiny [21:00:16] yeah I don't want to be put in shadow [21:01:03] * Damianz finds some degreaser [21:09:32] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [21:24:32] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [21:27:30] mutante: sooo, testwiki is broken :) [21:27:37] lol [21:28:35] I mean wikitech [21:29:50] oh ? [21:30:02] are you getting errors ? [21:30:33] LeslieCarr: can I come over and humor you [21:30:54] sure for a minute [21:30:55] then i get lunch [21:32:18] AaronSchulz: it was distupgraded and also MW upgraded and update.php'd all recently i think [21:38:41] jeremyb, wasn't it rolled back? [21:39:03] yes [21:39:08] but schema updates weren't [21:39:17] which is the user_options problem [21:39:55] MaxSem: maybe i wasn't paying attention to that part [21:42:22] Nemo_bis: I didn't really search. Do you have the bug number handy? [21:43:52] Brooke: it's easy [21:44:06] What's easy? [21:44:09] Brooke: https://bugzilla.wikimedia.org/show_bug.cgi?id=16043 [21:44:28] Oh, I reported it. [21:44:30] does the office have connectivity issues? :) [21:44:30] Silly me. [21:44:33] hehe [21:44:40] maybe it's beacause I changed the topic? [21:44:47] Do I need to file a file about nagios-wm in here? [21:44:48] now it's too clear :P [21:45:02] Brooke: do you know who caused the issue? [21:45:13] I don't understand what changed, is it something in puppet? [21:45:15] The issue --> nagios-wm's return to this channel? [21:45:18] !log Updated EducationProgram schema on both enwiki and test2wiki [21:45:28] Logged the message, Master [21:45:30] Hi Reedy. [21:45:30] Brooke: yes [21:45:36] Hi [21:45:42] Brooke: Reedy says the bot should stay [21:45:45] I don't know why though [21:45:52] Bots ftw [21:46:10] To be clear, I don't mind the bots absolutely, for myself. [21:46:11] No. [21:46:49] Reedy: what made the bot come back here? [21:46:59] It's not been here for a while before August 3. [21:47:03] nfi [21:47:41] lol @Wiktionary "Expresses lack of knowledge on the subject matter" [21:48:30] Reedy: don't you know where the config resides? isn't it in puppet? [21:48:38] !log reedy synchronized php-1.20wmf9/extensions/EducationProgram/ [21:48:45] Maybe [21:48:46] does nagios-wm just write to every channel it's in? [21:48:47] Logged the message, Master [21:48:57] Reedy: I didn't say any change in the operations repo [21:49:20] but I'm not very good at looking [21:49:22] seemingly not in puppet [21:49:39] Though, there is commented out: [21:49:39] # $ircecho_nick = "nagios-wm" [21:49:39] # $ircecho_chans = "#wikimedia-operations,#wikimedia-tech" [21:50:22] hui [21:50:31] hmm, on spence: [21:50:32] $ircecho_nick = "nagios-wm" [21:50:32] $ircecho_chans = "#wikimedia-operations" [21:50:42] I've no idea why said bot is in here [21:51:55] sigh [21:52:05] then it should just be +q shouldn't it [21:52:08] !g I5715c5e953d084d7e5a723e097abd19a685b3804 [21:52:08] https://gerrit.wikimedia.org/r/#q,I5715c5e953d084d7e5a723e097abd19a685b3804,n,z [21:52:13] i also don't mind it [21:52:29] i just don't like when there's 2 of them running saying the same thing twice in the same channel [21:52:32] ;) [21:52:48] I prefer triplicate [21:52:52] https://bugzilla.wikimedia.org/show_bug.cgi?id=39114 [21:52:53] and maybe during an outage we turn off one of them [21:54:45] gj Brooke [21:54:49] jeremyb, Nemo_bis: Perhaps if the bot ever had anything useful to say, it could stay. [21:54:59] As it is, its messages are useless for the audience. [21:55:02] And generally useless, I've found. [21:55:09] But if someone wants it to report elsewhere, I don't care. [21:56:13] Brooke: well i'm happy enough with it here but also wonder why it's back [21:57:06] Brooke: the point is that it's already on -ops [21:57:09] IMHO [21:57:18] I think you mean -opeartions. [21:57:21] -operations, too. [21:57:42] sure [21:57:54] jeremyb: Channel forward, second script, or someone added it directly. Those were my guesses. [21:57:57] it's the shortcut for lazy people in this channel [21:57:58] In no particular order. [21:58:05] (nobody cares about #wikimedia-ops here :p) [21:58:10] Brooke: from where though? [21:58:24] Dunno! [21:58:26] Brooke: it's only one script cause it's only one nick [21:58:52] what's in -opeartions? [22:00:59] btw jeremyb is deemed guilty of commenting this way: 23:44 < jeremyb> hi nagios-wm! [22:01:07] 23:44 < jeremyb> we missed you [22:01:47] what day? [22:01:58] 3 min after he rejoined [22:02:04] what's the charge exactly? [22:02:11] dunno [22:02:22] wiki is not a court etc. [22:02:54] I mean, seriously? http://ta.wikibooks.org/s/p [22:03:12] We're showing such horrible URLs below page titles for everyone? [22:03:16] Or is it an allucination [22:03:42] I thought this was something only for testwiki to look different and uglier as it ust [22:03:45] *must [22:07:39] Nemo_bis: ShortUrl extension, right? [22:08:38] Brooke: yes [22:09:25] *hallucination [22:12:25] !log reedy synchronized php-1.20wmf8/extensions/E3Experiments/ [22:12:34] Logged the message, Master [22:21:18] hrm, my best is when the bot was restarted it wound up joining due to our scripts [22:30:17] Reedy: thanks! [22:33:21] I got another error [22:33:23] PHP fatal error in /usr/local/apache/common-local/php-1.20wmf9/extensions/Echo/includes/DiscussionParser.php line 127: [22:33:24] Call to a member function getText() on a non-object [22:33:34] When creating a talk page on mediawikiwiki [22:33:48] The edit went through it seems [22:34:57] yeah [22:34:59] I BZ'd that [22:35:10] I think I'm gonna disable it [22:35:42] https://bugzilla.wikimedia.org/show_bug.cgi?id=39085 [22:37:17] It's probably trying to get the parent revision ID for a page which is just being created lol [22:37:24] (and then getText on it) [22:37:47] Yes [22:37:58] We knew this last week ;) [22:38:10] Andrew tried to fix it, but had no luck [22:39:32] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:40:22] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disabled echo due to bug 39085' [22:40:31] Logged the message, Master [22:47:21] !log kaldari synchronized wmf-config/InitialiseSettings.php 'syncing InitialiseSettings.php for Curation Toolbar' [22:47:30] Logged the message, Master [22:48:01] !log kaldari synchronized wmf-config/CommonSettings.php 'syncing CommonSettings.php for Curation Toolbar' [22:48:10] Logged the message, Master [22:53:29] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [22:53:29] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [23:19:29] !log catrope synchronized php-1.20wmf9/extensions/VisualEditor/ 'Update VisualEditor' [23:19:37] Logged the message, Master [23:19:55] RoanKattouw: what's new? [23:20:27] Jasper_Deng: Just a bug fix [23:20:36] RoanKattouw: which ones? [23:20:40] Links being broken [23:20:50] that one was rather annoying [23:20:55] We renamed the link types in Parsoid, so we had to update VE for that, and we missed a few spots [23:20:56] thanks [23:21:18] While also moving Parsoid out to a separate repo, and thinking we were gonna move Parsoid back to a separate box (which then turned out to be broken) [23:21:23] So lots of fun yesterday and today [23:22:40] !log catrope synchronized php-1.20wmf8/resources/mediawiki/mediawiki.js 'Deploying 6b6466f948d29520e1e3ab2592b940ce52415300' [23:22:49] Logged the message, Master