[00:02:10] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [00:03:04] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [00:06:49] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:08:46] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.131 seconds [00:24:58] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [00:25:25] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:04] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.055 seconds [00:30:58] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [00:32:37] nn [00:39:42] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:30] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [00:46:00] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:30] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:30] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [00:50:30] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [00:58:00] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.23 ms [00:58:20] getting bits.wm.o issues (504s), anyone looking at it already? [00:58:26] yes [00:58:29] looking [00:58:30] thanks [00:58:40] mark too [01:02:30] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:06] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:06:15] !log powercycling arsenic [01:06:19] Logged the message, Master [01:07:09] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [01:07:51] !log powercycling niobium (both unresponsive from the load) [01:07:52] Logged the message, Master [01:08:03] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3853 bytes in 0.627 seconds [01:09:24] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [01:09:25] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [01:10:18] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:10:27] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.99 ms [01:11:12] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3836 bytes in 0.055 seconds [01:11:39] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [01:11:48] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [01:11:49] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [01:13:19] oh hey [01:13:27] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [01:13:31] just jumped online because of the issues, looks like you have it paravoid ? [01:13:34] hey [01:13:40] were they completely unresponsive ? [01:13:47] oh, it's not search for once [01:13:52] died from the load [01:14:27] new and improved varnish bug perhaps ? [01:14:36] maybe DoS [01:15:36] !log Powercycled strontium [01:15:42] Logged the message, Master [01:17:34] hrm, looks like memory usage until a few hours ago was inching up - http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 then it goes insane [01:18:14] varnishncsa was using lots of memory too [01:20:48] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.054 seconds [01:21:24] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:23:56] ok, jumping back offline [01:30:25] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:30:52] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:35:14] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [01:35:49] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:58:37] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [01:59:04] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 196 seconds [02:01:28] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [02:01:28] RECOVERY - MySQL disk space on neon is OK: DISK OK [02:28:56] !log LocalisationUpdate completed (1.21wmf10) at Mon Feb 25 02:28:55 UTC 2013 [02:29:00] Logged the message, Master [02:30:17] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:32] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:36:17] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:40:47] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [02:41:32] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [02:42:17] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.24 ms [02:52:33] !log LocalisationUpdate completed (1.21wmf9) at Mon Feb 25 02:52:32 UTC 2013 [02:52:35] Logged the message, Master [02:56:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:56:14] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:09:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:11:06] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [03:11:24] RECOVERY - MySQL disk space on neon is OK: DISK OK [03:29:41] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:32:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:37:26] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:38:11] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 5 seconds [03:38:20] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:57:29] New patchset: Jeremyb; "annotate disabled root accounts on include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50724 [04:03:41] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [04:03:41] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [04:05:38] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (597842), Total (600662) [04:06:05] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (599265), Total (607648) [04:31:35] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [04:33:41] PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [04:34:36] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [04:34:36] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [04:35:47] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [04:36:41] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [04:36:41] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [04:36:41] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [04:38:38] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [04:40:47] tola is parsoid? RoanKattouw_away ^ [04:43:53] PROBLEM - mailman on sodium is CRITICAL: PROCS CRITICAL: 42 processes with args mailman [04:45:41] RECOVERY - mailman on sodium is OK: PROCS OK: 11 processes with args mailman [04:59:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50611 [05:03:32] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [05:06:41] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.27 ms [05:18:32] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:19:45] *now* it is search [05:23:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [05:38:39] (search is still broken on commons even though it says pool4 recovered) [05:38:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:38:52] hah [05:44:56] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:45:14] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.021 second response time on port 8123 [05:48:32] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.019 second response time on port 8123 [05:56:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:04:59] apergos: ^^ [06:06:59] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.024 second response time on port 8123 [06:14:29] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:21:59] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:31:26] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.022 second response time on port 8123 [06:37:08] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:38:01] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:38:52] !log restarted lucene search on search1016 [06:38:55] Logged the message, Master [06:39:04] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [06:51:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [06:51:25] thanks apergos [06:52:27] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 207 seconds [06:53:48] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:55:09] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:14:12] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:20:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:21:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:24:42] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:25:45] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:46:45] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:47:57] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:24] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:01:33] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [08:29:18] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [08:31:51] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:36] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:12] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:41:36] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [08:47:59] robots.php is where? [08:48:09] I'm looking in operations/mediawiki-config.git. [08:49:23] live-1.5 is the answer. [09:01:06] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 217 seconds [09:01:06] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [09:10:06] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:00] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:54] RECOVERY - Puppet freshness on amssq37 is OK: puppet ran at Mon Feb 25 09:11:33 UTC 2013 [09:13:33] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:14:45] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [09:22:42] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [09:22:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [09:25:17] RECOVERY - Puppet freshness on sq73 is OK: puppet ran at Mon Feb 25 09:25:02 UTC 2013 [09:36:14] RECOVERY - Puppet freshness on lardner is OK: puppet ran at Mon Feb 25 09:36:06 UTC 2013 [10:12:05] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:12:50] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:53] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [10:20:11] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:26:11] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [10:33:26] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [10:45:30] New patchset: Raimond Spekking; "Add a comment/cross reference to WikimediaMessages.i18n.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50740 [10:51:44] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [10:51:44] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [11:13:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [11:14:41] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [11:25:17] New patchset: Dereckson; "(bug 45333) Namespace configuration for uk.wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [11:25:56] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 187 seconds [11:26:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 196 seconds [11:31:22] New patchset: Dereckson; "(bug 45079) Add P: as alias for Property namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [11:33:18] New review: Dereckson; "Next time, please follow the correct comment case ("Bug" and not "bug") to standardize a little the ..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/50709 [11:45:33] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.33 ms [12:22:55] New patchset: Matthias Mullie; "cleanup AFTv5 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [12:23:21] New review: Matthias Mullie; "Do not merge before https://gerrit.wikimedia.org/r/#/c/50372/ is merged" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/50744 [12:25:22] New patchset: ArielGlenn; "minor fixes:" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/50745 [12:29:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/50745 [12:54:15] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:42] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [12:58:09] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:01:27] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [13:07:09] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [13:07:54] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [13:11:30] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [13:13:01] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 196 seconds [13:16:36] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [13:16:54] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [13:26:48] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:06] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:51] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:31:00] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:12] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:57] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [13:34:54] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:37:27] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:37:45] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:38:12] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.057 seconds [13:38:30] RECOVERY - MySQL disk space on neon is OK: DISK OK [14:05:19] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [14:05:20] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [14:19:21] Susan: whatchya doing with robots.php? [14:28:00] jeremyb_, humans.txt [14:33:22] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [14:35:19] PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours [14:36:22] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [14:40:25] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [14:42:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (67030), nlwiki (108506), Total (181046) [15:00:52] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:01:01] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:20] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [15:05:58] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:49] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:16:10] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [15:29:40] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:32:13] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [15:59:05] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (25215), Total (35832) [16:02:27] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (12603), Total (21625) [16:18:21] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 189 seconds [16:18:30] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [16:18:48] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 209 seconds [16:19:06] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 221 seconds [16:32:45] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 3 seconds [16:33:12] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 2 seconds [16:36:39] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:38:00] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [16:38:09] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [16:38:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [16:38:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [16:39:57] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [16:59:54] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 204 seconds [16:59:54] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [17:12:30] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [17:12:31] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [17:15:51] New patchset: Matthias Mullie; "cleanup AFTv5 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [17:15:57] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [18:05:13] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:13] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:07] PROBLEM - Puppet freshness on mw1029 is CRITICAL: Puppet has not run in the last 10 hours [18:11:07] reedy@mw110's password: [18:11:10] RobH: ^^ [18:11:12] !log reedy synchronized php-1.21wmf10/extensions/Wikibase [18:11:14] Logged the message, Master [18:11:17] fixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixit [18:11:26] Reedy: did that transcode script ever finish? [18:11:36] * jeremyb_ wonders if Reedy was assisted by a clipboard [18:11:40] Reedy: yep, i need to pull out of node lists since i didnt fix it [18:11:43] will take care of it [18:11:44] AaronSchulz: I was wondering the same thing [18:11:58] AaronSchulz: No, still on commonswiki via foreachwiki [18:12:10] And it's still going [18:13:16] commonswiki: mwstore://local-multiwrite/local-thumb/d/d5/Vladimir_Putin's_press_conference_on_2012-12-20.ogv/Vladimir_Putin's_press_conference_on_2012-12-20.ogv.360p.webm => mwstore://local-multiwrite/local-transcoded/d/d5/Vladimir_Putin's_press_conference_on_2012-12-20.ogv/Vladimir_Putin's_press_conference_on_2012-12-20.ogv.360p.webm [18:14:36] New patchset: Krinkle; "Notifications for TemplateData to #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50757 [18:17:58] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:21:49] Change abandoned: Ottomata; "Abandoning this, hope to soon use:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46618 [18:22:40] j^: About? [18:22:53] Does that script iterate over ALL transcodes? [18:24:17] Reedy: it should [18:25:28] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.469 seconds [18:25:31] AaronSchulz: was wondering if it might be a good transition workaround to check if a /transcoded/ url fails with 404 if the url with /thumb/ would work and return that [18:26:25] I'm wondering how far through the queue it is [18:29:48] if swift would support some kind of 'find' or 'ls' might be possible to find out [18:31:01] !log Copying all captchas into ceph [18:31:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:31:03] Logged the message, Master [18:32:48] New patchset: Ryan Lane; "Move xml and image backups into public location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50759 [18:37:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50759 [18:42:30] New patchset: Ryan Lane; "Fix variable inclusion in puppet reactor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50761 [18:43:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50740 [18:44:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50761 [18:45:53] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Feb 25 18:45:41 UTC 2013 [18:45:57] I take it the meeting that moved from 11 to 10:30 has been moved back to 11? [18:46:15] I think it has, yeah [18:49:25] !log reedy synchronized php-1.21wmf10/extensions/WikimediaMessages [18:49:27] Logged the message, Master [18:49:37] New patchset: RobH; "formatting fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50763 [18:50:33] !log deleting foundation-news-l (deprecated, replaced by WikimediaAnnounce-l) [18:50:34] Logged the message, Master [18:51:38] Change abandoned: RobH; "bleh, typo in my typo fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50763 [18:54:04] !log wikivoyager.de and wikivoyager.org transferred and both redirect to wikivoyage.org [18:54:05] Logged the message, Master [19:07:01] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf10 [19:07:02] Logged the message, Master [19:07:21] jeremyb_: https://bugzilla.wikimedia.org/show_bug.cgi?id=45347 [19:08:31] Can someone fix the replication of the SAL to twitter/identi.ca? Seems to have broken at the end of January https://twitter.com/wikimediatech [19:09:18] Susan: yeah, MaxSem answered :) [19:09:23] New patchset: Reedy; "enwiki to 1.21wmf10" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50765 [19:10:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50765 [19:11:18] Reedy: Does anyone care? [19:11:28] Also, file a bug. [19:11:29] Reedy: not sure we have those logins [19:11:29] I do, that's why I said it ;) [19:11:49] <^demon> Wow, I just read bug 45347. [19:11:55] <^demon> humans.txt seems totally pointless. [19:11:59] mutante: /h/w/c/docs? [19:12:10] Surely the script has the login. [19:12:19] There's probably an OAuth file somewhere. [19:12:21] Reedy: just checked.. dont see it [19:12:30] wikitech vm? [19:12:57] New patchset: Ryan Lane; "Make images and files labsconsole backup chdir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50766 [19:13:14] ^demon: It's obviously silly. [19:13:51] <^demon> Obviously. [19:14:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50766 [19:14:10] Reedy: checking later. in meeting [19:14:12] Re-wontfixed. Oh well. [19:14:16] heh, I'll file an RT ticket [19:14:25] There's no reason to use RT. [19:15:47] Reedy: thank you [19:16:15] Susan: Ops will pick it up and actually deal with it ;) [19:19:03] Rage. [19:32:47] !log reedy Started syncing Wikimedia installation... : Rebuilding message cache to update WikimediaMessages [19:32:49] Logged the message, Master [19:36:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:11] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:02] New patchset: Lcarr; "fixing ganglios on icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50772 [19:40:24] !log powercycling frozen niobium [19:40:26] Logged the message, Master [19:42:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50772 [19:44:23] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [19:46:39] PROBLEM - NTP on niobium is CRITICAL: NTP CRITICAL: Offset unknown [19:47:06] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:50:13] !log reedy Finished syncing Wikimedia installation... : Rebuilding message cache to update WikimediaMessages [19:50:15] RECOVERY - NTP on niobium is OK: NTP OK: Offset 0.004889369011 secs [19:50:15] Logged the message, Master [19:50:27] holy scap [19:50:30] 18 minutes, nice [19:55:58] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [19:56:09] New patchset: Reedy; "(bug 45083) Enable AbuseFilter IRC notifications on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:14] New review: Reedy; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:53] New patchset: Reedy; "checkoutMediaWiki: Remove redundant argument to git-submodule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608 [19:57:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608 [19:57:45] New patchset: Reedy; "(bug 44604) Enable PostEdit on ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49378 [19:57:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49378 [19:58:17] New patchset: Reedy; "(bug 44796) Updating logo for Telugu Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [19:58:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [19:59:38] New patchset: Reedy; "(bug 45233) Groups permissions on pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50181 [19:59:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50181 [20:00:24] New patchset: Reedy; "Remove document roots for deleted wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [20:00:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [20:01:32] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:34] New patchset: Reedy; "(bug 45113) Set cswiktionary favicon to the same as enwiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [20:01:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [20:02:13] New patchset: Reedy; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [20:02:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [20:03:20] New patchset: Reedy; "(bug 44587) Trwiki FlaggedRevs autopromotion config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49685 [20:03:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49685 [20:03:52] New patchset: Reedy; "(bug 45079) Add P: as alias for Property namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [20:03:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [20:04:21] New patchset: Reedy; "(bug 45333) Namespace configuration for uk.wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [20:04:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [20:04:55] New patchset: Reedy; "(bug 45205) Namespace configuration for hu.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50159 [20:06:42] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.27 ms [20:08:28] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [20:09:48] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Feb 25 20:09:39 UTC 2013 [20:14:42] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 181 seconds [20:15:18] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 186 seconds [20:15:38] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 193 seconds [20:15:45] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 193 seconds [20:20:28] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [20:21:21] Reedy: you didn't sync the config, did you? [20:22:11] New patchset: Pyoungmeister; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [20:25:28] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [20:25:37] Nemo_bis: No [20:25:39] Lunch time [20:27:26] !log reedy synchronized wmf-config/ [20:27:27] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [20:27:27] Logged the message, Master [20:28:26] !log reedy synchronized docroot [20:28:27] Logged the message, Master [20:31:40] PROBLEM - Host cp3003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:34:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50159 [20:34:30] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [20:36:48] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.50 ms [20:45:28] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [20:45:28] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [20:52:30] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [20:52:30] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [20:54:08] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:18] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.28 ms [21:08:08] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [21:09:36] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:27] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [21:18:57] !log reedy synchronized wmf-config/InitialiseSettings.php [21:18:59] Logged the message, Master [21:24:08] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:16] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.16 ms [21:29:35] RECOVERY - Host cp3003 is UP: PING WARNING - Packet loss = 80%, RTA = 118.32 ms [21:31:23] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [21:35:43] Reedy: is there any clue on how s3 is bearing the querypages updates? [21:35:51] In what sense? [21:36:03] The pmtpa mysql servers are idle, so we don't care a great deal [21:36:34] Reedy: no way to measure the effect then? [21:36:41]