[00:02:10] PROBLEM - Puppet freshness on sq73 is CRITICAL: Puppet has not run in the last 10 hours [00:03:04] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [00:06:49] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:08:46] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.131 seconds [00:24:58] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [00:25:25] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:04] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.055 seconds [00:30:58] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [00:32:37] nn [00:39:42] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:30] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [00:46:00] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:30] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:30] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [00:50:30] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [00:58:00] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.23 ms [00:58:20] getting bits.wm.o issues (504s), anyone looking at it already? [00:58:26] yes [00:58:29] looking [00:58:30] thanks [00:58:40] mark too [01:02:30] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:06] PROBLEM - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:06:15] !log powercycling arsenic [01:06:19] Logged the message, Master [01:07:09] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [01:07:51] !log powercycling niobium (both unresponsive from the load) [01:07:52] Logged the message, Master [01:08:03] RECOVERY - LVS HTTPS IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3853 bytes in 0.627 seconds [01:09:24] PROBLEM - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is CRITICAL: Connection refused [01:09:25] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [01:10:18] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:10:27] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.99 ms [01:11:12] RECOVERY - LVS HTTP IPv4 on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3836 bytes in 0.055 seconds [01:11:39] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [01:11:48] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.053 seconds [01:11:49] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [01:13:19] oh hey [01:13:27] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [01:13:31] just jumped online because of the issues, looks like you have it paravoid ? [01:13:34] hey [01:13:40] were they completely unresponsive ? [01:13:47] oh, it's not search for once [01:13:52] died from the load [01:14:27] new and improved varnish bug perhaps ? [01:14:36] maybe DoS [01:15:36] !log Powercycled strontium [01:15:42] Logged the message, Master [01:17:34] hrm, looks like memory usage until a few hours ago was inching up - http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=Bits+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 then it goes insane [01:18:14] varnishncsa was using lots of memory too [01:20:48] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.054 seconds [01:21:24] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:23:56] ok, jumping back offline [01:30:25] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [01:30:52] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [01:35:14] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [01:35:49] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [01:58:37] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [01:59:04] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 196 seconds [02:01:28] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [02:01:28] RECOVERY - MySQL disk space on neon is OK: DISK OK [02:28:56] !log LocalisationUpdate completed (1.21wmf10) at Mon Feb 25 02:28:55 UTC 2013 [02:29:00] Logged the message, Master [02:30:17] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:32] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:36:17] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:40:47] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [02:41:32] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [02:42:17] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.24 ms [02:52:33] !log LocalisationUpdate completed (1.21wmf9) at Mon Feb 25 02:52:32 UTC 2013 [02:52:35] Logged the message, Master [02:56:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [02:56:14] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:09:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:11:06] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [03:11:24] RECOVERY - MySQL disk space on neon is OK: DISK OK [03:29:41] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [03:32:59] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [03:37:26] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [03:38:11] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 5 seconds [03:38:20] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [03:57:29] New patchset: Jeremyb; "annotate disabled root accounts on include" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50724 [04:03:41] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [04:03:41] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [04:05:38] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (597842), Total (600662) [04:06:05] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (599265), Total (607648) [04:31:35] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [04:33:41] PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours [04:34:35] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [04:34:36] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [04:34:36] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [04:35:47] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [04:36:41] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [04:36:41] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [04:36:41] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [04:38:38] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [04:40:47] tola is parsoid? RoanKattouw_away ^ [04:43:53] PROBLEM - mailman on sodium is CRITICAL: PROCS CRITICAL: 42 processes with args mailman [04:45:41] RECOVERY - mailman on sodium is OK: PROCS OK: 11 processes with args mailman [04:59:16] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50611 [05:03:32] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [05:06:41] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.27 ms [05:18:32] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:19:45] *now* it is search [05:23:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.019 second response time on port 8123 [05:38:39] (search is still broken on commons even though it says pool4 recovered) [05:38:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:38:52] hah [05:44:56] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:45:14] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.021 second response time on port 8123 [05:48:32] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.019 second response time on port 8123 [05:56:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:04:59] apergos: ^^ [06:06:59] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.024 second response time on port 8123 [06:14:29] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:21:59] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:31:26] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.022 second response time on port 8123 [06:37:08] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:38:01] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:38:52] !log restarted lucene search on search1016 [06:38:55] Logged the message, Master [06:39:04] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [06:51:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [06:51:25] thanks apergos [06:52:27] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 207 seconds [06:53:48] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:55:09] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:14:12] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:20:12] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:21:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:24:42] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [07:25:45] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:46:45] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:47:57] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:24] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:01:33] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [08:29:18] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [08:31:51] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:36] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:12] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:41:36] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [08:47:59] robots.php is where? [08:48:09] I'm looking in operations/mediawiki-config.git. [08:49:23] live-1.5 is the answer. [09:01:06] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 217 seconds [09:01:06] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 217 seconds [09:10:06] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:00] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:11:54] RECOVERY - Puppet freshness on amssq37 is OK: puppet ran at Mon Feb 25 09:11:33 UTC 2013 [09:13:33] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:14:45] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [09:22:42] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [09:22:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [09:25:17] RECOVERY - Puppet freshness on sq73 is OK: puppet ran at Mon Feb 25 09:25:02 UTC 2013 [09:36:14] RECOVERY - Puppet freshness on lardner is OK: puppet ran at Mon Feb 25 09:36:06 UTC 2013 [10:12:05] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:12:50] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:53] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [10:20:11] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:26:11] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [10:33:26] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [10:45:30] New patchset: Raimond Spekking; "Add a comment/cross reference to WikimediaMessages.i18n.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50740 [10:51:44] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [10:51:44] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [11:13:38] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [11:14:41] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [11:25:17] New patchset: Dereckson; "(bug 45333) Namespace configuration for uk.wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [11:25:56] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [11:26:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 187 seconds [11:26:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 196 seconds [11:31:22] New patchset: Dereckson; "(bug 45079) Add P: as alias for Property namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [11:33:18] New review: Dereckson; "Next time, please follow the correct comment case ("Bug" and not "bug") to standardize a little the ..." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/50709 [11:45:33] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 118.33 ms [12:22:55] New patchset: Matthias Mullie; "cleanup AFTv5 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [12:23:21] New review: Matthias Mullie; "Do not merge before https://gerrit.wikimedia.org/r/#/c/50372/ is merged" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/50744 [12:25:22] New patchset: ArielGlenn; "minor fixes:" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/50745 [12:29:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/50745 [12:54:15] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:54:42] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:33] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [12:58:09] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:01:27] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [13:07:09] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [13:07:54] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [13:11:30] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [13:13:01] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 196 seconds [13:16:36] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [13:16:54] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [13:26:48] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:27:06] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:51] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:31:00] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:12] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:57] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [13:34:54] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [13:37:27] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:37:45] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 2 processes with args ircecho [13:38:12] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.057 seconds [13:38:30] RECOVERY - MySQL disk space on neon is OK: DISK OK [14:05:19] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [14:05:20] PROBLEM - Puppet freshness on mw1070 is CRITICAL: Puppet has not run in the last 10 hours [14:19:21] Susan: whatchya doing with robots.php? [14:28:00] jeremyb_, humans.txt [14:33:22] PROBLEM - Puppet freshness on amssq33 is CRITICAL: Puppet has not run in the last 10 hours [14:35:19] PROBLEM - Puppet freshness on mw1059 is CRITICAL: Puppet has not run in the last 10 hours [14:36:22] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on mc1001 is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [14:36:23] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on knsq26 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on mw1157 is CRITICAL: Puppet has not run in the last 10 hours [14:38:19] PROBLEM - Puppet freshness on srv235 is CRITICAL: Puppet has not run in the last 10 hours [14:40:25] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [14:42:49] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , arwiki (67030), nlwiki (108506), Total (181046) [15:00:52] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:01:01] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:20] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [15:05:58] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:49] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:16:10] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [15:29:40] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [15:32:13] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [15:59:05] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (25215), Total (35832) [16:02:27] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , commonswiki (12603), Total (21625) [16:18:21] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 189 seconds [16:18:30] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [16:18:48] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 209 seconds [16:19:06] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 221 seconds [16:32:45] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 3 seconds [16:33:12] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 2 seconds [16:36:39] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:38:00] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [16:38:09] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [16:38:18] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [16:38:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [16:39:57] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [16:59:54] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 204 seconds [16:59:54] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [17:12:30] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [17:12:31] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [17:15:51] New patchset: Matthias Mullie; "cleanup AFTv5 config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50744 [17:15:57] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [18:05:13] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:13] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:07] PROBLEM - Puppet freshness on mw1029 is CRITICAL: Puppet has not run in the last 10 hours [18:11:07] reedy@mw110's password: [18:11:10] RobH: ^^ [18:11:12] !log reedy synchronized php-1.21wmf10/extensions/Wikibase [18:11:14] Logged the message, Master [18:11:17] fixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixitfixit [18:11:26] Reedy: did that transcode script ever finish? [18:11:36] * jeremyb_ wonders if Reedy was assisted by a clipboard [18:11:40] Reedy: yep, i need to pull out of node lists since i didnt fix it [18:11:43] will take care of it [18:11:44] AaronSchulz: I was wondering the same thing [18:11:58] AaronSchulz: No, still on commonswiki via foreachwiki [18:12:10] And it's still going [18:13:16] commonswiki: mwstore://local-multiwrite/local-thumb/d/d5/Vladimir_Putin's_press_conference_on_2012-12-20.ogv/Vladimir_Putin's_press_conference_on_2012-12-20.ogv.360p.webm => mwstore://local-multiwrite/local-transcoded/d/d5/Vladimir_Putin's_press_conference_on_2012-12-20.ogv/Vladimir_Putin's_press_conference_on_2012-12-20.ogv.360p.webm [18:14:36] New patchset: Krinkle; "Notifications for TemplateData to #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50757 [18:17:58] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:21:49] Change abandoned: Ottomata; "Abandoning this, hope to soon use:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46618 [18:22:40] j^: About? [18:22:53] Does that script iterate over ALL transcodes? [18:24:17] Reedy: it should [18:25:28] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.469 seconds [18:25:31] AaronSchulz: was wondering if it might be a good transition workaround to check if a /transcoded/ url fails with 404 if the url with /thumb/ would work and return that [18:26:25] I'm wondering how far through the queue it is [18:29:48] if swift would support some kind of 'find' or 'ls' might be possible to find out [18:31:01] !log Copying all captchas into ceph [18:31:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:31:03] Logged the message, Master [18:32:48] New patchset: Ryan Lane; "Move xml and image backups into public location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50759 [18:37:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50759 [18:42:30] New patchset: Ryan Lane; "Fix variable inclusion in puppet reactor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50761 [18:43:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50740 [18:44:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50761 [18:45:53] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Feb 25 18:45:41 UTC 2013 [18:45:57] I take it the meeting that moved from 11 to 10:30 has been moved back to 11? [18:46:15] I think it has, yeah [18:49:25] !log reedy synchronized php-1.21wmf10/extensions/WikimediaMessages [18:49:27] Logged the message, Master [18:49:37] New patchset: RobH; "formatting fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50763 [18:50:33] !log deleting foundation-news-l (deprecated, replaced by WikimediaAnnounce-l) [18:50:34] Logged the message, Master [18:51:38] Change abandoned: RobH; "bleh, typo in my typo fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50763 [18:54:04] !log wikivoyager.de and wikivoyager.org transferred and both redirect to wikivoyage.org [18:54:05] Logged the message, Master [19:07:01] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf10 [19:07:02] Logged the message, Master [19:07:21] jeremyb_: https://bugzilla.wikimedia.org/show_bug.cgi?id=45347 [19:08:31] Can someone fix the replication of the SAL to twitter/identi.ca? Seems to have broken at the end of January https://twitter.com/wikimediatech [19:09:18] Susan: yeah, MaxSem answered :) [19:09:23] New patchset: Reedy; "enwiki to 1.21wmf10" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50765 [19:10:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50765 [19:11:18] Reedy: Does anyone care? [19:11:28] Also, file a bug. [19:11:29] Reedy: not sure we have those logins [19:11:29] I do, that's why I said it ;) [19:11:49] <^demon> Wow, I just read bug 45347. [19:11:55] <^demon> humans.txt seems totally pointless. [19:11:59] mutante: /h/w/c/docs? [19:12:10] Surely the script has the login. [19:12:19] There's probably an OAuth file somewhere. [19:12:21] Reedy: just checked.. dont see it [19:12:30] wikitech vm? [19:12:57] New patchset: Ryan Lane; "Make images and files labsconsole backup chdir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50766 [19:13:14] ^demon: It's obviously silly. [19:13:51] <^demon> Obviously. [19:14:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50766 [19:14:10] Reedy: checking later. in meeting [19:14:12] Re-wontfixed. Oh well. [19:14:16] heh, I'll file an RT ticket [19:14:25] There's no reason to use RT. [19:15:47] Reedy: thank you [19:16:15] Susan: Ops will pick it up and actually deal with it ;) [19:19:03] Rage. [19:32:47] !log reedy Started syncing Wikimedia installation... : Rebuilding message cache to update WikimediaMessages [19:32:49] Logged the message, Master [19:36:53] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:11] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:02] New patchset: Lcarr; "fixing ganglios on icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50772 [19:40:24] !log powercycling frozen niobium [19:40:26] Logged the message, Master [19:42:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50772 [19:44:23] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.054 seconds [19:46:39] PROBLEM - NTP on niobium is CRITICAL: NTP CRITICAL: Offset unknown [19:47:06] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:50:13] !log reedy Finished syncing Wikimedia installation... : Rebuilding message cache to update WikimediaMessages [19:50:15] RECOVERY - NTP on niobium is OK: NTP OK: Offset 0.004889369011 secs [19:50:15] Logged the message, Master [19:50:27] holy scap [19:50:30] 18 minutes, nice [19:55:58] New patchset: Ottomata; "Adding puppet-merge for sockpuppet puppet merges." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50452 [19:56:09] New patchset: Reedy; "(bug 45083) Enable AbuseFilter IRC notifications on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:14] New review: Reedy; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49704 [19:56:53] New patchset: Reedy; "checkoutMediaWiki: Remove redundant argument to git-submodule." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608 [19:57:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50608 [19:57:45] New patchset: Reedy; "(bug 44604) Enable PostEdit on ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49378 [19:57:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49378 [19:58:17] New patchset: Reedy; "(bug 44796) Updating logo for Telugu Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [19:58:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [19:59:38] New patchset: Reedy; "(bug 45233) Groups permissions on pt.wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50181 [19:59:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50181 [20:00:24] New patchset: Reedy; "Remove document roots for deleted wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [20:00:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50344 [20:01:32] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:34] New patchset: Reedy; "(bug 45113) Set cswiktionary favicon to the same as enwiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [20:01:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49681 [20:02:13] New patchset: Reedy; "Clean up common/README." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [20:02:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50612 [20:03:20] New patchset: Reedy; "(bug 44587) Trwiki FlaggedRevs autopromotion config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49685 [20:03:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49685 [20:03:52] New patchset: Reedy; "(bug 45079) Add P: as alias for Property namespace on Wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [20:03:57] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50709 [20:04:21] New patchset: Reedy; "(bug 45333) Namespace configuration for uk.wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [20:04:38] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50742 [20:04:55] New patchset: Reedy; "(bug 45205) Namespace configuration for hu.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50159 [20:06:42] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.27 ms [20:08:28] PROBLEM - Puppet freshness on srv253 is CRITICAL: Puppet has not run in the last 10 hours [20:09:48] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Mon Feb 25 20:09:39 UTC 2013 [20:14:42] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 181 seconds [20:15:18] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 186 seconds [20:15:38] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 193 seconds [20:15:45] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 193 seconds [20:20:28] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [20:21:21] Reedy: you didn't sync the config, did you? [20:22:11] New patchset: Pyoungmeister; "improving tone of nagios alerts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50778 [20:25:28] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [20:25:37] Nemo_bis: No [20:25:39] Lunch time [20:27:26] !log reedy synchronized wmf-config/ [20:27:27] PROBLEM - Puppet freshness on cp3003 is CRITICAL: Puppet has not run in the last 10 hours [20:27:27] Logged the message, Master [20:28:26] !log reedy synchronized docroot [20:28:27] Logged the message, Master [20:31:40] PROBLEM - Host cp3003 is DOWN: CRITICAL - Plugin timed out after 15 seconds [20:34:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50159 [20:34:30] PROBLEM - Puppet freshness on cp3004 is CRITICAL: Puppet has not run in the last 10 hours [20:36:48] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.50 ms [20:45:28] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [20:45:28] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [20:52:30] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [20:52:30] PROBLEM - Puppet freshness on ms-be3003 is CRITICAL: Puppet has not run in the last 10 hours [20:54:08] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [20:59:18] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.28 ms [21:08:08] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [21:09:36] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:27] PROBLEM - Puppet freshness on ms-be3001 is CRITICAL: Puppet has not run in the last 10 hours [21:18:57] !log reedy synchronized wmf-config/InitialiseSettings.php [21:18:59] Logged the message, Master [21:24:08] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:16] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 92.16 ms [21:29:35] RECOVERY - Host cp3003 is UP: PING WARNING - Packet loss = 80%, RTA = 118.32 ms [21:31:23] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [21:35:43] Reedy: is there any clue on how s3 is bearing the querypages updates? [21:35:51] In what sense? [21:36:03] The pmtpa mysql servers are idle, so we don't care a great deal [21:36:34] Reedy: no way to measure the effect then? [21:36:41] Look at their load I guess [21:36:42] and why not merge the other cronjob then [21:37:02] well but it was mixed with the eqiad migration, let's check [21:38:08] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Mon Feb 25 21:37:39 UTC 2013 [21:39:20] load seems to be consistently non-existent [21:43:39] RobH: Hey. Have we any "spare" misc servers in eqiad? gallium is getting a bit overloaded running jobs at time [21:44:18] I know it needs an RT ticket trail etc for actually getting one [21:48:26] Reedy: i have other misc servers [21:48:37] do you mean you need an additional one or a faster replacement? [21:53:10] <^demon> RobH: Either? If we had an additional one, we could set it up as a slave to gallium to offload the jobs. [21:53:16] <^demon> But that's a hashar question :) [21:53:53] before involving ops, I would like to find out the culprit in our current setup [21:54:05] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:13] I looked earlier, it seems most changes are waiting for parser tests which are a bit long [21:54:46] ^demon: That's sort of what I was thinking [21:54:57] It's a quad core cpu, 8 threads [21:55:29] <^demon> hashar: Well, parser tests are gonna be slow on most setups :\ [21:56:02] Theirs a tmpfs in ram now, right? [21:56:13] the sqlite are in tmpfs yeah [21:56:15] but still [21:56:16] SSD might help somewhat, but we'll just hit more bottlenecks [21:56:26] the whole parser tests suite is a mess [21:56:37] mosty caused by all the overhead in our PHPUnit integration [21:56:53] I will eventually drop it in favor of the good old parserTests.php [21:57:01] just need to have it output some JUnit XML [21:57:09] the puppet parser check takes forever [21:57:12] (just saying) [21:57:16] ;] [21:57:25] hashar, maybe, disable the ParserTest suite and run them using parserTests.php? [21:57:34] should be helluva faster [21:58:09] MaxSem: yeah what I said :-] [21:58:15] still need to have it report properly in Jenkins [21:58:19] which need junit output [21:58:26] I got a draft somewhere [21:58:30] New review: Dzahn; "it already redirects." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [21:58:34] though reimplementing JUnit support in PHP is overkill [21:59:44] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [22:00:38] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:32] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 5 seconds [22:02:27] !log Ran namespaceDupes.php on huwikt [22:02:46] Logged the message, Master [22:09:40] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 199 seconds [22:09:48] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [22:09:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 204 seconds [22:10:10] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [22:10:33] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 215 seconds [22:11:40] cmjohnson1 & sbernardin1 [22:11:49] In racktables, please do dates in format of yyyy-mm-dd [22:12:04] or else it sorts funky in some outputs (plus should be standardized) [22:12:08] OK [22:12:15] I am having to audit and add invoice data for every single item [22:12:20] and i am noticing everyone doing it differently, heh. [22:12:29] racktables is too stupid to force a format [22:12:40] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:12:50] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:13:23] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:14:08] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:14:21] robh: k...as we audit racktables for missing rt#'s etc...we'll just fix [22:15:26] cmjohnson1: Im doing that now. [22:15:38] I have a stack of invoices for the past 6 months it seems of orders [22:15:44] that i have to match up in racktables iwth the asset tag info [22:15:57] that sounds like real fun! [22:16:06] * RobH is also correcting wmf to WMF to match asset tags, cuz thats what OCD requires since he is editing every server anyhow ;] [22:16:14] cmjohnson1: Im thinking aobut printing them out and handing half to you [22:16:42] ok...well you know where to find me [22:16:47] !log restarting opendj/pdns on virt0 [22:16:49] Logged the message, Master [22:17:08] RECOVERY - MySQL Slave Running on db36 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:17:20] RECOVERY - MySQL Slave Running on db36 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:18:11] RECOVERY - MySQL Slave Running on db64 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:18:11] RECOVERY - MySQL Slave Running on db64 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:18:50] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 176865 seconds [22:20:20] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 228013 seconds [22:21:35] !seen werdna [22:22:05] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 176783 seconds [22:22:59] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 226661 seconds [22:24:31] is there a way to figure out what process just wrote to a file ? [22:24:37] fuser only has what file is open [22:25:36] ah lsof did that [22:25:37] woot [22:27:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [22:27:35] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 191 seconds [22:27:40] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [22:27:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [22:30:11] except that's just something parsing it, not writing to it [22:34:41] must be something running as user nagios [22:36:10] RECOVERY - Frontend Squid HTTP on sq41 is OK: HTTP OK: HTTP/1.0 200 OK - 533 bytes in 0.056 second response time [22:36:19] RECOVERY - Frontend Squid HTTP on sq41 is OK: HTTP OK HTTP/1.0 200 OK - 632 bytes in 0.011 seconds [22:37:33] LeslieCarr: i guess its both ganglia_parse. also the writing. because it says in the script that it also "Separates metrics into one file per host for easier processing by nagios " [22:37:40] ganglia_parser [22:38:25] dataDir = '/var/lib/ganglia/xmlcache' [22:38:25] logDir = '/var/log/ganglia' [22:38:31] New patchset: Pyoungmeister; "dev/nulling some stdout from a couple of noisy crons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50825 [22:39:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50825 [22:41:39] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:41:40] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:42:55] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [22:42:55] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [22:44:54] New patchset: Ottomata; "Not running decom_servers.sh cron on virt0 or virt1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50826 [22:45:10] Ryan_Lane^ checky? [22:45:59] RECOVERY - Backend Squid HTTP on sq41 is OK: HTTP OK: Status line output matched 200 - 487 bytes in 0.061 second response time [22:46:40] RECOVERY - Backend Squid HTTP on sq41 is OK: HTTP OK HTTP/1.0 200 OK - 494 bytes in 0.009 seconds [22:47:22] MaxSem: check_solr -r is still buggy. [22:47:46] paravoid, what's the error message? [22:47:50] New patchset: Lcarr; "trying to pin via puppet ganglios to 1.2 (known working on spence)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50827 [22:48:02] same as before [22:48:18] paravoid, and is it getting run with -r for vanadium [22:48:19] ? [22:48:24] yes. [22:48:38] can't unpack 0 values or whatever [22:48:43] on the (e, ) = line [22:49:29] mutante: reprepro ls ganglios [22:49:58] AaronSchulz: ping? [22:50:10] AaronSchulz: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (1338353), metawiki (562173), Total (1901646) [22:50:24] I'm guessing this isn't normal? :) [22:50:36] you guys fixed the check the other day, right? [22:52:51] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50826 [22:55:58] paravoid, I'm confused: it shouldn't be run with -r on vanadium at all: if ($replication_master) { [22:55:59] $check_command = "check_solr" [22:56:08] fuck [22:57:12] don't don't care I just want the (null) alert gone [22:57:19] don't know* [22:57:32] * MaxSem faceplams [22:57:41] New patchset: MaxSem; "Fix replication condition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50829 [22:58:06] the check needs fixing too [22:58:13] it backtraces now [22:58:22] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [22:59:16] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 2 seconds [22:59:25] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [22:59:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 1 seconds [22:59:39] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 4 seconds [23:00:32] if this stupid packetloss alert doesn't go away on icinga now …. i burn down neon [23:00:55] LeslieCarr: neon nrpe has been flapping a lot [23:01:25] grrr [23:01:36] notpeter, when did the cronspam appear? [23:01:42] yay [23:01:44] alert cleared [23:01:55] hurdle overcome! [23:02:02] now we can kill spence :) [23:02:30] see above [23:02:36] except nrpe [23:02:59] you don't happen to have an idea why nrpe is flapping do you ? [23:02:59] also, clientbucket keeps getting filled up [23:03:09] because of naggen replacing the config all the time [23:03:14] that probably needs a backup => false [23:03:24] no idea about nrpe, I've just been seeing the alerts [23:04:44] New review: Faidon; "Use a conditional, e.g. 10.1 http://docs.puppetlabs.com/guides/style_guide.html" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/50829 [23:05:18] paravoid: where would i put the backup => false ? [23:05:43] under the file level of each of the generated files ? [23:05:49] Jeff_Green, I'm okay moving access request tickets from ops-requests to access-requests right? [23:06:36] Change abandoned: Lcarr; "no longer needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50827 [23:07:27] Thehelpfulone: /me consults other ops folks... [23:07:29] MaxSem: february 3rd [23:07:38] LeslieCarr: yes [23:07:43] or maybe later, tbh, if it was baleeted out of my trash [23:07:47] s/later/earlier/ [23:07:52] heh sure [23:08:04] I think we're still deprecating the access request tag so it should be okay [23:08:06] notpeter, was it regular? [23:08:25] New patchset: Lcarr; "setting backup=> false on these large often changing configurations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50832 [23:08:28] paravoid: ^^ mind a quick review ? [23:08:41] MaxSem: no, every couple of days [23:08:53] mhm [23:09:01] 2/3, 2/14, 2/22, 2/24 [23:09:31] Thehelpfulone: afaik, that's fine [23:09:37] New review: Faidon; "That was fast, thanks :)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/50832 [23:09:40] i did the two tickets I had already taken [23:09:52] ah ok [23:10:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50832 [23:13:07] notpeter, can I get solr1001 error logs for one of the days there was this error? [23:13:41] MaxSem: sure! [23:15:55] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:16:14] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:17:57] Jeff_Green: http://wikitech.wikimedia.org/view/Setting_up_a_MySQL_replica [23:18:24] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 97876 seconds [23:19:23] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 96963 seconds [23:20:35] PROBLEM - Host ssl3002 is DOWN: CRITICAL - Time to live exceeded (91.198.174.103) [23:21:10] RECOVERY - MySQL Slave Running on db39 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:21:14] RECOVERY - MySQL Slave Running on db39 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [23:21:28] RECOVERY - Host ssl3002 is UP: PING OK - Packet loss = 0%, RTA = 121.97 ms [23:23:16] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 173694 seconds [23:23:43] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:23:44] PROBLEM - Host amssq48 is DOWN: CRITICAL - Time to live exceeded (91.198.174.58) [23:23:44] PROBLEM - Host amssq55 is DOWN: CRITICAL - Time to live exceeded (91.198.174.65) [23:23:44] PROBLEM - Host amssq52 is DOWN: CRITICAL - Time to live exceeded (91.198.174.62) [23:23:44] PROBLEM - Host amssq57 is DOWN: CRITICAL - Time to live exceeded (91.198.174.67) [23:23:44] PROBLEM - Host amssq60 is DOWN: CRITICAL - Time to live exceeded (91.198.174.70) [23:23:45] PROBLEM - Host amssq58 is DOWN: CRITICAL - Time to live exceeded (91.198.174.68) [23:23:45] PROBLEM - Host amssq54 is DOWN: CRITICAL - Time to live exceeded (91.198.174.64) [23:23:46] PROBLEM - Host amssq44 is DOWN: CRITICAL - Time to live exceeded (91.198.174.54) [23:23:46] PROBLEM - Host amssq47 is DOWN: CRITICAL - Time to live exceeded (91.198.174.57) [23:23:47] PROBLEM - Host amssq49 is DOWN: CRITICAL - Time to live exceeded (91.198.174.59) [23:23:47] PROBLEM - Host amssq56 is DOWN: CRITICAL - Time to live exceeded (91.198.174.66) [23:23:48] PROBLEM - Host amssq61 is DOWN: CRITICAL - Time to live exceeded (91.198.174.71) [23:23:52] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:53] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:53] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:10] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:11] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:11] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:11] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:18] so i can totally ignore that page right? [23:24:24] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:24] PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:24] PROBLEM - Host amslvs3 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:24] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:24] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:25] PROBLEM - Host wikimedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:29] we're looking at it [23:24:35] PROBLEM - Host ssl3002 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:35] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [23:24:35] oh, it wasnt intended [23:24:37] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:40] i assumed it was planned. [23:24:45] RobH_busy: it was not [23:24:45] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:46] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:47] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:47] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:48] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:48] PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:49] PROBLEM - Host amssq41 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:49] PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:50] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:50] PROBLEM - Host knsq29 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:51] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host wikiversity-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:54] PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:55] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:55] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:56] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:56] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:57] PROBLEM - Host knsq21 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:57] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:58] PROBLEM - Host bits.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:58] PROBLEM - Host ssl3001 is DOWN: PING CRITICAL - Packet loss = 100% [23:24:59] PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:24:59] PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:00] PROBLEM - Host ssl3003 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:00] PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:04] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 173664 seconds [23:25:04] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host knsq18 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host knsq28 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:05] PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:06] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:06] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:07] PROBLEM - Host amssq37 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:07] PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:08] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [23:25:08] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:09] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:09] PROBLEM - Host 91.198.174.6 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:10] PROBLEM - Host amssq46 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:14] PROBLEM - Host upload.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host wiktionary-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host wikiquote-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:15] PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [23:25:24] RECOVERY - Host amssq58 is UP: PING OK - Packet loss = 0%, RTA = 95.14 ms [23:25:24] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 96.36 ms [23:25:24] RECOVERY - Host amssq44 is UP: PING OK - Packet loss = 0%, RTA = 96.39 ms [23:25:24] RECOVERY - Host knsq21 is UP: PING OK - Packet loss = 0%, RTA = 94.96 ms [23:25:24] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 96.20 ms [23:25:25] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 96.21 ms [23:25:25] RECOVERY - Host amssq49 is UP: PING OK - Packet loss = 0%, RTA = 96.29 ms [23:25:26] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 96.21 ms [23:25:26] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 96.19 ms [23:25:31] RECOVERY - Host knsq18 is UP: PING OK - Packet loss = 0%, RTA = 122.78 ms [23:25:32] RECOVERY - Host knsq28 is UP: PING OK - Packet loss = 0%, RTA = 121.53 ms [23:25:32] RECOVERY - Host knsq23 is UP: PING OK - Packet loss = 0%, RTA = 122.47 ms [23:25:32] RECOVERY - Host knsq26 is UP: PING OK - Packet loss = 0%, RTA = 122.50 ms [23:25:32] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 121.04 ms [23:25:32] RECOVERY - Host knsq27 is UP: PING OK - Packet loss = 0%, RTA = 122.60 ms [23:25:40] RECOVERY - Host ms6 is UP: PING OK - Packet loss = 0%, RTA = 122.43 ms [23:25:40] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 122.33 ms [23:26:34] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 121.32 ms [23:34:50] !log re-enabling disabled puppet on srv253,srv235 [23:34:52] Logged the message, Master [23:35:18] RECOVERY - Puppet freshness on srv253 is OK: puppet ran at Mon Feb 25 23:35:00 UTC 2013 [23:36:21] RECOVERY - Puppet freshness on srv235 is OK: puppet ran at Mon Feb 25 23:35:49 UTC 2013 [23:36:42] hashar: jenkins service on gallium, critical for quite a while [23:36:48] and zuul-server too [23:36:57] :( [23:37:07] PROCS CRITICAL: 0 processes with args 'jenkins' [23:37:12] but thats like 32 days [23:37:14] looking [23:37:14] New patchset: Lcarr; "Changing redirect to correctly direct to icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50838 [23:37:23] New patchset: Reedy; "Bug 43329 - Enable SearchExtraNs extension on Commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47302 [23:37:45] mutante: zuul is running https://integration.mediawiki.org/zuul/status [23:37:49] mutante: jenkins too. [23:37:58] mutante: so that must be the nagios check :/ [23:38:06] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50838 [23:38:21] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47302 [23:39:01] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable SearchExtraNS on commonswiki' [23:39:03] Logged the message, Master [23:40:57] New patchset: Reedy; "Configure SearchExtraNS" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50839 [23:44:19] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/50839 [23:44:38] New patchset: Lcarr; "warn is invalid - shoudl be warning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50840 [23:44:46] paravoid: if someone edited a bunch of massive templates, then it will be high for a while [23:45:28] !log reedy synchronized wmf-config/ [23:45:30] Logged the message, Master [23:45:58] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50840 [23:47:09] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay NULL seconds [23:49:06] it sometimes jumps to 80k or so and goes back down so the warning could be premature is some cases, though 1M means some crazy editing is going on [23:51:13] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant find record in page_restrictions on query. Default da [23:53:08] anyone wants to fix the memcached/marmontel check? :) [23:53:17] RobH was going to handle it but apparently he's busy [23:53:54] RECOVERY - Puppet freshness on knsq26 is OK: puppet ran at Mon Feb 25 23:53:25 UTC 2013 [23:54:13] !log knsq26,knsq28 - deleting puppet lock file, fix puppet runs [23:54:14] Logged the message, Master [23:54:48] RECOVERY - Puppet freshness on knsq28 is OK: puppet ran at Mon Feb 25 23:54:35 UTC 2013 [23:54:56] New patchset: Pyoungmeister; "this too shall be quiet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50842 [23:55:54] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50842