[00:20:03] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 184 seconds [00:20:11] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 188 seconds [00:31:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:56] PROBLEM - MySQL Replication Heartbeat on db1028 is CRITICAL: CRIT replication delay 183 seconds [00:36:32] PROBLEM - MySQL Replication Heartbeat on db1041 is CRITICAL: CRIT replication delay 190 seconds [00:37:00] PROBLEM - MySQL Slave Delay on db1024 is CRITICAL: CRIT replication delay 190 seconds [00:37:26] PROBLEM - MySQL Slave Delay on db1041 is CRITICAL: CRIT replication delay 191 seconds [00:37:27] PROBLEM - MySQL Replication Heartbeat on db1024 is CRITICAL: CRIT replication delay 190 seconds [00:38:38] RECOVERY - MySQL Slave Delay on db1024 is OK: OK replication delay 0 seconds [00:39:05] RECOVERY - MySQL Slave Delay on db1041 is OK: OK replication delay 0 seconds [00:39:06] RECOVERY - MySQL Replication Heartbeat on db1024 is OK: OK replication delay 0 seconds [00:39:23] RECOVERY - MySQL Replication Heartbeat on db1028 is OK: OK replication delay 0 seconds [00:40:00] RECOVERY - MySQL Replication Heartbeat on db1041 is OK: OK replication delay 0 seconds [00:44:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.383 seconds [00:48:00] * Damianz wonders how much sex there is at wikipedia o.0 [00:50:02] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [00:50:39] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [00:59:47] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 215 seconds [01:00:05] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 225 seconds [01:12:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 271 seconds [01:13:53] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [01:18:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:56] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [01:21:05] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [01:25:26] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [01:25:27] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:30:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [01:47:20] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [02:04:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.454 seconds [02:26:24] !log LocalisationUpdate completed (1.21wmf6) at Sun Jan 6 02:26:24 UTC 2013 [02:26:35] Logged the message, Master [02:46:35] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 194 seconds [02:46:53] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 196 seconds [02:48:46] !log LocalisationUpdate completed (1.21wmf7) at Sun Jan 6 02:48:45 UTC 2013 [02:48:56] Logged the message, Master [02:57:05] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [02:57:32] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:14:20] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:24:41] > In order to set downtime / ack alerts you need to login which is done over https [03:26:00] http://wikitech.wikimedia.org/view/Nagios#Authentication [03:26:12] (the quote is from elsewhere on the same page) [03:26:17] pgehres: ^^ [04:18:41] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 188 seconds [04:20:29] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [04:43:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:43:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:06:58] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [07:56:23] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [09:53:21] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [11:26:21] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [11:26:22] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:43:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:48:24] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [11:59:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [12:22:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:28:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [12:33:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [13:01:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:06:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [13:14:48] New patchset: Umherirrender; "Remove old aftv5 permissions from mediawiki-config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42401 [13:15:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:15:16] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:18:33] New patchset: Umherirrender; "AFTv5: skip rollbacker group on wikis without that" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42402 [13:23:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [13:39:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [13:57:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [14:19:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.544 seconds [14:40:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:45:03] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:45:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:49:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [14:57:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [15:03:36] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42401 [15:23:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:39:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [16:07:51] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [16:07:52] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [16:07:52] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [16:07:52] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [16:07:52] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [16:07:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:11:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [16:29:54] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [16:51:48] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [16:56:45] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [16:56:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [17:23:42] !log LocalisationUpdate completed (1.21wmf6) at Sun Jan 6 17:23:41 UTC 2013 [17:23:54] Logged the message, Master [17:28:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 195 seconds [17:29:13] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 205 seconds [17:44:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:46:43] !log LocalisationUpdate completed (1.21wmf7) at Sun Jan 6 17:46:42 UTC 2013 [17:46:52] Logged the message, Master [17:57:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [17:57:25] PROBLEM - Puppet freshness on search30 is CRITICAL: Puppet has not run in the last 10 hours [18:00:24] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:00:33] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:30:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:40:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.602 seconds [19:11:26] jesus christ is ganglia a pile of awful [19:12:21] yes [19:13:01] quote from the official docs: [19:13:21] "time_max: maximum time in seconds between metric collection calls; The exact nature of this element is unclear, as is its relationship to the 'collect_every' configuration directive in your pyconf for the module. For all intents and purposes, this element seems... useless." [19:13:31] http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules [19:15:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:16] Good that the docs make it clear that it's unclear [19:29:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [19:54:53] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:58:23] eek, so we use SQLite for server-side Puppet? [19:59:44] You are going to die [20:03:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.913 seconds [20:47:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.142 seconds [21:01:17] MaxSem: no [21:01:22] MaxSem: mysql [21:01:46] ah, so https://wikitech.wikimedia.org/view/Puppet is wrong... [21:02:35] yes [21:02:53] the puppetmaster config is all in the puppet repo I believe [21:27:35] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [21:27:36] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:35:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.248 seconds [21:49:38] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: Puppet has not run in the last 10 hours [22:07:09] New patchset: Dereckson; "Throttle rule for UCSF event." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42495 [22:07:51] New patchset: Dereckson; "Throttle rule for UCSF event." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42495 [22:09:18] New patchset: Dereckson; "Throttle rule for UCSF event." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42495 [22:09:57] New review: Dereckson; "PS2: Removing previous bug reference" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42495 [22:20:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.570 seconds [22:33:46] New patchset: Dereckson; "Throttle rule for UCSF event." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42495 [22:43:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42495 [22:46:32] !log reedy synchronized wmf-config/throttle.php [22:46:44] Logged the message, Master [22:49:59] reedy@fenari:/home/wikipedia/common$ mwscript mcc.php commonswiki [22:49:59] > delete commonswiki:acctcreate:ip:205.154.255.252 [22:49:59] MemCached error [22:50:02] Isn't that helpful [22:54:07] New patchset: Dereckson; "(bug 43687) Namespace configuration for meta." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42497 [22:54:27] Ah, of course. It's broken by the memcached config changes [22:55:18] TimStarling: Isn't most of Antoines comment at the start now wrong? https://noc.wikimedia.org/conf/highlight.php?file=mc.php [22:55:20] New patchset: Dereckson; "(bug 43687) Namespace configuration for meta." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42497 [22:57:33] yes [22:59:26] New patchset: Reedy; "Remove old initial config relating to old memcached config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42498 [23:00:52] New patchset: Reedy; "Remove old initial comments relating to old memcached config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42498 [23:01:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42498 [23:06:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:54] Reedy: maybe you could merge Tim's https://gerrit.wikimedia.org/r/#/c/41562 ? would be nice to be on next deploy anyway [23:10:34] There's quite a bit of time till the next branch ;) [23:16:38] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:19:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.290 seconds [23:54:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds