[00:07:37] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 00:07:34 UTC 2013 [00:08:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:08:07] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 00:08:04 UTC 2013 [00:09:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:34:07] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 00:33:57 UTC 2013 [00:34:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:52:40] springle: hey, around? [00:53:00] ori-l, hi [00:54:19] can i poll you for advice? i'm wondering how you'd implement something like a capped collection regiment in mysql/mariadb. a capped collection is simply a table that you prune to max N rows (or you specify a historical reach in terms of some timestamp column) [00:55:31] we store some analytics data in mysql / mariadb because analysts' familiarity with SQL and the ability to intersect the data with data from the production databases outweighs the poor fit of mysql for volatile time-series log data [00:56:04] ori-l, triggers i guess. or a nicely indexed cron job [00:56:30] how often would you execute the triggers? on every insert? [00:56:43] does N have to be accurate? or can it vary a bit [00:56:50] it can vary [00:57:03] "shouldn't grow *much* larger than .." [00:57:31] trigger after insert [00:57:48] if it can vary i'd probably prefer a cron job [00:58:14] if it's a cron job, how do you determine an optimal batch size? [00:58:22] just to keep it simple -- we dont appear to use much in the way of triggers, and some thing like online schema change like not haveing trigger around [00:58:55] optimal batch size for pruning deletes? [00:58:56] i'm currently pruning a table that bloated to 40 million rows without me realizing it by having a bash script that deletes from ... limit 1000; sleep 5; in a loop. I'm guessing there are more sophisticated approaches [00:59:21] er, limit 100000 [00:59:41] ori-l, nibbling rather than chunking may help. check out pt-archiver [00:59:54] pt-archiver --purge .... iirc [01:00:19] oh, this looks like a good fit for me [01:00:21] it will be nice about throttling batch size to allow slaves ot keep up [01:01:07] we have v2.2 installed most places already [01:01:48] sweet, thanks for pointing me in that direction [01:01:54] no worries [01:04:57] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 01:04:55 UTC 2013 [01:05:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:35:57] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 01:35:52 UTC 2013 [01:36:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:58:13] (03PS1) 10Jalexander: Replace public key for jamesofur [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 [02:05:07] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 02:05:05 UTC 2013 [02:06:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:15:11] !log LocalisationUpdate completed (1.22wmf12) at Fri Aug 16 02:15:11 UTC 2013 [02:15:23] Logged the message, Master [02:33:37] PROBLEM - twemproxy process on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:27] RECOVERY - twemproxy process on mw1046 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:34:37] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 02:34:31 UTC 2013 [02:35:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:35:15] !log LocalisationUpdate completed (1.22wmf13) at Fri Aug 16 02:35:14 UTC 2013 [02:35:26] Logged the message, Master [02:41:57] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [02:48:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.094 second response time [02:57:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 16 02:57:56 UTC 2013 [02:58:07] Logged the message, Master [03:02:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:27] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 03:03:24 UTC 2013 [03:04:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [03:04:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.959 second response time [03:07:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:37] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 03:35:36 UTC 2013 [03:36:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [03:50:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.905 second response time [03:58:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:04:17] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 04:04:11 UTC 2013 [04:04:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.073 second response time [04:05:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [04:07:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.953 second response time [04:11:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:15:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.816 second response time [04:18:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:35:37] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 04:35:30 UTC 2013 [04:36:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [04:44:57] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [04:50:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.380 second response time [04:52:28] (03CR) 10Nemo bis: "The fix in docs is at https://gerrit.wikimedia.org/r/#/c/79281/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79279 (owner: 10Nemo bis) [04:56:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:58:48] morning apergos :) [04:58:55] morning [05:04:57] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 05:04:55 UTC 2013 [05:05:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [05:07:01] moin [05:08:57] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [05:18:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.231 second response time [05:19:37] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:20:37] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:27:57] (03PS1) 10ArielGlenn: make sure varnish.pyconf gets created for holmium (was breaking puppet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79307 [05:29:04] (03CR) 10ArielGlenn: [C: 032] make sure varnish.pyconf gets created for holmium (was breaking puppet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79307 (owner: 10ArielGlenn) [05:32:27] RECOVERY - Puppet freshness on holmium is OK: puppet ran at Fri Aug 16 05:32:18 UTC 2013 [05:32:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 05:32:43 UTC 2013 [05:33:07] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [05:39:38] (03PS1) 10ArielGlenn: varnish::monitoring::ganglia wants empty params for holmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/79308 [05:40:45] (03CR) 10ArielGlenn: [C: 032] varnish::monitoring::ganglia wants empty params for holmium [operations/puppet] - 10https://gerrit.wikimedia.org/r/79308 (owner: 10ArielGlenn) [05:43:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:43:59] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [05:43:59] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [05:43:59] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:59] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:43:59] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [05:44:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:44:45] !log downgraded puppetmaster to previous release, new minor update had significantly increased CPU requirements and DoSed stafford [05:44:58] Logged the message, Master [05:48:37] * Nemo_bis dances the cute lonely easy typofix dance to attract attention https://gerrit.wikimedia.org/r/#/c/79279/ [06:03:58] (03PS1) 10ArielGlenn: a couple of hardy hosts out there need timeout package (e.g. erzurumi) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79311 [06:05:31] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [06:06:44] (03CR) 10ArielGlenn: [C: 032] a couple of hardy hosts out there need timeout package (e.g. erzurumi) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79311 (owner: 10ArielGlenn) [06:10:01] RECOVERY - Puppet freshness on erzurumi is OK: puppet ran at Fri Aug 16 06:09:57 UTC 2013 [06:15:01] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:02] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [06:22:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [06:32:51] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 06:32:42 UTC 2013 [06:33:31] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [06:52:10] (03PS1) 10ArielGlenn: vars in ganglia_new::configuration need to be set before aggregator uses them [operations/puppet] - 10https://gerrit.wikimedia.org/r/79314 [06:54:16] (03CR) 10ArielGlenn: [C: 032] vars in ganglia_new::configuration need to be set before aggregator uses them [operations/puppet] - 10https://gerrit.wikimedia.org/r/79314 (owner: 10ArielGlenn) [06:57:23] apergos: are you still replacing h310 with h710s on swift? [06:57:32] yes [06:59:26] (03PS1) 10ArielGlenn: Revert "vars in ganglia_new::configuration need to be set before aggregator uses them" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79315 [07:01:11] (03PS1) 10Faidon: ceph: make check_ceph_health failures CRITICAL [operations/puppet] - 10https://gerrit.wikimedia.org/r/79317 [07:01:27] (03CR) 10ArielGlenn: [C: 032] Revert "vars in ganglia_new::configuration need to be set before aggregator uses them" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79315 (owner: 10ArielGlenn) [07:01:32] (03CR) 10Faidon: [C: 032 V: 032] ceph: make check_ceph_health failures CRITICAL [operations/puppet] - 10https://gerrit.wikimedia.org/r/79317 (owner: 10Faidon) [07:08:08] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:31:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:48] PROBLEM - Puppet freshness on hooft is CRITICAL: No successful Puppet run in the last 10 hours [07:32:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [07:32:48] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 07:32:39 UTC 2013 [07:33:08] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:38:41] (03PS1) 10ArielGlenn: correct parameter name in ganglia_new::configuration for cache_misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/79319 [07:41:05] (03CR) 10ArielGlenn: [C: 032] correct parameter name in ganglia_new::configuration for cache_misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/79319 (owner: 10ArielGlenn) [07:43:02] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Fri Aug 16 07:42:56 UTC 2013 [07:47:32] RECOVERY - Puppet freshness on pdf2 is OK: puppet ran at Fri Aug 16 07:47:23 UTC 2013 [07:48:42] RECOVERY - Puppet freshness on pdf3 is OK: puppet ran at Fri Aug 16 07:48:35 UTC 2013 [07:51:44] (03PS1) 10Amire80: Add tyv.wikipedia to projects with visualeditor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79321 [07:53:02] RECOVERY - Puppet freshness on ssl1004 is OK: puppet ran at Fri Aug 16 07:52:57 UTC 2013 [07:53:32] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset -0.07942116261 secs [08:00:42] PROBLEM - Host sq41 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:41] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [08:09:34] !log sq41 hardware issues, see rt 5615 [08:09:46] Logged the message, Master [08:21:03] http://en.wikipedia.beta.wmflabs.org/ is down :( [08:21:06] can anybody take a look? [08:21:31] OK [08:22:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [08:27:31] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [08:34:51] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 08:34:41 UTC 2013 [08:36:10] zeljkof: should be back up now. there was some maintenance work done on the network storage setup in labs that necessitated restarting instances, and not all of them were restarted [08:37:34] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1925952435 secs [08:38:38] zeljkof: it just so happens that it wouldn't have helped in this particular case, but in general the way to investigate fatals from the beta cluster is by logging in to one of the instances in the project and running tail -100 /home/wikipedia/logs/fatal.log [08:38:57] could you document that somewhere? [08:40:18] ori-l: thanks [08:40:35] I am on a conference right now, running a workshop :) [08:41:05] good luck [08:43:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [08:51:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:52:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [08:56:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:57:22] (03PS1) 10Mark Bergsma: Set req.hash_ignore_busy = true for PURGE requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/79322 [09:04:35] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [09:05:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:21:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:32:45] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 09:32:37 UTC 2013 [09:33:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:35:39] (03CR) 10TTO: [C: 04-1] "Redundant to Icbdc2117a02f2fedf4d0ff7839f458d10c6bad50" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/77269 (owner: 10Catrope) [09:36:48] !log rolling reboot of ceph ms-be nodes [09:36:59] Logged the message, Master [09:37:16] stupid nagios [09:37:50] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:38:10] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 8291 pgs degraded: 5432 pgs stuck unclean: recovery 149736186/902017146 degraded (16.600%): 24/143 in osds are down [09:38:10] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 8291 pgs degraded: 5432 pgs stuck unclean: recovery 149736186/902017146 degraded (16.600%): 24/143 in osds are down [09:38:20] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 8291 pgs degraded: 5440 pgs stuck unclean: recovery 149736186/902017146 degraded (16.600%): 24/143 in osds are down [09:38:30] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:39:20] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [09:41:10] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 4.31 ms [09:45:07] (03PS1) 10Faidon: sysctlfile: add newline for value invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79323 [09:45:28] (03CR) 10Faidon: [C: 032 V: 032] sysctlfile: add newline for value invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79323 (owner: 10Faidon) [09:48:40] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:49:11] (03PS1) 10TTO: Set up flood flag on shwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79324 [09:49:50] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:10] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 7.21 ms [09:51:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:50] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [09:52:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [09:53:21] PROBLEM - Host ms-be1012 is DOWN: PING CRITICAL - Packet loss = 100% [09:54:50] RECOVERY - Host ms-be1012 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:57:50] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:00] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:58:50] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:50] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 1.92 ms [10:00:20] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [10:05:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:09:05] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 10 hours [10:14:05] PROBLEM - NTP on ms-be1006 is CRITICAL: NTP CRITICAL: Offset unknown [10:14:25] PROBLEM - NTP on ms-be1007 is CRITICAL: NTP CRITICAL: Offset unknown [10:19:05] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset 0.000519990921 secs [10:19:25] RECOVERY - NTP on ms-be1007 is OK: NTP OK: Offset 0.0009069442749 secs [10:20:25] RECOVERY - Host ms-be1005 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [10:24:25] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:25] PROBLEM - Host ms-be1009 is DOWN: PING CRITICAL - Packet loss = 100% [10:24:51] PROBLEM - Host ms-be1011 is DOWN: PING CRITICAL - Packet loss = 100% [10:25:45] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [10:26:15] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 3.25 ms [10:26:15] RECOVERY - Host ms-be1010 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [10:32:45] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 10:32:36 UTC 2013 [10:33:05] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [10:33:05] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [10:33:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:33:15] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [10:41:04] PROBLEM - NTP on ms-be1011 is CRITICAL: NTP CRITICAL: Offset unknown [10:45:04] RECOVERY - NTP on ms-be1011 is OK: NTP OK: Offset -0.0002146959305 secs [11:06:10] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [11:15:14] (03CR) 10Reedy: [C: 031] Make SpecialPages Titlecase in misc::maintenance::updatequerypages [operations/puppet] - 10https://gerrit.wikimedia.org/r/79279 (owner: 10Nemo bis) [11:15:33] apergos: About? Any chance you could merge https://gerrit.wikimedia.org/r/#/c/79279/ please? [11:16:49] looking [11:17:15] Fixing the casing of some parameters passed to the updateSpecialPages maintenance script [11:17:30] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1517374516 secs [11:17:34] I saw a little of that discussion yesterday, yep [11:18:07] (03CR) 10ArielGlenn: [C: 032] Make SpecialPages Titlecase in misc::maintenance::updatequerypages [operations/puppet] - 10https://gerrit.wikimedia.org/r/79279 (owner: 10Nemo bis) [11:18:40] what host does that run on? [11:18:58] terbium [11:19:44] someone has disabled puppet runs on there temporarily [11:20:20] within the last 24 hours so it woudl be best (since I don't know who it was) to wait for them to re-enable [11:21:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:26:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:29:16] Reedy: can you look at the shell in #tech a little bit ago, it was for a event starting soon [11:29:24] (if no one already did) [11:32:41] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 11:32:37 UTC 2013 [11:33:10] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [11:43:31] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [11:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:43] (03PS2) 10Amire80: Add tyv.wikipedia to projects with visualeditor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79321 [11:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [12:04:24] (03PS1) 10ArielGlenn: fix up torrus conf file generation for squid/varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/79327 [12:05:25] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:05:35] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1613060236 secs [12:06:59] (03CR) 10ArielGlenn: [C: 032] fix up torrus conf file generation for squid/varnish caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/79327 (owner: 10ArielGlenn) [12:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [12:25:46] (03PS1) 10ArielGlenn: fix up torrus aggregation for varnish/squid caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/79328 [12:28:03] (03CR) 10ArielGlenn: [C: 032] fix up torrus aggregation for varnish/squid caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/79328 (owner: 10ArielGlenn) [12:32:03] (03PS1) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [12:32:45] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 12:32:41 UTC 2013 [12:33:25] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:36:39] (03PS1) 10ArielGlenn: wrong fix for torrus cache aggregation, trying again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79330 [12:38:01] (03CR) 10ArielGlenn: [C: 032] wrong fix for torrus cache aggregation, trying again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79330 (owner: 10ArielGlenn) [12:39:34] RECOVERY - Puppet freshness on manutius is OK: puppet ran at Fri Aug 16 12:39:31 UTC 2013 [12:42:54] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [12:51:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:52:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:56:34] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [12:57:53] (03PS2) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [12:58:21] \o/ [13:04:03] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:06:33] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1598188877 secs [13:21:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:32:43] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 13:32:36 UTC 2013 [13:33:03] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:34:33] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [13:58:32] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1853570938 secs [14:05:43] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [14:15:15] (03PS1) 10Akosiaris: helium backup director/storage server [operations/puppet] - 10https://gerrit.wikimedia.org/r/79334 [14:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [14:24:41] (03CR) 10Mark Bergsma: "This is directly to port 8080 as well, like gitblit. I realized this may be circumventing some apache redirect rules etc though?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79204 (owner: 10Mark Bergsma) [14:32:43] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 14:32:38 UTC 2013 [14:33:43] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [14:35:52] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:32] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [14:41:42] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [14:56:59] (03CR) 10Reedy: [C: 032] Add tyv.wikipedia to projects with visualeditor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79321 (owner: 10Amire80) [14:57:12] (03Merged) 10jenkins-bot: Add tyv.wikipedia to projects with visualeditor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79321 (owner: 10Amire80) [15:05:35] !log reedy synchronized database lists files: Ad tyvwiki to visualeditor.dblist [15:05:47] Logged the message, Master [15:05:52] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:06:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:54] (03PS2) 10Akosiaris: helium backup director/storage server [operations/puppet] - 10https://gerrit.wikimedia.org/r/79334 [15:07:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [15:09:02] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:19] !log reedy synchronized wmf-config/ 'touch' [15:09:30] Logged the message, Master [15:09:49] (03PS3) 10Akosiaris: helium backup director/storage server [operations/puppet] - 10https://gerrit.wikimedia.org/r/79334 [15:11:19] (03CR) 10Akosiaris: [C: 032] helium backup director/storage server [operations/puppet] - 10https://gerrit.wikimedia.org/r/79334 (owner: 10Akosiaris) [15:16:41] (03PS1) 10Akosiaris: Removing duplicate definition in backup.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/79337 [15:17:35] (03CR) 10Akosiaris: [C: 032] Removing duplicate definition in backup.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/79337 (owner: 10Akosiaris) [15:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.187 second response time [15:31:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:32:42] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 15:32:37 UTC 2013 [15:32:52] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:41:20] (03PS1) 10Akosiaris: bacula::console is a class not a definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/79338 [15:46:03] (03CR) 10Akosiaris: [C: 032] bacula::console is a class not a definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/79338 (owner: 10Akosiaris) [15:51:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.984 second response time [15:55:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [16:02:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:46] (03PS1) 10RobH: initial deploy of ytterbium for eventual gerrit replacement [operations/puppet] - 10https://gerrit.wikimedia.org/r/79339 [16:03:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [16:03:25] sbernardin: can you look at r300 and let me know if the disk are internal? I don't remember (A4 U3) [16:03:26] (03PS1) 10RobH: Revert "ytterbium install troubleshooting" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79340 [16:03:39] (03CR) 10RobH: [C: 032] initial deploy of ytterbium for eventual gerrit replacement [operations/puppet] - 10https://gerrit.wikimedia.org/r/79339 (owner: 10RobH) [16:04:13] (03CR) 10RobH: [C: 032 V: 032] Revert "ytterbium install troubleshooting" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79340 (owner: 10RobH) [16:06:09] cmjohnson1: yes ...disks are internal [16:06:31] ok..well shit we are going to need to schedule this [16:07:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [16:07:44] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1675481796 secs [16:11:04] (03CR) 10BBlack: [C: 032 V: 032] "Looks like an improvement to me!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79322 (owner: 10Mark Bergsma) [16:11:28] anyone using brewster at the moment? (robh) [16:12:08] cmjohnson1: all of labs uses brewster [16:12:16] so if it has to come down, you need to notify labs users [16:12:21] its our apt repo for everything [16:12:35] is my understanding [16:12:46] Ryan_Lane: would know, but he is also traveling and may not be awake and about [16:12:47] what for [16:12:59] the one time i had to take down brewster, i was told we had to notify. [16:13:00] just do it [16:13:04] just do it [16:13:53] (03CR) 10BBlack: [C: 032 V: 032] "Let's get this out there so we can see things better" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [16:14:09] cmjohnson1: but im not using it right now, which was your intial question, so yea i guess go for it. if folks get mad you can point them to mark and faidon ;] [16:14:19] feel free [16:14:27] k [16:14:35] need to notify someone every time I need to wipe my ass, can't be arsed to ;) [16:15:24] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:24] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:25] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:06] (03PS6) 10BBlack: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [16:18:24] (03CR) 10BBlack: [C: 032 V: 032] "rebased and submitting...." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [16:19:17] sbernardin: are you there? [16:19:56] cmjohnson1: yes [16:20:26] RobH: no worry about notifying people about brewster [16:20:27] okay..i am going to be shutting brewster down now...the disk you need to swap has the S/N 9TE02FT8 ..should be the first but plz double check [16:20:43] Ryan_Lane: awesome, thx [16:20:49] it doesn't cause instance creation failure anymore, it just makes the first puppet run fail [16:20:51] !log merged vhtcpd ganglia stuff and the VCL for PURGE hash_ignore_busy into puppet [16:20:54] ahhh [16:20:58] much better =] [16:21:00] !log shutting down brewster to replace bad disk [16:21:03] Logged the message, Master [16:21:13] Logged the message, Master [16:21:14] (03PS1) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:21:55] (03CR) 10jenkins-bot: [V: 04-1] Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 (owner: 10Mark Bergsma) [16:21:56] bblack: no need to !log on puppet merges [16:22:20] well, it felt a little scary :) [16:22:26] either one might break stuff [16:22:43] (03PS2) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:23:02] yeah [16:23:08] that VCL change I was thinking of merging monday ;) [16:23:21] i mean, i won't be around for very much longer [16:23:54] PROBLEM - Host brewster is DOWN: PING CRITICAL - Packet loss = 100% [16:24:09] WHO KILLED BREWSTER [16:24:30] hahahaha [16:24:33] :D [16:24:56] ahahaha [16:24:58] mark: I'll keep an eye on things and revert it if necc, I'll be around a lot today/tomorrow [16:25:06] ok ;) [16:25:12] i did test on one box, seemed ok there [16:25:15] <^d> cmjohnson1: Were you going to tweak testsearch1002 further? [16:26:03] ^d wasn't planning on it...but I can change the settings if you like [16:26:18] <^d> Hrm, well it's still freaking according to ganglia :\ [16:26:48] cmjohnson1: the suggestion in that thread was to turn off c1e & cstates [16:26:57] (03PS3) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:26:59] okay..i will look at it in a few [16:27:26] paravoid: didn't see that [16:27:32] * cmjohnson1 goes back to look [16:27:52] http://comments.gmane.org/gmane.linux.hardware.dell.poweredge/43926 [16:28:34] others said that this didn't work for them though [16:29:14] cmjohnson1: swapped correct drive on brewster...powered back up [16:29:19] so... needs some tinkering [16:29:58] also a bios/firmware upgrade wouldn't hurt... [16:30:28] agreed..i have a bios f/w udate for the 420 but not the 320..will need to get that [16:32:25] (03PS4) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:33:04] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 16:32:55 UTC 2013 [16:33:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [16:46:31] (03PS5) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:48:16] RECOVERY - Host brewster is UP: PING OK - Packet loss = 0%, RTA = 27.19 ms [16:48:37] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [16:50:04] (03PS1) 10Akosiaris: include role::backup::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/79344 [16:51:27] !log brewster going down again [16:51:38] Logged the message, Master [16:52:56] PROBLEM - Host brewster is DOWN: PING CRITICAL - Packet loss = 100% [16:55:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [16:57:35] (03PS6) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [16:57:43] (03CR) 10Akosiaris: [C: 032] include role::backup::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/79344 (owner: 10Akosiaris) [17:02:06] sberanrdin: i am going to shut it down again and remove the new disk and power on [17:02:30] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 17:02:24 UTC 2013 [17:03:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [17:03:46] sbernardin: ^ misspelled [17:04:33] cmjohnson1: OK...so we're booting with just sdb [17:04:40] yes [17:04:44] (03PS7) 10Mark Bergsma: Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 [17:05:45] (03CR) 10Mark Bergsma: [C: 032] Route list outbound mail to a separate SMTP transport [operations/puppet] - 10https://gerrit.wikimedia.org/r/79342 (owner: 10Mark Bergsma) [17:07:57] cmjohnson1: OK...done [17:09:46] sbernardin can you plug the console in and tell me what's on the screen [17:12:56] cmjonhson1: Says GRUB [17:13:19] cmjohnson1: with blinking cursor [17:15:47] k..will have to reconfig grub to dev/sdb [17:16:34] to both [17:17:22] right but i will do sda after the new disk [17:22:12] RECOVERY - Host brewster is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [17:29:55] !log demon synchronized php-1.22wmf13/extensions/CirrusSearch/CirrusSearch.body.php [17:30:06] Logged the message, Master [17:32:42] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 17:32:37 UTC 2013 [17:33:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [17:35:38] (03PS1) 10Asher: remove peter from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/79350 [17:35:59] (03CR) 10Asher: [C: 032 V: 032] remove peter from icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/79350 (owner: 10Asher) [17:38:36] awww [17:42:20] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1768844128 secs [18:04:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [18:11:17] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [18:22:17] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.2645692825 secs [18:23:17] paravoid: hi [18:32:57] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 18:32:48 UTC 2013 [18:33:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [18:44:05] paravoid: can you please --git-tag the libdclass ? [18:44:49] paravoid: your gpg key is required [18:45:50] paravoid: if you push that to the gerrit repo it would be very awesome. I'm about to push some new changes to add a backward compatibility dtree so there is a new version coming up [18:58:11] PROBLEM - Puppetmaster HTTPS on sockpuppet is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [19:05:49] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [19:08:19] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [19:15:19] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.188898325 secs [19:32:39] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 19:32:37 UTC 2013 [19:32:49] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [19:36:39] !log shutting down brewster yet again to replace the disk [19:36:50] Logged the message, Master [19:37:34] sbernardin: once it turns off do your thing [19:39:06] PROBLEM - Host brewster is DOWN: PING CRITICAL - Packet loss = 100% [19:45:06] RECOVERY - Puppetmaster HTTPS on sockpuppet is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.738 second response time [19:45:44] cmjohnson1: drive has been swapped [19:46:40] are you consoled in? [19:47:05] Now I am [19:47:20] what do you see? I see a problem [19:47:45] sbernardin: reboot to bios please [19:48:19] OK [19:48:52] cmjohnson1: change boot device? [19:49:09] i want to check it first [19:49:26] Ok [19:49:33] it is already set to sata b [19:49:56] Yup [19:50:17] well at least i know changing the mapper worked [20:05:34] ^d: haz new gerrit host. [20:05:41] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [20:06:00] <^d> RobH: You rock, thanks! [20:06:11] it was more than just me [20:06:24] puppet issues were abound and both apergos and LeslieCarr helped a lot [20:06:35] and then asher spotted something too [20:09:01] <^d> Oh wow, well thanks everyone :) [20:09:51] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 10 hours [20:09:51] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [20:19:41] PROBLEM - NTP on ytterbium is CRITICAL: NTP CRITICAL: No response from NTP server [20:24:21] RECOVERY - Host brewster is UP: PING OK - Packet loss = 0%, RTA = 28.36 ms [20:30:40] (03PS1) 10Jalexander: verify wikivoyage.org for google [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79426 [20:31:41] RECOVERY - NTP on ytterbium is OK: NTP OK: Offset 0.04874968529 secs [20:32:21] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [20:32:41] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 20:32:40 UTC 2013 [20:33:41] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [20:43:11] (03PS1) 10Reedy: Remove symlinkis from 1.22wmf6 through 1.22wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79428 [20:43:41] (03CR) 10Reedy: [C: 032] Remove symlinkis from 1.22wmf6 through 1.22wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79428 (owner: 10Reedy) [20:43:50] (03Merged) 10jenkins-bot: Remove symlinkis from 1.22wmf6 through 1.22wmf8 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79428 (owner: 10Reedy) [20:45:35] !log reedy synchronized docroot and w [20:45:46] Logged the message, Master [21:05:37] (03Abandoned) 10Reedy: verify wikivoyage.org for google [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79426 (owner: 10Jalexander) [21:05:48] !log reedy synchronized docroot and w [21:06:21] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [21:30:48] is there something like a live 'show processlist;' for mysql/mariadb? i know i can do watch -n1 'mysql ... "show processlist;"' but even that is not capturing super quick inserts [21:31:40] basically a human readable tail -f of the general query log, except not requiring shell access to the database host [21:33:11] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 21:33:07 UTC 2013 [21:33:21] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [21:42:05] ^d, yt? [21:45:49] !log olivneh synchronized php-1.22wmf12/extensions/CoreEvents 'Updating CoreEvents to master for I08f66861b (1/2)' [21:46:01] Logged the message, Master [21:46:14] !log olivneh synchronized php-1.22wmf13/extensions/CoreEvents 'Updating CoreEvents to master for I08f66861b (2/2)' [21:46:26] Logged the message, Master [21:46:43] !log Timeouts on last two sync-dirs: srv281, mw1089, mw1173 [21:46:57] Logged the message, Master [21:49:29] greg-g: things look good; got 3 datapoints already; just trying out different browsers for extra safety [21:50:25] (03PS2) 10MaxSem: Rebuild localisation cache in several threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 [21:51:08] MaxSem: that's very cool [21:51:26] ori-l, reviews welcome:) [21:51:45] now I'm gonna break beta testing it;) [21:53:06] <^d> ori-l: sup? [21:54:19] ^d: you had unmerged changes for Cirrus on wmf12 [21:54:26] but i looked closer and realized it was a change and then a revert [21:54:30] so assumed it was fine [21:54:40] <^d> Yeah, no 12 wikis will be getting cirrus. [21:54:47] <^d> they'll all be on 13 before they start getting it. [21:54:51] it might be good to sync that anyway [21:54:54] <^d> fwiw, I can't even ping those 3 boxes, so they're likely down or something. [21:54:55] even though it's a no-op [21:55:32] ^d: i know, but i'm trying to habituate ops into taking them out of the deployment dsh group when they pull them from rotation :P [21:55:41] by not treating it as par the course [21:56:11] ori-l: awesome (sorry for delay, wifi dropped for me for a second) [21:56:25] ^d: want me to sync it? [21:56:30] <^d> Doing it now. [21:56:35] thanks [21:56:41] i merged it, just didn't sync [21:57:22] !log demon synchronized php-1.22wmf12/extensions/CirrusSearch 'No actual changes, was just a change + revert' [21:57:34] Logged the message, Master [21:57:48] danke [21:57:56] <^d> yw [22:04:04] so are puppet updates still disabled (on terbium at least)? [22:06:01] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [22:07:37] (03CR) 10MaxSem: "Just tried this on beta - it worked and indeed improved performance." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79231 (owner: 10MaxSem) [22:14:41] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1374323368 secs [22:18:00] so, cp3004 in esams has a bad disk, which (among other minor issues) killed its backend varnish process's ability to run [22:18:27] not having dealt with this before, it looks like I can pull it from the list in manifests/role/cache.pp to get traffic off of it for now [22:18:32] any better advice on that? :) [22:20:40] (03PS1) 10BBlack: Remove cp3004 from esams upload pool (bad disk) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79439 [22:32:41] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 22:32:35 UTC 2013 [22:32:53] (03CR) 10BBlack: [C: 032 V: 031] "Looks good based on historical depool commits, let's try this out..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79439 (owner: 10BBlack) [22:33:01] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [22:43:18] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [22:53:08] PROBLEM - check google safe browsing for wikibooks.org on google is CRITICAL: Connection timed out [22:53:08] PROBLEM - check google safe browsing for wikipedia.org on google is CRITICAL: Connection timed out [22:53:18] PROBLEM - check google safe browsing for wikisource.org on google is CRITICAL: Connection refused [22:53:38] PROBLEM - check google safe browsing for wikiversity.org on google is CRITICAL: Connection refused [22:53:38] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: Connection timed out [22:53:48] PROBLEM - check google safe browsing for wiktionary.org on google is CRITICAL: Connection timed out [22:53:48] PROBLEM - check google safe browsing for wikiquotes.org on google is CRITICAL: Connection timed out [22:53:48] PROBLEM - check google safe browsing for wikimedia.org on google is CRITICAL: Connection timed out [22:53:58] PROBLEM - check google safe browsing for mediawiki.org on google is CRITICAL: Connection timed out [22:55:58] RECOVERY - check google safe browsing for wikibooks.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3847 bytes in 4.744 second response time [22:56:08] RECOVERY - check google safe browsing for wikipedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3997 bytes in 5.620 second response time [22:56:18] RECOVERY - check google safe browsing for wikisource.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3851 bytes in 4.198 second response time [22:56:28] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3843 bytes in 0.223 second response time [22:56:38] RECOVERY - check google safe browsing for wiktionary.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3851 bytes in 0.165 second response time [22:56:38] RECOVERY - check google safe browsing for wikiversity.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3854 bytes in 5.925 second response time [22:56:39] RECOVERY - check google safe browsing for wikiquotes.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3844 bytes in 4.287 second response time [22:56:48] RECOVERY - check google safe browsing for wikimedia.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3996 bytes in 0.089 second response time [22:56:48] RECOVERY - check google safe browsing for mediawiki.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3846 bytes in 3.274 second response time [23:05:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [23:08:40] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown [23:16:40] RECOVERY - NTP on pdf3 is OK: NTP OK: Offset 0.1890879869 secs [23:16:57] (03PS1) 10RobH: disabling absent volunteers access (if either of these guys comes back to do stuff, give them their access back.) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79443 [23:17:49] (03CR) 10RobH: [C: 032] disabling absent volunteers access (if either of these guys comes back to do stuff, give them their access back.) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79443 (owner: 10RobH) [23:32:50] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Fri Aug 16 23:32:46 UTC 2013 [23:33:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [23:58:38] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: Offset unknown