[00:07:45] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:35:25] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 00:35:16 UTC 2013 [00:46:40] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [01:06:52] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:32:42] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 01:32:37 UTC 2013 [01:32:52] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:55:01] (03PS1) 10Faidon: exim: force Gmail over IPv4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79753 [01:55:35] (03CR) 10Faidon: [C: 032] exim: force Gmail over IPv4 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79753 (owner: 10Faidon) [01:58:38] IPv4ForEver! [01:58:49] amazing isn't it [02:06:19] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:15:46] !log LocalisationUpdate completed (1.22wmf12) at Mon Aug 19 02:15:45 UTC 2013 [02:15:51] Logged the message, Master [02:29:49] PROBLEM - DPKG on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:49] RECOVERY - DPKG on mw1046 is OK: All packages OK [02:31:09] PROBLEM - twemproxy process on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:39] PROBLEM - Disk space on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:32:49] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 02:32:42 UTC 2013 [02:33:19] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:33:29] RECOVERY - Disk space on mw1046 is OK: DISK OK [02:33:59] RECOVERY - twemproxy process on mw1046 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:37:09] PROBLEM - twemproxy process on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:49] PROBLEM - DPKG on mw1046 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:39] RECOVERY - DPKG on mw1046 is OK: All packages OK [02:39:59] RECOVERY - twemproxy process on mw1046 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [02:41:19] !log LocalisationUpdate completed (1.22wmf13) at Mon Aug 19 02:41:19 UTC 2013 [02:41:25] Logged the message, Master [02:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [03:00:26] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 19 03:00:26 UTC 2013 [03:00:32] Logged the message, Master [03:02:09] RECOVERY - Puppet freshness on terbium is OK: puppet ran at Mon Aug 19 03:02:06 UTC 2013 [03:06:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [03:10:55] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [03:18:35] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:25] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [03:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:29:35] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:26] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [03:32:45] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 03:32:36 UTC 2013 [03:33:15] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [03:35:34] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:24] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [04:07:00] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [04:18:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:10] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:10] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:11] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:18:47] !log authdns-update: DKIM & DMARC (both no-op, for now) [04:18:52] Logged the message, Master [04:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:27:22] (03PS1) 10Faidon: exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 [04:32:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:32:50] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 04:32:40 UTC 2013 [04:33:00] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [04:33:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [04:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [05:06:25] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [05:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:05] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 05:33:03 UTC 2013 [05:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:33:25] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [05:34:53] Elsie: oh heh, thanks :) [05:35:54] :-) [05:36:42] (03PS2) 10Faidon: exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 [05:37:25] (added bz/rt to commit message) [05:43:03] spf, dkim, adsp, dmarc [05:43:16] let's pollute our txt records with all kinds of crap [05:45:09] RECOVERY - search indices - check lucene status page on search27 is OK: HTTP OK: HTTP/1.1 200 OK - 351 bytes in 0.055 second response time [05:48:59] !log authdns-update: add ADSP (dkim=unknown), DKIM policy record for lists (o=~) [05:49:05] Logged the message, Master [05:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [06:06:32] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [06:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:32:42] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 06:32:37 UTC 2013 [06:33:32] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [06:43:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [06:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [06:56:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:57:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [07:05:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:08:40] RECOVERY - search indices - check lucene status page on search1009 is OK: HTTP OK: HTTP/1.1 200 OK - 369 bytes in 0.004 second response time [07:20:53] !log destroyed leaked session with ID 796c2b... [07:20:59] Logged the message, Master [07:21:04] !log switching text-varnish pybal group back to squids [07:21:09] Logged the message, Master [07:26:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:27:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [07:32:50] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 07:32:42 UTC 2013 [07:33:30] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:43:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:44:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [07:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [07:55:27] !log tstarling synchronized php-1.22wmf12/includes/GlobalFunctions.php 'hack for session bug' [07:55:32] Logged the message, Master [08:01:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:03:41] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:31] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:09:08] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [08:11:37] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [08:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [08:32:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 08:32:40 UTC 2013 [08:33:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [08:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [09:03:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:04:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:08:57] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [09:32:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 09:32:39 UTC 2013 [09:32:57] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:50:12] (03PS2) 10Mark Bergsma: Make sure Set-Cookie responses are not cacheable, and log violations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79762 [09:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.049 second response time [10:00:41] (03PS3) 10Mark Bergsma: Make sure Set-Cookie responses are not cacheable, and log violations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79762 [10:08:04] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:08:58] (03PS4) 10Mark Bergsma: Make sure Set-Cookie responses are not cacheable, and log violations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79762 [10:15:07] I want to add a libav package to apt.wikimedia.org, whats the workflow for that again? its package from 12.04 + adding one patch [10:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [10:32:54] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 10:32:53 UTC 2013 [10:33:04] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:47:11] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [10:49:42] (03PS1) 10Akosiaris: Fix ERB typo in bacula fileset template [operations/puppet] - 10https://gerrit.wikimedia.org/r/79769 [10:51:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:07:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [11:18:00] (03CR) 10Akosiaris: [C: 032] Fix ERB typo in bacula fileset template [operations/puppet] - 10https://gerrit.wikimedia.org/r/79769 (owner: 10Akosiaris) [11:18:21] (03PS1) 10Andrey Kiselev: (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 [11:19:55] (03PS1) 10Akosiaris: Adding Default pool to backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/79771 [11:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [11:22:36] (03CR) 10Akosiaris: [C: 032] Adding Default pool to backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/79771 (owner: 10Akosiaris) [11:24:58] (03PS1) 10Akosiaris: Adding helium.eqiad.wmnet as backed-up host [operations/puppet] - 10https://gerrit.wikimedia.org/r/79772 [11:25:50] (03PS3) 10Nemo bis: exim: add DKIM for wikimedia.org domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/79754 (owner: 10Faidon) [11:25:51] (03CR) 10Akosiaris: [C: 032] Adding helium.eqiad.wmnet as backed-up host [operations/puppet] - 10https://gerrit.wikimedia.org/r/79772 (owner: 10Akosiaris) [11:26:39] diff? [11:27:00] ah [11:27:01] found it [11:28:48] (03PS1) 10Tim Starling: Bug 53032: change the name of the session cookie [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79773 [11:30:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:54] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 11:32:48 UTC 2013 [11:33:09] (03PS1) 10Akosiaris: Tabs vs spaces in site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/79774 [11:33:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [11:33:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [11:33:34] (03CR) 10Akosiaris: [C: 032] Tabs vs spaces in site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/79774 (owner: 10Akosiaris) [11:37:28] (03CR) 10Ori.livneh: [C: 031] "Can't say I am able to follow through the logical consequences of this entirely, but it looks all right, and there don't seem to be too ma" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79773 (owner: 10Tim Starling) [11:38:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [11:46:43] (03CR) 10Mark Bergsma: [C: 031] Bug 53032: change the name of the session cookie [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79773 (owner: 10Tim Starling) [11:49:29] (03CR) 10Mark Bergsma: [C: 032] Make sure Set-Cookie responses are not cacheable, and log violations [operations/puppet] - 10https://gerrit.wikimedia.org/r/79762 (owner: 10Mark Bergsma) [11:51:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.000 second response time [11:53:51] (03CR) 10Tim Starling: [C: 032] Bug 53032: change the name of the session cookie [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79773 (owner: 10Tim Starling) [11:54:00] (03Merged) 10jenkins-bot: Bug 53032: change the name of the session cookie [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79773 (owner: 10Tim Starling) [11:58:28] !log tstarling synchronized wmf-config/CommonSettings.php 'changing session cookie name due to bug 53032' [11:58:33] Logged the message, Master [12:05:20] (03PS2) 10Faidon: Re-enable multiwrites for Ceph [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79198 [12:07:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:12:34] (03CR) 10Faidon: [C: 032] Re-enable multiwrites for Ceph [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79198 (owner: 10Faidon) [12:17:06] !log faidon synchronized wmf-config/filebackend.php 'ceph as a filebackend' [12:17:10] Logged the message, Master [12:18:14] !log tstarling synchronized php-1.22wmf12/includes/GlobalFunctions.php 'revert temp hack, superseded' [12:18:19] Logged the message, Master [12:21:02] works [12:21:04] quiet [12:21:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:01] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:22:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:22:42] :) [12:22:51] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [12:29:49] I wonder if I should switch masters [12:29:55] or do that tomorrow maybe [12:30:40] masters? [12:30:54] the filebackend master is where the (MW) reads come from [12:31:05] now it's swift, ceph is just getting writes [12:31:22] oh, right [12:33:31] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 12:33:28 UTC 2013 [12:33:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:43:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [12:45:37] (03PS1) 10Akosiaris: Fix various typos in bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/79777 [12:47:41] (03CR) 10Akosiaris: [C: 032] Fix various typos in bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/79777 (owner: 10Akosiaris) [12:49:50] !log hopefully final swiftrepl (swift->ceph thumbs) run on ms-fe1002 [12:49:55] Logged the message, Master [12:50:11] that script has turned out useful hasn't it [12:50:17] very [12:50:18] kudos [12:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:58:50] akosiaris: two filesystems? [13:00:30] akosiaris: yeah. One per storage device. [13:00:53] I started with something relatively small (some 10Ts) for both and will grow them as necessary [13:01:32] so NFS across the inter-dc link? :) [13:01:56] nope [13:02:06] gods forbid that [13:02:16] which storage devices then? [13:03:00] The bacula-sd can have multiple storage devices (file-based, tape, fifo, dvd, etc) [13:03:48] so I have created 2 storage devices, that can be used interchangeably (or not). It allows for more simultaneous jobs plus some flexibility [13:03:59] (03PS3) 10Aude: Add DataTypes extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76481 [13:04:02] 2 storage devices on nas1001-a? [13:04:10] (03PS4) 10Aude: Add DataTypes extension [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76481 [13:04:29] i can send for example "Archive" jobs just to one for example and have different retention policies [13:05:07] Yeah. Two netapp flexible volumes (baculasd1, baculasd2) [13:05:14] ok [13:06:25] (03CR) 10Aude: "added comments in patchset 3 and rebased in patchset 4" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76481 (owner: 10Aude) [13:06:40] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:12:16] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [13:16:45] !log temporarily running pt-kill on s5 slaves for slow wikidata queries [13:16:50] Logged the message, Master [13:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [13:30:24] (03PS1) 10Mark Bergsma: Make sure that responses with any of private, no-cache, no-store are never cached [operations/puppet] - 10https://gerrit.wikimedia.org/r/79782 [13:33:20] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 13:33:10 UTC 2013 [13:33:20] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:33:40] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:34:10] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [13:38:49] (03CR) 10Mark Bergsma: [C: 032] Make sure that responses with any of private, no-cache, no-store are never cached [operations/puppet] - 10https://gerrit.wikimedia.org/r/79782 (owner: 10Mark Bergsma) [13:51:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.147 second response time [14:06:24] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [14:13:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [14:18:44] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:44] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:44] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:44] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:44] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [14:18:45] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [14:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [14:32:44] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 14:32:41 UTC 2013 [14:33:24] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [14:44:52] !log authdns-update: remove trailing dot from ulsfo reverse zone names [14:44:57] Logged the message, Master [14:52:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [15:07:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:13:18] (03CR) 10Akosiaris: [C: 032] Restore/Migrate Job templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/79795 (owner: 10Akosiaris) [15:21:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [15:32:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Aug 19 15:32:43 UTC 2013 [15:33:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:39:21] (03PS1) 10Faidon: Undecom mexia [operations/puppet] - 10https://gerrit.wikimedia.org/r/79798 [15:39:53] (03CR) 10Faidon: [C: 032 V: 032] Undecom mexia [operations/puppet] - 10https://gerrit.wikimedia.org/r/79798 (owner: 10Faidon) [15:45:04] (03PS3) 10Faidon: Add an authdns module & associated role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/74119 [15:59:40] (03PS1) 10Akosiaris: Present our cert to clients, autolabel volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/79800 [16:00:39] (03CR) 10Akosiaris: [C: 032] Present our cert to clients, autolabel volumes [operations/puppet] - 10https://gerrit.wikimedia.org/r/79800 (owner: 10Akosiaris) [16:03:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:10:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:30] Reedy: around? [16:11:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [16:13:00] csteipp: is oauth being enabled? [16:13:49] aude: his window just started, he seems to be working in a terminal right now, and looking at the "how to deploy code" wiki page, so I think yes ;) [16:14:02] cool :) [16:14:25] haha [16:14:30] greg-g the wmf spy [16:14:34] paravoid: :) [16:14:42] if anyone wants to look at https://gerrit.wikimedia.org/r/#/c/76481/ some time, that would be helpful [16:14:46] I sit behind him now, and didn't want to disturb, so I pulled an NSA [16:14:59] hahaha [16:15:07] i looked at it and tested it again and it seems fine [16:15:27] oauth to test wikis :) [16:15:36] aude: Yep [16:15:42] (03PS1) 10Faidon: Temporary setup of new authdns servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/79805 [16:18:51] alright, back later or tomorrow... [16:22:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.375 second response time [16:26:22] (03PS1) 10Petr Onderka: Implemented diff dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79808 [16:30:51] (03CR) 10Mark Bergsma: [C: 031] Add an authdns module & associated role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/74119 (owner: 10Faidon) [16:30:57] woo [16:31:49] (03CR) 10Faidon: [C: 032] Add an authdns module & associated role classes [operations/puppet] - 10https://gerrit.wikimedia.org/r/74119 (owner: 10Faidon) [16:32:15] (03CR) 10Mark Bergsma: [C: 031] "SO UGLY" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79805 (owner: 10Faidon) [16:32:58] !log csteipp synchronized php-1.22wmf13/extensions/CentralAuth 'Update CentralAuth to master for OAuth hooks' [16:33:03] Logged the message, Master [16:33:46] Hmm... anyone know if mw1046 more broken than normal? mw1046: rsync: mkstemp "/usr/local/apache/common-local/php-1.22wmf13/extensions/CentralAuth/.CentralAuthHooks.php.nkO77o" failed: Read-only file system (30) [16:33:59] huh [16:34:23] (that was just a new error... since the last time I synched stuff) [16:36:14] I think this one and the next one are relevant ;) https://wikitech.wikimedia.org/wiki/Projects#Basic_monitoring_.26_alerting [16:38:51] csteipp, I can't write to /home on it - definitely a problem as there's nothing on it in SAL or icinga [16:54:15] (03PS1) 10CSteipp: Add OAuth to test wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79816 [16:55:15] AaronSchulz: https://gerrit.wikimedia.org/r/#/c/79816/ [16:58:31] ^d: so, I guess with gerrit we do need to keep apache in between eh, behind varnish [16:59:45] <^d> What rewriting causes the problem? [17:00:00] i mean [17:00:04] we COULD do that rewriting in varnish [17:00:23] but generally it's better to let the webserver do that [17:00:27] it's kinda hacky in varnish [17:00:41] <^d> Mmk, no problem then. [17:00:53] although, is it the only reason to keep apache? [17:01:00] those gitweb redirects? [17:03:10] (03PS2) 10CSteipp: Add OAuth to test wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79816 [17:04:23] AaronSchulz: ^ [17:07:52] (03CR) 10Faidon: [C: 032] Temporary setup of new authdns servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/79805 (owner: 10Faidon) [17:08:14] (03CR) 10Aaron Schulz: [C: 031] Add OAuth to test wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79816 (owner: 10CSteipp) [17:08:25] AaronSchulz: morning [17:09:19] AaronSchulz: ceph is back in prod since a few hours ago [17:09:40] AaronSchulz: filebackend-ops has 47 "failed sync check", I guess that's normal? [17:12:49] !log removing sq41 from dsh node groups and pybal [17:12:54] Logged the message, Master [17:13:04] probably [17:14:04] (03PS2) 10Petr Onderka: Implemented diff dumps [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/79808 [17:15:25] (03CR) 10CSteipp: [C: 032] Add OAuth to test wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79816 (owner: 10CSteipp) [17:15:34] (03Merged) 10jenkins-bot: Add OAuth to test wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79816 (owner: 10CSteipp) [17:15:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [17:21:38] (03PS1) 10Faidon: authdns: fix private file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/79821 [17:21:50] (03CR) 10Faidon: [C: 032] authdns: fix private file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/79821 (owner: 10Faidon) [17:22:10] (03CR) 10Faidon: [V: 032] authdns: fix private file path [operations/puppet] - 10https://gerrit.wikimedia.org/r/79821 (owner: 10Faidon) [17:22:36] is anyone working on https://rt.wikimedia.org/Ticket/Display.html?id=5614 ? [17:23:50] robla: I think mark already fixed it [17:24:07] robla: lemme find the patchset [17:24:47] paravoid: wonderful, thanks! bd808 ^^ [17:24:53] <^d> mark: Probably. [17:25:17] bd808, robla: afaik mark's theory was that it was https://gerrit.wikimedia.org/r/#/c/79322/ [17:25:44] bd808, robla: bblack merged it on friday and was watching it, he may be able to provide more insight [17:34:16] AaronSchulz: ping? [17:34:24] (see above) [17:35:31] !log csteipp synchronized php-1.22wmf12/extensions/CentralAuth 'Updating wmf12 CentralAuth to master for OAuth' [17:35:35] Logged the message, Master [17:35:57] (03PS1) 10Faidon: git::clone: fix clones with unprivileged users [operations/puppet] - 10https://gerrit.wikimedia.org/r/79827 [17:36:13] paravoid: I said it was probably fine [17:36:13] that said, I can't look at the logs since I forgot to bring my key again [17:36:32] oh I didn't see the "probably", apologies [17:37:23] swiftrepl has finished, I'm running copy-missing.sh now [17:37:41] it copies a few objects, surprisingly [17:37:45] like mwstore://local-swift/local-public/2/29/Dunbar_United_colour.png [17:37:59] * AaronSchulz brought the drive and realized there was no key on it ;) [17:38:07] anyway, when that's done I'd like to switch masters [17:38:21] (03PS1) 10Cmjohnson: decommissioning sq41 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79829 [17:38:36] (03CR) 10Faidon: [C: 032 V: 032] git::clone: fix clones with unprivileged users [operations/puppet] - 10https://gerrit.wikimedia.org/r/79827 (owner: 10Faidon) [17:39:39] any objections to that? [17:39:46] or considerations? [17:40:47] I suppose its fine...wish I could see those logs though but it's probably not hugely important [17:40:55] * AaronSchulz will check when he gets home [17:42:26] !log csteipp synchronized php-1.22wmf12/extensions/CentralAuth 'revert ca update' [17:42:32] Logged the message, Master [17:46:32] (03CR) 10Cmjohnson: [C: 032 V: 032] decommissioning sq41 [operations/puppet] - 10https://gerrit.wikimedia.org/r/79829 (owner: 10Cmjohnson) [17:51:03] !log apt: new geoip-database package version, from 2 years later [17:51:08] Logged the message, Master [17:51:36] Reedy: I'm a little late getting OAuth out. I'll ping you when I'm done. [17:51:47] paravoid: yeah, I cant thing of a big reason not to switch now [17:51:51] *think [17:52:08] I can, it's near the end of my day :) [17:52:12] :P [17:52:16] but nod, this week [17:52:29] well, not literally now, but soonish [17:53:01] (03PS1) 10Nemo bis: Dereference unused category from ArticleFeedbackToolv5 en.wiki config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79832 [17:56:13] !log rebooting rubidium, mexia, eeden -- kernel upgrade [17:56:18] Logged the message, Master [17:57:49] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [17:57:59] PROBLEM - Host mexia is DOWN: PING CRITICAL - Packet loss = 100% [17:58:19] RECOVERY - Host mexia is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [17:59:39] RECOVERY - Host rubidium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:03:54] (03CR) 10Asher: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78414 (owner: 10Manybubbles) [18:05:18] !log csteipp Started syncing Wikimedia installation... : [18:05:23] Logged the message, Master [18:11:53] <^d> manybubbles: All the changes merged through to Cirrus minus the one we were waiting on upstream for. When csteipp is done we'll push the updates to test2. [18:12:01] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [18:12:05] <^d> We'll need an index rebuild right? [18:16:49] ^d: Of of curiosity, will CirrusSearch have the same daily index update as Lucene? [18:16:54] Or will it be faster/slower? [18:17:43] <^d> Way faster. Near realtime. [18:17:49] Sweeeeet. [18:18:07] <^d> At edit/delete/move time for many things, at linksupdate time for included content. [18:18:27] I have a pipe-dream of regex support at a per-character level. [18:18:27] . [18:21:24] (03PS1) 10awjrichards: Update regex for login cookies per session cookie name change [operations/puppet] - 10https://gerrit.wikimedia.org/r/79837 [18:22:13] ops, mobile login is busted due to the recent change in session cookie names [18:22:16] i think https://gerrit.wikimedia.org/r/#/c/79837/ should fix the issue [18:22:22] (03CR) 10MaxSem: [C: 04-1] "Please use [Ss]ession - not sure if it's going to stay current way in the future." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79837 (owner: 10awjrichards) [18:23:55] Hmm [18:24:36] regex + session cookies seems ... less than ideal. [18:27:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:24] (03PS2) 10awjrichards: Update regex for login cookies per session cookie name change [operations/puppet] - 10https://gerrit.wikimedia.org/r/79837 [18:28:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [18:30:29] (03CR) 10Mark Bergsma: [C: 032] Update regex for login cookies per session cookie name change [operations/puppet] - 10https://gerrit.wikimedia.org/r/79837 (owner: 10awjrichards) [18:31:31] thanks mark :) [18:33:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.997 second response time [18:44:58] bblack: So when a system has a bad disk, you want to check and see if a ticket is in that machine's datacenter specific queue in RT. If it is not, you can add it into those specific queues. The name of server usually denotes location: https://wikitech.wikimedia.org/wiki/Server_naming_conventions [18:45:18] (you asked about it and i wanted to give you info before i forgot) [18:45:40] Also can check racktables for system location [18:45:50] not sure if we got you setup for access, if not we should [19:00:07] for dc, just use puppet/facter, $::site [19:01:02] RobH: turns out there's already a 6 week old ticket on that disk failure: https://rt.wikimedia.org/Ticket/Display.html?id=5443 [19:01:37] damn lazy dc engineer there [19:01:44] ^d: cool [19:01:44] hahaha [19:01:46] I just ran into it because the backend varnish was stuck/dead (:3128) due to the disk [19:01:52] hasn't shown up to work in 2 months [19:03:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, closed, wikimedia, wikinews, wikivoyage to 1.22wmf13 [19:03:20] Logged the message, Master [19:03:57] ^d: now that you've merged that I believe I'll run a reindex in labs [19:09:55] !log mw1046 reporting readonly filesystem [19:10:00] Logged the message, Master [19:10:21] I was wondering how to best report that... [19:11:02] sloooow [19:11:13] scap? [19:11:48] it's been running over 80 mins for me [19:11:48] Nope, doing anything on that host [19:12:30] disk is dying [19:13:07] * Reedy takes that as "ops know" [19:13:08] ;) [19:13:20] now they do :-) [19:13:46] yuck [19:14:58] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource, wikiversity and wikibooks to 1.22wmf13 [19:15:03] Logged the message, Master [19:15:14] akosiaris: Could you depool mw1046 if it hasn't been done already? [19:16:37] csteipp: I'd probably kill it [19:16:49] Presumably localisation cache updates have worked? [19:16:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary and wikiquote to 1.22wmf13 [19:17:00] worked/have been done [19:17:01] Logged the message, Master [19:17:12] (03PS1) 10Reedy: Non 'pedia to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79848 [19:17:37] (03CR) 10Reedy: [C: 032] Non 'pedia to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79848 (owner: 10Reedy) [19:17:47] (03Merged) 10jenkins-bot: Non 'pedia to 1.22wmf13 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79848 (owner: 10Reedy) [19:18:05] 1 Fatal error: Using $this when not in object context in /usr/local/apache/common-local/php-1.22wmf13/includes/specials/SpecialUpload.php on line 686 [19:18:07] Rarrgh [19:19:47] * Reedy suspects bawolff [19:20:18] Reedy: done [19:20:24] akosiaris: Thanks [19:23:15] https://gerrit.wikimedia.org/r/79850 [19:28:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [19:30:58] Reedy: Just killed it [19:31:24] Are you going to re scap as part of the normal deploy today? [19:31:36] Nope [19:31:41] Need to sync one file [19:31:59] I'd suggest you just run sync-dir php-1.22wmf12 and then do the same for 13 [19:32:23] I'm just about done [19:32:37] Cool. Let me know when you're done and I'll do that [19:35:04] !log depooled mw1046 [19:35:09] Logged the message, Master [19:38:55] !log reedy synchronized php-1.22wmf13/includes/specials/SpecialUpload.php [19:39:00] Logged the message, Master [19:39:49] i think that's me done now.. [19:42:19] Alright, I'm syncing.. [19:44:31] !log csteipp synchronized php-1.22wmf12 [19:44:35] Logged the message, Master [20:02:27] !log csteipp synchronized php-1.22wmf13 'sync again in case mw1046 errors messed up scap' [20:02:31] Logged the message, Master [20:07:00] <^d> manybubbles: I'll do the same for test2. [20:07:35] ^d: cool. it'll take a rebuild of the searchIndexConfig as well [20:07:46] <^d> mmk [20:31:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [20:38:48] Coren: I meant to mention this but completely forgot in our meeting, can you send an update to Labs-l summarizing the state of NFS? [20:39:49] manybubbles: ^d There looks to have just been a spew of Cirrus related fatals/warnings on the cluster [20:40:32] ^d: Can you point me to the documentation to read them? [20:40:42] Reedy, rather, can you point me to how to look at them? [20:40:57] I know I shell into something, oxygen, maybe, but don't know beyond that. [20:41:10] fluorine [20:41:23] Though, these you can view in the apache syslogs on fenari [20:41:41] tail -n 1000 /home/wikipedia/syslog/apache.log | grep Cirrus [20:41:50] PHP Fatal error: require() [function.require]: Failed opening required '/usr/local/apache/common-local/php-1.22wmf13/extensions/CirrusSearch/Elastica/lib/Elastica/Document.php' [20:41:51] etc [20:42:44] reedy@mw1207:~$ ls -al /usr/local/apache/common-local/php-1.22wmf13/extensions/CirrusSearch/Elastica/lib/Elastica/Document.php [20:42:45] ls: cannot access /usr/local/apache/common-local/php-1.22wmf13/extensions/CirrusSearch/Elastica/lib/Elastica/Document.php: No such file or directory [20:44:14] Screwy submodules almost [20:44:40] I wouldn't be surprised [20:44:42] !log reedy synchronized php-1.22wmf13/extensions/CirrusSearch/ [20:44:46] Logged the message, Master [20:47:29] Reedy: looks like shelling into fenari hangs for me. I'm sure I've got something misconfigured. [20:47:54] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [20:48:44] manybubbles, no agent? [20:50:05] MaxSem: looks like I hang trying to get to bast1001 - which is new, I think [20:52:44] just config issuess.... [21:08:48] (03PS1) 10Tim Landscheidt: Tools: Add python-beautifulsoup to exec_environ. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79921 [21:11:22] (03PS2) 10Tim Landscheidt: Tools: Add python-beautifulsoup to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/79921 [21:35:40] (03PS1) 10Ottomata: Adding kafka-mirror package for kafka-mirror init.d scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 [21:38:15] (03PS2) 10Ottomata: Adding kafka-mirror package for kafka-mirror init.d scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 [22:01:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.136 second response time [22:03:34] (03CR) 10coren: [C: 032] "Package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79921 (owner: 10Tim Landscheidt) [22:06:56] ... something is wrong with pupper-merge? [22:07:09] error: unable to unlink old 'modules/toollabs/manifests/exec_environ.pp' (Permission denied) [22:07:19] Which, when one is root, is really odd. [22:08:29] <^d> csteipp: It's just that one schema file for mw.org only, right? backend/schema/mysql/OAuth.sql? [22:09:08] <^d> Or AaronSchulz can answer :) [22:09:28] Hm. How odd. Shouldn't all of that be owned by gitpuppet? [22:10:18] !log chown -R gitpuppet /var/lib/git/operations/puppet on sockpuppet: some of those files were (erroneously) owned by root [22:10:23] Logged the message, Master [22:12:28] <^d> !log created oauth tables for mediawikiwiki [22:12:33] Logged the message, Master [22:16:47] ^d: looks fine [22:24:19] Ryan_Lane: do you want to weigh in on https://rt.wikimedia.org/Ticket/Display.html?id=4824 , if you have a preference? [22:29:11] well, it would be good if the networking issues with NAT for floating IPs was solved [22:29:22] by I haven't been able to track it down and mark didn't when he looked [22:30:11] !log demon synchronized php-1.22wmf13/extensions/CirrusSearch 'Cirrus to master' [22:30:15] Logged the message, Master [22:42:34] !log demon synchronized php-1.22wmf13/extensions/CirrusSearch 'Cirrus to master' [22:47:04] <^d> !log ES: reindexing test2wiki [22:47:08] Logged the message, Master [22:52:16] With Jeff out, anyone around who can handle https://rt.wikimedia.org/Ticket/Display.html?id=5580 for Fundraising? It's a quick grant request. [22:54:58] Faidon or Asher maybe? [23:09:24] update of special pages is off now? [23:09:36] or the periods have been prolonged? [23:10:16] Isn't it done monthly? [23:10:23] That, and stuff was broken [23:10:33] it was done like in 3-4 days [23:10:50] s/in/every/ [23:11:17] monthday => "*/3" [23:11:23] hour => 5 [23:11:36] what is */3 ? [23:11:59] every 3rd? [23:12:15] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [23:12:25] http://en.wikipedia.org/wiki/Cron#Examples [23:12:36] mutante: is it in any public config file? [23:12:43] yes, every 3 days [23:12:49] It's in puppet [23:12:50] yes, in puppet [23:12:59] ./manifests/misc/maintenance.pp [23:13:11] class misc::maintenance::update_special_pages [23:13:15] so it doesn't run obviously [23:13:28] last update: 13. 8. 2013, 14:15 [23:13:29] command => "flock -n /var/lock/update-special-pages /usr/local/bin/update-special-pages > /home/wikipedia/logs/norotate/updateSpecialPages.l og 2>&1", [23:13:37] uhm, yeah, i don't know about the commandline [23:14:13] Never happy [23:22:49] (03PS7) 10Yuvipanda: Read routing tables from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 [23:24:42] anyway, in case it would be helpful to track down the issue - cs wikis lack the update [23:27:58] 1178 # Wrong log file location [23:27:58] 1179 class { misc::maintenance::update_special_pages: enabled => true } [23:28:06] 2762 # Broken cron jobs moved back to hume: [23:28:14] 2765 class { misc::maintenance::update_special_pages: enabled => false } [23:29:31] so, the enabled one is on hume in site.pp [23:29:52] not on the new host terbium [23:30:27] !createbug [23:31:23] cat: /home/wikipedia/logs/norotate/updateSpecialPages.log: No such file or directory [23:31:46] (03PS1) 10Dr0ptp4kt: Add IP addresses for Smart Cambodia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 [23:33:51] (03CR) 10Dzahn: [C: 032] "content legit and posterous is dead" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79466 (owner: 10Raimond Spekking) [23:33:58] (03CR) 10Dr0ptp4kt: "Mark, Faidon, Asher: this is ready for implementation now, provided your approval." [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [23:35:38] (03PS2) 10Dr0ptp4kt: Add IP addresses for Smart Cambodia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 [23:36:25] (03CR) 10Dr0ptp4kt: "(Replaced spaces with tabs.)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt) [23:42:39] (03PS3) 10Dr0ptp4kt: Add IP addresses for Smart Cambodia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 [23:43:01] (03CR) 10Dr0ptp4kt: "(And updating comment to reflect new carrier name.)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79953 (owner: 10Dr0ptp4kt)