[00:01:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64005 [00:01:35] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:01:44] PROBLEM - DPKG on searchidx2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:02:44] RECOVERY - DPKG on searchidx2 is OK: All packages OK [00:03:18] New patchset: Reedy; "Remove nomcom entries" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64002 [00:03:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64002 [00:03:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:44] !log reedy synchronized database lists files: [00:04:52] Logged the message, Master [00:05:57] !log reedy synchronized database lists files: [00:06:05] Logged the message, Master [00:06:26] New patchset: Catrope; "[WIP DO NOT MERGE] New Parsoid Varnish puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [00:07:54] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Thu May 16 00:07:46 UTC 2013 [00:07:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:07:52 UTC 2013 [00:08:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:09:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:09:01 UTC 2013 [00:09:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:10:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:10:02 UTC 2013 [00:10:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:10:59 UTC 2013 [00:11:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:11:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:11:48 UTC 2013 [00:12:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:12:34] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:12:31 UTC 2013 [00:13:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:13:40] !log puppetstoredconfigclean.rb ms2.pmtpa.wmnet [00:13:48] Logged the message, Master [00:15:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 00:14:54 UTC 2013 [00:15:34] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [00:16:08] New review: Ryan Lane; "Puppet config looks good. Someone else should likely check the vcl." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/63890 [00:17:19] New patchset: Catrope; "New Parsoid Varnish puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [00:17:59] New review: Faidon; "The comment on backend says "upload backends". I also doubt you need the If-Cached mechanism, this w..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [00:23:20] New review: GWicke; "Re If-cache: I'll drop it in a follow-up VCL changeset I am currently working on." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63890 [00:25:54] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [00:26:01] green [00:26:04] err [00:26:19] that was supposed to be a search. today is not my day for IRC skill [00:26:54] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [00:31:44] PROBLEM - Host ocg3 is DOWN: PING CRITICAL - Packet loss = 100% [00:32:35] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:33:17] New patchset: GWicke; "WIP: Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [00:33:45] New patchset: GWicke; "WIP: Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [00:34:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:35:16] New patchset: Dzahn; "decom barium, it moved to frack, per talk with Jeff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64010 [00:36:37] !log puppetstoredconfigclean.rb ocg3.pmtpa.wmnet [00:36:44] Logged the message, Master [00:37:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64010 [00:40:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:41:29] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:43:33] <-- backup user bzip2'ing things [00:44:29] RECOVERY - DPKG on snapshot2 is OK: All packages OK [00:44:54] New patchset: GWicke; "WIP: Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [00:49:39] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:53:17] New patchset: GWicke; "WIP: Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [00:53:39] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:39] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:58:39] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:41] New patchset: MarkTraceur; "Add fundraising components to #wm-fundraising" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64012 [01:00:13] New review: MarkTraceur; "Don't merge this until the FR team has had a chance to discuss it, but it's here and ready AFAICT." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/64012 [01:00:35] New patchset: MarkTraceur; "**awaiting discussion** Add fundraising components to #wm-fundraising" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64012 [01:00:39] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:36] does mediawiki have an official minimum memory requirement, with a concomitant commitment to, say, the unit tests passing? [01:03:07] i should probably ask this on #mediawiki, not channel-appropriate [01:03:15] * ori-l retracts [01:08:03] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [01:10:43] PROBLEM - Disk space on ms2 is CRITICAL: NRPE: Command check_disk_space not defined [01:22:14] New patchset: Cmjohnson; "Changing cfg for stat1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64013 [01:23:16] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64013 [01:32:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:34:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:35:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [01:36:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:45:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:49:25] New patchset: Akosiaris; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [01:50:34] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:51:50] New review: Adamw; "(1 comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/64012 [01:53:34] RECOVERY - DPKG on snapshot2 is OK: All packages OK [01:54:12] New review: Akosiaris; "So i added a couple of unit tests for the classes in the module. Most run just fine except for:" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [01:55:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:56:34] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:57:25] RECOVERY - DPKG on snapshot2 is OK: All packages OK [01:58:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:04] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:01:04] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:01:04] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [02:03:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:04:35] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:34] !log LocalisationUpdate completed (1.22wmf4) at Thu May 16 02:06:34 UTC 2013 [02:06:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:42] Logged the message, Master [02:09:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:11:32] RECOVERY - DPKG on snapshot2 is OK: All packages OK [02:11:55] !log LocalisationUpdate completed (1.22wmf3) at Thu May 16 02:11:55 UTC 2013 [02:12:02] Logged the message, Master [02:13:22] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [02:14:22] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [02:14:32] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:18:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:25:32] RECOVERY - DPKG on snapshot2 is OK: All packages OK [02:26:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:28:32] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:30:28] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 16 02:30:28 UTC 2013 [02:30:36] Logged the message, Master [02:31:32] RECOVERY - DPKG on snapshot2 is OK: All packages OK [02:33:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:35:42] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:32] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:28] RECOVERY - DPKG on snapshot2 is OK: All packages OK [02:40:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:38] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:41:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:41:38] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:28] RECOVERY - DPKG on mc15 is OK: All packages OK [02:42:38] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:42:38] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:28] RECOVERY - DPKG on snapshot2 is OK: All packages OK [02:48:38] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:38] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:56:38] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:29] RECOVERY - DPKG on snapshot2 is OK: All packages OK [03:00:38] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:04:38] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [03:08:35] RECOVERY - DPKG on snapshot2 is OK: All packages OK [03:09:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:38] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:24:35] RECOVERY - DPKG on snapshot2 is OK: All packages OK [03:24:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:27:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:35:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:38:26] RECOVERY - DPKG on snapshot2 is OK: All packages OK [03:38:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:42:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:44:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:47:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:49:36] RECOVERY - DPKG on snapshot2 is OK: All packages OK [03:50:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.612 second response time [03:55:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:55:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:00:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:04:27] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:06:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:07:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:07:45 UTC 2013 [04:08:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:08:53 UTC 2013 [04:09:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:09:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:10:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:09:55 UTC 2013 [04:10:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:45] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:10:53] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:10:51 UTC 2013 [04:11:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:11:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:11:40 UTC 2013 [04:12:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:12:24] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:12:22 UTC 2013 [04:13:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:13:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:16:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:16:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 04:16:42 UTC 2013 [04:17:23] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [04:21:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:22:43] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:23:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:43] RECOVERY - Disk space on snapshot2 is OK: DISK OK [04:26:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:28:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:29:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:30:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:30:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:31:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.238 second response time [04:33:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:33:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:34:46] New review: MZMcBride; "This seems reasonable to me!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63877 [04:36:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:41:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:43:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:33] RECOVERY - DPKG on snapshot2 is OK: All packages OK [04:51:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:01:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:33] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:05:43] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:07:34] RECOVERY - DPKG on snapshot2 is OK: All packages OK [05:08:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:11:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:13:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:15:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:23:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:31:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:31:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:32:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:35:44] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:38:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:32] TimStarling, around? [05:46:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:46:39] yes [05:47:17] hi, i'm having a weird issue, can't figure out what's causing it. Can i file-sync an extension file that would log an error condition? [05:47:35] TimStarling, problem is, i need to log IP and the full request (GET only) [05:48:05] it currently causes warnings in the fatalmonitor [05:48:18] yes you can [05:48:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:49:22] TimStarling, thx, i will ping you in a bit with the link to my patch, just in case. [05:50:27] RECOVERY - DPKG on snapshot2 is OK: All packages OK [05:50:37] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:51:26] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:51:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:36] RECOVERY - DPKG on snapshot2 is OK: All packages OK [05:57:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:59:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:02:36] RECOVERY - DPKG on snapshot2 is OK: All packages OK [06:03:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:05:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:05:46] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:06:46] RECOVERY - Disk space on snapshot2 is OK: DISK OK [06:06:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:10:06] PROBLEM - Puppet freshness on ocg1 is CRITICAL: No successful Puppet run in the last 10 hours [06:10:06] PROBLEM - Puppet freshness on ocg2 is CRITICAL: No successful Puppet run in the last 10 hours [06:12:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:13:06] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [06:13:06] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [06:14:36] RECOVERY - DPKG on snapshot2 is OK: All packages OK [06:15:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:17] TimStarling, https://gerrit.wikimedia.org/r/#/c/64020/1/includes/PageRenderingHooks.php [06:16:36] please +2 [06:16:48] will push it out now, and will revert right thereafter [06:17:08] should get enough hits to figure out who is triggering it [06:17:44] what bug number is it? [06:17:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:14] TimStarling, there is no bug - it was deployed yesterday, and immediatelly we saw it in fatalmonitor [06:18:28] php warning [06:18:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:59] probably something silly in logic, just that i don't see it in my tests [06:19:01] I think you should file a bug and reference it from the commit message or the patch comment or both [06:19:22] it lets other people know what you are doing without them having to ask you [06:19:34] ok [06:19:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:21:16] $dbg .= "\nURL: " . $_SERVER["SERVER_NAME"] . $_SERVER["REQUEST_URI"]; [06:21:28] this is not actually the URL, in a way that's particularly relevant for mobile clients [06:21:38] but it probably doesn't matter for you [06:22:01] TimStarling, ideally i would want to know the full HTTP request + headers [06:22:15] TimStarling, i will also need the source IP [06:22:21] you don't use $_SERVER['REQUEST_URI'] anywhere else, do you? [06:22:30] no, of course not [06:22:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:22:53] i need to figure out which X-CS based on IP [06:23:37] you can log wfGetIP() if that's all you need [06:24:04] not all - i also need the query, and which server they requested [06:24:57] if I were you, I'd either use a new log channel named after the bug number, or temp-debug, I wouldn't use a vague name like "mobile" that's already used for something else [06:25:23] sure, but that's there already [06:25:38] well, kind of [06:25:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:26:04] TimStarling, wouldn't i need to set up a new config setting, and do lots of other things? i don't want to accidently break too many things with this :) [06:26:35] just add it to wgDebugLogGroups in InitialiseSettings.php [06:26:36] RECOVERY - DPKG on snapshot2 is OK: All packages OK [06:26:39] (can't wait for the hadoop with sql interface :)) [06:26:48] that's what I usually do, but like I say, there's temp-debug if you think that's too hard [06:27:07] I guess someone else didn't like changing InitialiseSettings.php for every production debugging job [06:27:19] TimStarling, you mean i can use "temp-debug" instead of mobile? [06:27:23] yes [06:27:46] 'temp-debug' => "udp://$wmfUdp2logDest/temp-debug", // generic admin debug log [06:28:09] and the directory on fluorine is writable by udp2log now, so the file will be created automatically [06:28:40] it'll appear at /a/mw-log/temp-debug.log [06:28:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:31] TimStarling, thanks!!! i just git reviewed the change [06:29:35] pls +2 [06:29:36] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:16] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 06:30:10 UTC 2013 [06:30:19] what's the command to revert "git rm"? I'm sure I've done this once before but I can't remember it now [06:30:26] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:30:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:52] no idea - i usually use tortoise git ;) [06:30:59] or gerrit's [06:31:06] "revert" is very nice there [06:31:06] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 06:30:56 UTC 2013 [06:31:06] git reset file ; git checkout -- file [06:31:23] ori-l, are you always lurking??!? amazing :) [06:31:26] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:31:46] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 06:31:36 UTC 2013 [06:32:26] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:32:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:33:10] thanks ori-l [06:33:49] yurik: got to head to the airport for a flight in four hours or so, not much point in sleeping [06:34:03] europe? [06:34:28] going to israel first to spend a bit of time with my family [06:34:35] so: pseudo-europe [06:34:46] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:34:46] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:36:14] in the same sense that australia is pseudo-europe? ;) [06:36:34] exactly, with the disturbing colonial implications to boot [06:37:58] I wonder how israel will end up in the long term [06:38:06] like south africa or like liberia? [06:38:07] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [06:38:37] i.e. to what extent will it become like its surroundings? [06:38:37] RECOVERY - Disk space on snapshot2 is OK: DISK OK [06:38:43] about to sync file in ext [06:39:07] PROBLEM - Puppet freshness on db26 is CRITICAL: No successful Puppet run in the last 10 hours [06:40:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:41:21] !log yurik synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php [06:41:28] Logged the message, Master [06:42:02] I think Liberia is a really interesting analogue for Israel, in terms of their compassionate rationales for foundation [06:42:41] south africa is an interesting point of comparison; i don't know much about liberian politics but reading about it now [06:43:22] both Israel and Liberia were founded by western states as homelands for oppressed people in those sponsoring states [06:43:44] but Liberia is almost twice as old so maybe it gives you insight into a later stage of the process [06:44:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:45:28] RECOVERY - DPKG on snapshot2 is OK: All packages OK [06:46:07] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 06:45:59 UTC 2013 [06:46:27] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [06:46:32] New patchset: Tim Starling; "Remove three more scap scripts which were moved to puppet" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/64023 [06:47:13] sorry, more than twice as old, my memory was failing [06:48:07] 3 times as old, in fact: 191 years versus 65 years [06:48:37] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:37] RECOVERY - DPKG on snapshot2 is OK: All packages OK [06:50:35] palestinians are not capable of instigating civil war at the moment, i don't think. the degree to which israel's "security" apparatuses control every aspect of palestinian life is staggering [06:50:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:51:22] a unified nonviolent movement with major international backing is a possibility, but i don't know what it would accomplish. it's an incredibly depressing situation. [06:51:47] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:37] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:52:48] RECOVERY - Disk space on snapshot2 is OK: DISK OK [06:53:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:53:49] no, the palestinians are not capable of creating a civil war [06:54:14] you could compare the current period for israel with this period for liberia: https://en.wikipedia.org/wiki/History_of_Liberia#Americo-Liberian_domination_and_suppression [06:56:16] the parallels are a bit uncanny, right down to the reproduction of patterns of oppression [06:56:46] 'mobile' is used for bug logging only, so there's no problem in reusing it [06:57:47] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:58:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:59:47] but americo-liberians constituted 5% of the population, whereas jews currently slightly outnumber arabs in israel and palestine taken as one unit. they are a minority in modern palestine proper (settlers, that is), but that minority is increasingly geographically contiguous with the centers of jewish population in israel, so maybe the comparison with liberia breaks down there [07:01:47] RECOVERY - Disk space on snapshot2 is OK: DISK OK [07:01:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.264 second response time [07:02:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:21] aha yurik - you have logged hits now [07:05:47] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:08:02] MaxSem, yep, btw, tim suggested an excellent way to log :) [07:08:07] separate file, much easier [07:08:44] MaxSem, and the hits are weird :( it seems that our opera recognition is not working :( [07:09:06] that's a side benefit of this log [07:09:14] Anus_m.jpg - why I'm not surprised? [07:09:22] yeah, i was amused too [07:10:06] opera as in opera ips [07:10:19] yes [07:10:25] their forwarding cluster [07:10:58] there are also android hits [07:11:01] although the anus request is not [07:12:10] debating if i should revert now, or collect a few more hits [07:13:24] ooh, the last one is interesting - probably coming from cache [07:13:44] as we no longer have zeropartner=NNN [07:14:41] ok, i think this is good enough, time to stop this [07:17:35] syncing [07:25:16] MaxSem, i ran into an unexpected problem - what happens if my computer dies during the sync operation? [07:26:10] servers go out of sync because sync requires your auth agent [07:27:27] MaxSem, could you do me a favour and sync-file php-1.22wmf3/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php [07:27:36] everything is already in place [07:28:01] i think my desktop just died :( [07:28:22] or is rebooting due to urgent microsoft updates [07:28:32] probably the latter [07:28:39] Tuesday was Patch Tuesday [07:28:50] for me there are ~65.8 MB worth of updates [07:29:04] wow. well, its been down for the past 10 min... [07:29:20] i can just see it trying to reboot and asking me to confirm something... [07:29:21] !log maxsem synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php [07:29:27] MaxSem, thanks! [07:29:28] Logged the message, Master [07:29:58] yei, it just came back up!!!! [07:30:11] ~12 min down!!! [07:37:14] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:38:04] RECOVERY - DPKG on snapshot2 is OK: All packages OK [07:38:14] RECOVERY - Disk space on snapshot2 is OK: DISK OK [07:39:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:44:04] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:45:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 07:45:45 UTC 2013 [07:46:14] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [07:46:38] !log maxsem synchronized php-1.22wmf4/extensions/GeoData/ 'https://gerrit.wikimedia.org/r/#/c/63972/' [07:46:46] Logged the message, Master [07:48:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:53:18] !log maxsem synchronized php-1.22wmf3/extensions/GeoData/ 'https://gerrit.wikimedia.org/r/#/c/63972/' [07:53:26] Logged the message, Master [07:53:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:55:39] Tim-away, around? I notice in all the log entries each request had 3 XFF header values. Is that normal? [07:56:14] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:44] PROBLEM - SSH on snapshot2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:56:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.198 second response time [07:58:44] RECOVERY - SSH on snapshot2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:00:54] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [08:02:04] RECOVERY - DPKG on snapshot2 is OK: All packages OK [08:03:14] RECOVERY - Disk space on snapshot2 is OK: DISK OK [08:03:34] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:05:04] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:10:36] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:12:16] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:06] RECOVERY - Disk space on snapshot2 is OK: DISK OK [08:17:56] RECOVERY - DPKG on snapshot2 is OK: All packages OK [08:19:01] New patchset: Hashar; "** WIP ** role class for puppet agents ** WIP **" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64031 [08:20:36] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:21:06] PROBLEM - DPKG on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:25:16] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:36] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:06] RECOVERY - Disk space on snapshot2 is OK: DISK OK [08:28:46] PROBLEM - SSH on snapshot2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:29:51] New patchset: Hashar; "** WIP ** role class for puppet agents ** WIP **" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64031 [08:31:36] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:31:46] RECOVERY - SSH on snapshot2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:33:16] PROBLEM - Disk space on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:06] RECOVERY - Disk space on snapshot2 is OK: DISK OK [08:34:36] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:34:46] PROBLEM - SSH on snapshot2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:37] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:36:37] RECOVERY - SSH on snapshot2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [08:39:37] PROBLEM - RAID on snapshot2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:40:57] RECOVERY - DPKG on snapshot2 is OK: All packages OK [08:52:54] New patchset: Hashar; "** WIP ** role class for puppet agents ** WIP **" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64031 [08:54:57] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [09:08:59] New patchset: Hashar; "** WIP ** role class for puppet agents ** WIP **" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64031 [09:18:28] New patchset: Hashar; "** WIP ** role class for puppet agents ** WIP **" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64031 [10:07:09] re [10:26:04] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [10:27:04] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [12:01:21] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:01:21] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:01:21] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [12:07:59] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:07:54 UTC 2013 [12:08:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:09:10 UTC 2013 [12:10:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:10:29] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:10:22 UTC 2013 [12:11:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:11:39] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:11:29 UTC 2013 [12:12:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:12:29] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:12:27 UTC 2013 [12:13:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:13:29] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:13:19 UTC 2013 [12:13:59] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [12:14:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:14:09] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:14:03 UTC 2013 [12:14:59] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:15:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 12:15:11 UTC 2013 [12:16:09] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [12:34:38] New patchset: ArielGlenn; "redis job queue for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64044 [12:43:10] New patchset: Hashar; "redis job queue for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64044 [12:44:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64044 [13:14:47] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 13:14:46 UTC 2013 [13:15:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [13:48:41] New patchset: Hashar; "beta: appserver should uses /data/project/apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64057 [13:49:58] !log Graceful reload of Zuul deploying I0ea73ca1439c8aa6 [13:50:06] Logged the message, Master [13:56:38] New patchset: Hashar; "beta: appserver should uses /data/project/apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64057 [13:57:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.182 second response time [13:59:42] New patchset: Hashar; "beta: appserver should uses /data/project/apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64057 [14:02:55] ^demon: poke [14:03:05] <^demon> Hi [14:03:18] what do we need to do to get a gerrit repo setup for the easyrdf library? [14:03:46] that is needed for deploying new code on wikidata [14:05:29] <^demon> Oh, I figured Sam created it yesterday... [14:05:38] not that i know of [14:05:48] <^demon> mediawiki/extensions/Wikibase/easyrdf [14:05:52] <^demon> ^ Sound ok? [14:05:56] yes [14:06:35] <^demon> Empty repo created, permissions inherit from Wikibase. [14:06:42] thanks! :) [14:06:49] * aude can import the github repo there [14:07:02] <^demon> Should be able to, yeah [14:08:59] ok, i remember being able to "impersonate" other committers when doing this before [14:09:06] it was a special permission [14:09:30] otherwise i get committer address does not match my account [14:11:56] ^demon: ? [14:12:07] PROBLEM - DPKG on tin is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:12:09] <^demon> Ah, I'll grant that. [14:12:10] <^demon> Whoops [14:12:13] i don't think i can set that permission myself [14:12:19] only temporary [14:12:37] <^demon> Forge Committer set on easyrdf. [14:12:42] or actually it might be an issue if want to sync stuff later [14:12:45] k [14:13:06] ok, it's there now :) [14:13:08] thanks! [14:13:30] now to update the submodule from wikibase..... [14:14:27] PROBLEM - Host tin is DOWN: PING CRITICAL - Packet loss = 100% [14:14:47] RECOVERY - Host tin is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:15:07] RECOVERY - DPKG on tin is OK: All packages OK [14:16:27] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [14:17:27] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 90.16 ms [14:21:38] did someone break gerrit? [14:21:59] I did [14:22:03] error: Failed connect to gerrit.wikimedia.org:443; Connection refused while accessing https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Wikibase.git/info/refs [14:22:06] ah, ok [14:22:08] * aude waits [14:22:51] <^demon> F'ing gerrit. Need to fix it so it comes up better on reboot. [14:23:27] <^demon> Anyway, all should be back now. [14:23:32] k [14:25:30] New patchset: Hashar; "beta: appserver should uses /data/project/apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64057 [14:25:34] indeed [14:31:11] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64057 [14:36:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:42:42] New patchset: Hashar; "beta: fix mediawiki dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64065 [14:44:35] New review: Faidon; "Why not use the Java module/definitions instead of Package?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/63866 [14:45:15] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64065 [14:48:10] PROBLEM - Host fenari is DOWN: PING CRITICAL - Packet loss = 100% [14:48:50] PROBLEM - Host sockpuppet is DOWN: PING CRITICAL - Packet loss = 100% [14:49:20] RECOVERY - Host sockpuppet is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [14:52:30] RECOVERY - Host fenari is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [14:55:30] PROBLEM - SSH on fenari is CRITICAL: Connection refused [14:55:30] PROBLEM - HTTP on fenari is CRITICAL: Connection refused [14:57:30] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:57:30] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.071 second response time [14:58:09] New review: Hashar; "Applying both jenkins and jenkins::slave will cause a duplicate conflict for the headless java packa..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63866 [15:27:17] !log Graceful reload of Zuul deploying I7565eff83a7a128e [15:27:24] Logged the message, Master [15:27:35] hashar: $ git log -p ..origin [15:27:41] short for HEAD..origin/master :) [15:28:12] hashar: $ git log -p ..origin ; short for $ git log -p HEAD..origin/master [15:28:48] hashar: though it does rely on it expanding origin to origin/{tracking branch}, this is good and more secure because we use "git rebase" to merge, which makes the same expansion [15:28:54] context ? :D [15:28:58] and saves a few characters [15:29:02] hashar: deploying zuul [15:29:09] screw the few characters, I prefer to be explicit :-D [15:29:20] hashar: That's what I said [15:29:26] zuul is almos packaged btw :-] [15:29:27] hashar: Being explicit in this case hurts [15:29:43] hashar: If you're explicit in git-log but not in git-rebase, you can be looking at the wrong thing [15:29:58] by allowing git to compare against the branch that master is tracking, you're right both ways [15:30:01] yeah got your point [15:30:16] might want to update the MediaWiki deployment guide as well [15:30:21] I generally prefer explicit as well [15:30:39] note that the Zuul master branch in wikimedia repo has some hacks which are not upstream yet [15:30:58] zuul-config [15:31:01] What?! unvetted patches!? [15:31:02] this is about zuul-config [15:31:09] not zuul [15:47:20] paravoid / AaronSchulz: Could one of you answer https://bugzilla.wikimedia.org/show_bug.cgi?id=40514#c8 please? [15:56:41] !log Graceful reload of Zuul deploying Ic9194c079fa7d46 [15:56:49] Logged the message, Master [16:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:07:53 UTC 2013 [16:08:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:08:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:08:48 UTC 2013 [16:09:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:09:41] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:09:38 UTC 2013 [16:10:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:11:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:10:53 UTC 2013 [16:11:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:13:11] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [16:13:11] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:11] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:15:01 UTC 2013 [16:15:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:15:38] !log Parsoid update to current master: start [16:15:46] Logged the message, Master [16:17:58] !log Parsoid update to current master: done [16:18:06] Logged the message, Master [16:19:10] New patchset: Asher; "repooling db1043 for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64082 [16:20:07] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64082 [16:23:25] !log asher synchronized wmf-config/db-eqiad.php 'pooling db1043' [16:23:33] Logged the message, Master [16:26:31] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:01] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:32:21] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [16:37:51] PROBLEM - Host analytics1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:11] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [16:39:01] PROBLEM - Host analytics1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:39:11] PROBLEM - Puppet freshness on db26 is CRITICAL: No successful Puppet run in the last 10 hours [16:40:21] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [16:41:31] RECOVERY - Host analytics1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:43:21] RECOVERY - Host analytics1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:44:31] PROBLEM - NTP on analytics1001 is CRITICAL: NTP CRITICAL: Offset unknown [16:45:11] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 16:45:06 UTC 2013 [16:45:28] hi, I am looking at the custom production log, and see 3 IPs in the XFF header. Is that normal? [16:45:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [16:48:31] RECOVERY - NTP on analytics1001 is OK: NTP OK: Offset -0.003993868828 secs [16:52:17] !log rebooting stat1 for kernel upgrade [16:52:26] Logged the message, Master [16:53:31] PROBLEM - Host analytics1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:51] PROBLEM - Host stat1 is DOWN: PING CRITICAL - Packet loss = 100% [16:58:41] RECOVERY - Host analytics1004 is UP: PING OK - Packet loss = 0%, RTA = 1.01 ms [16:59:21] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [16:59:21] RECOVERY - Host stat1 is UP: PING OK - Packet loss = 0%, RTA = 26.93 ms [17:00:09] ^demon: Is gerrit still upset with itself? [17:00:09] error: The requested URL returned error: 403 while accessing https://gerrit.wikimedia.org/r/p/mediawiki/extensions/Wikibase.git/info/refs [17:00:09] fatal: HTTP request failed [17:00:10] Unable to fetch in submodule path 'extensions/Wikibase' [17:00:34] 406 not acceptible [17:07:29] I wonder if manganese and tin just don't mix well [17:08:39] aww WMF needs more chemistry [17:10:22] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [17:13:00] !log reedy synchronized php-1.22wmf4/extensions/ [17:13:08] Logged the message, Master [17:15:12] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 17:15:01 UTC 2013 [17:15:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [17:20:42] greg-g, need to re-deploy the tiny patch to allow logging in Zero. The issue seems weirder and more involved. Need to only sync-file in zero extension [17:21:14] would i step on anyones toes if i push it now? [17:21:23] yurik: ok, if Reedy is done, you can go now until 11am Pacific [17:21:45] I "need" to scap [17:21:58] But it's only for a few new messags, so isn't urgent at all [17:22:18] Reedy, mine is just one file - should be ~4 min, right? [17:22:28] but i don't want to step on your toes :) [17:22:29] One file? I could do it in under a minute ;) [17:22:39] hehe :) [17:22:40] Feel free to go ahead [17:22:45] Reedy, thx [17:23:22] PROBLEM - Host analytics1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:18] syncing... [17:26:42] mutante: so i tried scrubbing a mailman list archive for the first time a few mins ago (using wikitech instructions). was not fun. i gave up and reverted to backups! [17:27:14] !log yurik synchronized php-1.22wmf3/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php [17:27:22] RECOVERY - Host analytics1005 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [17:27:23] Logged the message, Master [17:27:26] mw1173 is still down [17:27:42] Reedy, done [17:27:45] nothing extraordinary, just some spam. so i figured i'd put in placeholders. no such luck. somehow old messages got out of order [17:27:56] jeremyb: eh? where did you do that [17:28:10] mutante: lists.freeculture.org [17:28:36] mutante: did i scare you? :-) [17:28:38] jeremyb: ahh, so you mean deleting stuff inside the archive and then recreating it? and then all your links broke? [17:28:39] not ops@wikimedia problem, don't worry! [17:28:45] jeremyb: hah, just a little bit [17:29:13] jeremyb: so all i can say is do not delete anything, just replace things with XXX's [17:29:20] mutante: i mean some messages in the new files ended up in a different *order* than in the original [17:29:29] and so their IDs swapped [17:29:43] New patchset: Sanja pavlovic; "Changed code to give summary message about conf errors." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64095 [17:29:53] edit the .mbox file, but replace only, use ./arch to recreate HTML from mbox [17:30:16] mutante: did i mention i followed the wikitech docs? ;-) [17:30:29] jeremyb: yea, and then archive links are leading you to a different message.. it still happened.. sometimes ... [17:30:31] i'm pretty sure i did it correctly [17:31:21] i was just picking random months to vimdiff /date.html vs. the backup and i saw some with reordering (but not missing messages) [17:33:04] mutante: anyway, i was just coming to say i understand how crazy mailman is :-) [17:33:04] <^demon> Reedy: No problems on my end... [17:33:25] ^demon: Indeed. Fine for me locally. fine on fenari. Not on tin. And seemingly onlt that repo [17:33:39] <^demon> Hrmmm.... [17:33:47] jeremyb: ok:) [17:34:42] New patchset: Sanja pavlovic; "Per bug #48012. Changed code to give summary message about conf errors." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64095 [17:35:23] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [17:36:53] New review: Sanja pavlovic; "I didn't know how to amend old patches, so I made another one, sorry." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/64095 [17:40:23] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.011 second response time [17:45:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 17:45:00 UTC 2013 [17:45:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [17:50:35] andre__: I don't think so [17:50:50] !log reedy Started syncing Wikimedia installation... : Rebuilding l10n cache for wikidata deploy [17:50:59] Logged the message, Master [17:51:48] <^demon> Reedy: If I had to guess, I'd say that tin can't hit gerrit over https. ssh on 29418 is fine. [17:52:24] Oh, submodule added via https not ssh? [17:52:31] <^demon> Prolly. [17:52:37] paravoid, did that refer to bug 40514, or to being able to answer that? :) [17:52:38] <^demon> Check .gitmodules [17:52:58] andre__: I don't think anything has happened [17:53:07] ah :) [17:53:14] Everything is HTTPS in the root .gitmodules [17:53:34] So is the sub module of the sub module [18:01:03] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [18:01:23] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [18:01:47] !log reedy Finished syncing Wikimedia installation... : Rebuilding l10n cache for wikidata deploy [18:01:56] Logged the message, Master [18:02:01] 10 minutes [18:02:02] Nice [18:02:38] eek:) [18:04:15] submodule of the submodule ... wow :p [18:06:01] mutante: VE were first there and done it [18:06:26] mutante, https://i.chzbgr.com/maxW500/4477970432/h26A6F2A4/ [18:06:31] we have an issue with our new special entity data page on wikidata [18:06:33] http://www.wikidata.org/wiki/Special:EntityData/Q60 [18:06:37] Error 330 (net::ERR_CONTENT_DECODING_FAILED): Unknown error. [18:06:43] MaxSem: :) [18:06:50] any ideas if this is related to cache config or what? [18:07:01] works perfect on my test wiki but i don't have squid in front of it [18:07:04] Reedy: tell me how you like graceful-all next time you need it:) [18:10:22] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [18:10:35] Reedy: and did i setup the submodule correctly? [18:10:50] aude: It works fine on fenari and locally, so I guess so [18:10:55] k [18:11:09] did it same way the extensions are (https) [18:11:36] and think that has nothing to do with the error, as i'm just requesting json representation of the item [18:11:57] i think rdf is with ".rdf" at the end and otherwise not hitting the rdf code [18:12:15] Jeff_Green: you about? [18:12:22] yes sir [18:12:25] ok [18:12:25] so [18:12:30] I was going to merge a patchset [18:12:33] that, uhhh [18:12:47] https://gerrit.wikimedia.org/r/#/c/64004/ [18:12:48] that one [18:12:57] spin up db77 the same as db78 [18:13:25] then switch it over to use the coredb shiz that I've defined for the fundraisingdb cluster (don't worry, it has no affect on anything yet) [18:13:30] ok [18:13:31] and look at what puppet does [18:13:32] PROBLEM - RAID on analytics1011 is CRITICAL: Timeout while attempting connection [18:13:37] I'm pretty sure that it should be a noop [18:13:41] as I've done this a time or two before [18:13:47] but, I'll want you to look over the output [18:13:51] and the code, if you hav ea chance [18:14:02] then, once that's good, we cna switch over db78 [18:14:06] and watch the noop :) [18:14:08] k [18:14:16] well, and switch it to mariadb [18:14:19] that'll be an op. [18:14:33] i already did that for db1025, so that should be nonawful [18:14:37] New patchset: Pyoungmeister; "lsetting up db77 as clone of db78 for coredb testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64004 [18:14:44] yeah [18:14:52] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 18:14:45 UTC 2013 [18:15:01] has been just time on all the other db boxes [18:15:02] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:02] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [18:15:21] Meh, let's just all switch to postgres. It's a few hours' work at worse, right? [18:15:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [18:15:50] Coren: I'm still gunning for mangodb https://github.com/dcramer/mangodb [18:16:22] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [18:16:35] Jeff_Green: oh yay, and db78 is precise, even [18:16:46] we do really need webscale stuff [18:17:11] !log restarting a ceph-osd osd.0 for testing [18:17:15] Reedy: mangodb is auto-sharting [18:17:24] doens't get much better than that [18:17:24] Logged the message, Master [18:17:34] * Reedy wonders how one sharts [18:17:40] google it ;) [18:17:53] heh [18:18:04] I like how it works regardless of wether the block storage devices are accessible. [18:18:12] notpeter: yep, it's a pretty recent build [18:18:17] Coren: that's part of the auto-sharting [18:18:22] Jeff_Green: cool! [18:19:22] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 1.06 ms [18:21:00] !log zirconium (planet): upgrading kernel and rebooting [18:21:10] Logged the message, Master [18:21:52] PROBLEM - Host zirconium is DOWN: CRITICAL - Host Unreachable (208.80.154.41) [18:22:17] Reedy: don't eat anything all day, then drink a bunch of crappy beer? [18:24:42] RECOVERY - Host zirconium is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:25:22] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [18:28:22] !log magnesium (racktables): upgrading kernel and rebooting [18:28:31] Logged the message, Master [18:31:02] PROBLEM - Host analytics1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:06] !log updated Parsoid to b0229051eb08c8 [18:34:13] Logged the message, Master [18:34:22] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.006 second response time [18:35:38] RECOVERY - Host analytics1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:39:09] mutante, time to break RT? [18:39:31] andrewbogott: hi. yes [18:39:51] i just upgraded magnesium so it's done before [18:40:08] and did you see the gerrit change and comment? [18:40:30] I saw it but didn't read closely. [18:41:06] andrewbogott: short version: it fixed Apache config so that RT and the other RT work together on magnesium and both enforce HTTPS [18:41:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64004 [18:41:17] "other RT" = racktables :p [18:41:32] yep, that's what we needed. [18:44:50] PROBLEM - NTP on analytics1023 is CRITICAL: NTP CRITICAL: Offset unknown [18:45:24] andrewbogott, mutante: RT #714 indicates that rt-mailgate breaks with https redirects, but I think that's what you've staged in puppet [18:46:07] was it not using https before? [18:46:08] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 18:45:59 UTC 2013 [18:46:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [18:48:04] mutante: do you have any suggestions on what might be wrong with http://www.wikidata.org/wiki/Special:EntityData/Q60.json [18:48:11] or without the .json [18:48:34] i suspect it's related to gzip but not sure where/how [18:48:40] and really don't know [18:48:41] aude: sorry, no, and in the middle of an upgrade [18:48:46] ok [18:49:00] New patchset: Reedy; "Education Program on dewikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64103 [18:49:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64103 [18:49:48] RECOVERY - NTP on analytics1023 is OK: NTP OK: Offset -0.002912044525 secs [18:50:05] !log reedy synchronized wmf-config/InitialiseSettings.php 'EP on dewikiversity' [18:50:18] Logged the message, Master [18:50:21] paravoid: arg, thaaat ticket. thanks for the reminder... [18:51:28] but.. our current setup does redirect to https [18:52:43] New patchset: Andrew Bogott; "Switch webport to 443 for happier RT+SSL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64105 [18:52:54] New patchset: Reedy; "Refactor out session code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63839 [18:53:02] mutante: ^ We can test and then merge that patch if it helps [18:53:30] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63839 [18:54:10] New review: Andrew Bogott; "Do not merge, pending test" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/64105 [18:54:17] ""It's just the appropriate apache config and editing your [18:54:19] etc/RT_SiteConfig.pm so $WebPort is set to 443. "" [18:54:31] !log reedy synchronized wmf-config/session.php [18:54:31] andrewbogott: <-- is this the answer to that ticket comment? [18:54:39] Logged the message, Master [18:54:43] yeah [18:55:01] sounds cool, i just wonder why we abandoned hashars change back then..looking [18:55:14] New patchset: Pyoungmeister; "db77 -> coredb for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64106 [18:55:14] Anyway, theat ticket is not directly related to the upgrade, is it? I mean, it's broken now, it may remain broken... [18:55:18] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [18:55:20] !log reedy synchronized wmf-config/ [18:55:28] Logged the message, Master [18:55:39] New patchset: Reedy; "(bug 48237) Remove redundant namespaces from testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63756 [18:55:48] New patchset: Pyoungmeister; "db77 -> coredb for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64106 [18:55:52] mark commented that the mail gateway doesn't support https [18:55:55] on that old change [18:56:17] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63756 [18:56:20] but that comment above was from Best Practical [18:56:25] New patchset: Reedy; "(bug 48479) Change default favicon in InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63877 [18:56:46] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63877 [18:57:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64106 [18:57:10] New patchset: Reedy; "Update $wgTranslateBlacklist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63678 [18:57:18] New patchset: Reedy; "Update $wgTranslateBlacklist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63678 [18:57:36] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63678 [18:57:52] New patchset: Reedy; "(bug 48457) Set abusefilter-modify-restricted for cawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63705 [18:58:05] Reedy, I got 7 emails for that one change - is there any way to make it send fewer emails (re: [Gerrit] (bug 48237) Remove redundant namespaces from testwiki - change (operations/mediawiki-config) ) - it's not a problem for me because Gmail puts it in one conversation, but I imagine that could be annoying for others [18:58:15] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63705 [18:58:38] Not sure [18:59:00] But in most cases people will be getting emails as they've some interest in it in one way or another [18:59:57] true, but if jenkins could send those 4 things in one email (if it does them all together), that'd be nice [19:00:02] ^demon, ^^ is that possible? [19:00:17] They're seperate events [19:00:18] Rebase [19:00:20] Submit [19:00:24] err, review even [19:00:26] Checking [19:00:26] <^demon> No. [19:00:27] submit [19:00:36] !log reedy synchronized wmf-config/ [19:00:44] Logged the message, Master [19:02:20] andrewbogott: paravoid: [19:02:23] If your RT server uses SSL, you will need to install additional Perl libraries. RT will detect and install these dependencies if you pass the --enable-ssl-mailgate flag to configure as documented in RT's README. [19:02:33] sounds like in 4.x it could support it [19:02:40] with additional libs [19:04:55] notpeter: Jeff_Green: so what are the steps for mysql -> maria? is that on wikitech? also was wondering about how percona-xtrabackup is used [19:05:24] the steps i used are probably not what we would do for prod hosts [19:05:46] dpkg -P old mysql [19:05:49] install mariadb [19:06:01] it also requires an addition to apt [19:06:05] but that's handled by puppet [19:06:22] so it's not dump and reimport? just use the existing ibdata? [19:06:28] yes [19:06:34] cool [19:06:36] percona-xtrabackup isn't used for this [19:06:37] I mean [19:06:47] we use it for making new slaves [19:06:47] etc [19:06:50] but this doesn't require any such heavy lifting [19:07:04] ok, and when you do use it what params do you give? [19:07:27] innobackupex-1.5.1 --stream=tar /a/sqldata --user=root --slave-info | nc NEW-SERVER 9210 [19:07:33] from https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica [19:07:41] ahh, yay, wikitech! [19:07:43] danke [19:07:45] yup [19:07:58] !log reedy synchronized wmf-config/InitialiseSettings.php [19:08:06] Logged the message, Master [19:08:18] New patchset: Reedy; "wiki not wikipedia for favico" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64109 [19:08:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64109 [19:08:54] New patchset: Hoo man; "Revert "(bug 48479) Change default favicon in InitialiseSettings.php"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64110 [19:09:09] PROBLEM - mysqld processes on db77 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:09:13] Change abandoned: Reedy; "Already fixed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64110 [19:10:03] Reedy: and why does that particular definition use wiki instead of wikipedia? [19:10:24] https://noc.wikimedia.org/conf/highlight.php?file=wgConf.php [19:10:38] geez, true. [19:11:28] ugly me. [19:15:00] Jeff_Green: ok [19:15:05] want to hop onto db77 [19:15:19] there are two files in /root/ [19:15:27] one is the output of the puppet run [19:15:47] which looks good (aside from lots of stuff being horrible broken because puppet can't handle apt well) [19:15:51] so what should I be looking for? you've just changed the mysql manifest? [19:15:59] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 19:15:53 UTC 2013 [19:15:59] yeah [19:16:09] this is a replica of what it would look like on db78 [19:16:21] if we switched it over [19:16:29] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [19:16:33] there's also just a file that's changes to the my.cnf [19:16:41] which looks reasonable to me [19:16:51] there are some changes to it [19:17:03] but they are asher's recommendations [19:17:16] i should have you guys look at what I did in frack [19:17:17] i.e. what we're running on all other mariadb boxxies [19:17:28] b/c the fundraising db's are going to be puppetized off of that puppet instance, not production [19:17:42] oh. [19:17:42] yes [19:18:12] have they deviated from what's running on db78? [19:18:12] max connections dropped in half, but i can't imagine 2500 being hit ever [19:18:35] yeah [19:18:44] that changed, but I was like "yeah, whatevs" [19:20:26] paravoid, ryan_Lane, mutante and I are looking at https://rt.wikimedia.org/Ticket/Display.html?id=714 and we don't see any evidence that the issue with mail is actually present. [19:20:27] see ~/my.cnf-db1025 [19:20:54] Which is sort of concerning, we're worried that we'll resurrect that bug when we migrate since we don't understand why the old system is working. [19:20:56] Any thoughts? [19:21:30] I'm not sure about that. I had reverted my change before and someone else worked on it after [19:21:37] 'someone else' :( [19:21:50] Jeff_Green: ah, ok [19:22:12] so, then I'm going to write some shit based on the mysql module [19:22:17] as it will work much better for this [19:22:31] (the official puppet labs mysql module, not the coredb stuffs that I wrote) [19:22:45] ok cool [19:22:59] i'd like to import that eventually to the frack puppet instance [19:23:30] cool [19:23:30] what I have now is sort of a stripped down version of our old manifests [19:23:30] then yeah [19:23:36] what I'll do is make some kinda mysql_fundraising module [19:23:42] and throw it into a roll class [19:23:53] and then you should be able to import that pretty easily [19:32:12] andrewbogott: heh [19:32:23] andrewbogott: yeah. no clue who [19:33:23] New patchset: Andrew Bogott; "Include exim::rt on magnesium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64119 [19:34:48] Mark, it looks like. [19:34:57] mark, still up? [19:35:12] Can you advise about how to make exim::rt function on a host in eqiad? [19:35:15] New review: Dzahn; "hostlist wikimedia_nets = does not include eqiad networks / ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64119 [19:36:10] New patchset: Yurik; "Added "zero" debug log" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64123 [19:38:07] PROBLEM - MySQL Replication Heartbeat on db77 is CRITICAL: NRPE: Unable to read output [19:38:45] yea, so our current RT setup redirects you to https as well (using lighttpd rules), and mail works for us .. so how does it avoid the bug [19:43:32] greg-g, i figured out the problem, could push out the patch at the earliest convinience [19:45:01] New patchset: Reedy; "vecwiktionary initial config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64124 [19:45:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64124 [19:46:01] greg-g, to make sure that we catch all such events, i also added cleaned up logging to a separate 'zero' log. when would be a good time to push these 2 files to prod? [19:46:12] !log reedy synchronized wmf-config/InitialiseSettings.php [19:46:21] Logged the message, Master [19:46:59] New patchset: Jgreen; "rename nsca_payments.cfg to nsca_frack.cfg since it covers all frack hosts now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64125 [19:47:36] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64125 [19:49:02] !log reedy synchronized database lists files: [19:49:10] Logged the message, Master [19:50:16] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [19:50:25] Logged the message, Master [19:50:43] yurik: E3 starts in 10 minutes [19:51:10] New patchset: Reedy; "vecwiktionary config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64126 [19:51:53] greg-g, E3? [19:52:19] * yurik is not up to date on all the lingo [19:52:25] New patchset: Andrew Bogott; "Include exim::rt on magnesium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64119 [19:53:11] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64126 [19:53:16] yurik: editor engagement experiments, see the calendar: https://wikitech.wikimedia.org/wiki/Deployments [19:53:53] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64123 [19:53:55] New patchset: Andrew Bogott; "Include exim::rt on magnesium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64119 [19:54:03] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64119 [19:54:44] greg-g, oh yes, sorry. Ok, I could try to push it out once they are done or very late at night [19:56:14] !log reedy synchronized wmf-config/InitialiseSettings.php [19:56:21] Logged the message, Master [19:57:53] !log reedy synchronized database lists files: [19:58:02] Logged the message, Master [19:58:30] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [19:58:39] Logged the message, Master [19:59:57] New patchset: Reedy; "elwikivoyage config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64127 [20:00:11] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64127 [20:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:52] New patchset: Reedy; "Update interwiki.cdb" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64129 [20:02:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64129 [20:02:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:02:47] New patchset: Andrew Bogott; "Use 'newstandard' instead of 'standard' on magnesium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64130 [20:02:47] !log reedy synchronized wmf-config/ 'Updating interwiki cache' [20:02:56] Logged the message, Master [20:03:12] robh, any concerns with https://gerrit.wikimedia.org/r/#/c/64130/ ? [20:03:32] andrewbogott: eww, dont use new [20:03:34] yes [20:03:38] its not for use, ever [20:03:38] ? [20:03:44] Which is why it's called 'new'? [20:03:56] OK, so, instead can I just enumerate the things from 'standard' and leave out exim? [20:03:58] Your account is active on 848 project sites. [20:04:48] New patchset: Andrew Bogott; "Use 'newstandard' instead of 'standard' on magnesium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64130 [20:04:49] RobH, is this better? [20:04:58] um… other than having the wrong title? [20:05:55] hrmm, if standard isnst included [20:05:59] what else is no longer being maintained? [20:06:05] or is standard just that exim package? [20:06:35] Standard is ganglia, ntp, exim, base. Newstandard is ganglia, ntp, base. [20:10:59] New patchset: Andrew Bogott; "Don't use 'standard' on magnesium." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64130 [20:13:46] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64130 [20:19:03] New patchset: Andrew Bogott; "Move standard-noexim back out of site.pp and into a role." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64175 [20:19:19] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu May 16 20:19:11 UTC 2013 [20:19:29] PROBLEM - Puppet freshness on ms2 is CRITICAL: No successful Puppet run in the last 10 hours [20:20:25] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64175 [20:26:39] New patchset: MarkTraceur; "Add fundraising components to #wm-fundraising" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64012 [20:27:09] PROBLEM - Puppet freshness on colby is CRITICAL: No successful Puppet run in the last 10 hours [20:27:09] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [20:29:41] the Parsoid load balancing does not seem to work quite right, all requests currently seem to end up on wtp1004 [20:30:15] could a root check if the other backends are properly listed in LVS and then depool wtp1004 to see if the load shifts over? [20:31:14] Roan doesn't have time to look into it currently and suggested depooling wtp1004 [20:31:43] New review: Katie Horn; "Exciting times." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/64012 [20:34:40] !log restarting a ceph-osd osd.3 for testing [20:34:49] Logged the message, Master [20:37:28] New patchset: Akosiaris; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [20:37:28] New review: Hashar; "Honestly I have Zero clue. After ten years on the cluster I still do not understand our file hierarc..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [20:37:29] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:29] PROBLEM - Host analytics1009 is DOWN: PING CRITICAL - Packet loss = 100% [20:37:44] * Aaron|home meows quietly [20:38:22] Aaron|home: trying to debug with sage why peering takes so long [20:38:59] RECOVERY - Host analytics1012 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [20:39:22] !log reedy synchronized php-1.22wmf4/extensions/Wikibase/ [20:39:30] Logged the message, Master [20:41:09] RECOVERY - Host analytics1009 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [20:42:08] terbium commonswiki: DescribeFileOp failed (batch #74vtxlcllqop197j0nrqxeji0ge28vv): {"op":"describe","src":"mwstore://local-ceph/local-public/3/38/En-uk-the_bill.ogg","headers":{"X-Content-Duration":1.3},"failedAction":"attempt"} [20:42:18] hmm, just one in today's log [20:42:25] New review: Akosiaris; "Fixed both errors. One was due to the test not passing datanode_mounts which defaults to undef, the ..." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [20:43:24] non yesterday, meh [20:50:00] New patchset: Reedy; "Complete the list of altUploadForm in $wgUploadWizardConfig" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64194 [20:50:27] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64194 [20:50:57] Reedy: [20:51:11] !log reedy synchronized wmf-config/ [20:51:27] Logged the message, Master [20:53:32] New review: Odder; "It. Does. Not. Work." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64194 [20:54:32] New patchset: Andrew Bogott; "Suppress RT's insistent 'cross-site request forgery' warning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64197 [20:54:45] andrewbogott++!!!! [20:55:00] No promises that it works consistently [20:55:43] i was wondering [20:55:55] !log running dist-upgrade on virt cluster [20:56:03] Logged the message, Master [20:56:27] * jeremyb runs away again [20:56:57] New review: Dzahn; "http on 443" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/64197 [20:58:01] New review: Andrew Bogott; "Oh, good point" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64197 [20:59:36] New patchset: Andrew Bogott; "Suppress RT's insistent 'cross-site request forgery' warning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64197 [21:00:10] andrewbogott: i think it's the port in a different place though (maybe both) [21:00:37] The new patch seems to actually prevent those warnings. It might be overkill. [21:00:56] where's the template for the site config again.. looking [21:01:41] Anyway… are you now fed and ready to work on email again? [21:03:33] PROBLEM - DPKG on virt2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:04:10] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:15] andrewbogott: re.. ok.. yeah, the warning is gone [21:04:33] i see you also changed the WebPort, gotcha [21:05:30] RECOVERY - Host analytics1013 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [21:05:46] andrewbogott: why is neon trying to talk SMTP to it ? [21:06:10] no IP address found for host neon (during SMTP connection from neon.wikimedia.org (neon) [21:06:53] New patchset: Odder; "(bug 33513) Define altUploadForm in $wgUploadWizardConfig" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64200 [21:07:16] New patchset: GWicke; "Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [21:07:46] New patchset: GWicke; "Parsoid VCL refinements" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64008 [21:08:19] yeah, I see that error about neon but don't understand it. [21:08:40] RECOVERY - DPKG on virt2 is OK: All packages OK [21:12:20] PROBLEM - DPKG on virt10 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:12:30] PROBLEM - DPKG on virt9 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:12:40] PROBLEM - DPKG on virt8 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:12:50] PROBLEM - DPKG on virt1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:12:50] PROBLEM - DPKG on virt11 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:13:10] PROBLEM - DPKG on virt1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:13:10] PROBLEM - DPKG on virt5 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:13:50] RECOVERY - DPKG on virt11 is OK: All packages OK [21:14:20] RECOVERY - DPKG on virt10 is OK: All packages OK [21:14:30] RECOVERY - DPKG on virt9 is OK: All packages OK [21:15:40] PROBLEM - RAID on analytics1014 is CRITICAL: Connection refused by host [21:15:50] RECOVERY - DPKG on virt1007 is OK: All packages OK [21:16:10] RECOVERY - DPKG on virt1005 is OK: All packages OK [21:16:10] RECOVERY - DPKG on virt5 is OK: All packages OK [21:16:35] binasher: now that you sat down next to me, I wanted to make sure you saw these two edits, and ask if those are A) on your radar and B) was that the right thing to do? https://wikitech.wikimedia.org/w/index.php?title=Schema_changes&diff=70746&oldid=69154 [21:16:50] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [21:17:13] New patchset: Andrew Bogott; "Suppress RT's insistent 'cross-site request forgery' warning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64197 [21:17:40] RECOVERY - DPKG on virt8 is OK: All packages OK [21:19:20] RECOVERY - Host analytics1014 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [21:19:39] !log rebooting virt0 [21:19:48] Logged the message, Master [21:22:00] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [21:22:40] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:26:30] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [21:27:00] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [21:28:22] !log running accountaudit_login aa_method schema migrations on all wikis [21:28:24] pgehres: ^^ [21:28:31] Logged the message, Master [21:28:32] binasher: ty [21:28:57] greg-g: can I get a LD window to deploy the code for ^^ [21:29:29] yessir, please double check with yurik as he has some things to get out as well [21:29:34] er, pgehres ^ [21:29:39] :-) [21:31:50] binasher: https://gerrit.wikimedia.org/r/#/c/57536/3 :) [21:32:33] Aaron|home: noooooo [21:32:37] but ok [21:33:12] New patchset: Reedy; "Set wqy-zenhei.ttc for timeline on ZH projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64205 [21:33:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64205 [21:34:14] !log reedy synchronized wmf-config/CommonSettings.php [21:34:23] Logged the message, Master [21:34:50] wmf4 got messed up somehow: "fatal: Not a git repository: /home/wikipedia/common/php-1.22wmf4/.git/modules/extensions/Wikibase" [21:35:09] Oh [21:35:10] Should I blow it away and reinitialize from whatever wmf4 is pointing to? [21:35:13] No [21:35:17] Because you can't [21:35:36] Reedy, can you fix it? [21:35:45] i'd just ignore it tbh [21:35:59] Reedy, it won't let me rebase. [21:35:59] git submodule update --init --recursive extensions/Foo [21:36:08] mflaschen@tin:/a/common/php-1.22wmf4$ git rebase origin/wmf/1.22wmf4 [21:36:10] fatal: Not a git repository: /home/wikipedia/common/php-1.22wmf4/.git/modules/extensions/Wikibase [21:36:11] Cannot rebase: You have unstaged changes. [21:36:13] Please commit or stash them. [21:37:02] Fixed [21:37:15] damn tin not working right [21:38:59] Thanks [21:40:14] !log Started scap for E3 deployment of GettingStarted and GuidedTour [21:40:22] Logged the message, Master [21:46:07] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [21:46:27] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [21:46:59] !log rebooting virt1000 [21:47:07] Logged the message, Master [21:47:31] !log mflaschen Started syncing Wikimedia installation... : E3 deployment [21:47:40] Logged the message, Master [21:48:16] !log mflaschen Finished syncing Wikimedia installation... : E3 deployment [21:48:24] Logged the message, Master [21:48:27] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [21:48:37] !log rebooting virt1005 and virt1007 [21:48:45] Logged the message, Master [21:48:57] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [21:49:16] !log Scap failed due to wrong ssh configuration [21:49:25] Logged the message, Master [21:49:27] Darn it, I must have done the forwarding wrong. [21:49:32] I have a bunch of permission denied. [21:49:56] I don't know what I did wrong. I have ssh -A. [21:49:57] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:17] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [21:50:24] superm401: check that ssh-agent -L lists the key (e.g. that your keys are loaded into the ssh agent) [21:50:42] Thanks, that might have been it. [21:50:58] Sorry, we're probably going to exceed the window a little. [21:51:27] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [21:51:37] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:52:07] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.000 second response time on port 389 [21:52:07] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [21:52:43] Scapping again. [21:52:51] BTW, it is ssh-add -l, not ssh-agent [21:53:06] !log mflaschen Started syncing Wikimedia installation... : E3 scap for deployment of GettingStarted and GuidedTour [21:53:16] Logged the message, Master [21:53:27] RECOVERY - Disk space on virt1005 is OK: DISK OK [21:53:37] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:53:37] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [21:59:35] New patchset: Reedy; "FlaggedRevs for RU Wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64209 [22:00:02] !log mflaschen Finished syncing Wikimedia installation... : E3 scap for deployment of GettingStarted and GuidedTour [22:00:11] Logged the message, Master [22:00:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64209 [22:01:19] !log reedy synchronized wmf-config/flaggedrevs.php [22:01:27] Logged the message, Master [22:01:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [22:01:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [22:01:27] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [22:04:03] pgehres, greg-g - i won't be able to push anything out until much later today, so please go ahead. Also, please let me know if initialSettings file gets pushed out - this way i won't have to push it out again :) [22:08:25] PROBLEM - NTP on virt1005 is CRITICAL: NTP CRITICAL: Offset unknown [22:10:07] New patchset: Reedy; "Set $wgRedirectSources on donatewiki and foundationwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64211 [22:10:32] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64211 [22:11:12] !log reedy synchronized wmf-config/InitialiseSettings.php [22:11:21] Logged the message, Master [22:12:25] RECOVERY - NTP on virt1005 is OK: NTP OK: Offset 0.001159310341 secs [22:14:55] PROBLEM - Puppet freshness on gallium is CRITICAL: No successful Puppet run in the last 10 hours [22:15:55] PROBLEM - Puppet freshness on db1017 is CRITICAL: No successful Puppet run in the last 10 hours [22:18:42] andrewbogott, mutante: what's up with rt4? need any help? [22:19:28] paravoid, we were trying to demonstrate that the new install can receive email before we shut off the old one... [22:19:32] And, so far no dice. [22:20:23] Our best effort is installed on magnesium, but it's unclear how to proceed. [22:21:50] So, yeah, we need help, but I don't know what kind of help :) [22:23:02] New patchset: Reedy; "Weekly feeds for frwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23112 [22:23:15] PROBLEM - RAID on virt8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:16] Reedy, someone broke the favicon for wmfwiki https://wikimediafoundation.org/wiki/Home it's now the wikipedia one [22:23:50] oh fuck [22:24:05] RECOVERY - RAID on virt8 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [22:25:12] New patchset: Reedy; "Weekly feeds for frwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23112 [22:26:18] New patchset: Reedy; "Weekly feeds for frwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23112 [22:26:37] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23112 [22:26:48] odder: ? [22:27:03] Thehelpfulone: I broke that. [22:27:10] :D [22:27:18] heh when you were fixing the other ones? [22:27:21] !log reedy synchronized wmf-config/ [22:27:29] Logged the message, Master [22:28:07] I guess foundationwiki is underwikipedia somewhere [22:28:10] binasher: status on the sql updates for accountaudit? [22:28:50] !log mwalker synchronized php-1.22wmf4/extensions/CentralNotice/ 'Updating CentralNotice to master' [22:28:53] * Reedy fixes [22:28:58] Logged the message, Master [22:29:06] I'm on it too :) [22:29:21] !log reedy synchronized wmf-config/InitialiseSettings.php [22:29:22] i win [22:29:22] ah... help? using tin to sync-dir and I just got a "mw1173: ssh: connect to host mw1173 port 22: Connection timed out" and then the script died... [22:29:26] Logged the message, Master [22:29:32] mwalker: Likely fine [22:29:41] Very likely the machine was at the end of the list [22:29:52] TimStarling: so the squid conf in /home/wikipedia/conf ... is that on tin? [22:29:55] https://gerrit.wikimedia.org/r/#/c/63877/2/wmf-config/InitialiseSettings.php was that bad idea [22:29:57] New patchset: Reedy; "Fix foundationwiki favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64218 [22:30:07] no, it's in the same place it always was [22:30:09] Reedy: not only foundationwiki [22:30:11] * Ryan_Lane sighs [22:30:19] well, virt0 is rebooting [22:30:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64218 [22:30:26] same as all the other configuration in that directory [22:30:31] pybal etc. [22:30:40] OS_TENANT_NAME=ganglia reboot aggregator2 != OS_TENANT_NAME=ganglia nova reboot aggregator2 [22:31:04] Reedy: guess so... looks like my change was applied... [22:31:15] mw1173 has been down for a while, that's a normal message [22:31:55] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [22:32:35] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [22:33:15] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [22:33:55] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [22:35:46] New patchset: Reedy; "Change in sqwiki for FlaggedRevs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64220 [22:37:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64220 [22:38:15] !log reedy synchronized wmf-config/flaggedrevs.php [22:38:24] Logged the message, Master [22:54:52] New patchset: Reedy; "Add all apache config files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64223 [22:55:14] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64223 [22:56:18] New patchset: Reedy; "List apache repo first to match file order" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64224 [22:56:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64224 [22:59:54] RECOVERY - mysqld processes on db1059 is OK: PROCS OK: 1 process with command name mysqld [22:59:54] RECOVERY - DPKG on db1059 is OK: All packages OK [22:59:54] RECOVERY - MySQL Slave Running on db1059 is OK: OK replication [23:00:04] RECOVERY - Disk space on db1059 is OK: DISK OK [23:00:14] RECOVERY - Full LVS Snapshot on db1059 is OK: OK no full LVM snapshot volumes [23:00:14] RECOVERY - RAID on db1059 is OK: OK: State is Optimal, checked 2 logical device(s) [23:00:24] RECOVERY - MySQL disk space on db1059 is OK: DISK OK [23:00:24] RECOVERY - MySQL Recent Restart on db1059 is OK: OK seconds since restart [23:00:34] RECOVERY - MySQL Replication Heartbeat on db1059 is OK: OK replication delay seconds [23:00:34] RECOVERY - MySQL Slave Delay on db1059 is OK: OK replication delay seconds [23:00:34] RECOVERY - MySQL Idle Transactions on db1059 is OK: OK longest blocking idle transaction sleeps for seconds [23:01:47] New patchset: Reedy; "Make be-x-old.wikisource.org point to be.wikisource.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/64225 [23:02:16] New patchset: Reedy; "Make be-x-old.wikisource.org point to be.wikisource.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/64225 [23:06:25] PROBLEM - RAID on analytics1015 is CRITICAL: Timeout while attempting connection [23:06:43] Anyone know where the redirect from be-x-old.wikisource -> wikisource.org actually is? [23:06:58] reedy@fenari:/home/wikipedia/common/docroot/noc$ curl -I http://be-x-old.wikisource.org [23:06:58] HTTP/1.1 302 Moved Temporarily [23:06:59] ... [23:07:05] Location: http://wikisource.org/wiki/ [23:08:05] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [23:08:14] didn't you just patch it on apache-config? [23:08:40] No, that's changing where it goes [23:08:56] I've no idea what controls the current behavior (noting 64225 isn't merged/live) [23:09:35] RECOVERY - Host analytics1015 is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [23:12:25] ServerAlias *.wikisource.org [23:12:27] 404 handler? [23:15:12] 302s are only obviously used on the upload rewrites.. [23:16:32] it's mediawiki [23:16:39] no idea what in mediawiki specifically [23:20:15] Is my fix sufficient as the redirect should've already happened before it actually hits mw? [23:21:15] should be, yes [23:23:34] Change abandoned: Reedy; "Bug WONTFIXed" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48444 [23:31:06] New patchset: Reedy; "Display session-labs too" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64227 [23:31:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64227 [23:33:09] New patchset: Reedy; "(bug 33513) Define altUploadForm in $wgUploadWizardConfig" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64200 [23:33:31] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64200 [23:34:23] !log reedy synchronized wmf-config/InitialiseSettings.php [23:34:32] Logged the message, Master [23:37:16] thanks Reedy, I'm sure Commonsers will be happy now :) [23:38:26] New patchset: Reedy; "Move VHosts config from wgConf to seperate files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64229 [23:41:48] New patchset: Reedy; "Update size dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64231 [23:42:13] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/64231 [23:42:55] !log reedy synchronized database lists files: [23:43:04] Logged the message, Master [23:45:08] !log pgehres synchronized php-1.22wmf4/extensions/AccountAudit/ 'Updating AccountAudit to master. Adding aa_method' [23:45:17] Logged the message, Master [23:47:22] !log pgehres synchronized php-1.22wmf3/extensions/AccountAudit/ 'Updating AccountAudit to master. Adding aa_method' [23:47:30] Logged the message, Master [23:54:48] New patchset: Asher; "adding accountaudit_login to list of private tables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64232 [23:55:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64232