[00:02:45] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 00:02:37 UTC 2013 [00:03:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:54] Ryan_Lane: see the logo on this page: http://en.wikipedia.org/wiki/Girl_Guides_Association_of_the_United_Arab_Emirates [00:18:03] Ryan_Lane: if you click it, the logo claims it is not in use on any pages [00:18:20] ok? [00:18:37] this seems like something that should be a bugzilla bug, rather than an ops issue [00:18:48] I'm lazy, now you know [00:19:02] and I'm lazy too, so now no one else knows [00:19:23] I honestly don't know what it's supposed to do, so I'm not going to enter a bug [00:19:37] well I'm going to fix it with a null edit now [00:19:41] but there may be some issue there [00:19:45] enter a bug [00:21:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:24:56] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 00:24:50 UTC 2013 [00:25:55] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:27:47] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 00:27:41 UTC 2013 [00:28:35] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:29:05] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 00:28:56 UTC 2013 [00:29:45] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [00:32:45] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 00:32:40 UTC 2013 [00:33:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [00:36:57] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [00:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.145 second response time [00:54:45] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 00:54:41 UTC 2013 [00:54:55] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [00:57:45] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 00:57:38 UTC 2013 [00:58:35] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [00:58:55] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 00:58:45 UTC 2013 [00:59:45] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:02:45] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 01:02:43 UTC 2013 [01:03:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:04:14] (PS2) Ryan Lane: Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 [01:04:51] (CR) jenkins-bot: [V: -1] Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 (owner: Ryan Lane) [01:07:14] (PS3) Ryan Lane: Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 [01:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:24:55] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 01:24:45 UTC 2013 [01:24:55] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:28:15] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 01:28:05 UTC 2013 [01:28:35] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:28:55] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 01:28:45 UTC 2013 [01:29:45] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [01:32:45] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 01:32:44 UTC 2013 [01:33:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [01:54:55] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 01:54:46 UTC 2013 [01:55:55] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:15] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 01:58:02 UTC 2013 [01:58:35] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [01:58:46] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 01:58:43 UTC 2013 [01:59:45] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:02:22] (PS2) Ottomata: Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 [02:02:36] (PS4) Ryan Lane: Use grains for deployment targets [operations/puppet] - https://gerrit.wikimedia.org/r/74108 [02:02:45] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 02:02:42 UTC 2013 [02:03:45] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:04:30] (CR) Ottomata: "Ergh, had to do some hacky puppet things to make that happen. Check it out." [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 (owner: Ottomata) [02:06:01] !log LocalisationUpdate completed (1.22wmf10) at Mon Jul 22 02:06:01 UTC 2013 [02:06:12] Logged the message, Master [02:10:17] !log LocalisationUpdate completed (1.22wmf11) at Mon Jul 22 02:10:16 UTC 2013 [02:10:27] Logged the message, Master [02:17:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 22 02:17:55 UTC 2013 [02:18:06] Logged the message, Master [02:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [02:24:55] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 02:24:51 UTC 2013 [02:25:58] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 02:28:56 UTC 2013 [02:29:12] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 02:29:02 UTC 2013 [02:29:25] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:25] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [02:29:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [02:34:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 02:34:43 UTC 2013 [02:35:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:55:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 02:55:00 UTC 2013 [02:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [02:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 02:57:37 UTC 2013 [02:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 02:58:49 UTC 2013 [02:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 03:02:39 UTC 2013 [03:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:25:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 03:24:53 UTC 2013 [03:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 03:28:46 UTC 2013 [03:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:30:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 03:30:34 UTC 2013 [03:31:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [03:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 03:32:40 UTC 2013 [03:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [03:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 03:54:45 UTC 2013 [03:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [03:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 03:58:49 UTC 2013 [03:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [03:59:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 03:59:50 UTC 2013 [04:00:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:03:02] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 04:02:59 UTC 2013 [04:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:25:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 04:24:53 UTC 2013 [04:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 04:27:36 UTC 2013 [04:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 04:28:52 UTC 2013 [04:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 04:32:46 UTC 2013 [04:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:39:22] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [04:46:42] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:47:42] RECOVERY - Disk space on labstore3 is OK: DISK OK [04:49:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:50:33] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [04:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 04:54:48 UTC 2013 [04:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:22] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 04:58:16 UTC 2013 [04:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 04:58:46 UTC 2013 [04:59:23] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [05:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 05:02:41 UTC 2013 [05:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [05:05:22] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: No successful Puppet run in the last 10 hours [05:09:22] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: No successful Puppet run in the last 10 hours [05:10:22] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [05:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 05:24:49 UTC 2013 [05:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [05:28:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 05:28:11 UTC 2013 [05:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [05:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 05:28:57 UTC 2013 [05:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [05:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 05:32:47 UTC 2013 [05:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [05:39:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:40:32] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [05:45:22] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [05:45:22] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:22] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:23] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:23] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [05:45:23] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [05:50:22] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [05:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 05:54:48 UTC 2013 [05:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [05:56:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 05:58:07 UTC 2013 [05:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [05:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 05:58:48 UTC 2013 [05:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [06:01:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [06:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 06:02:39 UTC 2013 [06:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [06:17:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:18:22] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:19:39] (PS4) Faidon: (power)dns: support multiple listen addresses [operations/puppet] - https://gerrit.wikimedia.org/r/74615 [06:20:23] (CR) Faidon: [C: 2] (power)dns: support multiple listen addresses [operations/puppet] - https://gerrit.wikimedia.org/r/74615 (owner: Faidon) [06:20:24] (Merged) Faidon: (power)dns: support multiple listen addresses [operations/puppet] - https://gerrit.wikimedia.org/r/74615 (owner: Faidon) [06:23:05] grr puppet broken [06:26:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 06:26:45 UTC 2013 [06:27:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [06:27:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 06:27:42 UTC 2013 [06:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [06:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 06:28:44 UTC 2013 [06:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [06:31:19] ffs [06:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 06:32:39 UTC 2013 [06:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [06:53:16] (PS1) Faidon: Workaround fallout from sysctlfile [operations/puppet] - https://gerrit.wikimedia.org/r/75065 [06:54:45] (CR) Faidon: [C: 2] "Ihatemyself" [operations/puppet] - https://gerrit.wikimedia.org/r/75065 (owner: Faidon) [06:54:46] (Merged) Faidon: Workaround fallout from sysctlfile [operations/puppet] - https://gerrit.wikimedia.org/r/75065 (owner: Faidon) [06:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 06:54:46 UTC 2013 [06:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:58:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 06:58:45 UTC 2013 [06:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 06:58:50 UTC 2013 [06:59:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [06:59:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [07:01:02] RECOVERY - Puppet freshness on dobson is OK: puppet ran at Mon Jul 22 07:00:58 UTC 2013 [07:02:52] (PS1) Faidon: Undecom cp104[1234] [operations/puppet] - https://gerrit.wikimedia.org/r/75067 [07:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 07:02:44 UTC 2013 [07:03:13] (CR) Faidon: [C: 2] Undecom cp104[1234] [operations/puppet] - https://gerrit.wikimedia.org/r/75067 (owner: Faidon) [07:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [07:03:30] (Merged) Faidon: Undecom cp104[1234] [operations/puppet] - https://gerrit.wikimedia.org/r/75067 (owner: Faidon) [07:17:54] !log restarting pybal and manually ipvsadm removing dns_auth services from lvs1/lvs5/lvs1002/lvs1005 [07:18:05] Logged the message, Master [07:19:51] (PS1) Jalexander: Make FlaggedRev rights available to global groups [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75070 [07:22:02] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Mon Jul 22 07:21:55 UTC 2013 [07:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 07:24:47 UTC 2013 [07:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [07:28:58] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 07:28:50 UTC 2013 [07:28:58] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 07:28:50 UTC 2013 [07:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [07:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [07:31:18] (PS1) Faidon: Add new ns0/ns1 service IPs to dobson & linne [operations/puppet] - https://gerrit.wikimedia.org/r/75071 [07:32:40] morning [07:32:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 07:32:41 UTC 2013 [07:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [07:33:33] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [07:34:05] good morning [07:38:37] (CR) Faidon: [C: 2] Add new ns0/ns1 service IPs to dobson & linne [operations/puppet] - https://gerrit.wikimedia.org/r/75071 (owner: Faidon) [07:38:38] (Merged) Faidon: Add new ns0/ns1 service IPs to dobson & linne [operations/puppet] - https://gerrit.wikimedia.org/r/75071 (owner: Faidon) [07:42:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:43:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [07:44:59] (PS2) Hashar: set some paths to use $wmfHostnames['bits'] [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/71774 [07:46:17] paravoid: i should have caught that in the sysctlfile module [07:46:38] the fact that init.pp is a resource and not a class is also a bit wtf. [07:48:55] (CR) Addshore: [C: 1] Move property-create for * to after loading of Wikibase [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74620 (owner: Aude) [07:49:10] the whole thing is pretty crazy [07:49:27] base.pp defining a file with source => module/sysctlfile/... [07:49:31] etc. [07:49:38] for some definition of module [07:53:52] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:54:42] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 07:54:44 UTC 2013 [07:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [07:56:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [07:57:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 07:57:42 UTC 2013 [07:57:52] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [07:58:42] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 07:58:48 UTC 2013 [07:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:00:32] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:01:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:03:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 08:03:12 UTC 2013 [08:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [08:08:32] PROBLEM - search indices - check lucene status page on search1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 189 bytes in 0.002 second response time [08:08:33] I have not received any bugmail from bugzilla.wikimedia.org since 22:49UTC. Is some mail infrastructure down? [08:09:48] (as I do see changes in Bugzilla after 22:49 when I query it) [08:10:59] (PS1) Hashar: fix system_role for role::protoproxy::ssl::beta [operations/puppet] - https://gerrit.wikimedia.org/r/75074 [08:11:45] (PS1) Eloquence: Disable "Mark as helpful" extension on English Wikipedia. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75075 [08:12:10] andre__: I have no clue [08:13:21] I cannot trigger a new bugmail either by commenting right now. Checked on gmail.com so it's not my local MUA. So I expect something is broken :-/ [08:13:44] I can't lookup the mail queue on bugzilla server :( [08:13:49] pity :) [08:14:08] I wonder if apergos could. Or anybody else in European timezones [08:14:21] any root could :) [08:14:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:45] could ? [08:14:47] yeah, but they need to be awake :P [08:14:53] bugzilla is not sending emails [08:15:01] apergos, see backlog here [08:15:01] oh [08:15:03] apergos: so maybe kaulen.wikimedia.org has some troubles to send emails [08:15:54] ah I managed to send myself an email :-] [08:15:57] using 'mail' command [08:16:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [08:17:10] I see an aklapper message having been processed in the log [08:17:43] 2013-07-22 08:16:09 (utc) [08:17:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:18:22] Uhm. Maybe the problem is with GMail then [08:18:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [08:20:05] ...which is also unlikely. Guess I should reboot my machine, though that still wouldn't explain why gmail.com in my browser does not show any bugmail either [08:20:26] well it would be easy to check if it's gmail or not [08:20:29] (I don't have gmail) [08:20:55] plus I can receive other "normal" email perfectly via my work account (which is gmail) [08:21:05] like mailing lists or private mail [08:23:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:27] * apergos looks irritatedly at the labstore3 alert [08:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 08:24:46 UTC 2013 [08:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [08:26:22] 2013-07-22 08:21:38 1V1BNC-0001aR-1f => aklapper@wikimedia.org R=smart_route T=remote_smtp S=2744 H=mchenry.wikimedia.org [2620:0:860:2:219:b9ff:fedd:c027] C="250 OK id=1V1BNC-0008Mh-5z" DT=0s [08:26:27] 2013-07-22 08:21:38 1V1BNC-0008Mh-5z => aklapper@wikimedia.org R=ldap_account T=remote_smtp S=3017 H=aspmx.l.google.com [2607:f8b0:400d:c02::1a] C="250 2.0.0 OK 1374481298 q4si10939151qag.112 - gsmtp" DT=0s [08:26:46] delivered to google [08:27:03] check your spam folder [08:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 08:27:38 UTC 2013 [08:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [08:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 08:28:45 UTC 2013 [08:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [08:29:58] RECOVERY - Disk space on labstore3 is OK: DISK OK [08:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 08:32:44 UTC 2013 [08:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [08:33:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:35:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [08:37:16] damn, it's really just my work account. I do receive bugmail for my testing account. [08:37:24] * andre__ totally puzzled [08:37:55] my work account is still set as globalwatcher in Bugzilla, and my email preferences are as usual [08:38:18] oh fuck. GMail spam folder. [08:38:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:38:43] that's the solution. apergos: sorry for the noise, got all bugmail in my gmail spam folder, but no idea yet why [08:39:00] * andre__ grumbles [08:39:00] okey dokey [08:39:27] 11:27 < paravoid> check your spam folder [08:39:29] :) [08:39:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:41:02] still wondering what happened. Gmail, the usual mystery. [08:42:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:32] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:51:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:32] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:42] PROBLEM - DPKG on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 08:54:45 UTC 2013 [08:54:52] PROBLEM - SSH on labstore3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:23] RECOVERY - RAID on labstore3 is OK: OK: State is Optimal, checked 1 logical device(s) [08:55:32] RECOVERY - DPKG on labstore3 is OK: All packages OK [08:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [08:55:42] RECOVERY - SSH on labstore3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [08:56:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [08:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 08:57:37 UTC 2013 [08:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [08:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 08:58:47 UTC 2013 [08:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [09:00:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:01:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [09:04:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 09:04:51 UTC 2013 [09:05:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [09:14:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:15:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [09:22:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 09:24:46 UTC 2013 [09:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [09:25:47] !log adding new ns0/ns1 service ip static routes to dobson/linne on cr1-sdtpa/cr2-pmtpa [09:25:57] Logged the message, Master [09:27:43] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 09:27:38 UTC 2013 [09:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [09:28:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [09:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 09:28:50 UTC 2013 [09:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [09:32:12] (PS1) Hashar: beta: set $wg.*Server for loginwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75080 [09:32:40] (CR) Hashar: [C: 2] beta: set $wg.*Server for loginwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75080 (owner: Hashar) [09:32:43] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 09:32:38 UTC 2013 [09:32:48] (Merged) jenkins-bot: beta: set $wg.*Server for loginwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75080 (owner: Hashar) [09:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [09:49:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [09:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 09:54:49 UTC 2013 [09:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [09:56:32] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [09:57:32] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.002 second response time [09:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 09:57:41 UTC 2013 [09:57:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [09:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 09:58:46 UTC 2013 [09:58:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [09:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [10:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 10:02:45 UTC 2013 [10:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:07:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:07:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:08:32] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.005 second response time [10:20:45] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:20:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [10:21:33] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 2.134 second response time [10:24:37] (PS1) Hashar: varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 [10:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 10:24:44 UTC 2013 [10:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [10:27:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:28:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 10:27:55 UTC 2013 [10:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [10:28:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [10:28:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 10:28:51 UTC 2013 [10:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [10:29:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:30:32] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 3.708 second response time [10:30:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [10:31:55] i am tired [10:32:16] the nginx/varnish/mediawiki X-Forwarded-Proto stuff is giving me headaches [10:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 10:32:35 UTC 2013 [10:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [10:33:27] (CR) Hashar: "I did it manually on the instance that does not fix the issue :-(" [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [10:35:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:32] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:36:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:23] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [10:39:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [10:44:05] (PS2) Hashar: varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 [10:45:04] (CR) Hashar: "I have edited the accesslist on deployment-cache-text1.pmtpa.wmflabs to include 127.0.0.0/8, that let us access https://login.wikimedia.be" [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [10:45:33] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:46:33] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.003 second response time [10:50:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:50:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:51:36] paravoid: https://gerrit.wikimedia.org/r/#/c/75087/ [10:51:42] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 8.762 second response time [10:51:49] i may have gotten slightly carried away [10:52:22] (PS1) Ori.livneh: Refactor sysctl [operations/puppet] - https://gerrit.wikimedia.org/r/75087 [10:52:37] even grrrit-wm couldn't handle it [10:52:58] ori-l: I blame toollabs [10:53:12] which seems down atm [10:54:03] i'd like to believe it was overeager to review my changeset [10:54:10] but to each his own theory, you know. [10:54:20] hmm, grrrit-wm should be made to review changes. [10:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 10:54:48 UTC 2013 [10:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [10:56:36] ori-l: oh wow [10:56:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [10:56:49] actually I have a completely different take tbh [10:57:00] first of all role::sysctl sounds wrong [10:57:19] setting a sysctl value is not a role [10:57:32] yeah, i thought about that, probably true [10:57:38] I would actually put sysctl calls inside role classes [10:57:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:52] and get rid of all those "advanced-routing" files or whatever [10:57:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [10:58:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 10:58:46 UTC 2013 [10:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 10:58:51 UTC 2013 [10:58:56] just inline them, i.e. sysctl { 'net.ipv6.conf.all.accept_ra': value => '0' } [10:59:10] well, in the interest of not making a complicated change more complicated, i just reproduced the pattern that already existed in each manifest [10:59:17] i did that for ceph, for example, since that's what you had in place [10:59:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:42] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 8.464 second response time [10:59:51] right [11:00:04] but I think the issues are more fundamental [11:00:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [11:00:40] andrew put those options in the sysctlfile module, which isn't pretty; you put them in role classes, which is wrong [11:00:58] !log restarting Jenkins, some threads are deadlocked ( see {{bug|51802}} ) [11:01:01] so maybe, *maybe* the answer is that these belong in the individual role classes [11:01:07] Logged the message, Master [11:01:21] i think i agree, but if you split it into two change sets the job of reviewing it is easier, since the diff now more or less maintains a 1:1 line mapping [11:01:45] btw, sysctlfile was merged on friday or so [11:02:00] funny how it's being reworked twice in two business days :) [11:02:26] maybe that's why I'm reluctant on another incremental change [11:02:34] but maybe you're right too [11:03:12] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 11:03:05 UTC 2013 [11:03:16] i don't mind going all the way, it just seems more likely that a mistake will slip through and cause headaches [11:03:21] hm [11:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [11:03:33] what about https://gerrit.wikimedia.org/r/#/c/75087/1/modules/sysctl/files/procps-puppet.conf & the recursive dir management? [11:03:34] I also don't like the puppet-managed thingy too much [11:03:37] heh [11:03:38] heh [11:04:03] let's just recurse/purge => true /etc/sysctl.d? [11:04:22] I mean we don't really do non-puppetized configs for such things [11:04:26] but ubuntu ships with some conf files there [11:05:09] and iirc sysctl values aren't sticky, so it's relying on the files being continuously there and the procps job re-setting them on boot [11:05:11] that's correct [11:05:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:15] This directory contains settings similar to those found in /etc/sysctl.conf. [11:06:15] what if an ubuntu / debian package update add a file to that directory? the next puppet run would purge it and trigger a refresh [11:06:18] In general, files in the 10-*.conf range come from the procps package and [11:06:21] serve as system defaults. Other packages install their files in the [11:06:24] 30-*.conf range, to override system defaults. End-users can use 60-*.conf [11:06:27] and above, or use /etc/sysctl.conf directly, which overrides anything in [11:06:30] this directory. [11:06:32] that's an ubuntu-ism [11:06:36] I don't think Debian does that [11:07:02] nope, it doesn't [11:07:27] (CR) jenkins-bot: [V: -1] varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 (owner: Hashar) [11:08:07] I wonder how can we purge => true only 60-* :) [11:09:02] with execs, but that's a bit gross [11:09:50] with execs how? [11:09:56] and i don't think you'd be able to reproduce the desired effect of triggering a single refresh for multiple updates after all of them have completed [11:10:21] (CR) jenkins-bot: [V: -1] Refactor sysctl [operations/puppet] - https://gerrit.wikimedia.org/r/75087 (owner: Ori.livneh) [11:10:23] yeah, scratch that, wouldn't work. [11:10:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:11:47] oh puppet... [11:11:51] even the simplest things... :) [11:11:53] i liked it because it allows you to have the ancillary service that runs on stopping procps [11:12:09] which tolerates there not being a procps service [11:12:39] and means that our puppet sysctl settings always run after any defaults [11:12:43] not having a procps service is really <= 8.04 [11:12:51] which is EOLed now [11:13:18] but this could be mitigated even with puppet [11:13:44] make the resources virtual, have a class sysctl that realizes them and include that conditional todistribution in base.pp [11:14:31] i went with https://dpaste.de/1Jk85/raw/ [11:14:37] that was my first attempt at the problem [11:15:02] i.e. define sysctl::param(...) { @file { ..., tag => 'sysctl' } }; class sysctl { File <| tag == 'sysctl' |> } [11:15:42] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 6.261 second response time [11:15:51] that's nicer and cleaner than what i pasted above, but it isn't worth adding these kinds of abstractions to work around an edge case introduced by a platform that is moribund [11:15:52] RECOVERY - Disk space on labstore3 is OK: DISK OK [11:16:15] true [11:16:58] note that it breaks with 8.04 anyway [11:17:02] sysctl.d doesn't exist [11:17:12] i ensure => directory'd it [11:17:14] even if you shortcut procps to /bin/true (as I did) [11:17:16] ah [11:18:16] hm, importing ubuntu's sysctl defaults into our module and exclusively managing /etc/sysctl.d sounds wrong, doesn't it? [11:18:20] just thinking loud here :) [11:18:32] there's also random sysctl that packages may ship too [11:18:52] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:01] !log jenkins restarted [11:19:05] which is, in Debian: [11:19:06] corekeeper: /etc/sysctl.d/corekeeper.conf [11:19:06] postgresql-common: /etc/sysctl.d/30-postgresql-shm.conf [11:19:06] procps: /etc/sysctl.d/README.sysctl [11:19:06] tracker-miner-fs: /etc/sysctl.d/30-tracker.conf [11:19:08] uhd-host: /etc/sysctl.d/uhd-usrp2.conf [11:19:11] Logged the message, Master [11:19:18] i was just going to search for that [11:19:21] how did you generate that list? [11:20:04] apt-get install apt-file; apt-file update; apt-file search /etc/sysctl.d [11:21:26] !log Jenkins: deleting -merge jobs ({{bug|51395}} sequentially to avoid deadlocking jenkins {{bug|51802}} [11:21:28] probably OK then [11:21:36] Logged the message, Master [11:22:23] i can't imagine a package absolutely *depends* on the sysctl value being set to some non-default value; i figure those are optimizations for the workload the software expects to impose. [11:22:58] and they're very rare anyhow [11:23:32] i need to be up and in a meeting in four hours [11:24:06] * ori-l waves [11:24:20] ouch [11:24:21] bye! [11:24:50] (PS3) Hashar: varnish: backends trust 127.0.0.1 for XFF [operations/puppet] - https://gerrit.wikimedia.org/r/75085 [11:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 11:24:42 UTC 2013 [11:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [11:28:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 11:28:10 UTC 2013 [11:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [11:28:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 11:28:51 UTC 2013 [11:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [11:30:42] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 8.927 second response time [11:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 11:32:34 UTC 2013 [11:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [11:40:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:41:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [11:46:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: HTTP CRITICAL - No data received from host [11:47:42] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 5.625 second response time [11:52:42] PROBLEM - Varnish HTTP mobile-frontend on cp1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 11:54:47 UTC 2013 [11:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [11:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 11:57:40 UTC 2013 [11:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [11:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 11:58:48 UTC 2013 [11:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [12:00:44] PROBLEM - Varnish HTTP mobile-frontend on cp1059 is CRITICAL: HTTP CRITICAL - No data received from host [12:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 12:02:38 UTC 2013 [12:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [12:03:32] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host [12:03:42] RECOVERY - Varnish HTTP mobile-frontend on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 7.857 second response time [12:03:52] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:33] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22068 bytes in 0.014 second response time [12:04:42] PROBLEM - Varnish HTTP mobile-frontend on cp1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:04:52] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:05:52] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22110 bytes in 9.006 second response time [12:06:32] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: Connection timed out [12:06:42] PROBLEM - Varnish HTTP mobile-frontend on cp1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:06:43] PROBLEM - Varnish HTTP mobile-frontend on cp1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:06:52] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:01] hmm [12:07:32] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22068 bytes in 9.012 second response time [12:07:42] RECOVERY - Varnish HTTP mobile-frontend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 9.136 second response time [12:07:52] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22110 bytes in 9.263 second response time [12:08:42] RECOVERY - Varnish HTTP mobile-frontend on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 9.390 second response time [12:09:52] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:18] hashar: did you do the zuul thingi last week? [12:10:32] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: Connection timed out [12:10:36] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host [12:10:42] PROBLEM - Varnish HTTP mobile-frontend on cp1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:10:43] hai btw [12:10:52] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 7.045 second response time [12:11:32] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22068 bytes in 0.009 second response time [12:12:33] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: Connection timed out [12:14:42] PROBLEM - Varnish HTTP mobile-frontend on cp1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:05] ori-l: there could be cases where, although the sysctl parameter is "just" an optimization, the software depends on it in the sense that (a) With the software's current configuration, it will not start up without those sysctl settings, or (b) It will start, but will fail to do its job well enough to matter if the sysctl setting wasn't in effect before daemon start [12:16:58] wth is going on [12:17:25] I don't know. my phone is mad at me and going beep beep beep beep, that's why I'm awake :P [12:18:23] AzaToth: nop [12:18:51] AzaToth: unlikely to be achieved before mid september :D [12:19:08] varnish frontends aren't happy [12:19:51] Jul 22 12:19:06 cp1041 vhtcpd[1958]: TCP conn to 127.0.0.1:80: response too large, dropping request [12:20:09] I thought we had fixed this before, bodies in purges? [12:23:27] k [12:23:28] hashar: why are you happy? [12:23:28] about that ヾ [12:23:28] /wiki/Special:BannerRandom? [12:23:28] that would have cascaded though [12:23:32] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: Connection timed out [12:23:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:23:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [12:24:34] 1548 SYN_RECV, 172367 CLOSE_WAIT [12:24:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [12:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 12:24:45 UTC 2013 [12:25:32] PROBLEM - LVS HTTP IPv4 on m.wikimedia.org is CRITICAL: Connection timed out [12:25:35] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [12:26:32] RECOVERY - Varnish HTTP mobile-frontend on cp1046 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.002 second response time [12:26:36] ori-l: arth you there? [12:26:41] okay, I restart cp1046 mobile-frontend [12:26:45] AzaToth: he left a while ago, to get some sleep. [12:26:52] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22060 bytes in 0.008 second response time [12:26:55] k [12:26:59] at least that's what he said. Unsure if he's gotten his sleep <-> IRC interface working yet [12:27:06] hehe [12:27:23] RECOVERY - LVS HTTP IPv4 on m.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22060 bytes in 0.002 second response time [12:27:51] LVS is load balancers right? [12:28:03] !log restarting cp1046/cp1060's varnish-frontend [12:28:07] AzaToth: yes [12:28:12] Logged the message, Master [12:28:28] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 12:28:13 UTC 2013 [12:28:28] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [12:28:32] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host [12:28:36] RECOVERY - Varnish HTTP mobile-frontend on cp1060 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.003 second response time [12:28:40] [2331559.040628] Out of socket memory [12:28:41] [2331559.280384] Out of socket memory [12:28:42] perfect [12:28:52] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22103 bytes in 0.025 second response time [12:28:55] heh [12:28:55] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22100 bytes in 0.014 second response time [12:28:59] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 12:28:48 UTC 2013 [12:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [12:30:32] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22051 bytes in 0.025 second response time [12:32:10] paravoid: not so ironically, the first google result for Out of socket memory is a blog entry from a guy running Varnish :) [12:32:13] http://blog.tsunanet.net/2011/03/out-of-socket-memory.html [12:32:22] hey [12:32:28] hey mark [12:32:35] sorry was on the road [12:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 12:32:36 UTC 2013 [12:32:54] so, something's strange happening [12:33:19] no load at all, nothing except varnish frontend is noticing anything, graphs show no spikes or anything [12:33:23] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [12:33:40] I'm looking for evidence of a syn flood [12:33:43] but couldn't find any [12:34:10] (syncookies are on but no warnings have been printed, SYN_RECV amount isn't that large) [12:34:51] out of socket memory happened after I restarted varnish and I'm guessing that's the kernel cleaning up after all those sockets the dying varnish left orphan [12:35:29] mark: and I restarted cp1046/cp1060 after a while, hoping that maybe that mkaes a difference [12:35:36] ok [12:35:55] and it did it seems [12:36:08] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Mobile+caches+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [12:36:14] traffic back to normal levels [12:36:34] a socket fd leak on varnish's side maybe? [12:37:00] progressively exhausting all sockets on the system [12:37:24] I restarted two frontends because of depool threshold but I left the rest for investigation [12:37:52] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [12:38:09] i'm not seeing out of socket memory on all mobile servers [12:38:13] no [12:38:17] 15:34 < paravoid> out of socket memory happened after I restarted varnish and I'm guessing that's the kernel cleaning up after all those sockets the dying varnish left orphan [12:38:33] the only occurence is immediately after varnish-frontend restart [12:38:48] root@cp1047:~# cat /proc/net/sockstat [12:38:49] sockets: used 200270 [12:38:49] TCP: inuse 198147 orphan 1121 tw 7049 alloc 205357 mem 409158 [12:39:02] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:19] root@cp1060:~# cat /proc/net/sockstat [12:39:19] sockets: used 4874 [12:39:19] TCP: inuse 5752 orphan 1164 tw 31072 alloc 5909 mem 9752 [12:39:29] right [12:39:52] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22103 bytes in 0.043 second response time [12:40:07] n_sess 199990 . N struct sess [12:40:14] on cp1047 [12:40:36] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Mobile%20caches%20eqiad&h=cp1047.eqiad.wmnet&r=day&z=default&jr=&js=&st=1374496691&v=199997&m=frontend.n_sess&vl=N&ti=N%20struct%20sess&z=large [12:40:58] i think we've seen this before [12:41:34] we have? [12:41:39] yes [12:41:45] but on bits mostly... [12:41:51] maybe it's different [12:41:52] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22062 bytes in 0.005 second response time [12:42:26] 2013-07-22 11:50:57.571899 [mobilelb6] Could not depool server cp1059.eqiad.wmnet because of too many down! [12:42:29] blergh [12:42:33] oh that's 40' ago though [12:43:02] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:43:10] so I think pybal is pooling/depooling cp1059/cp1047 [12:43:15] flapping [12:43:20] I'm going to disable them [12:43:29] ack? [12:43:52] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 22104 bytes in 0.037 second response time [12:44:46] just disable one [12:44:48] restart the other [12:45:01] so we have one for debugging [12:45:08] it'll be interesting to see if those sessions go away [12:45:13] we have four, I restarted two [12:45:20] I was thinking disable in pybal conf the other two [12:45:21] restart another one [12:45:27] that works too, two should be enough [12:46:10] !log depooling cp1047/cp1059 for further investigation [12:46:19] Logged the message, Master [12:46:32] RECOVERY - Varnish HTTP mobile-frontend on cp1059 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 1.694 second response time [12:46:33] RECOVERY - Varnish HTTP mobile-frontend on cp1047 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 2.194 second response time [12:46:35] bblack: feel free to ask questions if you lack info on some of these steps btw :) [12:47:25] inuse on cp1060 jumped to 10k (from 6k) [12:47:37] but seems to be holding off there [12:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 12:54:44 UTC 2013 [12:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [12:56:52] PROBLEM - SSH on sq41 is CRITICAL: Server answer: [12:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 12:57:36 UTC 2013 [12:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [12:59:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 12:58:52 UTC 2013 [12:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [13:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 13:02:36 UTC 2013 [13:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [13:11:52] RECOVERY - SSH on sq41 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [13:14:42] PROBLEM - Backend Squid HTTP on sq41 is CRITICAL: Connection refused [13:15:15] guys, can someone please, please merge https://gerrit.wikimedia.org/r/#/c/73565/ before VE is enabled on all those non-en-WP wikis? [13:18:12] pff [13:18:14] the triple negations there make it hard to understand what this patch is about :D [13:19:30] twkozlowski: there isno urgency really [13:19:38] twkozlowski: we can deploy that whenever needed [13:19:58] twkozlowski: that patch essentially would reopen bug https://bugzilla.wikimedia.org/show_bug.cgi?id=48666 [13:20:04] hashar: it's needed since Jul 13 [13:20:24] twkozlowski: so you would want to bring this on the wikitech-l mailing list and make sure James Forrester (bug 48666 solver + VE product manager) is in CC [13:20:34] hashar: and close 50929 [13:20:44] twkozlowski: but reopens 48666 :D [13:21:02] so you want to bring the discussion to the tech community so we don't enable/disable that option every week [13:21:09] "For the record, I asked him and James said that he doesn't have a plan to respond to this patch." [13:21:35] James_F|Away: ^^ [13:22:17] also perhaps Elsie [13:23:12] https://bugzilla.wikimedia.org/show_bug.cgi?id=48666#c6 offers some background information [13:24:43] (CR) Hashar: [C: -1] "That would reopen bug 48666 that explicitly made VisualEditor a hidden preference." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [13:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 13:24:42 UTC 2013 [13:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [13:25:40] twkozlowski: I recommend to discuss about it on wikitech-l and bug 50929. Meanwhile you can abandon the gerrit change https://gerrit.wikimedia.org/r/#/c/73565/ since it is not going to deployed :-] [13:25:42] :( [13:25:43] ror [13:25:54] well not deployed until we figure out whether this is actually wanted. [13:26:13] There is a thread about that on en.wp somewhere. [13:26:17] hashar: you know, this has already been discussed at length on multiple venues [13:26:25] including that bug and [[WP:VPT]] [13:27:16] hashar: and james has been blocking any progress by refusing to acknowledge that this is wanted :/ [13:27:27] so bring the topic on wiktiech-l [13:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 13:27:39 UTC 2013 [13:28:13] the whole point of that feature was to opt-in VE [13:28:16] (PS4) Helder.wiki: (bug 50929) Remove 'visualeditor-enable' from $wgHiddenPrefs [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [13:28:17] not to opt-out later on :-D [13:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [13:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 13:28:46 UTC 2013 [13:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [13:29:49] hashar: i'm pretty sure that a "position is reached" on that bug [13:30:21] but, as i said, james is refusing to acknowledge it or even reply [13:30:33] frankly i have no fucking idea why he's doing that [13:30:37] MatmaRex, so start a publis thread [13:30:57] wikitech-l is where such issues should be resolved [13:31:05] this isn't a code issue [13:31:05] Why. [13:31:14] this is a wikimeida-specific configuration issue [13:31:17] There is community consensus not to hide it. [13:33:12] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 13:33:05 UTC 2013 [13:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [13:34:11] MatmaRex: if james does not ack/reply, then bring the issue to more people by using wikitech-l :-] [13:34:47] wikitechis not an appropriat elist for this [13:34:47] and you can talk about it with the visual editor people on IRC (maybe #wikimedia-visualeditor ) (all of them in SF though) [13:35:00] but if this is going to make this get merged, okay, fuck appropriate lists [13:35:11] visualeitor people ignore me when i talk to them about this [13:35:14] i told you already [13:35:57] ok let me put this in another way: [13:36:01] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#.22Opt_out.22_of_VE_needed_under_preferences [13:36:05] 1) Gerrit / code review is not a place to argue [13:36:07] 'complaining on IRC won't change anything' [13:36:10] 2) it is none of our business [13:36:13] 3) get James to reply [13:36:15] period [13:36:23] None of our business? [13:36:41] that should be handled by the VisualEditor team [13:36:56] by "our" I was referring to the general mediawiki team sorry [13:37:16] that's why I'm not discussing this in #mediawiki [13:37:25] hashar: mail sent [13:37:29] MatmaRex: thanks :) [13:38:00] hope you like it [13:38:27] and really IRC Is an horrible place to talk about such issues [13:38:49] since that is only a handful of people interacting when you want much more people to be involved (and at least the proper people such as James :D ) [13:39:21] hashar: it's a last resort. it's harder to outright ignore people on irc [13:39:39] and enough people were involved in this already, see the linkabove [13:39:40] MatmaRex: You sent this to wikitech-l? Not seeing anything. [13:39:41] also stuff said here, while yes it's publically logged, in practice it's not part of the public record, almost never do we point to something said here as part of, say, an on-wiki discussion or an email threaad [13:40:02] twkozlowski: ah fuck, send from wrong e-mail address. fixing [13:40:05] sent* [13:40:32] done [13:42:05] nice. [13:48:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [13:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 13:54:47 UTC 2013 [13:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [13:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 13:57:39 UTC 2013 [13:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [13:58:42] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 13:58:41 UTC 2013 [13:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [14:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 14:02:39 UTC 2013 [14:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [14:09:20] hey paraoid [14:09:24] paravoid [14:09:31] why "never source a file from a module?" [14:23:03] Anybody around who could restart git? See https://bugzilla.wikimedia.org/show_bug.cgi?id=51769 [14:23:15] as discussed in #mediawiki right now [14:24:42] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 14:24:41 UTC 2013 [14:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:25:54] as usual wondering who's alive op ops at this time [14:26:01] apergos ? Sorry to bother again ^ git down [14:26:26] ah it is? grrr [14:26:50] apergos: Chad not around :( [14:26:50] (PS1) Hashar: contint: deny webspider from accessing Jenkins [operations/puppet] - https://gerrit.wikimedia.org/r/75105 [14:27:02] andre__: Chad is in SF timezone nowadays :D [14:27:14] gimme a sec, looking at it [14:27:18] hashar, I know! That's not good!!!! ;) [14:27:27] andre__: git.wikimedia.org has been dieing a few times per days :( [14:27:27] this is what happens when you don't meet the striking server kittehs demands for fresher tuna [14:27:34] I know [14:27:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=51769 [14:27:43] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 14:27:38 UTC 2013 [14:28:08] great [14:28:15] java at 100% over there, looking to see how to restart it [14:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [14:28:35] apergos: Could you try taking a stack trace? [14:28:36] andre__: I am not sure whether that bug should be a blocker. That is not really preventing any work from being done :] I would set it to annoying [14:28:40] apergos: Maybe that would help chad? [14:29:00] jstack !! [14:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 14:28:59 UTC 2013 [14:29:21] well two things [14:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [14:29:32] first how do I get a java stacktrace [14:29:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:30:06] apergos: jstack $PID should give you a stack trace. [14:30:14] hashar, yeah I might overreact. However it's at least for some folks a major entry point to quickly look up some code or changes (if they don't have a complete checkout ready) [14:30:33] well, this *is* part of the reason we replicate to GitHub. [14:30:41] apergos: or jstack $EXECUTABLE [14:30:56] apergos: (if you cannot see the java process) [14:32:06] oh I was able to see it ok, I can email this to chad I guess [14:32:11] now however [14:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 14:32:40 UTC 2013 [14:32:47] apergos: probably easier just to stick it on the bug and private it [14:32:57] since chad will be seeing the bug report anyway [14:33:00] where does he run this from [14:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [14:33:27] I see not init.d script with gitblit in it [14:33:55] not upstart either [14:34:01] apergos: Do you see something like tomcat, jetty, jboss [14:34:43] (PS1) Ottomata: Adding hasrestart and hasstatus to hadoop services [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75107 [14:35:13] (CR) Ottomata: [C: 2 V: 2] Adding hasrestart and hasstatus to hadoop services [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75107 (owner: Ottomata) [14:35:14] (Merged) Ottomata: Adding hasrestart and hasstatus to hadoop services [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75107 (owner: Ottomata) [14:35:25] (PS3) Ottomata: Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 [14:36:11] nope [14:36:27] it runs as java -jar gitblit.jar and it's in its own little process tree [14:36:51] That looks fine as well. [14:36:59] Do you get errors when you run that command? [14:37:04] /var/lib/gitblit I guess [14:37:16] I needed to see where to run it out of [14:37:25] sec, I'm gonna shoot it now and see [14:37:33] Do you see a gitblit.properties file somewhere? [14:38:09] wonder if I should have backgrounded that or if it will [14:39:10] Jars typically do not background on their own. Then again, gitblit might be different... [14:39:13] well I backgrounded it and I see some stuff instead of a proxy server whine [14:39:49] cpu is at 99% again [14:39:49] But git.wikimedia.org is up again. [14:39:55] so maybe that's its typical useage over here [14:40:22] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [14:41:16] !log shot and restarted gitblit: on antinomy, cd /var/lib/gitblit, java -jar gitblit.jar & (see bug 51769) [14:41:24] Logged the message, Master [14:41:37] !log (btw docs would be nice, is that really the right way to kick it?) [14:41:47] Logged the message, Master [14:42:02] yeah, I was hoping for https://wikitech.wikimedia.org/view/git.wikimedia.org to exist, but it doesn't [14:42:14] (as we have https://wikitech.wikimedia.org/view/bugzilla.wikimedia.org with some nuggets of wisdom) [14:42:51] I searched for the string gitblit and got an abysmally small number of search results [14:42:54] (total: 3) [14:43:16] we shouldn't use the urls like that for page tables >.> [14:43:22] *page titles [14:43:53] well bugzilla is the exception, because the whole product name in url thing [14:46:27] somebody came up with it before my time [14:47:11] just because it has been done in the past, doesn't mean it should repeat it self [14:47:57] (and bugzilla, one of the very few that do it iirc, is one of the exceptions that it does work for) [14:49:14] (PS1) Ottomata: Fixing sqoop path based on sqoop or sqoop2 [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75112 [14:49:50] (CR) Ottomata: [C: 2 V: 2] Fixing sqoop path based on sqoop or sqoop2 [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75112 (owner: Ottomata) [14:49:51] (Merged) Ottomata: Fixing sqoop path based on sqoop or sqoop2 [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75112 (owner: Ottomata) [14:50:22] well I dunno if this stack trace is any use, since cpu usage was "normal" [14:50:23] (PS4) Ottomata: Fixing automated hue SSL generation and permissions [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/74686 [14:51:39] (PS1) Hashar: jenkins: logrotate access.log [operations/puppet] - https://gerrit.wikimedia.org/r/75113 [14:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 14:54:42 UTC 2013 [14:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [14:57:29] apergos: Errm, thanks for quickly looking at the git problem, by the way [14:57:34] yw [14:58:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 14:57:56 UTC 2013 [14:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [14:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 14:58:42 UTC 2013 [14:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [15:00:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:01:10] https://wikitech.wikimedia.org/wiki/Git.wikimedia.org exists now [15:01:24] do we have documentation on wikitech about our actual git setup? [15:01:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 15:02:36 UTC 2013 [15:03:02] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:02] PROBLEM - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:25] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [15:04:02] RECOVERY - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67148 bytes in 7.319 second response time [15:04:02] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67150 bytes in 9.154 second response time [15:05:02] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:40] (PS2) Hashar: jenkins: logrotate access.log [operations/puppet] - https://gerrit.wikimedia.org/r/75113 [15:05:52] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 98250 bytes in 4.773 second response time [15:06:22] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: No successful Puppet run in the last 10 hours [15:06:34] peachey|laptop__: git.wikimedia.org == gitblit :] [15:06:47] peachey|laptop__: the rest is in Gerrit :) [15:09:30] hashar: git.* is undescriptive and could easily change to another platform (eg: gitorious *runs*) [15:09:52] so a gitblit page is better for documentation that is directly related to it [15:10:13] (PS3) Hashar: jenkins: logrotate access.log [operations/puppet] - https://gerrit.wikimedia.org/r/75113 [15:10:22] PROBLEM - Puppet freshness on analytics1018 is CRITICAL: No successful Puppet run in the last 10 hours [15:11:17] peachey|laptop__: yup. The current layout seems ok to me :-) [15:11:22] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [15:11:27] that being a bad example >.> [15:12:05] (CR) ArielGlenn: [C: 2] jenkins: logrotate access.log [operations/puppet] - https://gerrit.wikimedia.org/r/75113 (owner: Hashar) [15:12:06] (Merged) ArielGlenn: jenkins: logrotate access.log [operations/puppet] - https://gerrit.wikimedia.org/r/75113 (owner: Hashar) [15:15:32] (CR) ArielGlenn: [C: 2] contint: deny webspider from accessing Jenkins [operations/puppet] - https://gerrit.wikimedia.org/r/75105 (owner: Hashar) [15:15:33] (Merged) ArielGlenn: contint: deny webspider from accessing Jenkins [operations/puppet] - https://gerrit.wikimedia.org/r/75105 (owner: Hashar) [15:16:58] (PS1) Hashar: jenkins: fix indent in logrotate file [operations/puppet] - https://gerrit.wikimedia.org/r/75116 [15:17:39] !log gallium / jenkins : blacklisting a bunch of user agent {{gerrit|75105}} [15:17:50] Logged the message, Master [15:17:51] (CR) ArielGlenn: [C: 2] jenkins: fix indent in logrotate file [operations/puppet] - https://gerrit.wikimedia.org/r/75116 (owner: Hashar) [15:17:52] (Merged) ArielGlenn: jenkins: fix indent in logrotate file [operations/puppet] - https://gerrit.wikimedia.org/r/75116 (owner: Hashar) [15:21:05] (PS4) Hashar: contint: explicitly require php5-dev [operations/puppet] - https://gerrit.wikimedia.org/r/70182 [15:23:42] (CR) ArielGlenn: [C: 2] contint: explicitly require php5-dev [operations/puppet] - https://gerrit.wikimedia.org/r/70182 (owner: Hashar) [15:23:43] (Merged) ArielGlenn: contint: explicitly require php5-dev [operations/puppet] - https://gerrit.wikimedia.org/r/70182 (owner: Hashar) [15:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 15:24:46 UTC 2013 [15:25:37] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [15:26:33] (PS1) Hashar: jenkins: fix pid path in logrotate script [operations/puppet] - https://gerrit.wikimedia.org/r/75117 [15:28:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 15:28:06 UTC 2013 [15:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [15:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 15:28:47 UTC 2013 [15:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [15:30:01] (CR) ArielGlenn: [C: 2] jenkins: fix pid path in logrotate script [operations/puppet] - https://gerrit.wikimedia.org/r/75117 (owner: Hashar) [15:30:02] (Merged) ArielGlenn: jenkins: fix pid path in logrotate script [operations/puppet] - https://gerrit.wikimedia.org/r/75117 (owner: Hashar) [15:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 15:32:47 UTC 2013 [15:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [15:39:20] (CR) Demon: [C: 1 V: 1] "This seems like a completely reasonable change to me, but someone from the VE team should weigh in (or at least acknowledge they've seen t" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [15:44:02] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:02] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:19] (PS1) Mark Bergsma: Revert "Don't run the default vcl_fetch function on mobile caches" [operations/puppet] - https://gerrit.wikimedia.org/r/75119 [15:44:59] mark: hey [15:45:01] right on time :) [15:45:02] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 98248 bytes in 7.042 second response time [15:45:02] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 98250 bytes in 7.054 second response time [15:45:12] not you I presume? [15:45:21] hm? [15:45:24] https, no [15:45:32] right on time for what? [15:45:42] (CR) Mark Bergsma: [C: 2] Revert "Don't run the default vcl_fetch function on mobile caches" [operations/puppet] - https://gerrit.wikimedia.org/r/75119 (owner: Mark Bergsma) [15:46:02] a critical LVS and the same minute a commit from you :) [15:46:02] (CR) jenkins-bot: [V: -1] Revert "Don't run the default vcl_fetch function on mobile caches" [operations/puppet] - https://gerrit.wikimedia.org/r/75119 (owner: Mark Bergsma) [15:46:22] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [15:46:22] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:22] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:22] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:22] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:23] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:23] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:26] (CR) Mark Bergsma: [V: 2] Revert "Don't run the default vcl_fetch function on mobile caches" [operations/puppet] - https://gerrit.wikimedia.org/r/75119 (owner: Mark Bergsma) [15:46:38] that's the leak you think? [15:47:03] (Merged) Mark Bergsma: Revert "Don't run the default vcl_fetch function on mobile caches" [operations/puppet] - https://gerrit.wikimedia.org/r/75119 (owner: Mark Bergsma) [15:47:48] i have no idea [15:47:55] but it started around the time that VCL change was applied [15:48:04] given that it's not a critical change, I'm gonna try it ;) [15:49:27] (PS1) Hashar: jenkins: revert access log rotation [operations/puppet] - https://gerrit.wikimedia.org/r/75121 [15:50:38] (CR) jenkins-bot: [V: -1] jenkins: revert access log rotation [operations/puppet] - https://gerrit.wikimedia.org/r/75121 (owner: Hashar) [15:51:22] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [15:52:09] !log jenkins accidentally killed Jenkins :( [15:52:19] Logged the message, Master [15:54:42] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 15:54:41 UTC 2013 [15:55:18] btw, mobile has quite a lot of requests to Special:BannerRandom [15:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [15:55:37] I found out by accident of course :) [15:56:20] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Mobile+caches+eqiad&h=cp1046.eqiad.wmnet&jr=&js=&v=50188&m=frontend.n_sess_mem&vl=N&ti=N+struct+sess_mem [15:56:24] it does look like it's leveling off [15:57:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [15:57:38] nod [15:57:49] but why [15:58:33] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 15:58:23 UTC 2013 [15:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 15:58:48 UTC 2013 [15:59:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [15:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [16:02:17] (PS2) Hashar: jenkins: revert access log rotation [operations/puppet] - https://gerrit.wikimedia.org/r/75121 [16:02:18] perhaps thats what happens when you return (deliver) on a ttl <= 0 object [16:02:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [16:02:22] or something [16:02:39] I guess I can test with parts of the default vcl_fetch function [16:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 16:02:34 UTC 2013 [16:02:43] and see what's doing it [16:02:45] !log Jenkins back up [16:02:55] Logged the message, Master [16:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [16:08:15] (CR) ArielGlenn: [C: 2] jenkins: revert access log rotation [operations/puppet] - https://gerrit.wikimedia.org/r/75121 (owner: Hashar) [16:08:16] (Merged) ArielGlenn: jenkins: revert access log rotation [operations/puppet] - https://gerrit.wikimedia.org/r/75121 (owner: Hashar) [16:08:30] (CR) Parent5446: [C: 1] "Agreed on this as well." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [16:12:45] (PS1) Petr Onderka: LZMA-compressed revision text [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75126 [16:13:12] (PS1) Mark Bergsma: Use hit_for_pass if object's TTL is <= 0 [operations/puppet] - https://gerrit.wikimedia.org/r/75127 [16:13:13] (CR) Petr Onderka: [C: 2 V: 2] LZMA-compressed revision text [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75126 (owner: Petr Onderka) [16:13:14] (Merged) Petr Onderka: LZMA-compressed revision text [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75126 (owner: Petr Onderka) [16:14:03] (CR) Mark Bergsma: [C: 2] Use hit_for_pass if object's TTL is <= 0 [operations/puppet] - https://gerrit.wikimedia.org/r/75127 (owner: Mark Bergsma) [16:14:04] (Merged) Mark Bergsma: Use hit_for_pass if object's TTL is <= 0 [operations/puppet] - https://gerrit.wikimedia.org/r/75127 (owner: Mark Bergsma) [16:18:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:48] (PS1) BBlack: add per-connection purging limits for sanity [operations/software/varnish/vhtcpd] - https://gerrit.wikimedia.org/r/75128 [16:19:22] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 10 hours [16:24:33] (PS2) BBlack: add per-connection purging limits for sanity [operations/software/varnish/vhtcpd] - https://gerrit.wikimedia.org/r/75128 [16:25:12] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 16:25:05 UTC 2013 [16:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [16:27:44] so that seems to be it [16:27:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 16:27:41 UTC 2013 [16:27:59] objects with TTL <= 0 and return(deliver) in fetch causes the problem [16:28:07] perhaps due to the small memory caches on the frontend or something [16:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [16:28:33] (CR) Andrew Bogott: [C: -1] "I've never seen Jenkins report 'Lost' before!" [operations/puppet] - https://gerrit.wikimedia.org/r/75087 (owner: Ori.livneh) [16:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 16:28:47 UTC 2013 [16:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [16:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 16:32:40 UTC 2013 [16:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [16:46:52] ori-l, can you explain what problem /etc/sysctl.d/puppet-managed is solving? [16:48:34] (PS1) Mark Bergsma: Don't run default vcl_fetch on mobile backend caches [operations/puppet] - https://gerrit.wikimedia.org/r/75130 [16:49:39] (CR) Mark Bergsma: [C: 2] Don't run default vcl_fetch on mobile backend caches [operations/puppet] - https://gerrit.wikimedia.org/r/75130 (owner: Mark Bergsma) [16:49:40] (Merged) Mark Bergsma: Don't run default vcl_fetch on mobile backend caches [operations/puppet] - https://gerrit.wikimedia.org/r/75130 (owner: Mark Bergsma) [16:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 16:54:43 UTC 2013 [16:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [16:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 16:57:36 UTC 2013 [16:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [16:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 16:58:47 UTC 2013 [16:58:55] (CR) Demon: "I would note that using $wgHiddenPrefs is fundamentally flawed here. If the VE team truly wants to remove the option, they should remove i" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [16:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [17:00:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [17:02:48] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 17:02:38 UTC 2013 [17:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [17:05:21] (PS1) ArielGlenn: turn off rsync cron for dumps temporarily [operations/puppet] - https://gerrit.wikimedia.org/r/75132 [17:07:05] (CR) ArielGlenn: [C: 2] turn off rsync cron for dumps temporarily [operations/puppet] - https://gerrit.wikimedia.org/r/75132 (owner: ArielGlenn) [17:07:06] (Merged) ArielGlenn: turn off rsync cron for dumps temporarily [operations/puppet] - https://gerrit.wikimedia.org/r/75132 (owner: ArielGlenn) [17:18:57] (PS2) Andrew Bogott: Replace uses of generic::sysctl with sysctlfile module [operations/puppet] - https://gerrit.wikimedia.org/r/74852 [17:19:46] (CR) Andrew Bogott: [C: 2] Replace uses of generic::sysctl with sysctlfile module [operations/puppet] - https://gerrit.wikimedia.org/r/74852 (owner: Andrew Bogott) [17:19:47] (Merged) Andrew Bogott: Replace uses of generic::sysctl with sysctlfile module [operations/puppet] - https://gerrit.wikimedia.org/r/74852 (owner: Andrew Bogott) [17:21:24] (CR) Andrew Bogott: "ok, that dependency is merged now." [operations/puppet] - https://gerrit.wikimedia.org/r/75087 (owner: Ori.livneh) [17:21:29] (CR) Andrew Bogott: "recheck" [operations/puppet] - https://gerrit.wikimedia.org/r/75087 (owner: Ori.livneh) [17:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 17:24:47 UTC 2013 [17:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [17:27:06] <^demon> manybubbles: Sooo, I talked it over with AaronSchulz. Jobqueue is most likely the right approach for bulk invalidations, but hooking into HTMLCacheUpdateJob feels kind of icky :) [17:28:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 17:28:11 UTC 2013 [17:28:13] ^demon: it does. but it provides all the right partitioning logic :) it was too tempting for me not to try [17:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [17:28:52] <^demon> manybubbles: So, the ideal world would be for me to write an abstract link-invalidation-job that handles this sort of thing (other people need it too it seems). [17:29:00] <^demon> Then we'd have all the titles and so forth on hand already. [17:29:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 17:29:13 UTC 2013 [17:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [17:29:59] ^demon: essentially rename the job we have and move everything but the partitioning to a hook? [17:30:17] ^demon: I could probably do that if others think it makes sense. [17:30:50] AaronSchulz: what do you think of ^^^^ [17:32:26] (PS1) ArielGlenn: mwbzutils package for precise snapshots, don't use 'latest' for packages [operations/puppet] - https://gerrit.wikimedia.org/r/75136 [17:33:02] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 17:32:54 UTC 2013 [17:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [17:34:41] (CR) ArielGlenn: [C: 2] mwbzutils package for precise snapshots, don't use 'latest' for packages [operations/puppet] - https://gerrit.wikimedia.org/r/75136 (owner: ArielGlenn) [17:34:42] (Merged) ArielGlenn: mwbzutils package for precise snapshots, don't use 'latest' for packages [operations/puppet] - https://gerrit.wikimedia.org/r/75136 (owner: ArielGlenn) [17:35:58] ^demon, manybubbles, is LinksUpdate not enough? [17:36:33] (PS1) ArielGlenn: what was I thinking. dump servers don't need mwbzutils. [operations/puppet] - https://gerrit.wikimedia.org/r/75137 [17:37:33] (CR) ArielGlenn: [C: 2] what was I thinking. dump servers don't need mwbzutils. [operations/puppet] - https://gerrit.wikimedia.org/r/75137 (owner: ArielGlenn) [17:37:34] (Merged) ArielGlenn: what was I thinking. dump servers don't need mwbzutils. [operations/puppet] - https://gerrit.wikimedia.org/r/75137 (owner: ArielGlenn) [17:39:49] (PS1) ArielGlenn: snapshots with precise get mwbzutils package [operations/puppet] - https://gerrit.wikimedia.org/r/75138 [17:41:15] (CR) ArielGlenn: [C: 2] snapshots with precise get mwbzutils package [operations/puppet] - https://gerrit.wikimedia.org/r/75138 (owner: ArielGlenn) [17:41:16] (Merged) ArielGlenn: snapshots with precise get mwbzutils package [operations/puppet] - https://gerrit.wikimedia.org/r/75138 (owner: ArielGlenn) [17:42:08] MaxSem: looking but it doesn't seem the same [17:46:13] MaxSem: I see it now. [17:47:07] (PS1) Cmcmahon: enable VisualEditor for all users on test2wiki, experimental also [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 [17:47:11] basically, this hook gives you access to parser output and is run when pages are refreshed after template editing [17:48:57] MaxSem: very close to exactly what I added to the html one. I think I'd prefer to get it in bulk but this is very much what I needed [17:54:42] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 17:54:41 UTC 2013 [17:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [17:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 17:57:38 UTC 2013 [17:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [17:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 17:58:50 UTC 2013 [17:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [18:01:02] (CR) Cmcmahon: "Not sure I did that right, but would like VE for anons on test2wiki" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75140 (owner: Cmcmahon) [18:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 18:02:38 UTC 2013 [18:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [18:26:42] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 18:26:41 UTC 2013 [18:27:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [18:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 18:27:33 UTC 2013 [18:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [18:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 18:28:44 UTC 2013 [18:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [18:31:24] (CR) Anomie: [C: -1] "(1 comment)" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/71774 (owner: Hashar) [18:33:32] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 18:33:22 UTC 2013 [18:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [18:37:08] (PS1) Asher: only randomly profile http requests if $wmfDatacenter == 'eqiad' [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 [18:37:31] AaronSchulz: ^^ [18:39:17] (PS1) Ottomata: Installing libdclass-java on analytics nodes [operations/puppet] - https://gerrit.wikimedia.org/r/75148 [18:40:30] (PS2) Ottomata: Installing libdclass-java on analytics nodes [operations/puppet] - https://gerrit.wikimedia.org/r/75148 [18:40:54] (CR) Ottomata: [C: 2 V: 2] Installing libdclass-java on analytics nodes [operations/puppet] - https://gerrit.wikimedia.org/r/75148 (owner: Ottomata) [18:40:55] (Merged) Ottomata: Installing libdclass-java on analytics nodes [operations/puppet] - https://gerrit.wikimedia.org/r/75148 (owner: Ottomata) [18:41:38] (CR) Asher: [C: -1] "StartProfiler.php is required in WebStart.php before "require_once MW_CONFIG_FILE" occurs, so I don't think $wmfDatacenter is actually def" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 (owner: Asher) [18:42:52] RECOVERY - Puppet freshness on analytics1018 is OK: puppet ran at Mon Jul 22 18:42:51 UTC 2013 [18:49:02] RECOVERY - Puppet freshness on analytics1019 is OK: puppet ran at Mon Jul 22 18:49:00 UTC 2013 [18:52:12] RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Mon Jul 22 18:52:06 UTC 2013 [18:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 18:54:44 UTC 2013 [18:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [18:56:33] (CR) Ottomata: "I generally agree with Ryan. I haven't looked, but I'd assume since so much effort went in to this it is more complete and robust than gi" [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [18:58:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 18:57:52 UTC 2013 [18:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [18:59:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 18:59:43 UTC 2013 [19:00:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [19:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 19:02:36 UTC 2013 [19:03:27] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [19:17:48] (CR) Ottomata: "Andrew, would you be willing to use a git submodule for this module, if you import it? It would allow us to maintain this module separate" [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [19:20:32] Thanks folks :) [19:20:44] ottomata: re: libdclass-java [19:20:50] we had a long discussion with average on saturday [19:21:01] this won't work as is with oracle java which I think you use there [19:21:08] (speaking of using open source ;)) [19:21:14] it needs a symlink [19:21:19] that couldn't be really placed in the package [19:21:27] so I suggested adding it in puppet instead [19:21:33] oh [19:21:34] ok [19:21:36] it's a compatibility symlink so that oracle java will look for it [19:21:45] ? -> ? [19:21:46] s/look for/find/ [19:22:19] /usr/lib/jni/libdclass.so -> /usr/lib/x86_64-linux-gnu/jni/libdclass.so (iirc) [19:22:45] oracle java doesn't support multiarch [19:22:49] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Move 3 wikis with UW to 1.22wmf11 [19:22:59] Logged the message, Master [19:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 19:24:45 UTC 2013 [19:25:14] !log restarting & repooling cp1047/cp1059 [19:25:25] Logged the message, Master [19:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [19:25:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special, wikimedia, private, fishbowl and closed to 1.22wmf11 [19:26:03] Logged the message, Master [19:26:09] paravoid: shoudl I link both libdclass.so.0 and /usr/lib/x86_64-linux-gnu/jni/libdclassjni.so ? [19:26:37] not sure [19:27:56] oh probably just the jni one [19:27:58] i see [19:28:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 19:28:06 UTC 2013 [19:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [19:28:27] there are these two: [19:28:27] /usr/lib/x86_64-linux-gnu/libdclass.so.0 [19:28:27] /usr/lib/x86_64-linux-gnu/jni/libdclassjni.so.0 [19:28:33] java probalby just needs the jin/libdclassjni [19:28:34] one [19:28:43] dunno though [19:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 19:28:46 UTC 2013 [19:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [19:29:30] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiversity, wikivoyage and wiktionary to 1.22wmf11 [19:29:40] Logged the message, Master [19:30:38] (CR) Andrew Bogott: "Yep, I'd prefer it to be a submodule... do we have any existing examples in the puppet repo?" [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [19:31:23] (PS1) Ottomata: Symlinking libdclassjni.so into /usr/lib/jni [operations/puppet] - https://gerrit.wikimedia.org/r/75153 [19:31:30] * average is reading backlog [19:31:45] (CR) Ottomata: "Yup!" [operations/puppet] - https://gerrit.wikimedia.org/r/74099 (owner: Andrew Bogott) [19:32:01] average, basically, i'm doing the symlinking into /usr/lib/jni [19:32:05] i guess I just need to do [19:32:32] /usr/lib/jni/libdclassjni.so -> /usr/lib/x86_64-linux-gnu/jni/libdclassjni.so.0 [19:32:33] is that right? [19:32:35] ottomata: what you did was also written in patchset https://gerrit.wikimedia.org/r/#/c/74651 [19:32:45] from saturday [19:33:16] Faidon just told me [19:33:16] ok cool [19:33:16] great, didn't see this link [19:33:16] ottomata: sorry I haven't mentioned it [19:33:16] this coment [19:33:16] np [19:33:25] great ok [19:36:39] i just saw that the x86_64 one was in a jni subdir [19:38:11] average, paravoid ^ [19:38:29] look ok? [19:38:29] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiquote, wikinews and wikibooks to 1.22wmf11 [19:38:29] https://gerrit.wikimedia.org/r/75153 [19:38:34] Logged the message, Master [19:38:35] looking [19:38:56] ottomata: looks good [19:39:05] ottomata: should I V+2 ? [19:39:10] (CR) Stefan.petrea: [C: 1] "looks good" [operations/puppet] - https://gerrit.wikimedia.org/r/75153 (owner: Ottomata) [19:39:32] (CR) Ottomata: [C: 2 V: 2] Symlinking dclass shared object files into /usr/lib [operations/puppet] - https://gerrit.wikimedia.org/r/75153 (owner: Ottomata) [19:39:33] k danke [19:39:33] (Merged) Ottomata: Symlinking dclass shared object files into /usr/lib [operations/puppet] - https://gerrit.wikimedia.org/r/75153 (owner: Ottomata) [19:39:47] ottomata, do our various puppetmasters automatically do submodule init/submodule update somehow? [19:39:56] ^demon: ready to merge some of these changes? :) [19:40:11] yes/no [19:40:18] andrewbogott [19:40:18] so [19:40:21] ^demon: I'm going to merge https://gerrit.wikimedia.org/r/#/c/74687 [19:40:24] (CR) Ryan Lane: [C: 2] Remove gitweb, we don't use it anymore [operations/puppet] - https://gerrit.wikimedia.org/r/74687 (owner: Demon) [19:40:24] sockpuppet, yes, if you use puppet-merge [19:40:25] (Merged) Ryan Lane: Remove gitweb, we don't use it anymore [operations/puppet] - https://gerrit.wikimedia.org/r/74687 (owner: Demon) [19:40:29] but, stafford isn't smart enough to do this yet [19:40:44] <^demon> Ryan_Lane: Sounds good to me. [19:40:47] ottomata, if I use puppet merge on sockpuppet does it cause the corresponding update to happen on stafford? [19:40:48] so, when you make as submodule change, you need to cd to /var/lib/git/operations/puppet and to git-submodule update --init [19:41:00] i'd like to turn that into a git hook [19:41:19] i mentioned it to mark/paravoid once, but since i've been the only one using submodules so far, i think it hasn't been a priority [19:41:23] Does 'puppet merge' already use sockpuppet's merge hook? [19:41:46] yeah, its just a review wrapper arround git merge origin/production [19:41:59] it will show you actual submodule diffs, fi you change the sha the submodule points at [19:42:02] (CR) Ryan Lane: [C: -1] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/74688 (owner: Demon) [19:42:02] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everyhting else no wikipedia to 1.22wmf11 [19:42:08] and if you say 'yes', it will merge and then run submodule update --init [19:42:12] Logged the message, Master [19:42:27] (PS1) Reedy: Everything non wikipedia to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75156 [19:42:33] ^demon: gave you a −1 on the other [19:42:45] but, no, it currently does not cause submodules to update on stafford [19:42:45] it should [19:42:59] i think the merge hook on stafford should know to run git submodule update --init [19:43:03] ottomata, ok, lemme fix. [19:43:04] (CR) Reedy: [C: 2] Everything non wikipedia to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75156 (owner: Reedy) [19:43:06] <^demon> Ryan_Lane: Because a couple other things use that file as well. [19:43:12] (Merged) jenkins-bot: Everything non wikipedia to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75156 (owner: Reedy) [19:43:13] <^demon> It's designed as a generic "Deny all bots" file :) [19:43:22] k andrewbogott, danke [19:43:22] ^demon: make a new file ;) [19:43:27] hm [19:43:38] <^demon> We don't need a file at all anymore though. [19:43:44] I guess that's true [19:44:01] I guess we only have gerrit running on that system [19:44:50] (CR) Ryan Lane: [C: 2] "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/74688 (owner: Demon) [19:44:51] (Merged) Ryan Lane: Allow spiders to index gerrit again [operations/puppet] - https://gerrit.wikimedia.org/r/74688 (owner: Demon) [19:45:09] ^demon: ok. both merged in [19:45:23] <^demon> ty [19:45:26] my review queue for once is actually manageable. it's nice :) [19:45:46] lo :) [19:46:16] manybubbles: no the hook would go in the code that I mentioned to chad (HtmlCacheUpdate.php) and would add its own job [19:46:29] that job would subclass the backlink job class and would do its own stuff [19:46:47] there would not be hooks in a Job class [19:48:17] bah, I'm confused! ottomata, are the files in .git/ not actually checked into git? [19:48:30] AaronSchulz: another option that MaxSem mentioned with with LinkUpdate - which works for me as well but doesn't offer me the ability to do bulk changes [19:48:55] I'm sure that in a previous project I have altered a .git/config file and checked it in. But right now git is telling me that .git/config is untracked [19:49:08] <^demon> Um, what? [19:49:10] no .git is not in git [19:49:12] .git is local [19:49:46] hm… where are submodules tracked then? [19:49:51] .gitmodules [19:49:53] i think [19:50:04] <^demon> Yes [19:50:11] …I don't have one of those [19:50:12] (Abandoned) Ryan Lane: Add fundraising components to #wm-fundraising [operations/puppet] - https://gerrit.wikimedia.org/r/64012 (owner: MarkTraceur) [19:50:14] oh but no shaw in there [19:50:18] andrewbogott [19:50:21] run [19:50:25] oh, wait, yes I do. [19:50:27] git submodule update —init [19:50:28] locally [19:50:45] (CR) Ryan Lane: [C: 2] rv tab in the middle of a line [operations/puppet] - https://gerrit.wikimedia.org/r/74363 (owner: Jeremyb) [19:50:46] (Merged) Ryan Lane: rv tab in the middle of a line [operations/puppet] - https://gerrit.wikimedia.org/r/74363 (owner: Jeremyb) [19:51:05] Hmph. I really want the post-merge hook to be tracked in git, but I guess that will require a hack [19:51:23] yeah, that should really be tracked by the installer of the repo though [19:51:24] so [19:51:36] if the /var/lib/git/operations/puppet clone is puppetized [19:51:39] that's where you'd add it [19:51:40] i think [19:51:47] not as part of the actual puppet repo [19:51:59] since everyone's .git dir could be different [19:52:45] hm [19:53:27] if it isn't puppetized, i'd just add it manually on stafford [19:54:28] or [19:54:28] actually [19:54:28] andrewbogott [19:54:29] maybe just ad dit to the post-merge hook on sockpuppet [19:54:32] currently it does: [19:54:34] ssh root@stafford.pmtpa.wmnet 'cd /var/lib/git/operations/puppet && git pull' [19:54:37] you could jsut do [19:54:42] That's what I'm doing, but I want that file tracked someplace first [19:54:43] ssh root@stafford.pmtpa.wmnet 'cd /var/lib/git/operations/puppet && git pull && git submodule update —init' [19:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 19:54:49 UTC 2013 [19:55:29] it is! [19:55:31] hmm [19:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [19:55:32] i think? [19:55:48] puppetmaster.pp [19:55:52] puppetmaster::gitclone [19:56:18] looks like maybe that class isn't used anywhere though [19:56:30] I think it is… I will try. [19:56:55] i'm grepping codebase, i don't see it included anywhere [19:57:32] buh, this file is utterly different from the one on sockpuppet [19:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 19:57:32 UTC 2013 [19:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [19:58:32] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [19:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 19:58:48 UTC 2013 [19:58:52] right [19:58:59] which is why I think it isn't being used at all [19:59:05] i think mark would know what's up with .git on sockpuppet [19:59:07] probably no one else [19:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [19:59:32] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:00:50] (CR) Yuvipanda: "Merged and done! https://github.com/yuvipanda/lolrrit-wm/commit/7c3b2345b05882199a493ca4b2e502022c21bc0e" [operations/puppet] - https://gerrit.wikimedia.org/r/64012 (owner: MarkTraceur) [20:01:46] (PS1) Ottomata: Supporting both short hostnames and fqdn for labs role::puppet::self [operations/puppet] - https://gerrit.wikimedia.org/r/75158 [20:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 20:02:42 UTC 2013 [20:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [20:07:40] (PS1) Andrew Bogott: Half-assed attempt to track the post-merge hook on sockpuppet. [operations/puppet] - https://gerrit.wikimedia.org/r/75159 [20:08:07] (PS1) Andrew Bogott: Submodule update --init on stafford post merge. [operations/puppet] - https://gerrit.wikimedia.org/r/75160 [20:08:12] ottomata: ^^ and ^ [20:11:34] (CR) Ottomata: "Some of these (sorta) already exist in puppet/files/git/puppet/, perhaps these should live there?" [operations/puppet] - https://gerrit.wikimedia.org/r/75159 (owner: Andrew Bogott) [20:12:29] andrewbogott: ^ [20:13:22] PROBLEM - Disk space on analytics1010 is CRITICAL: DISK CRITICAL - free space: / 710 MB (3% inode=85%): [20:18:20] ^demon: Hi! It turned out I created a repo in the wrong spot. Can I just use the "delete-project" plugin to remove the repo at the wrong spot, or do we need to take other actions as well? [20:18:53] <^demon> It'll delete on the gerrit box, but I need to delete manually from github as well. What's the repo? [20:18:54] ^demon: (Like you had some fancy github stuff) [20:19:11] ^demon: you should delete lolrrit too from github then [20:19:15] <^demon> I did. [20:19:32] ^demon: The repo to delete is mediawiki/extensions/LaTeXML [20:19:47] !log deploying bmc-config fix across all pmtpa/eqiad mgmt [20:19:57] Logged the message, Master [20:20:22] RECOVERY - Disk space on analytics1010 is OK: DISK OK [20:20:27] ^demon: I nuked the repo on gerrit. [20:21:25] <^demon> Deleted on github [20:21:33] ^demon: Thanks :-) [20:21:38] ^demon: want to look at https://gerrit.wikimedia.org/r/#/c/71966/ ? :) [20:21:50] ^demon: Btw... anything I can do to help you on the gitblit side? [20:22:02] ^demon: Looks like it's causing some harm :-( [20:22:11] <^demon> Not really. Logs are completely unhelpful. [20:22:23] <^demon> I'm going to send the logs to upstream and see if he can figure it out [20:22:36] Ok. [20:22:42] <^demon> AaronSchulz: Define "want" ;-) [20:24:05] Is gitblit supoosed to be running right now, or are we leaving it off for now until upstream comments on the problem? [20:25:02] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 20:24:52 UTC 2013 [20:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [20:25:36] ^demon: any reason grrrit-wm isn't gerrit-vm atm? [20:25:36] ^demon: as in "will this not kill me" ;) [20:25:42] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:26:09] <^demon> AzaToth: Maybe YuviPanda can answer...he might be still having nick problems [20:26:17] YuviPanda: ↑ [20:26:26] nice arrow there, AzaToth [20:26:36] do you know the unicode code points by heart? [20:26:38] ←↓→↑ [20:26:52] ottomata, there's a bit of a chicken-and-egg issue. I want to fix puppetmaster::gitclone but first I want to switch to vcsrepo (or something) and before I do that I want submodules to be working, etc. etc. [20:26:56] or do you have a clipboard of some sort? [20:26:59] YuviPanda: Alt-Gr + Shift + U is ↑ [20:27:29] YuviPanda: on debian this is [20:27:41] LeslieCarr, what can you tell me about puppetmaster::gitclone? Is it meant to describe sockpuppet, or was it for something else? [20:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 20:27:39 UTC 2013 [20:28:10] AzaToth: ah, right. I'm sadly stuck on a stupid OS though [20:28:13] <^demon> AzaToth: ↑↑↓↓←→←→BA [20:28:14] should find another solution [20:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [20:28:25] ^demon: cheater [20:28:42] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [20:28:50] ^demon: you know, I don't know where that is *actually* from. Only know that as a secret reference to some game [20:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 20:28:46 UTC 2013 [20:29:04] YuviPanda: anyway... [20:29:08] YuviPanda: grrrit-wm [20:29:15] the name of the game is? [20:29:15] <^demon> YuviPanda: [[w:Konami Code]] [20:29:17] my distractions do not seem to work :P [20:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [20:29:33] hehe [20:29:38] AzaToth: it's on +q here, and also I love the new name. can we please keep it that way? [20:29:42] [20:30:11] YuviPanda, try it here: http://www.vogue.co.uk/ [20:30:37] YuviPanda: what are you not trying to say? [20:30:44] andrewbogott: cute [20:30:50] AzaToth: i'm trying to say let us keep it as grrrit-wm [20:31:27] YuviPanda: that's not up to me to decide [20:31:42] so let's forget the nick problem, then :) [20:31:48] YuviPanda: remember though the risk of st [20:31:55] YuviPanda: remember though the risk of stick + ass + stuck [20:32:05] andrewbogott: it is to make sure it pulls all of the repositories that we want to call from puppet [20:32:07] that... makes no sense to me? [20:32:32] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 8.30644534091 (gt 8.0) [20:32:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 20:32:45 UTC 2013 [20:33:07] LeslieCarr: it looks to not be applied anywhere, though… can you give me an example of where it might be applied if it were? [20:33:17] haha [20:33:18] oh [20:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [20:33:31] LeslieCarr: Background is, I want to change the post-merge hook on sockpuppet, and wondering if that is puppetized anyplace, or if it should be [20:34:03] andrewbogott: ok… honestly i haven't looked at this in forever … i didn't even remember making it :) [20:34:28] ok then :) [20:35:00] ah it's included on line 388 [20:35:10] (i think in the dashboard class [20:35:32] (CR) Ottomata: [C: 2 V: 2] Supporting both short hostnames and fqdn for labs role::puppet::self [operations/puppet] - https://gerrit.wikimedia.org/r/75158 (owner: Ottomata) [20:35:33] or maybe the main puppetmaster class (god i hate when people make confusing sets of {} ) [20:35:33] (Merged) Ottomata: Supporting both short hostnames and fqdn for labs role::puppet::self [operations/puppet] - https://gerrit.wikimedia.org/r/75158 (owner: Ottomata) [20:35:49] (CR) Aaron Schulz: "I think it's fine." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 (owner: Asher) [20:36:14] ah, so /every/ puppetmaster does it [20:36:20] hm [20:37:54] eh? [20:37:58] my irc cut out for a sec [20:38:00] what did I miss [20:38:11] I don't see the puppetmaster:;gitclone included anywhere...right? [20:38:22] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [20:38:36] ottomata, it is included in the puppetmaster class [20:38:52] ottomata: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20130722.txt [20:38:54] which means that sockpuppet has /two/ puppet repos, one in /var/lib/git/operations/puppet/ and one in /root/puppet. [20:39:03] * andrewbogott is a little upset by this [20:39:19] !log hot swapping disk 13 on db78 storage array [20:39:30] Logged the message, Master [20:41:11] ottomata, lesliecarr, is it possible that when a patch is merged it is merged into /root/puppet and then pushed onto stafford by the post-merge hook and then copied back into /var/lib/git/operations/puppet/ on sockpuppet? This definitely needs a chart with boxes and arrows [20:41:39] it is totally possible [20:41:46] though sockpuppet is only looking at root/puppet [20:42:04] the puppetmaster on sockpuppet derives from root/puppet and not from /var/lib/git/operations/puppet/? Are you sure? [20:43:09] stafford seems to only have /var/lib/git/operations/puppet/, so that's a relief. [20:43:25] andrewbogott: the puppetmaster on sockpuppet is actually in /etc/puppet [20:43:26] i think [20:43:35] the post-merge there rsyncs stuff from /root/puppet to /etc/puppet [20:43:36] AaronSchulz: thanks for the comment on 75146. do you think a global $wmfDatacenter is required there then? I don't think its been defined as global in the function that requires StartProfile.php [20:44:14] damn [20:44:17] check /etc/puppet/puppet.conf [20:44:25] to see the puppetmaster daemon actually uses [20:44:32] looks like yeha, stafford uses /var/lib/git/operations/puppet [20:44:39] manifestdir = /var/lib/git/operations/puppet/manifests [20:44:46] (/etc/puppet/manifests is the default) [20:45:34] ok, so to fix this mess, sockpuppet should also use /var/lib/git/operations/puppet/ and /etc/puppet should be abolished. [20:45:35] y'think? [20:45:45] binasher: which function? I'm looking at 'require "$IP/StartProfiler.php";' in WebStart.php [20:46:18] you mean getMediaWiki/getMediaWikiCli? [20:46:21] AaronSchulz: is it all in global land still? [20:48:11] errrrrrrrrr [20:48:15] i dunno andrewbogott [20:48:21] if i was doing this from scratch [20:48:41] binasher: yeah [20:48:41] i would make /etc/puppet be the place puppetmasters run from [20:48:47] and maybe some other location the git clone [20:48:48] wherever [20:48:55] /var/lib… is fine [20:48:58] or /root/... is fine [20:49:01] doesn't matter to me [20:49:08] looks like that's where those variables come from right now as used [20:49:17] wait, why would the clone be a different place from where the puppetmasters run? [20:49:17] but the /etc/puppet would be clones from whereever the main repo is [20:49:32] well, IF i was doing this on my own servers, that's what I'd do [20:49:40] but I think mark has it this way for sanity and security reasons [20:49:55] the main clone (where merges are done) is kinda like a last hold staging area for review [20:49:57] reviews [20:50:17] Ah, sure. [20:50:24] binasher: did you see the 'global $wmfDatacenter, $wmfRealm;' in MWRealm.php ? [20:50:29] OK, it still seems like there's a third repo on sockpuppet for no reason [20:50:34] Which is presumably the one in /var/lib/git/operations/puppet/ [20:51:18] AaronSchulz: ah! nope, i was only looking in core [20:52:00] * andrewbogott draws some pictures [20:52:01] AaronSchulz: do you +1 the change aside from the "will this work" question? [20:53:02] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:54:05] (CR) Aaron Schulz: [C: 1] only randomly profile http requests if $wmfDatacenter == 'eqiad' [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 (owner: Asher) [20:54:10] the idea is fine, yes [20:54:29] (CR) Asher: [C: 2 V: 2] only randomly profile http requests if $wmfDatacenter == 'eqiad' [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 (owner: Asher) [20:54:30] (Merged) Asher: only randomly profile http requests if $wmfDatacenter == 'eqiad' [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75146 (owner: Asher) [20:54:33] well then [20:54:52] graphs gonna be pretty again [20:55:06] (PS1) Ryan Lane: Add ssl1005/6 config [operations/puppet] - https://gerrit.wikimedia.org/r/75245 [20:57:22] binasher: heh, so apache.log is still in tampa? heh [20:57:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 20:57:50 UTC 2013 [20:57:52] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 20:57:50 UTC 2013 [20:58:20] (PS1) Ottomata: Fixing multi instance hostname issue [operations/puppet] - https://gerrit.wikimedia.org/r/75247 [20:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [20:58:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [20:58:40] (CR) Ottomata: [C: 2 V: 2] Fixing multi instance hostname issue [operations/puppet] - https://gerrit.wikimedia.org/r/75247 (owner: Ottomata) [20:58:41] (Merged) Ottomata: Fixing multi instance hostname issue [operations/puppet] - https://gerrit.wikimedia.org/r/75247 (owner: Ottomata) [20:58:57] AaronSchulz: "syslog.eqiad.wmnet is an alias for nfs-home.pmtpa.wmnet." heh :/ [20:59:22] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 20:59:19 UTC 2013 [20:59:37] (CR) Ryan Lane: [C: 2] Add ssl1005/6 config [operations/puppet] - https://gerrit.wikimedia.org/r/75245 (owner: Ryan Lane) [20:59:37] (Merged) Ryan Lane: Add ssl1005/6 config [operations/puppet] - https://gerrit.wikimedia.org/r/75245 (owner: Ryan Lane) [20:59:37] I assume that could be on fluorine ideally? [20:59:37] (PS7) Ottomata: Adding role::analytics::hue [operations/puppet] - https://gerrit.wikimedia.org/r/74388 [20:59:42] so much random stuff in tampa :( [21:00:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:30] AaronSchulz: i would think so. need to verify if any code is consuming apache.log [21:02:17] (PS1) Brian Wolff: Change fa wikis to use uca-fa sort order [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75248 [21:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 21:02:50 UTC 2013 [21:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [21:04:41] !log asher synchronized wmf-config/StartProfiler.php 'only randomly profile web reqs in eqiad' [21:04:50] Logged the message, Master [21:05:50] and "tcpdump port 3811 and udp" shows that change had the desired effect [21:06:28] andrewbogott: also, we don't keep puppet.conf and fileserver.conf files in operations/puppet [21:06:49] i think another reason to keep the main clone separate from the puppetmaster's conf dir is so that the clone can be clean [21:06:55] hrm, or maybe it didn't [21:09:18] ah, pmtpa hosts are still logging wfIncrStats calls, of course. but not profiling any more. [21:12:32] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 2.65400275 [21:17:50] k i gotta run, laters [21:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 21:24:45 UTC 2013 [21:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [21:26:03] (PS1) Cmjohnson: adding ssl1005/6 to dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75250 [21:27:37] (CR) Cmjohnson: [C: 2 V: 2] adding ssl1005/6 to dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75250 (owner: Cmjohnson) [21:27:38] (Merged) Cmjohnson: adding ssl1005/6 to dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75250 (owner: Cmjohnson) [21:29:02] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 21:28:57 UTC 2013 [21:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [21:32:02] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 21:31:58 UTC 2013 [21:32:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [21:33:22] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 21:33:15 UTC 2013 [21:34:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [21:47:02] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Mon Jul 22 21:46:59 UTC 2013 [21:47:29] oh noes, dns froze up on ns1 [21:47:35] !log restarting powerdns on ns1 [21:47:46] Logged the message, Mistress of the network gear. [21:47:53] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:20] Ryan_Lane, what systems (if any) use sockpuppet as their puppet master? [21:48:32] all systems use sockpuppet for certs [21:48:52] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [21:49:21] Ryan_Lane, why use a different box for certs vs. everything else? [21:49:42] historical -- wehn we moved from sockpuppet to stafford didn't want to redo all the cerst [21:49:42] because no one has finished moving stuff [21:49:50] Ah, ok. [21:49:52] salt is also on sockpuppet [21:49:52] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.035 seconds response time. www.wikipedia.org returns 208.80.154.225 [21:50:15] I need to enable multi-master salt and put one in eqiad [21:52:01] So, as far as the public repo is concerned, is this diagram accurate? https://wikitech.wikimedia.org/wiki/File:Lifeofpuppetpatch.png [21:52:56] needs more tigers and jackals [21:53:28] it's partially wrong [21:53:35] virt0 pulls directly from gerrit [21:53:45] on a cron [21:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 21:54:51 UTC 2013 [21:55:21] ok, fixing that part... [21:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [21:57:27] Ryan_Lane: OK, updated. My question is: why do we have both /etc/puppet and /root/puppet on sockpuppet? [21:57:42] historical reasons [21:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 21:57:38 UTC 2013 [21:58:12] any *good* reason? :P [21:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [21:58:27] not really. I set it up that way and no one has ever changed it [21:58:38] (sorry, I can only say that since I'm a non-interested/affected party) [21:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 21:58:49 UTC 2013 [21:58:53] Ryan_Lane, actually, I just changed the diagram properly to reflect my understanding, which is... [21:58:53] it was changed on new systems [21:59:21] puppet merge updates /root/puppet, then copies everything into /etc/puppet then tells stafford to pull, which pulls from the repo in /etc/puppet [21:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [21:59:26] correct? [21:59:45] !log dns update [21:59:55] Logged the message, Master [22:01:35] andrewbogott: yep [22:01:50] it's kind of silly :) [22:02:10] puppet-merge should probably just update /var/lib/git/operations [22:02:32] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:02:37] Nah, that part is right, since /var/lib/git/operations is the waiting room for patches that aren't officially 'merged' yet [22:02:48] I mean, it adds an additional, possibly silly, security layer. [22:02:52] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 22:02:50 UTC 2013 [22:03:02] not really, because it's doing a fetch and then a compare [22:03:14] it only applies if you say yes [22:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [22:03:43] (PS1) Cmjohnson: changing ssl1006 dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75257 [22:03:50] !log restarting ns2 [22:03:53] um… but right now doesn't puppet merge merge from /var/lib/git/operations into /root/puppet? [22:04:00] Logged the message, Master [22:04:08] I think the original reason it was in /root was that /etc/puppet wasn't a git repo [22:04:12] So it doesn't make sense to have it merge into /var/lib/gir/operations 'cause it would just be merging it with itself [22:04:25] and that I had a hook to rsync everything on merge [22:04:44] (CR) Cmjohnson: [C: 2 V: 2] changing ssl1006 dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75257 (owner: Cmjohnson) [22:04:45] (Merged) Cmjohnson: changing ssl1006 dhcpd [operations/puppet] - https://gerrit.wikimedia.org/r/75257 (owner: Cmjohnson) [22:04:51] I mean just get rid of /root/puppet [22:05:09] /root/puppet isn't really used for anything as far as I can tell [22:05:22] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.096 seconds response time. www.wikipedia.org returns 208.80.154.225 [22:05:49] it really didn't like the large amount of changes [22:05:50] Ah, I see what you're saying. Yeah, I think that's right. [22:06:52] So, I started looking at this because I wanted to change one of the hooks in /root/puppet. But now I think I should just replace everything in that hook with stuff in puppet-merge. [22:07:13] yep [22:07:33] down with my 3 year old cruft! :D [22:09:44] So, we would end up with this: https://wikitech.wikimedia.org/wiki/File:Lifeofpuppetpatchfuture.png [22:11:08] How does sockpuppet's /var/lib/git/operations/puppet repo get updated when gerrit merges a patch? [22:11:21] andrewbogott: /etc/puppet should have links to /var/lib/git/operatons/puppet [22:12:00] Ryan_Lane, Uhoh, I'm confused again [22:12:16] puppet is configured to use /etc/puppet [22:12:27] but the repo doesn't match up well with that filesystem layout [22:12:40] does 'puppet merge' merge from sockpuppet (in one place) to sockpuppet (in another place)? Or does it merge from gerrit? [22:12:44] so, we link directories in /etc/puppet to directories in /var/lib/git/operations/puppet [22:12:56] just merges from gerrit, I believe [22:13:03] it does a fetch, then a compare [22:13:09] then when you say yes, a merge [22:13:19] + logic for submodules [22:14:02] Ryan_Lane, it doesn't look to me, right now, like /etc/puppet is linked to /var/lib/git/ on sockpuppet. They seem totally disjoint. [22:14:05] Am I missing a link? [22:14:16] yo MaxSem, do you know who is working on getting solr off of vanadium? [22:14:20] hm. maybe that's never been fixed? [22:15:12] that's why I'm so confused… I can't tell what each of those dirs do [22:15:20] hence my weird diagram, which is still not right :( [22:15:22] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [22:15:40] puppet is served from /etc/puppet [22:15:42] for sure [22:15:53] maybe /var/lib/... updates /etc? [22:16:09] I bet that /var/lib/... is updated but just sits there untouched by anything. [22:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 22:24:48 UTC 2013 [22:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [22:26:14] ori-l, needs a server ;) [22:26:33] did someone from ops say they were going to work on it? [22:27:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 22:27:36 UTC 2013 [22:28:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [22:28:32] ori-l, notpeter was mentioned in http://etherpad.wmflabs.org/pad/p/EMGT-Ops-Projects-15July2013 [22:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 22:28:49 UTC 2013 [22:29:13] * ori-l pounces on notpeter [22:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [22:30:58] !log reedy synchronized php-1.22wmf11/extensions/UploadWizard [22:31:08] Logged the message, Master [22:31:14] notpeter: can you ping me when you have a sec? [22:31:29] sure, what's up? [22:31:56] so, wanted to make a quick pitch as to why this is important [22:32:20] solr is well-behaved and in general resources on the system are underutilized [22:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 22:32:33 UTC 2013 [22:32:56] but -- with a mongodb instance, eventlogging processes and solr combined it's hard to debug issues [22:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:30] ori-l: I pinged nikerabbit and alolita about this, robh said he had a box for it, and I have heard nothing back from rtt [22:33:37] I'll ping them again [22:34:02] ok, thanks. should i do something? i can help generate more of a paper trail if that helps, i probably should have done that before. [22:34:10] I don't think it will take long, but I need to keep on the users [22:34:17] nah, I started with an email [22:34:20] I'll switch to a ticket [22:34:28] but yeah, I'll keep pushing on it [22:34:55] ok. i really appreciate it. [22:35:00] yep! [22:35:38] notpeter, just apply the same role as vanadium and poke nikerabbit to index it and make it used my MW config [22:36:06] MaxSem: cool, can do [22:37:36] (PS1) Aaron Schulz: Added DB performance log [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75260 [22:38:10] (CR) jenkins-bot: [V: -1] Added DB performance log [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75260 (owner: Aaron Schulz) [22:39:14] (PS2) Aaron Schulz: Added DB performance log [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75260 [22:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 22:54:44 UTC 2013 [22:55:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [22:58:12] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 22:58:11 UTC 2013 [22:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [22:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 22:58:47 UTC 2013 [22:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [23:02:12] Yay, lightning deploy! [23:02:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 23:02:41 UTC 2013 [23:03:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [23:07:23] !log anomie synchronized php-1.22wmf10/extensions/CentralAuth 'Update CentralAuth to fix bug 51644' [23:07:52] !log anomie synchronized php-1.22wmf11/extensions/CentralAuth 'Update CentralAuth to fix bug 51644' [23:09:51] (PS1) Andrew Bogott: Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 [23:10:07] (CR) jenkins-bot: [V: -1] Simplify our puppet master setup. [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [23:10:21] (CR) Andrew Bogott: [C: -1] "Do not merge yet, this will need some hand-holding when it applies." [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [23:11:14] Ryan_Lane, regarding ^, I could use a hand sorting out how to handle the private repo [23:16:02] (PS2) Ori.livneh: Refactor sysctl [operations/puppet] - https://gerrit.wikimedia.org/r/75087 [23:16:06] (CR) Lcarr: "(2 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/75263 (owner: Andrew Bogott) [23:16:17] (CR) jenkins-bot: [V: -1] Refactor sysctl [operations/puppet] - https://gerrit.wikimedia.org/r/75087 (owner: Ori.livneh) [23:18:57] (PS3) Ori.livneh: Refactor sysctl [operations/puppet] - https://gerrit.wikimedia.org/r/75087 [23:24:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 23:24:46 UTC 2013 [23:25:32] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [23:28:22] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 23:28:20 UTC 2013 [23:28:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 23:28:51 UTC 2013 [23:29:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [23:29:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours [23:32:42] RECOVERY - Puppet freshness on cp1043 is OK: puppet ran at Mon Jul 22 23:32:37 UTC 2013 [23:33:22] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [23:42:47] (PS1) Pyoungmeister: adding zinc as rtt solr host [operations/puppet] - https://gerrit.wikimedia.org/r/75265 [23:50:28] (CR) MaxSem: [C: 1] adding zinc as rtt solr host [operations/puppet] - https://gerrit.wikimedia.org/r/75265 (owner: Pyoungmeister) [23:53:21] (Abandoned) Andrew Bogott: Half-assed attempt to track the post-merge hook on sockpuppet. [operations/puppet] - https://gerrit.wikimedia.org/r/75159 (owner: Andrew Bogott) [23:53:38] (Abandoned) Andrew Bogott: Submodule update --init on stafford post merge. [operations/puppet] - https://gerrit.wikimedia.org/r/75160 (owner: Andrew Bogott) [23:54:52] RECOVERY - Puppet freshness on cp1042 is OK: puppet ran at Mon Jul 22 23:54:50 UTC 2013 [23:55:33] PROBLEM - Puppet freshness on cp1042 is CRITICAL: No successful Puppet run in the last 10 hours [23:57:42] RECOVERY - Puppet freshness on cp1044 is OK: puppet ran at Mon Jul 22 23:57:34 UTC 2013 [23:58:22] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [23:58:52] RECOVERY - Puppet freshness on cp1041 is OK: puppet ran at Mon Jul 22 23:58:42 UTC 2013 [23:59:22] PROBLEM - Puppet freshness on cp1041 is CRITICAL: No successful Puppet run in the last 10 hours