[00:01:33] (03PS2) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [00:24:29] (03PS4) 10Krinkle: [WIP] Implement role::ci::slave::localbrowser (Chromium) [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:28:12] (03PS5) 10Krinkle: [WIP] Implement role::ci::slave::localbrowser (Chromium) [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:28:22] (03CR) 10Krinkle: "Ori: Thanks for that great feedback. Done in the next patchset." [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [00:29:57] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:16] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:17] ACKNOWLEDGEMENT - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ori.livneh Tim debugging HHVM bug [00:32:17] ACKNOWLEDGEMENT - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds ori.livneh Tim debugging HHVM bug [00:35:06] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.491 second response time [00:35:57] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 67656 bytes in 0.257 second response time [00:41:08] Anybody who wants to get a segfault bt for me? [00:41:11] * Any op [00:41:47] hoo: sure [00:42:06] ori: Just tell me which host to send my failing stuff to [00:43:20] ori: How about mw1024 ? :P [00:45:04] sorry, I misread your question. I can't do that at the moment, but file a bug with the request and I'll do it later [00:47:36] ori: https://bugzilla.wikimedia.org/show_bug.cgi?id=71542 [01:08:24] Re: the earlier scribuntu/lua bug, was that fixed so fast that there isn't a bug# ? (I just want a link, to close the onwiki thread with. :) [01:16:26] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 0 MB (0% inode=99%): [01:38:24] (03CR) 10coren: [C: 031] "This should work." [puppet] - 10https://gerrit.wikimedia.org/r/164234 (owner: 10coren) [01:38:37] * Coren could use an ee on ^^ [01:38:40] eye* [01:56:11] (03CR) 10BBlack: [C: 031] Fix apparmor to allow for new cert scheme. [puppet] - 10https://gerrit.wikimedia.org/r/164234 (owner: 10coren) [01:57:03] Coren: ^ [01:57:20] Saw. Thank you. Alex only noticed that timebomb by accident this evening. [01:57:38] (03CR) 10coren: [C: 032] Fix apparmor to allow for new cert scheme. [puppet] - 10https://gerrit.wikimedia.org/r/164234 (owner: 10coren) [01:58:46] New release '14.04.1 LTS' available. What? A point releast on an LTS? [02:05:26] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [02:08:57] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:22:07] RECOVERY - Disk space on ocg1003 is OK: DISK OK [02:24:08] !log LocalisationUpdate completed (1.24wmf22) at 2014-10-02 02:24:08+00:00 [02:24:15] Logged the message, Master [02:31:27] * Coren looks into those fails. [02:38:50] Ah, the abstractions dir is only there iff there is something that depends on it. Vexing. [02:47:19] (03CR) 10Cscott: "Probably wait until I deploy the last collection config path tomorrow during SWAT. And I'm a little confused since I think the mediawiki-" [puppet] - 10https://gerrit.wikimedia.org/r/162814 (owner: 10Dzahn) [02:47:20] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-02 02:47:20+00:00 [02:47:28] Logged the message, Master [02:49:37] (03PS1) 10coren: Fix to the ssl_certs apparmor profile install [puppet] - 10https://gerrit.wikimedia.org/r/164272 [02:50:32] bblack: Still around for a fix fix for carbon and labstore1001? [02:50:38] ^^ [02:51:01] Funny how, apparently, only two boxen total in all of prod has no /abstractions/ profile at all. :-) [03:00:08] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:00:28] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:03:27] (03PS1) 10Hoo man: Remove all references to pmtpa from role::cache [puppet] - 10https://gerrit.wikimedia.org/r/164273 [03:03:29] (03PS1) 10Hoo man: Remove squid monitoring from torrus [puppet] - 10https://gerrit.wikimedia.org/r/164274 [03:11:29] (03PS2) 10Springle: remove pmtpa db's from coredb role config [puppet] - 10https://gerrit.wikimedia.org/r/164244 (owner: 10Dzahn) [03:12:31] (03CR) 10Springle: [C: 032] remove pmtpa db's from coredb role config [puppet] - 10https://gerrit.wikimedia.org/r/164244 (owner: 10Dzahn) [03:14:25] (03CR) 10Springle: [C: 04-2] "until friday (for amaranth)" [puppet] - 10https://gerrit.wikimedia.org/r/164253 (owner: 10Dzahn) [03:14:35] (03CR) 10Springle: [C: 04-2] "until friday (for amaranth)" [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [03:14:46] Coren: even the fix is a little scary, because I kinda hate apparmor, and I imagine it doing awful things to running services on those boxes that weren't ready for it :) [03:15:00] but, it's just carbon and labstore, not mw* or cp*, so meh [03:15:19] (03CR) 10BBlack: [C: 031] Fix to the ssl_certs apparmor profile install [puppet] - 10https://gerrit.wikimedia.org/r/164272 (owner: 10coren) [03:20:51] bblack: It was hotfixed there, but getting it into puppet is muy important. :-) [03:22:05] (03CR) 10coren: [C: 032] "This matches a hotfix already deployed on carbon and labstore1001 (the only two boxes that lacked the directories)" [puppet] - 10https://gerrit.wikimedia.org/r/164272 (owner: 10coren) [03:23:52] bblack: That said, it would probaby best to restart apparmor on mw* and cp* to make extra juicy certain that the new profile has been taken into account. I worry that a restart of apache might be problematic in the future otherwise. [03:24:02] bblack: I leave that to your best judgment though. [03:25:22] Well, not restart but reload (restart is certainly overkill, though wouldn't be harmful) [04:03:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [04:08:17] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Oct 2 04:08:17 UTC 2014 (duration 8m 16s) [04:08:23] Logged the message, Master [04:22:58] !log springle Synchronized wmf-config/db-eqiad.php: depool db1073 (duration: 00m 13s) [04:23:05] Logged the message, Master [04:32:54] (03PS1) 10Ori.livneh: Forward syslog messages logged by the kernel about HHVM to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164278 [04:33:03] ^ TimStarling [04:34:24] (03PS2) 10Ori.livneh: Forward syslog messages logged by the kernel about HHVM to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164278 [04:40:39] !log springle Synchronized wmf-config/db-eqiad.php: upgraded db1073, repool, warm up (duration: 00m 07s) [04:40:45] Logged the message, Master [04:50:25] (03Abandoned) 10Springle: Enable extra_port 3307 on MariaDB 10 slaves [puppet] - 10https://gerrit.wikimedia.org/r/157335 (owner: 10Springle) [05:04:48] !log springle Synchronized wmf-config/db-eqiad.php: divert writes away from external storage cluster 25 (duration: 00m 10s) [05:04:53] Logged the message, Master [05:17:16] (03PS1) 10Springle: Depool es1008 for upgrade. Prepare es1009 as new es3 master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164280 [05:18:23] (03CR) 10Springle: [C: 032] Depool es1008 for upgrade. Prepare es1009 as new es3 master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164280 (owner: 10Springle) [05:18:30] (03Merged) 10jenkins-bot: Depool es1008 for upgrade. Prepare es1009 as new es3 master. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164280 (owner: 10Springle) [05:18:50] (03PS1) 10BBlack: authdns: gdnsd 2.x compat for scripts [puppet] - 10https://gerrit.wikimedia.org/r/164281 [05:18:56] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [05:19:24] !log springle Synchronized wmf-config/db-eqiad.php: depool es1008 (duration: 00m 08s) [05:19:32] Logged the message, Master [05:19:53] (03CR) 10BBlack: [C: 032] authdns: gdnsd 2.x compat for scripts [puppet] - 10https://gerrit.wikimedia.org/r/164281 (owner: 10BBlack) [05:21:08] !log Manual sync-common on mw1053 [05:21:15] Logged the message, Master [05:24:28] (03PS1) 10BBlack: no-op test change for authdns scripts [dns] - 10https://gerrit.wikimedia.org/r/164282 [05:27:35] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/164282 (owner: 10BBlack) [05:27:42] (03CR) 10jenkins-bot: [V: 04-1] no-op test change for authdns scripts [dns] - 10https://gerrit.wikimedia.org/r/164282 (owner: 10BBlack) [05:28:02] that was fast :p [05:29:52] hey springle - I see mw1163 has stopped spamming errors [05:31:50] MaxSem: it was complaining about db1073, which i've since repooled after upgrade [05:32:13] i guess the basic problem is still there [05:32:27] the low error rate suggests osmething freaky like APC [05:32:51] maybe [05:33:41] (03PS1) 10Springle: promote es1009 to es3-master [dns] - 10https://gerrit.wikimedia.org/r/164283 [05:34:46] after that, there was some problems with various db hosts on mw1093 (LoadBalancer choked on something?) [05:35:48] springle: you won't be able to merge dns for a few minutes, I'm breaking the scripts :) [05:36:30] (03PS1) 10BBlack: authdns: Fix v2 support in scripts [puppet] - 10https://gerrit.wikimedia.org/r/164284 [05:36:37] grmbl [05:36:44] (03CR) 10BBlack: [C: 032 V: 032] authdns: Fix v2 support in scripts [puppet] - 10https://gerrit.wikimedia.org/r/164284 (owner: 10BBlack) [05:36:46] bblack: no worries. no hurry :) [05:37:15] MaxSem: don't know, havn't checked those yet [05:37:36] some may have been es1008 [05:38:12] yup, that one looks distinctive in logs, all hosts running long queries didn't like it:) [05:38:37] (03PS1) 10Springle: prepare es1008 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164285 [05:39:31] meanwhile, exception logs are overflown with testwiki woes [05:39:36] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/164282 (owner: 10BBlack) [05:40:25] (03CR) 10BBlack: [C: 032] no-op test change for authdns scripts [dns] - 10https://gerrit.wikimedia.org/r/164282 (owner: 10BBlack) [05:41:00] springle: ok it works again [05:41:10] awesome [05:41:24] (03PS2) 10Springle: promote es1009 to es3-master [dns] - 10https://gerrit.wikimedia.org/r/164283 [05:41:46] (03PS2) 10Springle: prepare es1008 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164285 [05:42:01] (03CR) 10Springle: [C: 032] promote es1009 to es3-master [dns] - 10https://gerrit.wikimedia.org/r/164283 (owner: 10Springle) [05:42:47] (03CR) 10Springle: [C: 032] prepare es1008 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164285 (owner: 10Springle) [05:52:14] <_joe_> bblack, springle good morning [05:52:16] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [05:52:31] <_joe_> (or goodnight, or afternoon) [05:56:20] :) [05:59:28] !log upgrade es1008 mariadb 10, restart [05:59:37] Logged the message, Master [06:00:10] !log mw1053 still flooding error logs with "Unrecognized job type 'EchoNotificationDeleteJob'." Disabling Puppet and jobrunner for now, planning to investigate during SF daytime hours. [06:00:15] Logged the message, Master [06:04:39] (03PS1) 10Springle: repool es1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164286 [06:05:23] (03CR) 10Springle: [C: 032] repool es1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164286 (owner: 10Springle) [06:05:29] (03Merged) 10jenkins-bot: repool es1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164286 (owner: 10Springle) [06:07:17] !log springle Synchronized wmf-config/db-eqiad.php: repool es1008, renable writes to external storage cluster 25 (duration: 00m 06s) [06:07:29] Logged the message, Master [06:07:58] ugh. read_only [06:08:08] so close to a clean switch [06:27:45] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [06:28:15] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [06:28:19] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [06:28:25] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [06:28:36] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: puppet fail [06:28:46] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [06:28:47] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: puppet fail [06:29:15] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:55] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:26] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:35] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:56] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:05] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:25] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:36] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:46] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:55] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:56] PROBLEM - puppet last run on db1006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:54] !log springle Synchronized wmf-config/db-eqiad.php: depool es1010 (duration: 00m 07s) [06:37:01] Logged the message, Master [06:37:54] (03PS3) 10Ori.livneh: Forward syslog messages logged by the kernel about HHVM to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164278 [06:40:27] (03PS1) 10Springle: prepare es1010 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164290 [06:41:06] (03CR) 10Springle: [C: 032] prepare es1010 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164290 (owner: 10Springle) [06:42:55] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:44:57] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:06] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:45:25] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:26] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:36] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:55] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:15] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:47:25] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:47:45] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:05] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:48:25] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:53:16] RECOVERY - puppet last run on db1006 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:53:43] !log upgrade es1010 mariadb 10, restart [06:53:53] Logged the message, Master [06:58:39] !log springle Synchronized wmf-config/db-eqiad.php: repool es1010 (duration: 00m 07s) [06:58:46] Logged the message, Master [07:00:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [07:03:57] (03PS1) 10Ori.livneh: Increase the stack size soft limit to 32MiB [puppet] - 10https://gerrit.wikimedia.org/r/164291 [07:06:59] (03PS3) 10Springle: WIP: Cleanup the Sanitarium [software] - 10https://gerrit.wikimedia.org/r/163147 [07:07:36] _joe_: if you recover, https://gerrit.wikimedia.org/r/#/c/164291/ and https://gerrit.wikimedia.org/r/#/c/164278/ are the important ones [07:10:14] (03CR) 10Springle: [C: 032] WIP: Cleanup the Sanitarium [software] - 10https://gerrit.wikimedia.org/r/163147 (owner: 10Springle) [07:12:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [07:13:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [07:27:31] !log Stopping Jenkins [07:27:39] Logged the message, Master [07:30:56] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Puppet has 1 failures [07:33:41] !log Jenkins restarting [07:33:50] Logged the message, Master [07:34:28] (03PS1) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [07:34:38] (03PS1) 10BBlack: use cp1008 to test authdns stuff too [puppet] - 10https://gerrit.wikimedia.org/r/164294 [07:38:56] (03CR) 10BBlack: [C: 032] use cp1008 to test authdns stuff too [puppet] - 10https://gerrit.wikimedia.org/r/164294 (owner: 10BBlack) [07:43:46] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:01] (03PS1) 10BBlack: fix gitrepo for role::authdns::testns [puppet] - 10https://gerrit.wikimedia.org/r/164295 [07:44:16] (03CR) 10BBlack: [C: 032 V: 032] fix gitrepo for role::authdns::testns [puppet] - 10https://gerrit.wikimedia.org/r/164295 (owner: 10BBlack) [07:45:11] (03PS2) 10Ori.livneh: Increase the stack size soft limit to 32MiB [puppet] - 10https://gerrit.wikimedia.org/r/164291 [07:45:20] (03CR) 10Ori.livneh: [C: 032 V: 032] Increase the stack size soft limit to 32MiB [puppet] - 10https://gerrit.wikimedia.org/r/164291 (owner: 10Ori.livneh) [07:46:12] (03CR) 10Hashar: [C: 031] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [07:46:47] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:48:17] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:04:52] (03PS1) 10Ori.livneh: Collapse /etc/hhvm/fcgi into /etc/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/164296 [08:06:13] (03CR) 10Hashar: [C: 04-1] "I got it for SauceLabs." [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [08:10:26] !log Jenkins has been upgraded to latest LTS version 1.565.3 [08:10:36] Logged the message, Master [08:23:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Collapse /etc/hhvm/fcgi into /etc/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/164296 (owner: 10Ori.livneh) [08:25:49] (03PS4) 10Giuseppe Lavagetto: Forward syslog messages logged by the kernel about HHVM to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164278 (owner: 10Ori.livneh) [08:26:00] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Forward syslog messages logged by the kernel about HHVM to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164278 (owner: 10Ori.livneh) [08:57:11] _joe_: hhvm related I have a rather annoying dependencies issue when installing hhvm build-deps :/ [08:57:18] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 0 MB (0% inode=99%): [08:57:29] <_joe_> hashar: yea I know [08:57:36] there is some conflict between libjpeg62-dev and libjpeg8-dev :/ [08:57:40] <_joe_> sorry but didn't have the time to look into that [08:57:41] anyway bug filled at https://bugzilla.wikimedia.org/show_bug.cgi?id=71413 :D [08:58:43] (03PS1) 10Filippo Giunchedi: syslog: switch to lithium [dns] - 10https://gerrit.wikimedia.org/r/164298 [08:59:06] I'm trying again with switching syslog to lithium, anyone for review? [09:01:50] <_joe_> godog: isn't this going to trigger the swift rsyslog bug? [09:02:00] changing the dns? no [09:02:14] <_joe_> so you don't need to restart syslog? [09:02:29] <_joe_> IDK if/how long it caches name resolution [09:03:04] yesterday I wasn't able to make it pick it up the new address without a restart, didn't check the code tho [09:03:21] but yes likely a restart, after the ttl has expired I'll start with the swift boxes [09:06:30] <_joe_> godog: lithium is already configured I guess [09:06:48] yep, I provisioned it yesterday [09:07:14] (03CR) 10Giuseppe Lavagetto: [C: 031] syslog: switch to lithium [dns] - 10https://gerrit.wikimedia.org/r/164298 (owner: 10Filippo Giunchedi) [09:07:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog: switch to lithium [dns] - 10https://gerrit.wikimedia.org/r/164298 (owner: 10Filippo Giunchedi) [09:07:40] \o/ thanks [09:08:24] <_joe_> !log restarting node-ocg on ocg1001; a _lot_ of deleted files with the FD still opened [09:08:34] Logged the message, Master [09:09:06] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163815 (owner: 10Hashar) [09:09:12] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [puppet] - 10https://gerrit.wikimedia.org/r/163814 (owner: 10Hashar) [09:10:01] <_joe_> wtf, a dir there takes 32 GB!!! [09:11:22] <_joe_> !log removing /mnt/tmpfs/fd29e937fea41d186175bcb880ef96980825dd1c.rdf2latex from ocg1001, it contains a 32 gb pdf [09:11:31] Logged the message, Master [09:12:28] RECOVERY - Disk space on ocg1001 is OK: DISK OK [09:12:49] more bugs to be filled :D [09:13:23] <_joe_> I think there is a bug for this already [09:28:21] !log start rolling depooling/restart/pooling of swift frontends in eqiad to pick up syslog change [09:28:30] Logged the message, Master [09:37:38] PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: Connection refused [09:37:48] PROBLEM - Swift HTTP frontend on ms-fe1001 is CRITICAL: Connection refused [09:40:12] icinga-wm: shush! [09:40:28] godog: schedule a downtime :) [09:40:48] RECOVERY - Swift HTTP frontend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.044 second response time [09:41:38] RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.023 second response time [09:52:45] godog: so syslog moved ?? [09:52:50] :-D [09:53:58] (03PS1) 10Giuseppe Lavagetto: nagios_common: use a template for contacts. [puppet] - 10https://gerrit.wikimedia.org/r/164301 [09:54:18] akosiaris: yep :) now rolling-restart swift to avoid a python logging bug, then we'll need to restart rsyslog everywhere I fear :( [09:55:58] <_joe_> godog: that can be done via salt quite easily [09:58:09] yeah. finally. nfs1's syslog is the only blocker for shutting down pmtpa netapps [09:59:28] PROBLEM - RAID on analytics1010 is CRITICAL: CRITICAL: Active: 7, Working: 7, Failed: 1, Spare: 0 [10:00:07] but the replication is already broken, correct? [10:05:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comment, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [10:05:43] mark yes and most disks zeroed [10:06:08] the ones not zeroed are the ones belonging to pmtpa_home [10:07:03] <_joe_> it's always ones and zeroes [10:07:44] i want to sanitize them with the string hax0r [10:07:56] no that's wrong [10:07:58] File:Sting.ogg [10:07:59] h@x0R is better [10:09:11] (03CR) 10Giuseppe Lavagetto: nagios_common: use a template for contacts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [10:30:54] (03CR) 10JanZerebecki: [C: 04-1] "As the goal is not to actually run it in labs, but just to test it there, I think it is a bad idea to disable quite a part we want to test" [puppet] - 10https://gerrit.wikimedia.org/r/164103 (owner: 10Yuvipanda) [10:36:16] (03PS2) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [10:39:13] (03CR) 10JanZerebecki: [C: 04-1] "I think having a different role for labs is bad as it increases the difference between labs and production." [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [10:39:31] (03Abandoned) 10JanZerebecki: Avoid referencing private contacts in icinga::monitor on labs. [puppet] - 10https://gerrit.wikimedia.org/r/158355 (owner: 10JanZerebecki) [10:54:38] !log restarted rsyslog in esams [10:54:49] Logged the message, Master [10:57:28] !log restarted rsyslog in ulsfo [10:57:35] Logged the message, Master [10:57:54] !log restarted rsyslog in codfw [10:57:59] Logged the message, Master [11:03:30] (03PS5) 10Christopher Johnson (WMDE): Moves file ensure directory to init.pp, changes $title to $name Change-Id: If2e66a090581e10e350f3e7f9795e3b43c6b25da [puppet] - 10https://gerrit.wikimedia.org/r/162873 [11:04:53] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 205, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/3/1: down - BR [11:16:26] right :P [11:18:00] sigh [11:18:06] i'm not authorized according to icinga [11:20:58] sounds like you need to tell it to respect your authority [11:36:04] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [11:36:14] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [11:36:24] PROBLEM - Certificate expiration on sanger is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:44:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather you used .19 and not .32. It 'll be a more compact use of IP space (less holes) and less prone to cause us difficulties if we " [dns] - 10https://gerrit.wikimedia.org/r/164238 (owner: 10Dzahn) [11:44:27] shit [11:44:33] a sanger [11:44:55] for a moment I thought it was the new LDAP mirror [11:45:52] i just disabled it [11:46:45] did the imap sync ever finish ? [11:46:50] no [11:46:58] nowhere near [11:47:22] i'm at like 25% now [11:48:31] ETA ? [11:48:36] somewhere in 2015 ? [11:48:41] yeah roughly [11:48:56] it's much quicker to setup my own imap server again ;) [11:49:21] but this gmail thing must be working for someone ;) [11:53:05] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.031 second response time on port 636 [11:53:10] wtf [11:53:27] RECOVERY - LDAP on sanger is OK: TCP OK - 0.037 second response time on port 389 [11:53:31] puppet ... [11:53:37] the puppet cron job [11:53:44] RECOVERY - Certificate expiration on sanger is OK: SSL_CERT OK - X.509 certificate for sanger.wikimedia.org from Wikimedia CA valid until Oct 11 20:23:26 2015 GMT (expires in 374 days) [11:54:15] so the apparmor change from yesterday ? apparmor needs a restart for starters [11:54:16] removed it [11:54:31] then some servers don't even have apparmor installed [11:55:18] still figuring it out [11:56:02] (03CR) 10Reedy: [C: 04-1] "Need to rebase/redo this..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/90704 (owner: 10Reedy) [11:56:06] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [11:56:13] _joe_: ^ Am I good to merge that when rebased? :) [11:56:22] (removing of most of the misc docroot folders) [11:56:25] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [11:56:44] PROBLEM - Certificate expiration on sanger is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [11:58:04] !log Migrated all mediawiki-core-regression* jobs to Zuul cloner {{bug|71549}} [11:58:10] Logged the message, Master [11:59:49] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM and i'll self merge this noew to see if the torrus compiler (yes compiler) successfully compiles the configuration" [puppet] - 10https://gerrit.wikimedia.org/r/164248 (owner: 10Dzahn) [12:03:39] !log restarted apparmor throughout the fleet [12:03:44] Logged the message, Master [12:03:55] that is when I just love salt [12:07:02] !log restarting rsyslog in eqiad [12:07:08] akosiaris: I hope it goes as smoothly :) [12:07:09] Logged the message, Master [12:07:13] another happy salt user [12:08:39] hehe [12:09:00] hashar: you say "happy"... ;) [12:28:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:28:37] looking [12:31:53] (03PS3) 10Reedy: Remove all superfluous docroot folders [mediawiki-config] - 10https://gerrit.wikimedia.org/r/90704 [12:32:13] !log rolling-restart swift on ms-be1*, saw increased load possibly as a cause of 5xx spike [12:32:19] Logged the message, Master [12:35:19] (03PS3) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [12:39:07] (03PS4) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [12:39:54] godog: was the rsyslog restart across the fleet? [12:39:59] or just lithium's rsyslog? [12:40:14] if it was the former, that's probably the cause of swift's increased load [12:40:26] godog is well aware of that [12:40:33] that swift bug ;) [12:41:14] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:43:50] paravoid: yes likely that, I tried to be careful and restart swift first and exclude it from the salt runs [12:45:21] clearly not enough though, I was trying to think about the solutions you outlined [12:58:44] akosiaris: looks like we're off nfs1 for syslog, just 6/7 pmtpa boxes still logging there which I didn't bother messing with [12:59:45] note we still have references to /home/wikipedia/syslog in puppet manifests [12:59:49] godog: whee :) [13:00:04] including the fatalmonitor script which tail/grep the syslog output [13:00:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141002T1300). [13:00:06] Reedy: happy days [13:00:22] hashar: That needs updating, and moving to fluorine [13:00:39] Vaguely on my todo list [13:00:51] hashar: yep it is a symlink on lithium ATM [13:01:09] fluorine would be ideal since it has all the log already [13:01:46] will have to update a bunch of doc :D or maybe make fatalmonitor to whine that logs are now on fluorine whenever the file does not exist [13:02:17] fluorine doesn't need the rest of misc::deployment::common_scripts though [13:04:03] moaar refactoring needed. I am not volunteering though [13:04:04] * Reedy refactors [13:06:45] (03PS1) 10Reedy: Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 [13:07:08] gah [13:07:47] (03PS2) 10Reedy: Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 [13:09:48] Reedy: a bit of formatting pls? :) [13:11:05] formatting? [13:11:38] err, wrapping [13:13:45] of what? the fatalmonitor script itself? :P [13:14:37] yeah, arguably it won't make it less ugly but at least readable [13:15:43] haha [13:16:24] What do you suggest doing? [13:16:44] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: puppet fail [13:16:50] (03CR) 10Anomie: Fully disable all mwlib formats; use OCG service instead. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [13:19:03] Reedy: I'll try with a PS [13:23:09] !log Zuul deadlocked somehow again :( [13:23:21] Logged the message, Master [13:23:29] (03PS3) 10Filippo Giunchedi: Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [13:23:50] godog: :) [13:23:55] Yeah, that's much more readable [13:26:00] Reedy: hehe https://google-styleguide.googlecode.com/svn/trunk/shell.xml is generally helpful for this stuff [13:29:56] (03CR) 10Filippo Giunchedi: [C: 031] "would it be of use in beta too?" [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [13:30:11] !log Zuul back around :] [13:30:18] Logged the message, Master [13:30:20] (03CR) 10Cscott: Fully disable all mwlib formats; use OCG service instead. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [13:31:10] It's more legacy... I think beta "just" uses logstash [13:31:11] modules/beta/manifests/fatalmonitor.pp [13:31:41] ah ok nevermind then [13:33:40] yeah I don't think anyone use fatalmonitor on beta [13:33:59] did it ever work (in the past 6 months) [13:34:08] hoo: work where? [13:34:15] on beta [13:35:06] no idea :) [13:35:55] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:40:19] (03PS1) 10Mark Bergsma: Indent ACLs [puppet] - 10https://gerrit.wikimedia.org/r/164320 [13:40:21] (03PS1) 10Mark Bergsma: Add comments to explain the routers' functions [puppet] - 10https://gerrit.wikimedia.org/r/164321 [13:40:23] (03PS1) 10Mark Bergsma: Formatting of some options for improved readability [puppet] - 10https://gerrit.wikimedia.org/r/164322 [13:40:25] (03PS1) 10Mark Bergsma: Use local_parts router option instead of a condition [puppet] - 10https://gerrit.wikimedia.org/r/164323 [13:40:27] (03PS1) 10Mark Bergsma: Use Exim's macro facility to define INSTANCEPROJECT and MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/164324 [13:42:55] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: /mnt/tmpfs 456 MB (1% inode=99%): [13:45:35] (03PS5) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [13:52:14] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%): [13:53:28] %&*@#$& [13:53:31] my bad [13:53:32] (03PS6) 10Springle: prepare dbproxy100[12] for trials [puppet] - 10https://gerrit.wikimedia.org/r/164293 [13:55:14] RECOVERY - Disk space on lithium is OK: DISK OK [14:06:14] RECOVERY - Disk space on ocg1002 is OK: DISK OK [14:09:57] cmjohnson1: you are not near eqiad, are you? [14:10:21] im here now [14:11:46] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1010&service=RAID [14:11:53] i think it is sde [14:13:31] that's the hadoop primary namenode [14:14:01] ottomata sde1[0](F) [14:14:30] its the name partition, which has 4 disks in raid 1 for super safety, so we are ok for now i think but we should get swapped [14:14:37] sooner rather than later to be safe? [14:15:09] we can also failover to the standby namenode if we need to [14:15:18] can those be hot swapped? that's a cisco :/ [14:15:43] they can be hot swapped...I have spares of them lying around [14:15:51] awesoome [14:16:42] should I make an RT to track it, or do you wanna just swap it? [14:21:57] cmjohnson1: ^? [14:22:21] make a RT plz [14:22:45] k [14:23:56] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=8520 [14:25:39] apergos: [14:25:42] can you review this? [14:25:42] https://gerrit.wikimedia.org/r/#/c/164124/ [14:25:51] hoping to get it out before the metrics meeting today [14:25:56] i think toby might want to talk about it [14:26:40] (03PS1) 10Gilles: Prerender thumbnails at upload time on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164336 [14:33:54] (03PS2) 10Mark Bergsma: Use Exim's macro facility to define INSTANCEPROJECT and MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/164324 [14:33:56] (03PS2) 10Mark Bergsma: Use local_parts router option instead of a condition [puppet] - 10https://gerrit.wikimedia.org/r/164323 [14:33:58] (03PS2) 10Mark Bergsma: Formatting of some options for improved readability [puppet] - 10https://gerrit.wikimedia.org/r/164322 [14:35:19] (03PS1) 10coren: Tool Labs: fix mail to bind to LDAP and not use -x [puppet] - 10https://gerrit.wikimedia.org/r/164337 [14:35:55] you're not making it prettier are you [14:36:47] (03CR) 10coren: [C: 031] "Moar pretty" [puppet] - 10https://gerrit.wikimedia.org/r/164324 (owner: 10Mark Bergsma) [14:38:14] (03CR) 10Filippo Giunchedi: [C: 031] "clarified offline with Gilles, if original size < thumb size the scalers will 500 right away, and the job will fail" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164336 (owner: 10Gilles) [14:38:36] (03CR) 10coren: [C: 031] "Clear improvement" [puppet] - 10https://gerrit.wikimedia.org/r/164323 (owner: 10Mark Bergsma) [14:40:38] (03CR) 10coren: [C: 031] "De gustibus et coloribus. I find the no_X syntax clearer, but consistency is good too if we generally use the = false" [puppet] - 10https://gerrit.wikimedia.org/r/164322 (owner: 10Mark Bergsma) [14:41:46] ^ people not familiar with exim get confused by that as they can't find the no_ or not_ options as easily in manuals [14:41:50] (03CR) 10coren: [C: 031] "Moar dox" [puppet] - 10https://gerrit.wikimedia.org/r/164321 (owner: 10Mark Bergsma) [14:42:16] (03PS1) 10Filippo Giunchedi: syslog-ng: create archive directory [puppet] - 10https://gerrit.wikimedia.org/r/164338 [14:42:23] (03CR) 10coren: [C: 031] "Cosmetics ftw" [puppet] - 10https://gerrit.wikimedia.org/r/164320 (owner: 10Mark Bergsma) [14:43:08] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [14:46:55] (03CR) 10coren: [C: 032] Use local_parts router option instead of a condition [puppet] - 10https://gerrit.wikimedia.org/r/164323 (owner: 10Mark Bergsma) [14:47:05] (03CR) 10coren: [C: 032] Use Exim's macro facility to define INSTANCEPROJECT and MAILDOMAIN [puppet] - 10https://gerrit.wikimedia.org/r/164324 (owner: 10Mark Bergsma) [14:47:15] (03CR) 10coren: [C: 032] Formatting of some options for improved readability [puppet] - 10https://gerrit.wikimedia.org/r/164322 (owner: 10Mark Bergsma) [14:47:22] (03CR) 10coren: [C: 032] Add comments to explain the routers' functions [puppet] - 10https://gerrit.wikimedia.org/r/164321 (owner: 10Mark Bergsma) [14:47:28] (03CR) 10BryanDavis: [C: 031] use scap's embedded linking, remove lint script [puppet] - 10https://gerrit.wikimedia.org/r/160691 (https://bugzilla.wikimedia.org/68255) (owner: 10Filippo Giunchedi) [14:47:31] (03CR) 10coren: [C: 032] Indent ACLs [puppet] - 10https://gerrit.wikimedia.org/r/164320 (owner: 10Mark Bergsma) [14:47:40] (03CR) 10coren: [C: 032] Tool Labs: fix mail to bind to LDAP and not use -x [puppet] - 10https://gerrit.wikimedia.org/r/164337 (owner: 10coren) [14:50:45] marktraceur, ^d: Which of us wants to SWAT today? [14:51:10] making our build [14:51:21] * anomie would rather not [14:52:04] how much is in swat today besides our stuff? [14:52:29] aude: I see 5 other patches listed [14:52:32] ok [14:52:42] <^d> I can. [14:52:46] cscott, tgr, gi11es, Glaisher: Ping for SWAT [14:52:52] ^d: Good! (: [14:52:54] anomie: pong [14:52:57] pong [14:53:01] pong [14:53:08] pong [14:53:11] (03CR) 10BryanDavis: Move fatalmonitor to fluorine (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [14:53:14] ping! [14:53:22] (isn't that how we play this game?) [14:53:24] packet loss [14:53:44] <^d> brb in 5m and we'll start swatting. [14:55:01] bd808: indeed, it shows I don't mind the \'s :) but yeah much better without those [14:55:22] godog: :) I hate line continuation characters [14:56:01] They are like pointy sticks for my eye to run into when reading a line of shell script code [14:57:22] hehe like remembering semicolumns, except \ can be omitted [14:59:53] (03PS4) 10Filippo Giunchedi: Move fatalmonitor to fluorine [puppet] - 10https://gerrit.wikimedia.org/r/164314 (owner: 10Reedy) [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141002T1500). Please do the needful. [15:00:59] waiting on jenkins [15:02:16] (03CR) 10Rush: "Can you make your commit message something more relevant to the change? As is it will not make sense once merged." [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [15:02:24] <^d> Alrighty, let's see what we've got here. [15:02:39] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:03:40] <^d> cscott: You're first. [15:03:47] whoo! [15:04:22] (03CR) 10Chad: [C: 032] Fully disable all mwlib formats; use OCG service instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [15:04:36] (03Merged) 10jenkins-bot: Fully disable all mwlib formats; use OCG service instead. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [15:05:57] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [15:06:02] Logged the message, Master [15:06:20] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 04s) [15:06:26] <^d> cscott: Ok, you're live. [15:06:29] Logged the message, Master [15:06:59] <^d> tgr: Your patch needs a rebase. [15:07:27] looks like jenkins is almost done [15:08:19] (03CR) 10Chad: [C: 032] Enable DynamicPageList extension on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164114 (https://bugzilla.wikimedia.org/68346) (owner: 10Glaisher) [15:08:27] <^d> Glaisher: Ok, you're up now [15:08:31] (03Merged) 10jenkins-bot: Enable DynamicPageList extension on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164114 (https://bugzilla.wikimedia.org/68346) (owner: 10Glaisher) [15:08:55] !log Replaced exim4-deamon-light by exim4-daemon-heavy on tools-mail [15:09:01] Logged the message, Master [15:09:04] (03PS2) 10Gergő Tisza: Enable image dimension logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163279 [15:09:11] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 04s) [15:09:17] (03PS1) 10BBlack: fix recdns forwarding for private zones (new ns1 addr) [puppet] - 10https://gerrit.wikimedia.org/r/164340 [15:09:17] Logged the message, Master [15:09:29] ^d: done [15:09:33] <^d> ty [15:09:41] updating submodule [15:09:47] (03CR) 10Chad: [C: 032] Enable image dimension logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163279 (owner: 10Gergő Tisza) [15:10:00] (03Merged) 10jenkins-bot: Enable image dimension logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163279 (owner: 10Gergő Tisza) [15:10:16] cajoel: I'm looking at your labs instance 'flow-localpuppet' do you know if it's still in use and/or needed? [15:10:18] <_joe_> Reedy: I guess so [15:10:20] akosiaris: anything in particular I should before shutting nfs1 ? [15:10:21] (03CR) 10BBlack: [C: 032 V: 032] fix recdns forwarding for private zones (new ns1 addr) [puppet] - 10https://gerrit.wikimedia.org/r/164340 (owner: 10BBlack) [15:11:00] (03PS1) 10Chad: .gitconfig: not sure how push.default = simple snuck in there [puppet] - 10https://gerrit.wikimedia.org/r/164341 [15:11:26] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 04s) [15:11:33] ^d: https://gerrit.wikimedia.org/r/#/c/164342/ [15:11:35] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 04s) [15:11:36] <^d> tgr: You're live ^ [15:11:39] * aude adds to wiki [15:11:39] Coren: incidentally, I noticed it's the direct IP for virt0 in our recdns here: https://gerrit.wikimedia.org/r/#/c/164340/1/templates/powerdns/recursor.conf.erb . Not sure if that's relevant to any ongoing stuff recently with labs resolution, maybe should be labs-ns[01] IPs there? [15:12:32] ^d: thanks! the actual logging will go out with the train, I'll test after that [15:12:36] (03PS6) 10Christopher Johnson (WMDE): Abstracts Sprint install with defined resource type phabricator::libext Change-Id: If2e66a090581e10e350f3e7f9795e3b43c6b25da [puppet] - 10https://gerrit.wikimedia.org/r/162873 [15:13:10] ^d: working. thanks [15:13:34] (03CR) 10Chad: [C: 032] Prerender thumbnails at upload time on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164336 (owner: 10Gilles) [15:13:44] the mwlib/ocg change looks good to me [15:14:05] (03PS2) 10Ori.livneh: add citoid.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/164238 (owner: 10Dzahn) [15:14:07] <^d> aude: Doing you last :) [15:14:13] ok [15:14:25] <^d> gi11es: Merging yours now. [15:14:44] (03Merged) 10jenkins-bot: Prerender thumbnails at upload time on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164336 (owner: 10Gilles) [15:15:25] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s) [15:15:33] Logged the message, Master [15:15:46] <^d> gi11es: That's half of yours live ^ [15:17:05] (03CR) 10Yuvipanda: "Should add the contacts to hiera before this gets merged, since otherwise contacts.cfg will just be empty." [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [15:17:22] <^d> Soon as jenkins catches up we'll do the two submodule syncs for 25wmf1 [15:17:45] ^d: everthing looks fine, I've just uploaded a test file to enwiki [15:17:59] <^d> :) [15:18:50] <^d> gi11es: I did leave a comment on the core change. We might want to tidy that up in master anyway. [15:18:55] (03PS7) 10Christopher Johnson (WMDE): Abstracts Sprint install with defined resource type phabricator::libext [puppet] - 10https://gerrit.wikimedia.org/r/162873 [15:19:21] <^d> (not a blocker to deploying though) [15:21:38] !log shutting down nfs1 [15:21:43] \O/ [15:21:46] Logged the message, Master [15:21:50] * ^d twiddles thumbs [15:22:19] * aude rages at jenkins [15:22:21] yeah our tests are wayyy too slow [15:22:43] * hashar points bad tests that keeps hitting the whole stack for each assertion [15:23:12] <^d> Plus the whole "setup and tear down the database" thing. [15:23:17] <^d> Which is terribly inefficient. [15:23:24] hehe indeed [15:23:36] https://integration.wikimedia.org/ci/job/mwext-Wikidata-testextension/62/testReport/(root)/ gives you some duration indications [15:23:56] some tests there take several seconds :( [15:23:57] nice [15:24:22] <^d> Grr, zuul just says queued. [15:24:25] <^d> With unknown time. [15:24:32] <^d> Oh there we go. [15:24:35] <^d> Now we have eta. [15:24:45] lovely isn't it ? [15:24:52] RECOVERY - check configured eth on labstore1001 is OK: NRPE: Unable to read output [15:24:53] I am not sure why it cancelled the builds though [15:25:11] ^d: both patches (if passing) should be merged about at the same time [15:25:16] in the order they have been +2ed [15:25:22] <^d> yeah [15:25:37] zuul speculates the ones ahead in the queue will pass [15:27:09] (03Restored) 10Ori.livneh: apache: add vhost_combined log format to defaults.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 (owner: 10Ori.livneh) [15:27:16] ApiQueryContinueTest takes a good 10 seconds :/ [15:29:13] PROBLEM - check configured eth on labstore1001 is CRITICAL: bond0 reporting no carrier. [15:30:50] well done jenkins [15:31:57] !log demon Synchronized php-1.25wmf1/includes/jobqueue/jobs/ThumbnailRenderJob.php: (no message) (duration: 00m 05s) [15:32:06] Logged the message, Master [15:32:10] <^d> gi11es: ^ [15:32:10] (03PS2) 10Ori.livneh: apache: add vhost_combined log format to defaults.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 [15:32:27] !log demon Synchronized php-1.25wmf1/extensions/Wikidata: (no message) (duration: 00m 20s) [15:32:30] <^d> aude: ^ [15:32:33] godog, _joe_: are you guys busy? i'd like to merge this ^^^ so we can have some access logs on the apaches [15:32:34] checking [15:32:35] Logged the message, Master [15:32:50] looks good [15:32:53] (03CR) 10Cscott: "OK, https://gerrit.wikimedia.org/r/164099 has been deployed and it looks like Tampa load is zero." [puppet] - 10https://gerrit.wikimedia.org/r/162814 (owner: 10Dzahn) [15:33:12] PROBLEM - check configured eth on labstore1001 is CRITICAL: bond0 reporting no carrier. [15:33:23] thanks! [15:33:27] <^d> yw [15:33:32] <^d> ori: Can you look at a simple puppet patch for my homedir? [15:33:41] sure [15:33:43] <_joe_> Reedy: I'm preparing an evil followup to your docroot consolidation patch [15:33:55] <^d> ori: https://gerrit.wikimedia.org/r/#/c/164341/ I screwed up yesterday and my tin .gitconfig is a little fubar'd :) [15:34:11] (03CR) 10Ori.livneh: [C: 032] .gitconfig: not sure how push.default = simple snuck in there [puppet] - 10https://gerrit.wikimedia.org/r/164341 (owner: 10Chad) [15:34:37] ^d: all clear [15:34:44] <^d> wheee [15:35:28] ori: sure, what's the relationship with https://gerrit.wikimedia.org/r/#/c/162541/ btw? I mean, the plan [15:35:33] RECOVERY - check configured eth on labstore1001 is OK: NRPE: Unable to read output [15:35:38] <^d> Ok, swat's over. We'll be back for another showing in 7 1/2 hours. [15:35:41] <^d> :) [15:36:19] has SWAT ever happened in real life? [15:36:19] godog: (a) we specify the vhost_combined log format in default.conf, so if we want to fix it everywhere, we should define the log format there; (b) the %O format specifier isn't supported in apache 2.2 unless mod_logio is enabled, so i changed that to %b [15:36:32] <^d> ori: ty! [15:36:52] <^d> godog: how do you mean? [15:37:13] ^d: like with everybody physically in the same room [15:37:16] <_joe_> ori: everywhere but on the appservers vhost_combined _is_ already defined in apache2.conf [15:37:30] <_joe_> so changing that there made sense [15:37:34] <^d> godog: During hackathons when people come up to the deployers and say "hey can you deploy X?" [15:38:19] ^d: hah indeed! :) [15:38:20] _joe_: it is? so why is mw1019:/var/log/apache2/other_vhosts_access.log contain just "vhost_combined"? [15:38:46] _joe_: i don't see vhost_combined defined in apache2.conf [15:38:48] bblack: Oooo. Nice catch. This is not the problem we had but it's definitely /a/ problem. [15:38:49] <_joe_> ori: everywhere but on the appservers :) [15:39:07] _joe_: oh, i see what you're saying [15:39:13] _joe_: well, ok, in that case we can move it back [15:39:16] <_joe_> "This should go in /modules/apache/files/defaults.conf" <--- maybe [15:39:20] <_joe_> :) [15:39:41] isn't it better to standardize it there? i'm kind of on the fence [15:39:43] i could go either way [15:41:14] <_joe_> ori: me too, I was just presenting the fact :) [15:41:33] * Coren hopes it's not a rasor wire fence, that'd be uncomfortable. [15:41:39] ok, so let's go with jeremyb's patch since it was earlier [15:41:50] but i'll amend it to remove the other log format line, per godog's suggestion [15:42:07] and it appears that mod_logio is baked in to precise's apache2, so %O is fine [15:43:21] I am off see you tomorrow :] [15:43:22] apergos: https://gerrit.wikimedia.org/r/#/c/164124/ ping again, if you are around, i want to merge this soon, if you aren't around i will merge soon. You can still comment and I will make a new change if you want me to do things differently [15:44:21] (03PS2) 10Ori.livneh: import LogFormat s from apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [15:46:15] ori: taking a look [15:47:20] many thanks [15:47:35] (03CR) 10Manybubbles: "We have to package and install the python library for this to work, right?" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [15:51:05] ori: I think we're using 'combined' [15:51:17] /etc/apache2/sites-available/default: CustomLog ${APACHE_LOG_DIR}/access.log combined [15:51:42] godog: modules/apache/files/defaults.conf:11:CustomLog /var/log/apache2/other_vhosts_access.log vhost_combined [15:52:04] see /var/log/apache2/other_vhosts_access.log on any app server [15:53:03] ori: sure, but you are removing the combined logformat in the last PS no? [15:53:08] PROBLEM - puppet last run on sanger is CRITICAL: CRITICAL: Puppet last ran 14426 seconds ago, expected 14400 [15:53:39] godog: oh, d'oh. [15:53:54] godog: i saw your comment on jeremyb's patch and i thought it meant that he was introducing it [15:54:30] comms fail :( my idea was to have a diff with one green line, namely adding vhost_combined [15:54:40] but in that case should it be %O or %b? [15:54:52] maybe %b for consistency with the other ones? [15:55:20] yes, we can switch later [15:55:24] (03CR) 10Ottomata: Rsync Hive generated webstats pagecounts_all_sites dataset (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [15:55:31] OK, let's do this: I'll make my patch just introduce vhost_combined with %b [15:55:39] and jeremyb's patch about changing all %bs to %Os [15:55:45] that's true to the intent, i think [15:55:50] (03PS3) 10Ottomata: Rsync Hive generated webstats pagecounts_all_sites dataset [puppet] - 10https://gerrit.wikimedia.org/r/164124 [15:56:16] ori: yep sounds good, thanks! [15:57:06] sorry ottomata, was trying to make soup... body on the blink for days now, eating is an adventure in the bad way [15:57:34] (03PS3) 10Ori.livneh: mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 [15:57:37] oof [15:57:54] I have the tb open, if anything looks awry that I think is controversial I'll poke you, otherwise I'll just fix if needed [15:58:04] ok cool, thanks [15:58:48] (03CR) 10QChris: [C: 031] Rsync Hive generated webstats pagecounts_all_sites dataset [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [15:58:58] (03CR) 10Ottomata: [C: 032 V: 032] Rsync Hive generated webstats pagecounts_all_sites dataset [puppet] - 10https://gerrit.wikimedia.org/r/164124 (owner: 10Ottomata) [15:59:11] (03PS3) 10Ori.livneh: import LogFormat s from apache2 package [puppet] - 10https://gerrit.wikimedia.org/r/162541 (owner: 10Jeremyb) [15:59:31] godog: done. being a selfish person, i only care about my patch, because i don't really care about the %b/%O difference [15:59:32] (03CR) 10Filippo Giunchedi: "syslog has been moved to lithium, I've shut nfs1 down and stopped notifications in icinga" [puppet] - 10https://gerrit.wikimedia.org/r/159442 (owner: 10Dzahn) [15:59:37] godog: up to you if you want to do both or just one [15:59:59] (03PS4) 10Ori.livneh: mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 [16:00:31] "this is my patch. there are many like it but this one is mine" [16:01:18] heh [16:01:45] doh, node name typo [16:01:46] ori: TBH I don't mind %b vs %O either [16:02:20] the things we look at in the apache logs are: referer, ip, response status [16:02:27] (03PS1) 10Ottomata: dataset1001, not datasets1001 in $hosts_allow in role::analytics::rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/164357 [16:02:36] (03CR) 10Ottomata: [C: 032 V: 032] dataset1001, not datasets1001 in $hosts_allow in role::analytics::rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/164357 (owner: 10Ottomata) [16:02:56] (03PS5) 10Filippo Giunchedi: mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 (owner: 10Ori.livneh) [16:03:11] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 (owner: 10Ori.livneh) [16:03:57] godog: mind if i merge? i can babysit [16:04:41] ori: +1, will it notify apache or is it manual? [16:04:46] manual [16:04:57] godog: "+1" could be interpreted in two ways in this context ;) [16:05:35] ori: haha true, yes I don't mind if you are merging it [16:05:46] (03PS6) 10Ori.livneh: mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 [16:05:47] ottomata: what is '$source' in the uh [16:05:51] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: add vhost_combined log format to apache2.conf [puppet] - 10https://gerrit.wikimedia.org/r/163551 (owner: 10Ori.livneh) [16:06:04] ty! [16:06:10] in dataset::cron::pagecounts_all_sites ? [16:06:31] apergos: I didn't want to hardcode a node name into the module [16:06:31] so [16:06:33] its in the role [16:06:48] https://gerrit.wikimedia.org/r/#/c/164124/3/manifests/role/dataset.pp [16:07:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [16:07:14] ori: np, thanks to you and jeremyb for taking care of it, when I saw that in the logs I was like duuuuude [16:07:23] (03PS2) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [16:07:29] godog: malkovich malkovich malkovich [16:07:50] CustomLog /var/log/malkovich.log malkovich_malkovich [16:07:54] ok I can't think of a reasonable default value for it [16:08:17] ori: haha that'll do [16:08:53] <_joe_> put that in the apache config while nobody's watching, with some obscure comment [16:09:02] <_joe_> I'm sure in 10 years it will still be there [16:09:28] as long as it is logrotated away yeah it'll live forever [16:10:14] <_joe_> godog: and don't forget a comment like # migrated here for SOX compliance [16:10:23] ottomata: the cron job should require the destination, other than that it looks ok to me [16:11:16] you mean not have a default? [16:11:34] I mean that source has no default right now, and I can't think of a good one so I guess I can't complain about it [16:11:41] (03CR) 10Ori.livneh: "I'll update this" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [16:11:56] _joe_: haah plus some threats like "we tried removing this, but everything broke" [16:13:04] "don't remove this, needed for stats collection" it will stay there forever, guaranteed [16:13:24] ? [16:13:30] that's to godog [16:13:31] sorry [16:13:34] ohk, ha [16:13:54] (03PS2) 10Ori.livneh: HHVM build-deps: add condition to exec, move to contint [puppet] - 10https://gerrit.wikimedia.org/r/164250 [16:14:00] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM build-deps: add condition to exec, move to contint [puppet] - 10https://gerrit.wikimedia.org/r/164250 (owner: 10Ori.livneh) [16:14:35] how much stuff is in there to be rsynced right now, ottomata? [16:15:03] apergos: haha even better [16:16:21] apergos, i'm rsyncing now [16:16:45] 32G [16:16:47] right now apergos [16:16:49] ah fine [16:17:20] (03PS1) 10Mark Bergsma: Use exim4-daemon-heavy on the tools mail relay [puppet] - 10https://gerrit.wikimedia.org/r/164360 [16:17:27] does this work ^ ? [16:17:36] last time I tried that it didn't, but that was a looong time ago [16:19:54] (03CR) 10Mark Bergsma: "Testing is what Labs is for, isn't it ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/164360 (owner: 10Mark Bergsma) [16:20:01] (03CR) 10Mark Bergsma: [C: 032] "Testing is what Labs is for, isn't it ;-)" [puppet] - 10https://gerrit.wikimedia.org/r/164360 (owner: 10Mark Bergsma) [16:21:35] (03PS1) 10Mark Bergsma: Revert "Use exim4-daemon-heavy on the tools mail relay" [puppet] - 10https://gerrit.wikimedia.org/r/164362 [16:21:43] (03CR) 10Mark Bergsma: [C: 032] Revert "Use exim4-daemon-heavy on the tools mail relay" [puppet] - 10https://gerrit.wikimedia.org/r/164362 (owner: 10Mark Bergsma) [16:21:53] (03CR) 10Mark Bergsma: [V: 032] Revert "Use exim4-daemon-heavy on the tools mail relay" [puppet] - 10https://gerrit.wikimedia.org/r/164362 (owner: 10Mark Bergsma) [16:22:04] lame [16:23:13] heh [16:23:28] are you using toolsbeta to test? [16:23:40] no [16:23:45] what is toolsbeta? :) [16:24:01] mark: ah, toolsbeta is to tools what labs is to prod :) [16:24:07] testing ground for testing patches before merging [16:24:09] oh god [16:24:12] although I haven't used it in a while. [16:24:16] well i always test in production too [16:24:20] hahah :D [16:24:20] so i'll do that in tools as well ;-) [16:29:15] (03PS3) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [16:29:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:29:42] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 4 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:29:57] (03CR) 10Chad: "Yeah, we'll have to install it somehow. I think there's a pretty straightforward way to turn a pip package to deb." [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:30:44] !log graceful'd all apaches for I98bcdbfc7: mediawiki: add vhost_combined log format to apache2.conf [16:30:54] Logged the message, Master [16:31:07] (03CR) 10Ottomata: [C: 031] "Just skimmed this briefly, and it looks pretty well organized. +1 from me." [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [16:33:40] * bblack uses toolsbetabetabeta [16:34:04] (03CR) 10Manybubbles: "Going from pip to deb might be less painful then building a normal package but its still a package to build." [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:34:57] (03CR) 10Chad: "Yeah, however. I'm not married to any particular way, long as it's ok by ops :)" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:36:51] (03CR) 10Ottomata: "I can build you a pip -> deb package. Have done that a few times. What where?" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:38:08] <^d> ottomata: elasticsearch-py [16:38:12] <^d> Is the pip package. [16:38:30] (03CR) 10Filippo Giunchedi: [C: 031] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:38:53] <^d> it's the official client supported by ES, which I'm using to write some maintenance wrappers to easy our jobs :) [16:38:57] anyone from flow around? [16:39:04] https://www.mediawiki.org/w/index.php?title=Topic:S1b3m6774w8qpf8t&fromnotif=1#flow-post-s3ermfxsleu9goor blows up [16:39:32] (03CR) 10BryanDavis: "You could use trebuchet like mediawiki/tools/scap does rather than packaging. For scap was to add the /src/deployment/scap/scap/bin direct" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [16:39:44] ^d: https://pypi.python.org/pypi/elasticsearch [16:39:44] ? [16:39:53] 1.2.0 ok? [16:40:17] <^d> godog: I thought about doing that (single script with subdommands) but started getting sucked down a rabbit hole of over-engineering it :p [16:40:19] <^d> *commands [16:40:45] <^d> ottomata: Yeah, 1.2 should be fine. Lemme check which one I've been developing against locally to confirm. [16:41:16] hmm, i'm going to do this from upstream here [16:41:20] https://github.com/elasticsearch/elasticsearch-py [16:41:22] and use tags [16:41:24] cool. [16:41:40] <^d> Yep, 1.2 [16:41:47] ^d: hehe it is easy to fall into that pitfall alright, either works really as long as there isn't much boilerplate duplicated, doesn't look like it tho [16:42:25] (03PS1) 10Mark Bergsma: Install exim4-daemon-heavy the ugly way [puppet] - 10https://gerrit.wikimedia.org/r/164366 [16:43:00] (03PS2) 10Filippo Giunchedi: syslog-ng: create archive directory [puppet] - 10https://gerrit.wikimedia.org/r/164338 [16:43:05] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] syslog-ng: create archive directory [puppet] - 10https://gerrit.wikimedia.org/r/164338 (owner: 10Filippo Giunchedi) [16:43:15] (03CR) 10Mark Bergsma: [C: 032] Install exim4-daemon-heavy the ugly way [puppet] - 10https://gerrit.wikimedia.org/r/164366 (owner: 10Mark Bergsma) [16:43:45] mark: I'll hold off puppet-merge btw, many patches from you [16:43:51] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: Puppet has 1 failures [16:44:05] aude: do you know about "Fatal error: Base lambda function for closure not found at ..../WikibaseLib.default.php on line 18" ? [16:44:08] you can merge, they're all for toollabs [16:44:17] ack, merging [16:44:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:45:02] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:45:56] Are all redirects rules in operations/puppet/modules/mediawiki/files/apache/sites/redirects nowadays? [16:46:47] I'm looking for something which made http://bits.wikimedia.org/robots.txt a 404/redirect [16:47:29] manybubbles: need to graceful apached [16:47:33] s* [16:47:54] i don't know how better to avoid this, but it's annoying [16:47:57] aude: cool. So long as you know abpout it. Its spewer all over the error logs [16:48:03] weird [17:07:06] i know but can't do anything, except ask someone to graceful [17:07:06] i think it's a php bug [17:07:06] with apc [17:07:11] labs seems to be down? no response in ~5 minutes from http://en.wikipedia.beta.wmflabs.org/ or http://mwui.wmflabs.org/ [17:07:12] Ah, an error message now: Request: GET http://en.wikipedia.beta.wmflabs.org/wiki/Special:Watchlist, from 24.68.108.64 via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 1115233304 [17:07:12] Forwarded for: 24.68.108.64 [17:07:12] Error: 503, Service Unavailable at Thu, 02 Oct 2014 16:57:01 GMT [17:07:13] (in a firefox "private" window, so not HHVM related) [17:07:15] http://en.wikipedia.beta.wmflabs.org/ not loading [17:07:15] godog: [17:07:15] hm [17:07:15] i'm working on debianizing that python elasticsearch package [17:07:15] but, I also need [17:07:15] https://packages.debian.org/search?keywords=python-urllib3 [17:07:15] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:07:15] how should I proceed? can I just stick that in our apt under a differernt distribution? [17:07:16] ottomata: ouch, not in precise? but yeah precise-wikimedia perhaps [17:07:16] do I have to rebuild it? [17:07:16] or can I just manually include the existing deb? [17:07:16] ottomata: if there's a deb built for precise then yeah it is fine to include deb and dsc, otherwise build the dsc for precise [17:07:16] ok lemme try that [17:07:16] do I need a gerrit repo? or just rebuild and add to our apt? [17:07:17] ottomata: heh good question, I don't think we have a policy but if it is unmodified then I'd say a repo is overkill, as long as there dsc is there of course [17:07:17] perhaps paravoid or akosiaris have thoughts on that [17:07:17] there is a dsc [17:07:17] <_joe_> ottomata: if you're just rebuilding with none/minimal modifications, just upload it [17:07:17] <_joe_> with the whole shebang (including the changes file) [17:07:17] k... [17:07:22] <_joe_> greg-g: some issues labs-wide AFAICS [17:08:10] <_joe_> greg-g: mark and Coren are solving it I guess [17:08:19] (03CR) 10Chad: "Also: what's the proper place to put elastictool.py?" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [17:09:03] greg-g: labs is unglogging now; bastion-restricted is back. [17:10:54] godog: shoudl I change the version in changelog somenow? [17:11:00] 1.9.1-1~precise-wikimedia? [17:11:02] or somethign? [17:12:07] ottomata: yep consensus seems like ~1wmf1 or sth like that [17:12:36] oof there are build deps that aren't avail for precise [17:12:42] godog, can I just dpkg -i them for the build? [17:13:35] ottomata: heh good question, is python-urllib3 in trusty I'm assuming? [17:14:29] I'm asking because if it is then it isn't so bad to have missing build deps in precise [17:14:30] hm, yes! [17:14:53] hm. maybe we could just build this for trusty....? [17:15:02] ^d, do you need this script on the elasticseach servers themselves? [17:15:07] (03PS1) 10coren: Tool Labs: terminology fix [puppet] - 10https://gerrit.wikimedia.org/r/164370 [17:15:08] or can you just run stuff from a trusty box? [17:17:50] I think it'd be hitting the http api, so anywhere (?) [17:18:51] <_joe_> godog: It's in trusty yes [17:19:14] <_joe_> ottomata: backporting the package should be straightforward [17:22:20] manybubbles, they're looking at the Flow error now, thanks. (sidenote: pls ping me, or #wikimedia-corefeatures next time :) (#wikimedia-flow even redirects there! :D )) [17:22:37] quiddity: cool! thanks [17:23:49] thanks Coren ! [17:25:39] _joe_, yes, but there are a buncha build debs that i'd have to backport too [17:25:39] :/ [17:25:56] <_joe_> eh [17:26:01] <_joe_> it sucks [17:26:12] !log aaron Synchronized php-1.25wmf1/maintenance/findMissingFiles.php: aa2eb3c0de08256822a2b0c985ebb3a6145d28cd (duration: 00m 05s) [17:26:26] Logged the message, Master [17:39:34] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [17:39:45] <^d> ottomata: Eh, well the next script I was wanting to write is es-restart-quick, which would need to call ES' init script. [17:39:57] <^d> Unless I dsh it, I guess. [17:45:08] (03PS1) 10Mark Bergsma: Replace user_forward shellout by an Exim LDAP query [puppet] - 10https://gerrit.wikimedia.org/r/164386 [17:46:22] (03CR) 10Krinkle: "How old our install of PhantomJS is is irrelevant. There is no such thing as a "new PhantomJS" by my definition. The nature of that softwa" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [17:47:14] (03PS2) 10Mark Bergsma: Replace user_forward shellout by an Exim LDAP query [puppet] - 10https://gerrit.wikimedia.org/r/164386 [17:52:22] (03PS1) 10Ori.livneh: Tag NavigationTiming events with PHP5/HHVM [puppet] - 10https://gerrit.wikimedia.org/r/164388 [17:53:25] (03CR) 10Ori.livneh: [C: 032] Tag NavigationTiming events with PHP5/HHVM [puppet] - 10https://gerrit.wikimedia.org/r/164388 (owner: 10Ori.livneh) [17:53:49] (03CR) 10Dzahn: "thank you! i love it when things like this are simply merged by others when i come back" [puppet] - 10https://gerrit.wikimedia.org/r/164248 (owner: 10Dzahn) [17:57:05] ^d, hm. [17:57:36] !log going to try to restart lsearchd on the misc pool machines to see if that makes it responsive [17:57:40] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:57:44] Logged the message, Master [18:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141002T1800). [18:00:46] Oh man, we should totally have you guys at a special table in the front of the metrics meeting [18:00:58] so everyone can watch you deploying with tense looks on your faces during the other talks [18:01:07] andrewbogott: :P [18:05:23] * tonythomas gets *** System restart required *** on logging to bastion. [18:06:09] which bastion? [18:06:11] (03CR) 10Amire80: [C: 04-1] "I'll redo it with the new way to configure the domain." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/147918 (owner: 10Amire80) [18:06:27] (03PS6) 10Krinkle: [WIP] Implement role::ci::slave::localbrowser (Chromium) [puppet] - 10https://gerrit.wikimedia.org/r/163791 [18:06:43] tonythomas: 'tis ok, you can ignore those, I think [18:06:45] bastion1 [18:07:49] ha. was kidding - just wrote in case someone was doing a restart in few minutes. [18:08:05] that happens when somebody installed a new kernel package but didnt reboot yet.. mostly [18:08:34] anecdotal; my hubby is reporting unusal number of 500s on frwikisource; is something going on? [18:13:50] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164396 [18:13:52] (03PS1) 10Reedy: testwiki to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164397 [18:13:54] (03PS1) 10Reedy: Wikipedias to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164398 [18:13:56] (03PS1) 10Reedy: group0 to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164399 [18:14:35] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164396 (owner: 10Reedy) [18:14:42] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164396 (owner: 10Reedy) [18:14:45] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164397 (owner: 10Reedy) [18:14:52] (03Merged) 10jenkins-bot: testwiki to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164397 (owner: 10Reedy) [18:15:35] !log reedy Started scap: testwiki to 1.25wmf2 [18:15:43] Logged the message, Master [18:28:40] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [18:32:26] !log replacing failed disk es1005 [18:32:36] Logged the message, Master [18:35:04] (03PS1) 10Chad: Another ES node script: restart a node! [puppet] - 10https://gerrit.wikimedia.org/r/164401 [18:35:32] oof, ^d, getting this package in trusty is looking tougher and tougher [18:35:37] was hoping it would be a simple pip conversion :/ [18:35:40] <^d> :( [18:36:12] urllib3 has a bunch of build depends that don't have packages for precise or before, and the trusty packages depend on a python version not available in precise. [18:36:38] sorry, 'getting this package in precise is looking tougher and tougher' [18:36:46] getting it in trusty shouldn,'t be hard though. [18:39:55] !log reedy Finished scap: testwiki to 1.25wmf2 (duration: 24m 19s) [18:40:05] ^d, is it worth it to you for me to build python-elasticsearch for trusty? [18:40:06] Logged the message, Master [18:40:22] <^d> ottomata: Trusty or precise? [18:40:38] trusty. [18:41:24] <^d> Well if the ES boxes are all precise I'm not thinking so. [18:41:24] (03CR) 10Reedy: [C: 032] Wikipedias to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164398 (owner: 10Reedy) [18:41:31] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164398 (owner: 10Reedy) [18:41:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf2 [18:42:00] Logged the message, Master [18:42:33] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164399 (owner: 10Reedy) [18:42:40] (03Merged) 10jenkins-bot: group0 to 1.25wmf2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164399 (owner: 10Reedy) [18:43:08] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf2 [18:43:15] Logged the message, Master [18:44:01] PROBLEM - Apache HTTP on mw1030 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50412 bytes in 0.044 second response time [18:44:21] PROBLEM - Apache HTTP on mw1164 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50412 bytes in 0.038 second response time [18:44:51] !log reedy Purged l10n cache for 1.24wmf21 [18:45:01] Logged the message, Master [18:45:13] graceful apached? [18:45:17] s* [18:45:26] gah, so annoying [18:46:23] (03CR) 10Ottomata: "Sub commands would be awesome. Use docopt! It is great!" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [18:46:44] (03PS1) 10Reedy: Remove 1.24wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164403 [18:47:10] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.055 second response time [18:47:28] !log graceful'ed apache on mw1030,mw1164 [18:47:28] (03CR) 10Ottomata: "I gave it a go at creating python-elasticsearch deb package from pip, but couldn't do to a urllib3 dependency. urlib3 (and many of its dep" [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [18:47:30] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [18:47:35] aude: ^ fixed [18:47:35] Logged the message, Master [18:47:40] (03PS2) 10Reedy: Remove 1.24wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164403 [18:48:01] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:48:36] thanks [18:48:50] * aude makes a bug to see if we can remove the closure [18:49:02] even if it is an apc issue, i am tired of this [18:49:09] <^d> ottomata: What if we just upgraded everything to precise? :p [18:49:16] (03CR) 10Reedy: [C: 032] Remove 1.24wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164403 (owner: 10Reedy) [18:49:23] (03Merged) 10jenkins-bot: Remove 1.24wmf17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164403 (owner: 10Reedy) [18:49:37] to trusty? [18:49:38] ha, maybe! [18:49:47] actually [18:49:54] ^d, we are going to be reinstalling all these nodes soon, eh? [18:50:01] the new ones will be trusty, eh? [18:50:01] <^d> We will, yeah. [18:50:11] so, as long as elasticsearch debs are avail for trusty...:) [18:50:12] <^d> I imagine we'll go trusty on the new boxes too, yeah [18:51:21] (03PS2) 10Reedy: Have production and Labs Redis sessions use same structure. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164027 (https://bugzilla.wikimedia.org/59838) (owner: 10Mattflaschen) [18:51:26] (03CR) 10Reedy: [C: 032] Have production and Labs Redis sessions use same structure. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164027 (https://bugzilla.wikimedia.org/59838) (owner: 10Mattflaschen) [18:51:33] (03Merged) 10jenkins-bot: Have production and Labs Redis sessions use same structure. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164027 (https://bugzilla.wikimedia.org/59838) (owner: 10Mattflaschen) [18:54:51] * Reedy feels like making a new wikivoyage [18:58:07] (03PS2) 10RobH: settting codfw es servers mgmt [dns] - 10https://gerrit.wikimedia.org/r/164215 [19:01:50] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:07:50] (03PS4) 10Reedy: Initial setup for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163356 (https://bugzilla.wikimedia.org/71382) (owner: 10Glaisher) [19:08:09] (03CR) 10Reedy: [C: 032] Initial setup for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163356 (https://bugzilla.wikimedia.org/71382) (owner: 10Glaisher) [19:08:16] (03Merged) 10jenkins-bot: Initial setup for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163356 (https://bugzilla.wikimedia.org/71382) (owner: 10Glaisher) [19:08:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:08:53] Ok, I'm chatting with Papaul on how we allocate IP addresses and install information for new systems. I'm bringing the conversation out of PM and into this channel in case anyone else is interested in following along. [19:09:23] So, my example for this is the recent allocation of misc. server pollux in codfw [19:09:28] dns change https://gerrit.wikimedia.org/r/#/c/162649/ [19:09:36] install server change https://gerrit.wikimedia.org/r/#/c/162762/ [19:09:45] Reedy: oh, that's sweet to see more voyage [19:10:18] So, every single server gets an asset tag [19:10:28] and our mgmt network is only divided into site subnets [19:10:47] so every mgmt ip in codfw is in the same subnet of every single mgmt server ip in codfw [19:11:01] !log reedy Synchronized database lists: fawikivoyage (duration: 00m 16s) [19:11:06] as soon as a server arrives, it gets an asset tag (eg: wmf3454) [19:11:10] Logged the message, Master [19:11:33] and then we open up the template file templates/10.in-addr.arpa [19:12:04] .... ahhh, im explaining for him and he isnt in this channel, haha [19:12:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: fawikivoyage [19:12:15] Logged the message, Master [19:12:32] robh: what is the naming convention for misc in codfw? [19:12:34] !log reedy Synchronized wmf-config/InitialiseSettings.php: fawikivoyage (duration: 00m 16s) [19:12:41] Logged the message, Master [19:12:41] proper star names =] [19:12:46] mutante's suggestion [19:12:58] * YuviPanda suggests Britney Spears [19:13:00] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Miscellaneous_Servers [19:13:02] not that kinda 'star'? [19:13:12] https://en.wikipedia.org/wiki/List_of_proper_names_of_stars [19:14:13] YuviPanda: is that going to "kerneloops, i did it again" [19:14:20] haha [19:14:40] 'funrolloops I did it again'? [19:14:50] I wanted stats :/ [19:14:59] but stars are cool too :) [19:15:10] matanya: stats? 63129312.wikimedia.org ? [19:15:22] * state [19:15:29] ah :) [19:15:42] maybe not enough of them [19:16:07] mutante: echo $RANDOM > /etc/hostname :) [19:16:58] <^d> matanya: Coindicentally, $RANDOM is how you spell eiximenis. [19:18:01] substract vowels, will make it easier :) [19:19:58] $RNDM [19:20:01] I wonder if $ is a voewel [19:20:30] imagines the chaos if matanya actually sent that to salt "*" [19:20:35] hehe [19:20:49] tbh, I liked srv123 [19:21:08] sometimes it is nice to do things like server provisioning without too much mental work involved [19:21:08] !log Updated scap to eff0d01 (Fix format specifier for error message) [19:21:16] Logged the message, Master [19:21:22] Reedy: Try again :) [19:21:52] /nick user123 [19:22:55] mutante: naming servers and not services means you care about the wrong thing [19:22:57] aude: about? [19:22:58] :) [19:23:03] Reedy: yep [19:23:08] new wikivoyage? [19:23:09] I just created fawikivoyage [19:23:11] Yup [19:23:12] mutante: go antropomorphize services :) [19:23:14] hah [19:23:15] !log Trebuchet reports for scap sync "231/234 minions completed fetch; 230/234 minions completed checkout" Some stale entries need to be removed from Trebuchet redis cache [19:23:18] ok [19:23:24] Logged the message, Master [19:23:52] i can take care of the sites table [19:24:00] Thanks [19:24:31] i think running populateSitesTable for wikidataclient dblist will take care of it [19:24:39] will try [19:24:44] domas: it's like all appservers are going to have individual names, just misc :) [19:24:51] domas: eh.. it's not like [19:25:00] mutante: misc has too much custom [19:25:07] yea [19:25:15] or [19:25:24] has to be done currently by site group [19:25:47] domas: agree, best would be to have at least 1 in each dc [19:26:10] mutante: at least? at most! [19:27:38] !log hosts that failed Trebuchet update of scap: virt0.wikimedia.org, fenari.wikimedia.org, mw1110.eqiad.wmnet, mw1053.eqiad.wmnet. mw1053.eqiad.wmnet only failed checkout [19:28:47] bd808: why is fenari still in there after removal from dsh? [19:29:23] mutante: trebuchet uses salt and also has a local redis cache of hosts [19:30:03] i see, so that means another step to properly decom a host means having to clean from redis cache ? hmmm [19:30:13] Yah. It's lame. [19:30:16] bd808: would be gone once i delete salt key? [19:30:21] nope [19:30:25] ok..hmm [19:30:45] mutante: https://wikitech.wikimedia.org/wiki/Trebuchet#Removing_minions_from_redis [19:30:56] https://gdash.wikimedia.org/dashboards/reqerror/ [19:31:01] what's up with 5xx here ^ [19:31:20] err 500, not 5xx [19:31:46] deploy I guess [19:31:48] * Reedy pokes [19:32:56] "17:57 manybubbles: going to try to restart lsearchd on the misc pool machines to see if that makes it responsive" [19:33:03] "18:15 logmsgbot: reedy Started scap: testwiki to 1.25wmf2" [19:33:15] oh, maybe not me then? [19:34:06] there are errors 'Unknown site ID configured: fawikivoyage' for wikidata change notification jobs [19:34:12] I don't _think_ my change would have done that - lsearchd isn't used much on the misc pool. I blieve the is part of the problem with it. [19:34:12] should be resolved shortly [19:34:46] aude: is that the spike? [19:34:54] doubt it [19:35:04] manybubbles: I only just added that wiki, so it can't be [19:35:32] the fatal/exception logs don't look to be spamming [19:37:10] There's a fuckload of File does not exist: /srv/mediawiki/docroot/wikibooks.org/upload.wikimedia.org, referer: http://en.m.wikibooks.org/wiki/Signals_and_Systems [19:37:53] All with mobile referers [19:38:20] Reedy: was just trying to see if there is a spike in those [19:38:29] Would that graph as a 5xx? Seems 404 to me [19:38:39] bd808: presumably [19:38:44] but there's still seeming a lot of them [19:39:11] There was a big spike in 404s on the graph too but it dies down [19:39:41] The 404 spike looks to have started around the same time though [19:40:04] What does 5xx actually measure? [19:40:05] it's all pre scap too [19:40:21] $1M question [19:40:44] Something at varnish... but what [19:40:55] I imagine that file not found is probably 404 [19:40:57] ori: ^^ [19:41:08] what makes the 500 resp graph? [19:42:32] tailing the fatal and exception logs for a few minutes doesn't have anything that sounds out [19:43:47] neither does apache2.log I believe [19:43:58] http://en.wikipedia.org/en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_Amusement_Parks&action=edit§ion=new [19:44:05] What the hell is making links like that? [19:44:07] i did see exception (bug) on wikidata [19:44:12] but never saw it in the logs [19:44:12] If you factor out file not found apache2.log "only" grows by ~1k a minute [19:44:17] also last night [19:44:29] don't know if there is something weird going on [19:44:34] Oct 2 19:43:37 mw1029: [error] [client 10.64.32.106] File does not exist: /srv/mediawiki/docroot/wikipedia.org/upload.wikimedia.org [19:44:34] Oct 2 19:43:37 mw1072: [error] [client 10.64.32.106] File does not exist: /srv/mediawiki/docroot/foundation/meta.wikimedia.org [19:44:34] Oct 2 19:43:37 mw1055: [error] [client 10.64.0.105] File does not exist: /srv/mediawiki/docroot/mediawiki/m.mediawiki.org [19:44:34] Oct 2 19:43:37 mw1055: [error] [client 10.64.0.105] File does not exist: /srv/mediawiki/docroot/wikipedia.org/upload.wikimedia.org [19:44:34] Oct 2 19:43:37 mw1219: [error] [client 10.64.32.104] File does not exist: /srv/mediawiki/docroot/wikipedia.org/Castillejo_de_Robledo [19:45:07] But as above, they're "only" 404s [19:45:22] Where does one run `varnishncsa`? [19:45:33] or `varnishlog -i Backend_health -O` [19:45:33] I think you might need root/ops [19:45:35] bblack: ^^ [19:45:55] 5xx != 500 [19:46:05] usually varnish -> be is 503? [19:46:19] but yeah, I can look [19:46:36] please [19:46:43] So 500 would be backend requests that got a 500 response from mw in theory? [19:46:44] it's a little stabbing in the dark atm [19:46:59] s/mw/apache/ [19:47:29] I believe so [19:48:09] Uh [19:48:14] hmm [19:52:00] hhvm is puking all over the place. Maybe that's it? [19:52:17] Lots of "AH01070: Error parsing script headers" [19:52:49] tail -f apache2.log|grep proxy_fcgi:error on fluorine [19:53:32] populated sites table for fawikivoyage [19:53:44] gah, i have to add my name? [19:54:32] cmjohnson: pinggeeee on the analytics1010 disk [19:55:18] !log populated sites table for fawikivoyage [19:55:25] Logged the message, Master [19:57:32] ottomata: it's on schedule for tomorrow [19:57:47] thanks [19:58:40] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [19:59:11] oh is it misc? [19:59:25] I never think to look at misc [20:02:06] yeah graphite is dead [20:10:59] !log stopping -> starting uwsgi/apache -type stuff on tungsten [20:11:08] Logged the message, Master [20:16:50] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [20:21:19] <^d> ottomata: https://gerrit.wikimedia.org/r/#/c/163944/ is easy :) [20:22:10] sure [20:22:14] (03PS3) 10Ottomata: Remove unused bash functions. Nothing calls this. [puppet] - 10https://gerrit.wikimedia.org/r/163944 (owner: 10Chad) [20:22:32] (03CR) 10Ottomata: [C: 032 V: 032] Remove unused bash functions. Nothing calls this. [puppet] - 10https://gerrit.wikimedia.org/r/163944 (owner: 10Chad) [20:23:04] <^d> ottomata: ty! [20:23:24] (03CR) 10Dzahn: [C: 032] add citoid.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/164238 (owner: 10Dzahn) [20:25:01] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:29:33] (03PS1) 10Ori.livneh: webperf::navtiming: Handle SaveTiming events [puppet] - 10https://gerrit.wikimedia.org/r/164417 [20:30:19] (03PS2) 10Ori.livneh: webperf::navtiming: Handle SaveTiming events [puppet] - 10https://gerrit.wikimedia.org/r/164417 [20:31:33] (03CR) 10Ori.livneh: [C: 032] webperf::navtiming: Handle SaveTiming events [puppet] - 10https://gerrit.wikimedia.org/r/164417 (owner: 10Ori.livneh) [20:33:07] (03CR) 10Dzahn: "DNS has been added, i was about to try adding the LVS config but i see you already do it here. just one thing. i think you also want a "$s" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [20:38:02] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:39:05] (03PS1) 10Dzahn: remove deprecated labsnfs class and files [puppet] - 10https://gerrit.wikimedia.org/r/164421 [20:40:05] (03CR) 10Dzahn: "thanks Tim. see -> https://gerrit.wikimedia.org/r/#/c/164421/1 now" [puppet] - 10https://gerrit.wikimedia.org/r/164247 (owner: 10Dzahn) [20:40:18] Coren: I'm about to build a new default Trusty image. Any requests? [20:40:37] * Coren ponders. [20:40:44] (03CR) 10Dzahn: "or, alternatively if you want to keep it but make it work for eqiad, maybe just https://gerrit.wikimedia.org/r/#/c/164247/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/164421 (owner: 10Dzahn) [20:41:05] andrewbogott: Nothing comes to mind. [20:41:24] Coren: andrewbogott ^ those are for switching labsnfs from pmtpa.. (or removing the transitional class) [20:41:54] mutante: cool, I will look. [20:42:15] mutante: Actually, those are now completely irrelevant - we aren't using autofs anymore. [20:42:38] you have the choice, one patch switches it from virt0 to virt1000, the other one deletes it:) [20:42:46] Going to sync a NavigationTiming schema change with greg's OK [20:43:08] * greg-g nods [20:43:29] mutante: Ah, I only noticed the latter. [20:43:52] (03CR) 10coren: [C: 031] "You cannot expunge the taint of autofs fast enough." [puppet] - 10https://gerrit.wikimedia.org/r/164421 (owner: 10Dzahn) [20:44:22] !log ori Synchronized php-1.25wmf1/extensions/NavigationTiming: Update NavigationTiming for cherry-picks (duration: 00m 04s) [20:44:31] Logged the message, Master [20:44:38] (03CR) 10Andrew Bogott: [C: 031] remove deprecated labsnfs class and files [puppet] - 10https://gerrit.wikimedia.org/r/164421 (owner: 10Dzahn) [20:44:48] (03CR) 10coren: [C: 04-2] "This needs to die, not be resuscitated. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/164247 (owner: 10Dzahn) [20:46:53] !log ori Synchronized php-1.24wmf22/extensions/NavigationTiming: Update NavigationTiming for cherry-picks (duration: 00m 03s) [20:46:59] Logged the message, Master [20:50:00] RECOVERY - RAID on es1005 is OK: OK: optimal, 1 logical, 2 physical [20:52:16] ori: 1.24wmf22 isn't in use! [20:56:59] Reedy: doh [20:58:43] (03Abandoned) 10Dzahn: labsnfs - replace labstore1 with labstore1001 [puppet] - 10https://gerrit.wikimedia.org/r/164247 (owner: 10Dzahn) [21:00:34] (03CR) 10Dzahn: [C: 032] remove deprecated labsnfs class and files [puppet] - 10https://gerrit.wikimedia.org/r/164421 (owner: 10Dzahn) [21:03:09] (03CR) 10Raimond Spekking: Fully disable all mwlib formats; use OCG service instead. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164099 (owner: 10Cscott) [21:12:30] ^d: is this probably still used or something old? [21:12:33] # search QA scripts for ops use [21:12:46] include search::searchqa [21:13:26] <^d> the hell is this crud? [21:13:32] finds that only on iron [21:13:38] so ops bastion [21:13:52] also, 246 class search::searchqa::phase1 { [21:14:31] puppet:///files/searchqa/bin [21:16:41] added by Jeff in 2012.. making patch [21:19:46] (03PS1) 10Chad: Revert "Prerender thumbnails at upload time on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164428 [21:20:33] (03CR) 10BryanDavis: [C: 031] "We'd like to see if this causes the 500 error rate to drop back to "normal"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164428 (owner: 10Chad) [21:20:49] (03PS1) 10Dzahn: remove old 'searchqa' classes and files [puppet] - 10https://gerrit.wikimedia.org/r/164429 [21:21:05] <^d> gi11es: fyi: we're going to revert the prerendering from this morning's swat. [21:21:28] <^d> We've had a pretty large spike in 500s from varnish today and we suspect that change. [21:22:18] 'out_dir_prefix' => '/tmp/fire_in_the_hole-', # where we'll drop search response logs [21:22:29] <^d> *snicker* [21:22:53] <^d> bd808: Just waiting on jenkins and I'll +2. [21:24:37] <^d> This might take a bit, zuul's dashboard is like a christmas tree. [21:25:05] <^d> well, an xmas tree with half the lights out. [21:25:24] (03CR) 10Dzahn: "JeffGreen: consider this asking via gerrit if you expect this to still be in use or not" [puppet] - 10https://gerrit.wikimedia.org/r/164429 (owner: 10Dzahn) [21:25:55] greg-g: moar jenkins hardware plz [21:30:35] (03CR) 10Manybubbles: [C: 031] remove old 'searchqa' classes and files [puppet] - 10https://gerrit.wikimedia.org/r/164429 (owner: 10Dzahn) [21:33:11] !log added LVS BGP config setup to cr[12]-codfw [21:33:17] Logged the message, Master [21:39:43] (03CR) 10Dzahn: "Alex, fenari is down now. This is my reply to your "So the difficult problem to solve before merging this.." above in PS3, see this helper" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [21:40:46] (03PS1) 10Yuvipanda: icinga: Move permission fixing HACKs into module [puppet] - 10https://gerrit.wikimedia.org/r/164470 [21:41:15] mutante: wanna merge a couple of icinga patches? :) [21:42:08] (03CR) 10Chad: [C: 032] Revert "Prerender thumbnails at upload time on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164428 (owner: 10Chad) [21:42:24] <^d> bd808: Now we wait again. [21:42:31] (03Merged) 10jenkins-bot: Revert "Prerender thumbnails at upload time on all wikis except commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164428 (owner: 10Chad) [21:42:47] <^d> Oh that was fast. [21:43:24] !log demon Synchronized wmf-config/InitialiseSettings.php: Disabling prerendering of images from this mornings swat (duration: 00m 04s) [21:43:26] <^d> Ok, and we're live. [21:43:35] Logged the message, Master [21:44:29] (03PS1) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [21:47:22] (03PS2) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [21:48:27] (03PS3) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [21:50:47] <^d> bd808: Doesn't look back to "normal" but it's still declining. [21:53:13] ^d: I don't see any change I can attribute to your sync -- https://graphite.wikimedia.org/render?from=-1hours&until=now&width=500&height=400&target=cactiStyle(reqstats.500)&target=cactiStyle(alias(timeShift(reqstats.500%2C%221w%22)%2C%22last%20week%22))&title=500%20responses&_uniq=0.5698605675715953 [21:53:43] <^d> Indeed, it was declining before too. [21:54:47] probably declining as it only does the prerender once per image? [21:54:54] I'm guessing randomly [21:55:17] It would vary with upload rate, yes [21:55:45] But we just turned it off, so unless there were a lot of queued jobs... which could be the case I guess [21:57:06] (03PS1) 10Brian Wolff: Experimentally enable vips for larger (>50MP) tiff files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164476 (https://bugzilla.wikimedia.org/52045) [21:57:07] <^d> bd808: I guess we'll just keep an eye [21:59:05] <^d> bd808: fwiw, the spike started at 18:00utc, which is when the train went out. [21:59:29] <^d> Should we start looking at something in 1.25wmf1 (went to wikipedias) as suspect? [22:01:16] The version switch wasn't until 18:39 though based on SAL [22:01:26] 18:15 logmsgbot: reedy Started scap: testwiki to 1.25wmf2 [22:01:35] 18:39 logmsgbot: reedy Finished scap: testwiki to 1.25wmf2 (duration: 24m 19s) [22:02:01] 18:41 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf2 [22:02:32] So anything that happened before 18:41 would be hard to pin on 1.25wmf1 [22:02:42] s/1/2/ [22:02:53] GWToolset bulk uploads? [22:02:53] <^d> Well I was done by 15:30ish, so that would've been an almost 1.5h lag [22:03:47] Reedy: I didn't see that happening in the commons recent changes... [22:04:45] Erm [22:04:49] Wikisource can't be found? [22:04:53] Updating something? [22:05:02] ? [22:05:03] WFM [22:05:28] ohnoes a new episode of Weird Qcoder00 Networking Issues XII [22:05:34] Qcoder00: https://en.wikisource.org/wiki/Main_Page loads for me [22:05:34] lol [22:05:40] OK [22:05:47] Google wasn't liking the link [22:06:04] I'm loading via ipv6 [22:07:51] <^d> bd808: I'm kind of out of ideas. [22:08:00] <`808db> ^d: Me too [22:09:19] <`808db> It's actually heading back up now [22:11:35] (03CR) 10RobH: [C: 031] "lots of small changes, all looks sane." [puppet] - 10https://gerrit.wikimedia.org/r/164473 (owner: 10BBlack) [22:11:44] (03CR) 10Brian Wolff: "See also: https://lists.wikimedia.org/pipermail/multimedia/2014-October/000861.html" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164476 (https://bugzilla.wikimedia.org/52045) (owner: 10Brian Wolff) [22:13:22] Googles bugged up :( [22:15:46] (03CR) 10Dzahn: [C: 032] "ok, since it just moves the identical things. removing them would be another story. for some reason this can't be submitted like this thou" [puppet] - 10https://gerrit.wikimedia.org/r/164470 (owner: 10Yuvipanda) [22:16:15] mutante: ah, reason being it is dependent on https://gerrit.wikimedia.org/r/#/c/164239/1 :) [22:18:49] (03PS2) 10Dzahn: icinga: Move permission fixing HACKs into module [puppet] - 10https://gerrit.wikimedia.org/r/164470 (owner: 10Yuvipanda) [22:20:06] mutante: you could also +2 the previous patch... [22:20:09] (03CR) 10Dzahn: [C: 032] "removed dependency on ganglios change which i don't know about and doesn't seem to be related" [puppet] - 10https://gerrit.wikimedia.org/r/164470 (owner: 10Yuvipanda) [22:20:39] I wonder how useful our torrus installation is these days [22:20:44] (I poked it a bit yesterday) [22:21:43] (03PS2) 10Yuvipanda: icinga: Get rid of ganglios [puppet] - 10https://gerrit.wikimedia.org/r/164239 [22:23:16] YuviPanda: Max concurrent service checks (3200) has been reached. :p [22:23:42] just sayin.. not related to patch of course [22:24:05] (03PS1) 10Rush: phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 [22:24:06] mutante: :) I'm unsure what makes checks concurrent... [22:24:26] (03PS2) 10Rush: phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 [22:25:15] (03CR) 10jenkins-bot: [V: 04-1] phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:25:23] (03CR) 10Rush: [C: 032] phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:25:49] (03CR) 10Rush: [V: 032] phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:26:47] chasemp: jenkins, actual syntax error [22:26:57] yeah just saw that [22:27:08] could thing palladium was smarter than me :) [22:28:41] YuviPanda: how "surprising".. icinga reload broken [22:28:54] hmm, broken by what? [22:28:54] not saying because of your change, just that it is all the time [22:28:59] checks that now [22:29:13] Error: Could not find any hostgroup matching 'openldap_corp_mirror_codfw' [22:29:24] a common one, hosts added to a group that doesnt exist [22:29:25] mutante: can you look at that real quick, I don't see the syntax error yet [22:29:31] but it's been a long day [22:29:35] need a second pair of eyes [22:30:29] ok trying something I guess [22:30:33] (03PS3) 10Rush: phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 [22:30:33] maybe class foo() is dumb [22:30:43] chasemp: no, missing $, i was about to upload [22:30:50] ah crap ok thanks [22:31:13] (03CR) 10jenkins-bot: [V: 04-1] phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:31:43] (03PS4) 10Dzahn: phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:32:43] YuviPanda: and this is why it always turns into fixing something else when touching icinga [22:32:56] mutante: heh, true [22:33:03] chasemp: jenkins likes now [22:33:04] * YuviPanda would offer to help if he had neon access. 4 more weeks [22:33:08] tx [22:33:24] (03PS5) 10Rush: phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 [22:33:30] (03CR) 10Rush: [C: 032 V: 032] phab readd migration tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164479 (owner: 10Rush) [22:37:00] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: puppet fail [22:38:13] !log icinga_broken_due_to_missing_hostgroup_counter incremented [22:38:19] Logged the message, Master [22:40:22] <`808db> bblack: 500 rate is crawling back down but I don't think we still have any good idea of shat caused it to spike or if we caused it to trend down. [22:40:26] <`808db> bblack: https://graphite.wikimedia.org/render?from=-8hours&until=now&width=500&height=400&target=cactiStyle(reqstats.500)&target=cactiStyle(alias(timeShift(reqstats.500%2C%221w%22)%2C%22last%20week%22))&title=500%20responses&_uniq=0.5698605675715953 [22:40:34] (03PS1) 10Rush: phab add lock file for tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164481 [22:41:10] (03CR) 10Dzahn: [C: 04-1] "missing '" [puppet] - 10https://gerrit.wikimedia.org/r/164481 (owner: 10Rush) [22:41:18] (03CR) 10jenkins-bot: [V: 04-1] phab add lock file for tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164481 (owner: 10Rush) [22:41:32] chasemp: tools.lock' [22:41:45] does that fix "creates must be a fully qualified path"? [22:41:49] on it [22:41:55] yes I think so [22:41:57] cool [22:42:03] (03PS2) 10Rush: phab add lock file for tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164481 [22:42:49] YuviPanda: so, there is this: 3 @monitor_group { "openldap_corp_mirror_${::site}": description => 'Corp OIT LDAP Mirror' } [22:43:02] (03CR) 10Rush: [C: 032] phab add lock file for tools repo [puppet] - 10https://gerrit.wikimedia.org/r/164481 (owner: 10Rush) [22:43:05] YuviPanda: but it does not create that for $site "codfw" for some reason [22:43:15] ah, hmm [22:43:22] are hostgroups created by naggen2 as well? [22:43:24] * YuviPanda checks [22:43:36] hmm, they aren't [22:43:50] hmm, I wonder how hostgroups are defined [22:44:00] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:44:03] by @monitor_group [22:44:24] it's the exact same thing that happened last time [22:45:16] i bet you just adding the group would work [22:45:42] (03PS1) 10Ori.livneh: HHVM: Increase the stack size soft limit to 64MiB [puppet] - 10https://gerrit.wikimedia.org/r/164482 [22:46:12] (03PS1) 10Kaldari: Enable WikiGrok on en.wiki for beta testing on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164483 [22:47:00] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [22:48:13] <^d> so many 5xxs :( [22:48:43] YuviPanda: it must just be delayed.. puppet_hosts.cfg: hostgroups openldap_corp_mirror_codfw [22:48:53] ah [22:48:57] mutante: does puppet start now? [22:49:16] waits for the result of that [22:49:55] no [22:50:34] YuviPanda: no, so the above is defining a host to be in that group, but the group is not being created still [22:50:39] (03CR) 10Ori.livneh: [C: 032] HHVM: Increase the stack size soft limit to 64MiB [puppet] - 10https://gerrit.wikimedia.org/r/164482 (owner: 10Ori.livneh) [22:50:42] and that is still the issue [22:50:46] :( [22:50:55] I've no idea how nagios_hostgroup works [22:50:57] * YuviPanda goes to read docs [22:51:53] YuviPanda: arr.. wut? [22:51:54] puppet_hostgroups.cfg: empty [22:52:02] all host groups are gone? wah [22:52:09] (03PS2) 10Reedy: Experimentally enable vips for larger (>50MP) tiff files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164476 (https://bugzilla.wikimedia.org/52045) (owner: 10Brian Wolff) [22:52:15] (03CR) 10Reedy: [C: 032] Experimentally enable vips for larger (>50MP) tiff files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164476 (https://bugzilla.wikimedia.org/52045) (owner: 10Brian Wolff) [22:52:22] (03Merged) 10jenkins-bot: Experimentally enable vips for larger (>50MP) tiff files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164476 (https://bugzilla.wikimedia.org/52045) (owner: 10Brian Wolff) [22:52:30] YuviPanda: was something else merged earlier that touched hostgroups file maybe? [22:52:40] not that I'm aware of [22:52:43] nagios.pp hasn't been touched [22:52:58] and a git grep hostgroups.cfg doesn't show anything... [22:53:04] !log reedy Synchronized wmf-config/: Experimentally enable vips for larger (>50MP) tiff files (duration: 00m 15s) [22:53:11] Logged the message, Master [22:53:17] well, with puppet_hostgroups.cfg being empty, we will have more errors [22:53:20] until they are all back [22:53:40] starts to look through gerrit [22:54:21] YuviPanda: wait, /etc/icinga/puppet_hostgroups.cfg vs /etc/nagios/puppet_hostgroups.cfg ? [22:54:29] did stuff move from icinga to nagios? [22:54:52] modules/icinga/files/icinga.cfg:cfg_file=/etc/nagios/puppet_hostgroups.cfg [22:54:59] mutante: it's always been in /etc/nagios [22:55:17] mutante: since we use nagios_hostgroup with path being $::nagios_config_dir [22:55:19] which is /etc/nagios [22:55:26] since when is that? [22:56:13] puppet_hostextinfo.cfg: ASCII text [22:56:14] puppet_hostgroups.cfg: empty [22:56:19] puppet_servicegroups.cfg: empty [22:56:20] puppet_services.cfg: ASCII text [22:56:28] that's like super confusing [22:56:58] mutante: in where? [22:57:04] /etc/icinga [22:57:32] /etc/nagios has hostgroups and servicegroups but nothing else [22:57:59] so here's back to the original issue: [22:58:07] puppet_hostgroups.cfg: hostgroup_name openldap_corp_mirror [22:58:10] puppet_hostgroups.cfg: hostgroup_name openldap_corp_mirror_eqiad [22:58:13] but it needs: [22:58:15] mutante: yup, if you look at icinga.cfg [22:58:22] openldap_corp_mirror_codfw [22:58:29] it has those two things in /etc/nagios and rest in /etc/icinga [22:58:37] says the reason is 'backwards compat with old nagios install' [22:58:43] which is mostly BS, we should move it all to /etc/icinga [23:00:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141002T2300). Please do the needful. [23:00:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:00:29] ok, swat is mine [23:03:17] YuviPanda: i added the missing group manually so we can see if our _actual_ change was a nop [23:03:41] mutante: it's been there for quite a long time, I see commits in 2013, and more [23:03:44] mutante: ok [23:04:07] YuviPanda, do you have root already? [23:04:13] so as long as puppet doesn't remove it ... :p [23:04:13] mutante: note [23:04:15] mutante: nope [23:04:19] err [23:04:21] MaxSem: nope [23:04:25] MaxSem: next month, I hope [23:04:31] neon sucks away time [23:06:32] !log maxsem Synchronized php-1.25wmf2/extensions/MobileFrontend/: (no message) (duration: 00m 04s) [23:06:38] Logged the message, Master [23:07:01] YuviPanda: so. yea, puppet finished catalog, the "execs" work like before. nop .. except it was a manual fix [23:13:04] cajoel: did you shutdown sanger? [23:13:37] mutante: hmm, so hostgroups still fucked up? [23:13:42] mutante: I'll clean up the /etc/nagios thing tomorrow [23:14:05] (03CR) 10MaxSem: [C: 032] Enable WikiGrok on en.wiki for beta testing on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164483 (owner: 10Kaldari) [23:14:12] (03Merged) 10jenkins-bot: Enable WikiGrok on en.wiki for beta testing on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164483 (owner: 10Kaldari) [23:14:26] YuviPanda: not anymore, but only because i added the group, not because puppet did [23:14:37] MaxSem: lemme know whenever that config change is live [23:14:44] mutante: right, and puppet will remove it next time? [23:15:29] so far it doesn't seem to care either way [23:15:30] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/164483 (duration: 00m 03s) [23:15:37] Logged the message, Master [23:15:53] kaldari, [23:18:49] YuviPanda: no, it doesn't remove it, it's honeybadger [23:19:00] 'honeybadger'? [23:19:22] YuviPanda: http://www.youtube.com/watch?v=4r7wHMg5Yjg [23:19:42] YuviPanda: http://knowyourmeme.com/memes/honey-badger [23:20:57] mutante: haha [23:21:22] ACKNOWLEDGEMENT - Certificate expiration on sanger is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 daniel_zahn RT #6163 [23:21:22] ACKNOWLEDGEMENT - LDAP on sanger is CRITICAL: Connection refused daniel_zahn RT #6163 [23:21:22] ACKNOWLEDGEMENT - LDAPS on sanger is CRITICAL: Connection refused daniel_zahn RT #6163 [23:21:22] ACKNOWLEDGEMENT - puppet last run on sanger is CRITICAL: CRITICAL: Puppet last ran 41300 seconds ago, expected 14400 daniel_zahn RT #6163 [23:31:44] (03PS1) 10Andrew Bogott: You know what I hate? [puppet] - 10https://gerrit.wikimedia.org/r/164489 [23:33:24] (03CR) 10Andrew Bogott: [C: 032] You know what I hate? [puppet] - 10https://gerrit.wikimedia.org/r/164489 (owner: 10Andrew Bogott) [23:34:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:45:45] (03CR) 10Dzahn: [C: 031] Remove squid monitoring from torrus [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [23:47:12] (03PS1) 10Reza: Enable Echo for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164491 [23:48:24] (03CR) 10Calak: [C: 031] Enable Echo for Persian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164491 (owner: 10Reza) [23:53:48] (03PS1) 10Dzahn: torrus - remove pmpta subtrees and renderers [puppet] - 10https://gerrit.wikimedia.org/r/164492