[00:02:46] (03PS1) 10Yuvipanda: icinag: Remove unused global variables [puppet] - 10https://gerrit.wikimedia.org/r/164494 [00:02:59] mutante: ^ [00:03:06] should be a icinga no-op [00:03:13] since it doesn't actually modify any code there [00:03:17] (03PS1) 10Dzahn: mediawiki_singlenode - don't use pmtpa servername [puppet] - 10https://gerrit.wikimedia.org/r/164496 [00:03:46] andrewbogott: I should spend some time killing mediawiki_singlenode [00:08:23] (03PS1) 10Dzahn: protoproxy - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/164497 [00:10:26] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [00:11:31] (03PS1) 10Dzahn: ganglia_new - remove pmtpa from configuration [puppet] - 10https://gerrit.wikimedia.org/r/164498 [00:23:34] (03PS7) 10Krinkle: [WIP] Implement role::ci::slave::localbrowser (Chromium) [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:29:42] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:33:24] (03PS8) 10Krinkle: [WIP] Implement role::ci::slave::localbrowser (Chromium) [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:37:32] (03PS9) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:38:51] (03PS10) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [00:58:44] (03PS11) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [01:00:02] (03CR) 10Krinkle: "Fix operations-puppet-puppetlint warnings" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [01:06:11] Krinkle: is the idea behind the $id parameter that you want to allow multiple instances? [01:08:13] (03CR) 10Ori.livneh: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [01:08:24] ori: Yes [01:08:56] Krinkle: it won't work, because Puppet classes can only be instantiated once [01:09:06] you'd need to make it a define() [01:09:37] ori: It should be considered a generic thing (a little bit like nodejs or python). In that I don't want there to be some canonical singleton global xvfb window [01:09:52] yeah, it should be a define () { } then [01:09:54] in that sense it's more like php or nodejs, and not like redis or apache. [01:10:00] except much more primitive [01:10:21] but xvfb is not flexible enough to play nice with people just stuffing things into the default window. I'd rather keep it separate. [01:10:21] https://docs.puppetlabs.com/learning/definedtypes.html#beyond-singletons [01:11:09] (03PS12) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [01:29:30] (03PS13) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [01:29:32] (03CR) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [01:30:53] (03CR) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [01:31:01] (03PS14) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [01:47:17] if you wanted to import to a cluster wiki from xml dump , where would you do that? [02:17:11] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [02:17:41] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [02:35:00] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [02:35:20] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:44:39] !log LocalisationUpdate completed (1.25wmf1) at 2014-10-03 02:44:39+00:00 [02:44:47] Logged the message, Master [02:58:52] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [03:03:44] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [03:06:45] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:07:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [03:07:54] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [03:12:05] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 1 failures [03:13:34] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [03:15:36] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures [03:17:04] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:17:28] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-03 03:17:28+00:00 [03:17:33] Logged the message, Master [03:19:04] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:25:08] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [03:29:19] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [03:30:39] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [03:32:48] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [03:48:04] (03PS4) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [04:04:40] !log springle Synchronized wmf-config/db-eqiad.php: depool db1063 (duration: 00m 08s) [04:04:47] Logged the message, Master [04:06:21] (03PS5) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [04:13:03] (03PS1) 10Ori.livneh: Get rid of role::apachesync [puppet] - 10https://gerrit.wikimedia.org/r/164508 [04:13:40] (03PS6) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [04:15:00] (03PS1) 10Springle: prepare db1063 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164509 [04:16:20] (03CR) 10Springle: [C: 032] prepare db1063 for upgrade [puppet] - 10https://gerrit.wikimedia.org/r/164509 (owner: 10Springle) [04:16:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 3 04:16:35 UTC 2014 (duration 16m 34s) [04:16:43] Logged the message, Master [04:22:12] !log upgrade db1063 mariadb 10 [04:22:19] Logged the message, Master [04:33:21] (03PS1) 10Rush: phab fixup the legal footer [puppet] - 10https://gerrit.wikimedia.org/r/164510 [04:36:52] (03PS1) 10Rush: phab update for local mwoauth provider [puppet] - 10https://gerrit.wikimedia.org/r/164511 [04:37:04] (03CR) 10Rush: [C: 032] phab fixup the legal footer [puppet] - 10https://gerrit.wikimedia.org/r/164510 (owner: 10Rush) [04:41:48] (03CR) 10Rush: [C: 032] phab update for local mwoauth provider [puppet] - 10https://gerrit.wikimedia.org/r/164511 (owner: 10Rush) [06:27:38] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: puppet fail [06:28:28] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail [06:28:47] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:28:47] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [06:29:08] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [06:29:28] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:37] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:37] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:57] PROBLEM - puppet last run on search1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:58] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:58] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:07] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:08] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:17] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:48] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:28] PROBLEM - puppet last run on ssl1003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:01] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:41] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:00] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on search1007 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:11] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:47:20] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:30] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:31] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:31] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:47:51] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:49:10] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [06:49:20] (03CR) 10Ori.livneh: "This looks good. Two small suggestions:" [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [06:50:00] (03PS1) 10Giuseppe Lavagetto: mediawiki: reimage mw1022 to HAT [puppet] - 10https://gerrit.wikimedia.org/r/164517 [06:50:04] (03CR) 10Ori.livneh: "enough to ensure it _isn't_ globbed, I mean." [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [06:55:51] RECOVERY - puppet last run on ssl1003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:59:19] <_joe_> !log depooling mw1022, then reimaging it [06:59:25] Logged the message, Master [07:04:14] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: reimage mw1022 to HAT [puppet] - 10https://gerrit.wikimedia.org/r/164517 (owner: 10Giuseppe Lavagetto) [07:05:57] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:07:18] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:13:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [07:14:28] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [07:14:57] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:15:28] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [07:26:10] <_joe_> gwicke: you around? [07:26:40] <_joe_> if so... I was looking at your latest ticked and I have a few questions [07:30:44] hello [07:33:30] <_joe_> ciao hashar [07:36:35] so many things to do :/ [07:36:42] I don't know which one to start with [07:36:53] <_joe_> start with coffee + cigarette [07:37:20] <_joe_> when you have too many things to do, rebel to the modern-world rush to do more, and do less [07:37:23] <_joe_> :P [07:39:51] I started with a beer and indeed a cigarette [07:40:12] <_joe_> a beer at 9 AM? [07:40:13] <_joe_> wow [07:40:21] mostly kidding [07:40:28] went back home at something like 2 am [07:40:46] yesterday was my monthly "philosophical discussion group" [07:40:52] read: geeks having beers and trolling constantly [07:42:26] <_joe_> lol [07:43:19] oh the good news, I have an issue for which I already filled a bug \O/ [07:45:16] _joe_: do you have any what linux process accounting is for? That fills up labs instance with ton of logs in /var/log/account/ [07:46:48] oh found out [07:46:56] <_joe_> it is used to account for single users use of the system resources [07:47:29] <_joe_> and to have an audit trail of what's done on a server [07:47:32] on some instance yesterday log takes 200MB :/ [07:48:29] and I am discovering `lastcomm` [07:51:00] (03PS2) 10Giuseppe Lavagetto: nagios_common: use a template for contacts. [puppet] - 10https://gerrit.wikimedia.org/r/164301 [07:53:19] (03CR) 10Giuseppe Lavagetto: "@Yuvi: at the moment contacts.cfg is not served from here in prod, so we can safely merge this and create the file later, or were you refe" [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [07:53:57] (03CR) 10Giuseppe Lavagetto: nagios_common: use a template for contacts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164301 (owner: 10Giuseppe Lavagetto) [07:56:26] (03PS4) 10Hashar: contint: python3.4 on Trusty labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/164071 [07:57:44] (03CR) 10Hashar: "Thanks Ori :-)" [puppet] - 10https://gerrit.wikimedia.org/r/164250 (owner: 10Ori.livneh) [07:58:13] (03CR) 10Hashar: "Rebased :]" [puppet] - 10https://gerrit.wikimedia.org/r/164071 (owner: 10Hashar) [08:03:08] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [08:03:49] PROBLEM - puppet last run on mw1022 is CRITICAL: CRITICAL: Puppet has 1 failures [08:16:08] RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:17:03] (03PS1) 10Hashar: labs: reduce acct archiving retention [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) [08:18:26] (03PS2) 10Hashar: labs: reduce acct archiving retention [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) [08:18:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:20:23] !log unexporting, offline, destroying /vol/home_pmtpa on nas1-a [08:20:29] Logged the message, Master [08:20:42] gogod: thanks for shutting down nfs1 [08:23:07] (03CR) 10Hashar: "Turns out we had an incident documentation with deployment-salt (puppet and salt master) having /var filed up and thus causing puppet and " [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [08:24:12] akosiaris: no problem! [08:31:02] !log deleting snmp community from nas1-a, nas1-b. I guess librenms is going to start complaining [08:31:08] Logged the message, Master [08:34:25] (03PS3) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [08:55:52] (03PS1) 10Filippo Giunchedi: syslog-ng: update to trusty [puppet] - 10https://gerrit.wikimedia.org/r/164523 [08:55:54] (03PS1) 10Filippo Giunchedi: syslog-ng: filter out swift noise [puppet] - 10https://gerrit.wikimedia.org/r/164524 [09:15:18] (03PS4) 10Giuseppe Lavagetto: mediawiki: consolidate apache configs [puppet] - 10https://gerrit.wikimedia.org/r/164358 [09:23:20] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:23:50] PROBLEM - Memcached on ms-fe2001 is CRITICAL: Connection refused [09:24:41] PROBLEM - Swift HTTP backend on ms-fe2001 is CRITICAL: Connection refused [09:25:01] PROBLEM - Swift HTTP frontend on ms-fe2001 is CRITICAL: Connection refused [09:25:39] godog: is that ^ you ? [09:26:20] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:33:06] akosiaris: ah yes that's indeed me, initial provisioning [09:33:21] didn't expect icinga to pick it up that quickly, good work icinga-wm [09:35:14] !log springle Synchronized wmf-config/db-eqiad.php: repool db1063 (duration: 00m 07s) [09:35:21] Logged the message, Master [09:36:41] PROBLEM - NTP on ms-fe2001 is CRITICAL: NTP CRITICAL: Offset unknown [09:38:31] RECOVERY - Swift HTTP frontend on ms-fe2001 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.091 second response time [09:38:32] RECOVERY - NTP on ms-fe2001 is OK: NTP OK: Offset -0.04611456394 secs [09:41:00] (03CR) 10Hashar: [C: 04-1] "Thanks for the contint class refactor that part is fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [10:04:49] (03PS1) 10Filippo Giunchedi: swift: fix system::role description [puppet] - 10https://gerrit.wikimedia.org/r/164527 [10:04:51] (03PS1) 10Filippo Giunchedi: swift: use fully-qualified vars [puppet] - 10https://gerrit.wikimedia.org/r/164528 [10:07:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix system::role description [puppet] - 10https://gerrit.wikimedia.org/r/164527 (owner: 10Filippo Giunchedi) [10:16:49] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:16:59] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:24:29] <_joe_> godog: ^^ [10:26:06] (03PS2) 10Giuseppe Lavagetto: swift_new: include role::swift::base [puppet] - 10https://gerrit.wikimedia.org/r/163809 [10:28:34] (03CR) 10Giuseppe Lavagetto: [C: 032] swift_new: include role::swift::base [puppet] - 10https://gerrit.wikimedia.org/r/163809 (owner: 10Giuseppe Lavagetto) [10:29:30] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [10:29:31] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:39:10] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [10:43:49] !log Shutdown amaranth.toolserver.org's switchport on asw-d-pmtpa [10:43:57] Logged the message, Master [10:58:11] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:23:15] mark: are we going to route 10.0.0.0/16 to eqiad or codfw ? [11:23:22] it is the last pmtpa private subnet [11:23:26] neither? [11:23:55] keep it as a reserve then [11:24:01] sounds fine to me [11:25:08] eqiad uses 10.64.0.0/12, and codfw has 10.128.0.0/12 [11:25:14] er [11:25:17] 10.192.0.0/12 [11:25:29] ulsfo is 10.128.0.0/12 i believe, a bit much, but oh well [11:25:44] so yeah, 10.0.0.0/12 becomes mostly free, except there are some things in there like the service ips [11:26:31] (03PS2) 10Springle: remove Tampa db's from site and dsh [puppet] - 10https://gerrit.wikimedia.org/r/164253 (owner: 10Dzahn) [11:27:48] (03CR) 10Springle: [C: 032] remove Tampa db's from site and dsh [puppet] - 10https://gerrit.wikimedia.org/r/164253 (owner: 10Dzahn) [11:28:58] (03PS2) 10Springle: remove Tampa db and es servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164249 (owner: 10Dzahn) [11:30:15] (03CR) 10Springle: [C: 032] remove Tampa db and es servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164249 (owner: 10Dzahn) [11:34:44] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Per IRC discussion with mark we are not going to assign 10.0.0.0/16 to any DC but rather keep it as a reserve. So removing the entire DHCP" [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [11:37:33] !log shutdown db60 db68 db69 db71 db72 db73 db74 es4 es7 es10 [11:37:39] Logged the message, Master [11:41:57] _joe_: whoop, sorry [11:49:45] !log aude Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 10s) [11:49:50] Logged the message, Master [11:50:23] (03PS1) 10Glaisher: Enable EducationProgram extension on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164532 (https://bugzilla.wikimedia.org/71381) [11:51:03] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 15s) [11:51:08] Logged the message, Master [11:51:19] haha [11:51:32] O_o [11:51:52] (03PS1) 10Reedy: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164534 [11:51:57] thanks aude [11:52:00] hah [11:52:11] (03CR) 10Reedy: [C: 032] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164534 (owner: 10Reedy) [11:52:21] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164534 (owner: 10Reedy) [11:53:05] (03PS3) 10Springle: remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [11:53:24] springle: woo :) [11:53:43] :) [11:54:20] I'm guessing the tampa dbs were really just replicating to give an off eqiad backup? [11:54:36] right, and amaranth for toolserver [11:54:54] <_joe_> Reedy: I'd like your opinion on https://gerrit.wikimedia.org/r/#/c/164358/ [11:56:49] _joe_: Awesome. Looks ok at a quick glance [11:57:03] Will look a bit more thoroughly in a bit [11:57:35] (03CR) 10Alexandros Kosiaris: [C: 032] Remove squid monitoring from torrus [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [11:58:24] <_joe_> Reedy: no rush [11:59:16] (03CR) 10Springle: [C: 031] remove Tampa db and es servers [dns] - 10https://gerrit.wikimedia.org/r/164257 (owner: 10Dzahn) [12:02:07] volunteers for https://gerrit.wikimedia.org/r/#/c/164528/ ? _joe_ ? [12:13:06] (03CR) 10Alexandros Kosiaris: [C: 031] "But please don't forget to purge from hosts!" [puppet] - 10https://gerrit.wikimedia.org/r/164429 (owner: 10Dzahn) [12:17:53] (03PS2) 10Filippo Giunchedi: swift: use fully-qualified vars [puppet] - 10https://gerrit.wikimedia.org/r/164528 [12:18:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: use fully-qualified vars [puppet] - 10https://gerrit.wikimedia.org/r/164528 (owner: 10Filippo Giunchedi) [12:22:54] (03PS1) 10Alexandros Kosiaris: Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 [12:28:38] (03PS1) 10Springle: depool es1004 for upgrade & clone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164544 [12:29:33] (03CR) 10Springle: [C: 032] depool es1004 for upgrade & clone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164544 (owner: 10Springle) [12:29:40] (03Merged) 10jenkins-bot: depool es1004 for upgrade & clone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164544 (owner: 10Springle) [12:29:58] (03PS2) 10Alexandros Kosiaris: Add a ferm service for ssh on all bastionhosts [puppet] - 10https://gerrit.wikimedia.org/r/164542 [12:30:57] (03CR) 10Alexandros Kosiaris: "Daniel, thanks for bringing this up again. So seems like we are good to go. I commented in the pad, seems like the only blocker is" [puppet] - 10https://gerrit.wikimedia.org/r/96424 (owner: 10Dzahn) [12:30:57] !log springle Synchronized wmf-config/db-eqiad.php: depool es1004 (duration: 00m 06s) [12:31:04] Logged the message, Master [12:43:21] PROBLEM - Swift HTTP frontend on ms-fe2002 is CRITICAL: Connection refused [12:44:00] scheduling downtime [12:44:29] PROBLEM - Memcached on ms-fe2002 is CRITICAL: Connection refused [12:53:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, apart from a minor issue" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164542 (owner: 10Alexandros Kosiaris) [12:56:19] RECOVERY - Swift HTTP frontend on ms-fe2002 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.094 second response time [12:56:38] RECOVERY - Memcached on ms-fe2002 is OK: TCP OK - 0.043 second response time on port 11211 [13:00:04] K4: Respected human, time to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141003T1300). Please do the needful. [13:04:08] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.010 second response time [13:07:02] !log Updated minor_mime to varbinary(100) on image|filearchive|oldimage on foundationwiki [13:07:11] Logged the message, Master [13:08:48] RECOVERY - Memcached on ms-fe2001 is OK: TCP OK - 0.043 second response time on port 11211 [13:14:07] (03PS1) 10Springle: remove old pmtpa es nodes [puppet] - 10https://gerrit.wikimedia.org/r/164547 [13:15:20] (03CR) 10Springle: [C: 032] remove old pmtpa es nodes [puppet] - 10https://gerrit.wikimedia.org/r/164547 (owner: 10Springle) [13:16:48] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.022 second response time [13:23:40] (03CR) 10Alexandros Kosiaris: [C: 032] Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 (owner: 10BBlack) [13:24:12] (03CR) 10Alexandros Kosiaris: [C: 032] network.pp - remove fenari [puppet] - 10https://gerrit.wikimedia.org/r/164154 (owner: 10Dzahn) [13:26:52] (03PS7) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [13:32:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:34:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [13:39:44] (03PS1) 10Filippo Giunchedi: swift: fix auth url usage [puppet] - 10https://gerrit.wikimedia.org/r/164549 [13:40:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix auth url usage [puppet] - 10https://gerrit.wikimedia.org/r/164549 (owner: 10Filippo Giunchedi) [13:41:19] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:46:26] (03PS1) 10Ottomata: Add Fabian Kaelin to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/164550 [13:46:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:47:29] (03CR) 10Ottomata: [C: 032 V: 032] Add Fabian Kaelin to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/164550 (owner: 10Ottomata) [13:52:32] (03CR) 10Jsahleen: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/163841 (owner: 10KartikMistry) [13:56:49] (03PS1) 10Filippo Giunchedi: lvs: add swift in codfw [puppet] - 10https://gerrit.wikimedia.org/r/164552 [13:57:15] (03PS1) 10Manybubbles: [Beta] Have Cirrus build index for regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164553 [13:57:58] bblack: https://gerrit.wikimedia.org/r/#/c/164552/ looks easy enough, if you have 5min [13:57:59] (03PS2) 10Manybubbles: [Beta] Have Cirrus build index for regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164553 [13:58:03] (03PS24) 10Alexandros Kosiaris: Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [14:00:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall this is a really good patch. I added a couple of parts missing like the LVS balancer IPs (Daniel noticed as well, thanks), the act" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [14:00:49] (03CR) 10Alexandros Kosiaris: [C: 032] remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 (owner: 10Dzahn) [14:03:47] If I merge a patch to mediawiki-config with [beta] in the name that is obviously for beta should I sync it to prod or just let the next person notice that its for beta and ignore it? [14:04:05] * aude would sync it [14:04:07] betause beta configs _are_ synced to prod and they can get confusing [14:05:57] (03CR) 10Alexandros Kosiaris: [C: 032] replace sanger,sfo-aaa1 with ldap1/ldap2.corp [puppet] - 10https://gerrit.wikimedia.org/r/164139 (owner: 10Dzahn) [14:08:55] (03CR) 10BBlack: [C: 031] lvs: add swift in codfw [puppet] - 10https://gerrit.wikimedia.org/r/164552 (owner: 10Filippo Giunchedi) [14:09:57] (03PS1) 10Filippo Giunchedi: swift: fix credentials variable [puppet] - 10https://gerrit.wikimedia.org/r/164554 [14:10:35] (03PS2) 10Filippo Giunchedi: lvs: add swift in codfw [puppet] - 10https://gerrit.wikimedia.org/r/164552 [14:10:42] godog: another thing I noticed while looking at that patch: the old monitor check for ms-fe in eqiad uses the hostname "ms-fe.eqiad.wmnet", which is redundant for ms-fe.svc.eqiad.wmnet in DNS [14:10:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] lvs: add swift in codfw [puppet] - 10https://gerrit.wikimedia.org/r/164552 (owner: 10Filippo Giunchedi) [14:10:51] (03PS2) 10Filippo Giunchedi: swift: fix credentials variable [puppet] - 10https://gerrit.wikimedia.org/r/164554 [14:10:58] we could probably switch that and kill the old DNS name (well, assuming some app code doesn't use that old name?) [14:10:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix credentials variable [puppet] - 10https://gerrit.wikimedia.org/r/164554 (owner: 10Filippo Giunchedi) [14:12:11] bblack: yeah we can change the monitor, but afair the non-svc name is used in apps too [14:13:17] well we may as well keep the monitor matching the app then [14:14:29] yep, I'm trying to remember where to look for the upload pipeline but we'll get it right for codfw and going forward [14:16:10] (03CR) 10Gilles: "No doubt that it introduced an increase of 500s, I just assumed those were harmless since the PHP thumbnailing script gives up right away " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164428 (owner: 10Chad) [14:17:04] K4: Dear anthropoid, the time has come. Please deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141003T1417). [14:18:31] !log launched iodine:/opt/otrs/bin/otrs.RebuildFulltextIndex.pl per bugzilla #64473 [14:18:38] Logged the message, Master [14:29:42] (03PS8) 10BBlack: Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 [14:30:14] (03CR) 10Jgreen: [C: 031] "Yep, purge at will, I think this tool is no longer applicable for our search architecture." [puppet] - 10https://gerrit.wikimedia.org/r/164429 (owner: 10Dzahn) [14:30:50] (03CR) 10BBlack: [C: 032] Use codfw LVS-based recdns [puppet] - 10https://gerrit.wikimedia.org/r/164473 (owner: 10BBlack) [14:32:06] anyone mind if I merge and sync some beta configuration now? [14:33:37] silence means ascent [14:33:46] * manybubbles will deploy a noop change now [14:34:09] (03CR) 10Manybubbles: [C: 032] [Beta] Have Cirrus build index for regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164553 (owner: 10Manybubbles) [14:34:16] (03Merged) 10jenkins-bot: [Beta] Have Cirrus build index for regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164553 (owner: 10Manybubbles) [14:35:07] !log manybubbles Synchronized wmf-config/CirrusSearch-labs.php: Noop - just keeps beta config in sync (duration: 00m 04s) [14:35:15] Logged the message, Master [14:35:34] (03PS1) 10Filippo Giunchedi: swift: fix credentials #2 [puppet] - 10https://gerrit.wikimedia.org/r/164557 [14:36:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: fix credentials #2 [puppet] - 10https://gerrit.wikimedia.org/r/164557 (owner: 10Filippo Giunchedi) [14:37:46] !log testing dns server upgrade on baham [14:37:50] Logged the message, Master [14:38:19] (03PS1) 10Filippo Giunchedi: swift: this isn't python, sorted vs sort [puppet] - 10https://gerrit.wikimedia.org/r/164558 [14:38:36] this back and forth is getting old [14:39:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: this isn't python, sorted vs sort [puppet] - 10https://gerrit.wikimedia.org/r/164558 (owner: 10Filippo Giunchedi) [14:39:42] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [14:42:11] PROBLEM - Host ms-fe.svc.codfw.wmnet is DOWN: CRITICAL - Network Unreachable (10.2.1.27) [14:43:03] icinga check works! :) [14:43:12] a page! [14:43:14] haha, whooops, acknowleging [14:43:16] sorry [14:43:17] yay! [14:44:37] (03PS1) 10BBlack: No-op change for authdns-update test [dns] - 10https://gerrit.wikimedia.org/r/164559 [14:45:00] (03CR) 10BBlack: [C: 032] No-op change for authdns-update test [dns] - 10https://gerrit.wikimedia.org/r/164559 (owner: 10BBlack) [14:45:44] I'm assuming the codfw svc range isn't routable internally anywhere yet [14:46:01] hmmm probably not, I haven't looked [14:46:22] I did set up the pybal bgp stuff on the codfw routers yesterday, so you should be able to hit it from within codfw at least, manually [14:46:30] (if you want to check that) [14:46:40] but eqiad, etc may not now about 10.2.1.0/24 -> codfw [14:49:11] bblack: mhh ok starting with codfw, assuming puppet has ran on lvs2001 since my change I don't see the ip assigned to any interface yet, the pybal config for codfw/swift should be available already, anything I might be missing? [14:50:14] looking [14:50:16] vlan perhaps, but I'm assuming it goes on private vlan and that's it [14:50:48] vlan doesn't matter [14:50:55] are you sure it's 2001? [14:52:14] bblack: not sure what you mean [14:52:39] ah it's on lvs2003, trying a fresh puppet run now [14:53:17] swift is 'class' => "low-traffic", [14:53:31] which is lvsx00[36] in eqiad/codfw [14:53:50] Anyone looking into the mass of segfaults from the apaches? 640 in the last hour according to logstash. [14:53:50] but I still don't see it after a puppet run there [14:53:58] They seem to be spread out across the cluster [14:54:50] <_joe_> bd808: mmmh segfaults look scary [14:55:41] Have we upgraded php5 or an extension recently? [14:55:59] <_joe_> a couple of weeks ago AFAIK [14:56:29] <_joe_> bd808: what did you search for in logstash? [14:56:31] This logstash report does not look good -- https://logstash.wikimedia.org/#dashboard/temp/O3rcWzy9QDSoXSAtNmDs1Q [14:56:43] godog: ah there was a missing bit in the LVS patch [14:56:59] (go us for our data redundancy without hiera!) [14:57:29] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:57:42] the sip part in the role ? that is usually the one I am missing [14:57:43] <_joe_> bd808: it began yesterday evening, lemme take a look to the sal and one of the servers [14:57:46] (03PS1) 10BBlack: Add swift to codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/164563 [14:58:12] (03CR) 10BBlack: [C: 032 V: 032] Add swift to codfw LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/164563 (owner: 10BBlack) [14:58:29] _joe_: Here's a simple search for them in logstash -- normalized_message.raw:"Segmentation fault (11)" AND type:apache2 [14:58:29] bblack: ow :( yeah I wouldn't have discovered that without much digging :( thanks! [14:59:06] it's basically the same data one could infer from the file you did edit, that's what sucks [15:01:11] indeedly [15:01:51] ok so even in codfw, that IP doesn't route yet [15:02:02] I'll start looking at the routers [15:02:09] but lvs2003 is listening now [15:02:22] <_joe_> bd808: this is indeed bad [15:03:27] <^demon|away> bblack: re: swift in codfw ^. Are we preemptively allocating the lvs stuff now, or does it need to wait on actual hardware for the service? [15:04:20] ^demon|away: if you mean the IP addrs and LVS config, I've been waiting so we're not defining a bunch of dead services with failing monitors [15:04:25] _joe_: The only thing I know for certain about php5 segfaults is that once they start happening it's a real pain in the butt to track down what php code and/or extension is causing them. :( [15:04:39] <^demon|away> bblack: wasn't sure, thx. will wait then. [15:04:42] <_joe_> yeah, hhvm is much better at that [15:05:12] <_joe_> bd808: we can only basically look at the logs and/or strace a process until it dies :( [15:06:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [15:06:57] ^demon|away: the 500's graph looks "normal" again. I think from what I saw of para.void's analysis it was a bad bot. We should probably put a revert of the revert of gi11es' change back up for swat on Monday. [15:07:46] <_joe_> or gdb [15:08:06] <^demon|away> Yeah, I saw his notes last night, can likely revert. [15:08:11] <^demon|away> s/likely// [15:09:56] bd808: you sure? my thing would generate 500s with the current code, up to 8 per upload for really tiny files. they're harmless 500s, but if you're tracking 500s to identify issues, I should just make the code smarter and not request thumbnails that can't be rendered [15:11:18] gi11es: It would be awesome to keep it from making spurious 500s, but the big spike yesterday seems to correlate with a user agent of "Java/1.7.0_45" and bad unicode url handling. [15:11:48] bd808: I was going to work on avoiding the 500s right now, I should have it completed for monday's SWAT [15:11:56] cool [15:14:44] !log dropping 10.2.1.0/24 aggregate + static routes in cr2-pmtpa [15:14:50] Logged the message, Master [15:14:58] _joe_: good evening [15:15:22] !log adding 10.2.1.0/24 aggregate in cr-[12].codfw [15:15:28] Logged the message, Master [15:16:59] <_joe_> gwicke: I've say your RT about parsoid cache being non-performant [15:17:08] <_joe_> and I wanted more info before acting [15:17:56] _joe_: okay [15:18:18] (03PS8) 10Rush: Abstracts Sprint install with defined resource type phabricator::libext [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [15:18:21] _joe_: there was a thread on the ops list a while ago [15:18:35] <_joe_> yeah I remember that [15:18:49] <_joe_> so can you add to the ticket some url that we can test? [15:19:03] !log shutdown pdf2 & pdf3 [15:19:08] <_joe_> I'd like us to be able to troubleshoot this, instead of just banning random urls [15:19:11] Logged the message, Master [15:19:19] <_joe_> bye bye pdf [15:19:43] yay ;) [15:19:49] PROBLEM - Host pdf2 is DOWN: PING CRITICAL - Packet loss = 100% [15:19:56] _joe_: let me copy over the info from the thread [15:20:13] * mark hugs icinga-wm [15:20:57] <_joe_> no I don't need _that_ [15:21:05] PROBLEM - Host pdf3 is DOWN: PING CRITICAL - Packet loss = 100% [15:21:06] <_joe_> I'd like to know what test did you do today [15:21:15] _joe_: same test [15:21:20] <_joe_> ok [15:21:36] it also shows up in VE performance graphs [15:21:45] YuviPanda: yoho? [15:21:57] chasemp: 'sup [15:21:57] <_joe_> oh ok for no-cache requests [15:22:04] _joe_: the miss rate is also higher than expected [15:22:15] https://gerrit.wikimedia.org/r/#/c/162873 for real [15:22:18] which makes sense since varnish holds onto a lot of variants [15:22:22] can you test on your labs thing once merged [15:22:25] or at least look for blow up [15:22:27] bblack: happy days, looks like the route is announced and I can connect and swift replies [15:22:27] godog: [15:22:28] root@baham:~# curl http://ms-fe.svc.codfw.wmnet/monitoring/backend [15:22:28]

Unauthorized

This server could not verify that you are authorized to access the document you requested.

[15:22:32] yeah [15:23:01] yep looking at the monitoring urls now [15:23:10] _joe_: you could also try a POST instead [15:23:14] <_joe_> gwicke: sorry to ask, but where are the VE performance graphs? [15:23:35] RECOVERY - Host ms-fe.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 44.08 ms [15:23:58] _joe_: https://gdash.wikimedia.org/dashboards/ve/ [15:24:10] towards the end [15:24:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:24:43] <_joe_> gwicke: not really, to be honest [15:25:21] <_joe_> it looks like it's pretty stable across last week [15:26:05] it's a very slow degradation over the last 1/2 year [15:26:30] https://graphite.wikimedia.org/render/?title=VisualEditor%20save%20completion%20time,%20one-minute%20sliding%20window,%20last%20year&vtitle=milliseconds&from=-1year&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=alias(color(ve.performance.user.saveComplete.median,%22blue%22),%22Median%22)&target=alias(color(ve.performance.user.saveComplete.75 [15:26:31] percentile,%22red%22),%2275th%20percentile%22) [15:27:14] yep, I was asking about that some time ago as well [15:27:39] (I think I added the 1y graph a couple months ago?) [15:34:21] <_joe_> !log purging varnish cache for parsoid (RT 8528) [15:34:26] Logged the message, Master [15:35:21] <_joe_> gwicke: done [15:36:16] _joe_: muchos gracias! [15:40:00] chasemp: gah, lost connectivity just after I said 'sup' [15:40:02] 'sup? [15:41:12] (03PS1) 10Giuseppe Lavagetto: swift: clean up [puppet] - 10https://gerrit.wikimedia.org/r/164566 [15:41:14] (03PS1) 10Giuseppe Lavagetto: hiera: use hiera to lookup the cluster [puppet] - 10https://gerrit.wikimedia.org/r/164567 [15:41:14] no problem [15:41:16] (03PS1) 10Giuseppe Lavagetto: puppet: drop global variable $puppet_version [puppet] - 10https://gerrit.wikimedia.org/r/164568 [15:41:55] YuviPanda: out of time atm but will want to merge https://gerrit.wikimedia.org/r/#/c/162873/ maybe monday and you can verify it doesn't kill phab::labs? [15:42:15] chasemp: yup, sure [15:42:16] (03CR) 10Rush: "seems good, need to coordinate w/ yuvi regarding making sure this doesn't kill labs phab. probably monday?" [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [15:43:31] (03CR) 10BryanDavis: [C: 031] "I took the initiative to cherry-pick this on deployment-salt and forced a puppet run on deployment-bastion. It applied cleanly." [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [15:46:58] morning [15:49:40] <_joe_> hi ori [15:55:22] bd808: zend segfaults possibly related to https://bugzilla.wikimedia.org/show_bug.cgi?id=71542 [15:55:47] (03CR) 10Ori.livneh: [C: 031] "weee, nice" [puppet] - 10https://gerrit.wikimedia.org/r/164358 (owner: 10Giuseppe Lavagetto) [16:10:13] (03PS4) 10Ori.livneh: misc::maintenance: clean-up [puppet] - 10https://gerrit.wikimedia.org/r/160232 [16:24:02] (03CR) 10Filippo Giunchedi: [C: 031] swift: clean up [puppet] - 10https://gerrit.wikimedia.org/r/164566 (owner: 10Giuseppe Lavagetto) [16:32:24] ottomata: hey! wanna merge some (only 2!) trivial icinga changes? [16:32:36] I'm not going to do more today, being friday and all [16:32:38] (03CR) 10Filippo Giunchedi: [C: 031] "having the puppet compiler output would be great too" [puppet] - 10https://gerrit.wikimedia.org/r/160232 (owner: 10Ori.livneh) [16:37:09] YuviPanda: add me as reviewer! [16:37:58] oh you did! [16:38:40] ottomata: :D [16:43:38] (03PS2) 10Ottomata: icinag: Remove unused global variables [puppet] - 10https://gerrit.wikimedia.org/r/164494 (owner: 10Yuvipanda) [16:44:01] icinag :D [16:44:09] (03PS3) 10Ottomata: icinga: Get rid of ganglios [puppet] - 10https://gerrit.wikimedia.org/r/164239 (owner: 10Yuvipanda) [16:44:26] omen nomen [16:45:00] (03CR) 10Ottomata: [C: 032] icinag: Remove unused global variables [puppet] - 10https://gerrit.wikimedia.org/r/164494 (owner: 10Yuvipanda) [16:47:23] ottomata: I was just about to fix that icinag name :'( [16:48:44] ahhh well: ) [16:49:23] hah [16:49:26] PROBLEM - Parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:56] also I just noticed the topic changed from RT duty to OPs duty. The inconsistent capitalisation annoys me :p [16:51:16] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.033 second response time [16:52:27] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:29] ottomata: :) the only remaining global variable is used in the percona test, should refactor [17:01:35] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [17:01:58] hmm, what's going on with parsoid? [17:03:41] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Parsoid%2520eqiad&tab=m&vn= [17:03:54] I'm going to restart parsoids [17:04:33] !log restarting parsoids after CPU spike [17:04:39] Logged the message, Master [17:07:36] I'm seeing a lot of failed API requests in the logs [17:12:18] jgage: yo, yt? [17:12:50] saved some logs for later investigation [17:13:32] were there any recent API changes? [17:13:49] ottomata: you forgot to merge https://gerrit.wikimedia.org/r/#/c/164239/! :) [17:13:49] Parsoid is getting seeing a lot of errors from the PHP API cluster [17:13:53] YuviPanda: I know [17:13:58] i'm doing 3 things at once right now:) [17:14:00] oh [17:14:00] ok [17:14:01] sorry :) [17:18:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [17:18:43] aha! [17:18:46] icinga noticed [17:18:58] heh [17:19:11] opsens: something's up with the API! [17:20:30] there isn't much load on the cluster, so maybe some network issue? [17:22:15] Coren: ^^ [17:22:57] springle: if you are awake, consider yerself pinged [17:24:48] YuviPanda: do I need to actively remove ganglios stuff after this merge? [17:25:10] ottomata: I don't think so. you can if you want to [17:25:17] I guess just hte ganglios package? [17:25:21] hm, maybe i'll just leav eit, eyah, it won't hurt [17:25:27] (03PS4) 10Ottomata: icinga: Get rid of ganglios [puppet] - 10https://gerrit.wikimedia.org/r/164239 (owner: 10Yuvipanda) [17:25:35] ottomata: yeah [17:26:10] gwicke: puppet is off on tungsten, so that's suspicious, someone may be working on it. [17:26:38] andrewbogott: are you troubleshooting the api spike or somethign else? [17:26:40] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Get rid of ganglios [puppet] - 10https://gerrit.wikimedia.org/r/164239 (owner: 10Yuvipanda) [17:26:49] (I just want to make sure we dont thrash one anothers info gathering or anything) [17:26:59] robh: well, 'troubleshooting' in the sense of looking at the icinga page [17:27:03] Haven't done any more than that. [17:27:19] durn it [17:27:27] i was hoping your answer would be 'oh that, yea its this, fixing' [17:27:28] ;] [17:28:17] parsoid from a cache wipe? (did that happen?) [17:28:21] bah, "'reason not specified'" [17:28:33] seeing a whole lot of 2014-10-03 17:27:59 mw1117 eswiki: "action=expandtemplates&!prop" "10.64.32.88" "10.64.32.88" "" "Parsoid/0.1" [17:28:51] bblack: the load on the api cluster doesn't look especially high [17:28:53] oh wrong log [17:29:05] there's still a lot of failures though [17:29:21] oh I thought I read "API load spiking" on IRC [17:29:27] I didn't actually look at the graph :) [17:29:30] andrewbogott: no one logged the stopping puppet [17:29:37] so i wonder if maybe wasnt intentional [17:29:51] I guess I'll turn it back on! [17:29:57] I would [17:30:06] !log enabling puppet on tungsten which is disabled for mysterious reasons [17:30:15] Logged the message, Master [17:30:20] gwicke, looks like load on parsoid cluster is getting back to normal .. at least on a few cores. [17:30:25] and I can see folks poking the system regularly [17:30:59] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet last ran 76789 seconds ago, expected 14400 [17:32:21] hmm I probably did that yesterday ^ [17:32:35] bblack: ok, should we not have enabled it? [17:32:40] cuz we totally just did [17:32:52] though i imagine its not the cause of this issue [17:33:00] since you are pasting those crazy requests ;] [17:33:13] * andrewbogott watches puppet run nervously [17:33:21] gwicke: Sorry, was out to lunch. Is the problem still present? What symptoms are you getting? [17:33:30] no it's fine [17:33:48] I just meant I probably accidentally left it off yesterday [17:34:29] Coren: the problem is still there [17:35:04] parsoid still sees lots of API request failures [17:35:10] where? [17:35:24] /var/log/parsoid/parsoid.log on any of the parsoid boxes [17:35:28] * Coren goes to dive into logs. [17:36:03] (03PS1) 10QChris: Allow local server-status requests on http [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/164583 (https://bugzilla.wikimedia.org/71606) [17:36:07] parsoid bypasses Varnish for most requests, and the timing pattern looks like connection issues [17:36:13] which points towards network or LVS [17:36:32] what does the failure look like in the log? [17:36:44] [warning/api][enwiki/Czarnoborsko?oldid=545000380] Failed API request, 0 retries remaining. ? [17:36:53] [warning/api][fawiki/نبیل_مسکین‌یار?oldid=9677207] Failed API request, 0 retries remaining. [17:36:59] [warning/api][enwiki/Jemielno?oldid=545000800] Failed API request, 0 retries remaining. [17:37:11] * Coren wishes for better than Failed API request as diagnostic. [17:37:18] I doubt it's a network issue. It's all local to eqiad right? [17:37:26] i wish i had looked at a parsoid log when its optimal ;] [17:37:35] we had similar issues before where LVS got confused by a DNS change [17:37:37] Hm, this puppet run on tungsten is editing resolv.conf [17:37:48] -nameserver 91.198.174.6 [17:37:49] +nameserver 208.80.153.254 [17:37:51] oh [17:37:56] we cycled those yesterday right? [17:38:01] to move off esams to codfw [17:38:05] yeah I pushed that this morning [17:38:08] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:38:09] ah ha [17:38:22] should we manually fix tungsten? [17:38:26] since its puppet is broken? [17:38:31] what's wrong with tungsten? [17:38:39] oh, it just lceared, nm [17:38:45] yeah andrewbogott ran it [17:38:47] andrewbogott: but that is expected [17:38:59] puppet is happy, it just made that one edit [17:39:04] (03PS1) 10Andrew Bogott: Enable puppet as a part of firstrun. [puppet] - 10https://gerrit.wikimedia.org/r/164588 [17:39:05] gwicke: got confused by what kind of dns change? [17:39:14] (and has there been such a DNS change?) [17:40:05] the last time this happened (talking to gwicke), the ips of boxes where changed [17:40:06] (03PS2) 10Andrew Bogott: Enable puppet as a part of firstboot.sh. [puppet] - 10https://gerrit.wikimedia.org/r/164588 [17:40:16] * gwicke nods [17:40:19] and lvs had to be restarted because it had cached the old box ips [17:40:27] but this isnt exactly the same [17:40:36] we just changed the resolv.conf of every system, not the system IP addresses [17:40:38] <_joe_> we flushed all cache for parsoid from varnishes earlier [17:40:55] <_joe_> the two might be related [17:41:21] (03CR) 10Andrew Bogott: [C: 032] Enable puppet as a part of firstboot.sh. [puppet] - 10https://gerrit.wikimedia.org/r/164588 (owner: 10Andrew Bogott) [17:41:25] so this would just be it getting slammed repopulating the cache? [17:41:30] I'm not convinced about that, as the load on the API cluster is not very high [17:41:37] yea they are green and yellow in ganglia [17:41:47] and the load on the parsoid cluster is normal too after a spike [17:42:08] gwicke: the rate of those "0 retries" lines is unchanged from the previous log rotation [17:42:15] ~10-20% seems to be the norm [17:42:16] hmmm [17:42:16] I don't understand why tungsten has a jobqueue [17:42:28] bblack: let me double-check that the logging isn't misleading here [17:42:35] it might have changed since I last looked at it [17:42:46] <_joe_> gwicke: did you cycle restart parsoid nodes already [17:43:00] at least on wtp1009 [17:43:07] <_joe_> I strongly suspect that spike at 100% cpu didn't do anything good to them [17:43:10] andrewbogott: that's probably just graphite checks *on* tungsten for jobqueues elsewhere [17:43:22] _joe_: yes, I did [17:43:22] andrewbogott: similar to labmon1001 having graphite checks for things on labs [17:43:34] Ah, that makes sense. So is the jobqueue troubled? [17:43:57] 28 doesn't seem like very much, unless it's x10000 or something [17:44:18] e.g. parsoid.log.2.gz: '0 retries' is 386268 / 3220837 . Last 10000 lines currently is 2184 / 10000 [17:44:22] so the thing that makes me suspicious is that there aren't any countdowns [17:44:29] which is a little higher, but doesn't seem unreasonable [17:44:33] <_joe_> ok sorry but it's to late in my tz, see you on monday [17:44:37] normally it starts with '6 retries remaining' [17:44:41] _joe_: have a nice weekend [17:44:43] or something like that [17:46:14] RECOVERY - RAID on analytics1010 is OK: OK: Active: 7, Working: 8, Failed: 0, Spare: 1 [17:46:31] so was the parsoid cache cleared because there was a config change? [17:46:33] (03CR) 10Yuvipanda: [C: 04-1] "Minor quibbles." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [17:46:49] robh: no, because it had accumulated a lot of stale variants [17:46:57] and performance had degraded for cache misses [17:47:10] so no reason for varnish to change what its handing out [17:47:11] ok [17:47:21] so we did wipe the cache, right? did this start about that time? [17:47:39] checking admin log [17:47:46] cuz _joe_ just said 'earlier' [17:47:46] bblack: yes, it started not too long after [17:47:58] 15:34 _joe_: purging varnish cache for parsoid (RT 8528) [17:48:01] maybe we could try restarting those varnishes? [17:48:04] so that's the most likely proximal cause in terms of human action [17:48:50] so i can see the restart varnish commands for dsh on the page for parsoid [17:48:55] oh, wait, parsoid [17:49:08] we need to restart varnish on cp1045/1058? [17:49:18] I don't know [17:49:21] * robh hasnt done it just asking [17:49:25] shall we try? [17:49:26] =] [17:49:28] (03CR) 10Andrew Bogott: "I'd support shortening the rotation period but I have mixed feelings about moving the logs. If /var fills up we can still at least log in" [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [17:49:33] it could only make it so much worse! [17:49:39] I don't think stabbing at things is helpful if it's not an emergency [17:49:50] robh: yes [17:50:00] we need better data on what is failing how, all parsoid is telling us is that something went wrong, somewhere [17:50:19] bblack: the parsoid folks haven't gotten to logstash yet [17:50:24] * robh is chekcing cp1045 if it has any telling log errors [17:50:31] so we're stabbing. maybe it's network. maybe it's lvs. maybe it's varnish needs a reboot (prod varnish hasn't ever needed reboots) [17:50:38] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:50:56] can we trace a request that fails somehow and find out how/why it fails? [17:51:13] bblack: we just cleared about 400G of persistent varnish cache per box that has built up over a year or so [17:51:28] it's conceivable that something didn't work perfectly [17:51:30] which should make it faster! :) [17:51:43] but in general, suspecting a varnish code bug isn't a good bet [17:52:00] I'll go stab just to make a point :p [17:52:03] well, we know that the variants are a bug [17:52:19] I haven't kept up with development to see if it's fixed in 4.0 [17:53:03] andrewbogott: I don't think https://gerrit.wikimedia.org/r/#/c/164520/2 puts the logs in /etc, just the config file that says how long to keep the logs in /etc [17:53:09] !log restarted parsoid varnishes [17:53:19] Logged the message, Master [17:53:44] the variants thing are a possible issue with our VCL and how we're using varnish, they're not a varnish code bug other than "suboptimal performance in a suboptimal scenario" [17:53:45] bblack: https://www.varnish-cache.org/trac/wiki/VCLExampleEnableForceRefresh [17:53:55] (03CR) 10Andrew Bogott: "My mistake, that's exactly what this patch does. Nevermind!" [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [17:54:01] "The downside of this approach is that it will not free up the older objects until they expire, as of Varnish 3.0.2. This is considered a flaw and a fix is expected." [17:54:27] alright, a flaw then ;) [17:54:39] well either way, a restart ain't gonna fix it [17:54:56] the cache is persistent, remember? unless there's a real bug. [17:55:33] alright, we are going to make proper monitoring + dispatching to logstash a high priority in the next week. [17:56:26] we thought we would wait till after qr since we fixed the previous memory issue and deployed it, but that was a bit too long of a wait (based on this incident today). [17:56:52] so externally things seem to work okay [17:57:38] so maybe it was a false alarm, based on misleading parsoid logging [17:57:56] I saved a log at time of the cpu spike for later checking [17:58:54] ok, my (unverified) theory is that we still have some problematic pages and the cache purge cause the spike on the cluster because of trying to parse those pages and locking up cores over time. [17:59:49] but, we are going to add logging now to track pages that continue to parse beyond 60 sec. [18:00:43] thanks gwicke and ops for investigating. [18:01:30] subbu: logging at least a little bit of the API request error message would be helpful [18:02:28] yes. [18:03:24] so on to other mysteries, what's up with OCG health? [18:03:27] CRITICAL: ocg_job_status 297914 msg (>=100000 critical): ocg_render_job_queue 0 msg [18:03:38] on all 3x ocg bxoes for the past two days straight, basically [18:03:54] is it fine and we just need to raise that limit? [18:04:07] (03CR) 10Nuria: Allow local server-status requests on http (031 comment) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/164583 (https://bugzilla.wikimedia.org/71606) (owner: 10QChris) [18:05:05] cscott: ^^ [18:06:31] bblack: yes, we just need to raise that limit [18:06:39] bblack: i've been discussing this with jgreen [18:06:45] ok [18:06:59] the ocg servers aren't at steady state yet, they just started gc'ing stuff yesterday evening after being put in production on monday [18:07:23] (04:28:53 PM) cscott-free: with 3 day expiry, i expect our job status count and cache size to level off tomorrow, so we should be able to set some new reasonable triggers then. [18:07:23] (04:29:02 PM) Jeff_Green: k [18:07:23] (04:29:30 PM) cscott-free: they are at roughly 170,000 now (current limit is 100,000). [18:07:23] (04:29:42 PM) cscott-free: that should be about 2/3 of the steady state [18:07:24] (04:30:41 PM) cscott-free: so maybe 300,000 should be within the range of normal, maybe put the warning above that level, like 350k, and then critical at twice that? [18:07:24] (04:30:57 PM) cscott-free: but we'll know better tomorrow/friday. [18:08:00] i'd probably warn at 400k and critical at 800k, looking at graphite today [18:08:36] and i'm fine if you want to go ahead and do that now. i'm going to continue to be keeping an eye on this over the next week to understand typical load better. [18:08:43] do we have any idea about correlating that to actual capacity of the backends? it might be better to set the limits based on that, so we know when to add more [18:09:35] bblack: well, the job status queue is just a redis set. is there anyone in #ops who really understands redis scaling? [18:10:28] not me! :) [18:10:30] bblack: i'm not concerned with the cache size as a measure of capacity. we seem to have enough disk allocated among our three current servers to cache about 3 days worth of requests, but that's only giving us a 25% hit rate. [18:11:20] so i'd say that we're probably already slightly underprovisioned there in terms of cache utilization, but it's still early days yet. [18:11:42] (03PS1) 10BBlack: Raise ocg health check limits to 400k/800k [puppet] - 10https://gerrit.wikimedia.org/r/164602 [18:11:43] and the cache is strictly FIFO, it's possible we'd do slightly better with a LRU cache or some such. [18:12:01] (03CR) 10QChris: Allow local server-status requests on http (031 comment) [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/164583 (https://bugzilla.wikimedia.org/71606) (owner: 10QChris) [18:12:39] bblack: http://graphite.wikimedia.org/ click on Graphite.ocg.pdf.status_objects.value to play along at home. [18:13:14] look at the past two days, and you can see the first gc kick in at 8pm yesterday (UTC, I assume) [18:13:33] (03PS2) 10BBlack: Raise ocg health check limits to 400k/800k [puppet] - 10https://gerrit.wikimedia.org/r/164602 [18:14:09] (03CR) 10Cscott: [C: 031] Raise ocg health check limits to 400k/800k [puppet] - 10https://gerrit.wikimedia.org/r/164602 (owner: 10BBlack) [18:14:09] LRU does sound like probably a big win [18:14:41] !log maxsem Synchronized php-1.25wmf1/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/164543/ (duration: 00m 04s) [18:14:44] jenkins is being lazy [18:14:47] Logged the message, Master [18:14:53] !log maxsem Synchronized php-1.25wmf2/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/164543/ (duration: 00m 04s) [18:14:59] Logged the message, Master [18:15:00] depends on how frequently content is reused w/in the 3 day window. if we're serving the same popular file 10 times in 3 days, then it's only 10% overhead to re-generate it once every 3 days. [18:15:08] (03CR) 10BBlack: [C: 032 V: 032] Raise ocg health check limits to 400k/800k [puppet] - 10https://gerrit.wikimedia.org/r/164602 (owner: 10BBlack) [18:15:23] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 5962: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 148 [18:15:33] but yes, potentially a win. [18:16:51] (03PS1) 10Andrew Bogott: Increase /var partition to 10g for new instances. [puppet] - 10https://gerrit.wikimedia.org/r/164605 [18:17:37] Hi, I'm Ldorfman (Liron Dorfman) from the Hebrew Wikipedia [18:17:59] Got this error for more than 12 hours already: Request: POST http://he.wikipedia.org/w/index.php?title=%D7%91%D7%A8%D7%A7_%D7%90%D7%95%D7%91%D7%9E%D7%94&action=submit, from 10.64.0.105 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 2487612769 Forwarded for: 85.64.161.132, 91.198.174.103, 208.80.154.77, 10.64.0.105 Error: 503, Service Unavailable at Fri, 03 Oct 2014 18:13:58 GMT [18:18:02] Request: POST http://he.wikipedia.org/w/index.php?title=%D7%91%D7%A8%D7%A7_%D7%90%D7%95%D7%91%D7%9E%D7%94&action=submit, from 10.64.0.105 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 2487612769 Forwarded for: 85.64.161.132, 91.198.174.103, 208.80.154.77, 10.64.0.105 Error: 503, Service Unavailable at Fri, 03 Oct 2014 18:13:58 GMT [18:18:12] Is this the place to report it anyway? [18:18:49] Ldorfman: this may be the right place! In theory Coren is on support rotation today, although someone else may also be able to help you. [18:18:57] (03CR) 10Gage: [C: 031] "yay!" [puppet] - 10https://gerrit.wikimedia.org/r/164605 (owner: 10Andrew Bogott) [18:19:10] Thanks [18:19:19] So, do I have to do something else? [18:19:27] Ldorfman, did you enable the HHVM beta feature? [18:19:44] No - I guess... I don't know what's that? [18:19:47] ... [18:19:55] then you probably didn'yt :) [18:19:57] (03PS2) 10Andrew Bogott: Increase /var partition to 10g for new instances. [puppet] - 10https://gerrit.wikimedia.org/r/164605 [18:20:16] andrewbogott: ^ yay! [18:20:21] I mean, I haven't made any special settings changes in the last days [18:20:26] YuviPanda: it hasn't worked yet... [18:20:32] takes ~ an hour to test [18:20:40] cscott: sorry, my client was muted and window buried [18:20:42] ah ok [18:20:51] no worries [18:20:53] …and that's if Jenkins ever tests [18:21:19] bblack, ^^ varnish errors [18:22:15] cscott: new one: /srv is almost maxed on ocg102 [18:22:18] err ocg1002 [18:22:43] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/164605 (owner: 10Andrew Bogott) [18:22:59] that ocg1002 is always getting into trouble [18:23:23] yeah that's odd isn't it [18:23:41] maybe the round robin isn't so round [18:23:41] looks similar to yesterday, huge files 'deleted' but still open by some process [18:23:54] seems like a pointy robin [18:23:54] in /srv this time? [18:23:58] yeah [18:24:07] lsof|grep srv|grep deleted [18:24:27] So, can I leave you now? Take my mail for any questions. It's Liron@wikimedia.org.il . [18:24:40] ocg1001 looks similar actually [18:25:16] ocg1003 looks ok, but hasn't gotten the collector threshold adjustments yet [18:25:36] i think lsof might be lying to you somehow [18:25:38] running puppet there [18:25:42] well df is saying the same thing [18:25:43] !log aaron Synchronized php-1.25wmf2/extensions/CentralAuth: (no message) (duration: 00m 05s) [18:25:45] Is Jenkins broken for everyone or just me? [18:25:49] df shows 99% inode utilization [18:25:52] Logged the message, Master [18:25:54] still pretty amusing [18:25:56] oop [18:25:58] Something I may add: I can make any other edits in other articles [18:26:14] Ldorfman: so it's just this one URL? [18:26:15] i just restarted ocg on ocg1002, which got rid of all the deleted messages in lsof, but didn't free up any space on /srv [18:26:16] The problem is in a specific article from what I saw till now [18:26:23] yes [18:26:34] Ah, there he is, I just had to complain [18:26:36] cscott: nm re inodes, i was reading it wrong [18:26:45] but block use is 96% [18:26:48] (03CR) 10Andrew Bogott: [C: 032] Increase /var partition to 10g for new instances. [puppet] - 10https://gerrit.wikimedia.org/r/164605 (owner: 10Andrew Bogott) [18:27:25] yeah, the cache lifetime is still too long I guess. cf bblack's question earlier about whether we are underprovisioned. [18:27:38] It's the one about president Obama and someone added things we have to correct. [18:27:48] yo yo SF peeps [18:28:01] If anybody wants to come to the Mesosphere BBQ tonight all are welcome [18:28:18] http://www.evite.com/l/9tmxnER5Py [18:28:39] thanks preilly :) [18:28:46] cscott: I dunno how to answer that unless we can collect stats on how often those files are requested [18:29:19] yeah, i'm trying to collect that. really i need to log all requests over some period of time and make histograms or some such. [18:29:26] jgage: yeah no worries lot’s of good food and drink and Ryan Lane should be around too [18:29:33] can i export data from logstash easily? [18:29:43] Well, Brandon, I will go back to my regular "patrolling duties". If I get anything else, I'll come back. [18:29:56] cscott: well... maybe? [18:30:05] cscott: disk utilization is 305G/output vs 131G/postmortem if that's useful [18:30:17] Jeff_Green: oh, the postmortem directory is 130G on ocg1002. that should certainly have a smaller lifetime, i'll have to tweak that. [18:30:23] k [18:30:28] in the meantime, it's always safe to nuke postmortem in an emergency. [18:30:33] : OK? [18:30:41] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 5993: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 117 [18:30:44] Ldorfman: can you submit another edit attempt? (also I see others have edited while you've been failing?) [18:30:45] cscott: It would be possible to query the backing elasticsearch instance to get documents out [18:30:45] (which is what i'm going to do now) [18:30:47] cscott: cool., we should put that in motd :-) [18:30:58] Jeff_Green: yes, could you puppetize that? [18:31:03] Ldorfman: I have some logging running to try to see the failure [18:31:53] OK. Iet me try again [18:32:04] cscott: I'll make a note to self. [18:32:06] part of the issue is that mwalker coded a time-based expiry for the cache, rather than a size-based one. it's not trivial, but i should probably refactor this so that we have a constant 200G cache (or some such) and just vary our hit rate. [18:32:44] so many things on the to-do list. [18:33:26] cscott: indeed [18:33:43] Well, it happened again. [18:33:52] cscott: or yet another case for the invention of a LRU filesystem [18:35:07] Ldorfman: yeah I captured 6 or so hits that time [18:35:39] they all look the same [18:35:55] Tried something a bit different now - only part of the edit I tried to make. [18:36:14] ... and still: Request: POST http://he.wikipedia.org/w/index.php?title=%D7%91%D7%A8%D7%A7_%D7%90%D7%95%D7%91%D7%9E%D7%94&action=submit, from 10.64.0.103 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 2489374784 Forwarded for: 85.64.161.132, 91.198.174.103, 208.80.154.77, 10.64.0.103 Error: 503, Service Unavailable at Fri, 03 Oct 2014 18:35:37 GMT [18:36:34] basically every time you submit that, it gets through all the varnish layers, and it when it gets to the part where it makes a request to mediawiki, all I get is: [18:36:44] 441 BackendClose b ipv4_10_2_2_1 [18:37:00] (as in, we send all the request headers to mediawiki, and then it just aborts the connection and doesn't respond) [18:37:40] You know what - maybe I should try doing it in IE (I use Chrome) [18:37:54] maybe if I get it all fresh it may work... [18:38:06] it does seem strange that others can edit it and you can't [18:38:23] or try chrome with a fresh icognito window and log in again? [18:38:38] either way there's still a bug somewhere on our end [18:40:32] * Jeff_Green changing venues, BIAB [18:40:37] OK... I'm trying in IE [18:42:09] lets see now.. [18:42:15] (03PS2) 10Chad: Another ES node script: restart a node! [puppet] - 10https://gerrit.wikimedia.org/r/164401 [18:42:17] (03PS1) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [18:42:58] from the processing time, it doesn't look good.... [18:43:05] (03CR) 10jenkins-bot: [V: 04-1] Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 (owner: 10Chad) [18:43:22] Well, indeed - got the same thing: Request: POST http://he.wikipedia.org/w/index.php?title=%D7%91%D7%A8%D7%A7_%D7%90%D7%95%D7%91%D7%9E%D7%94&action=submit, from 10.64.32.105 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 2489932913 Forwarded for: 85.64.161.132, 91.198.174.103, 208.80.154.77, 10.64.32.105 Error: 503, Service Unavailable at Fri, 03 Oct 2014 18:42:30 GMT [18:43:56] Let me try something... [18:45:17] Didn't work out... [18:45:20] )-: [18:45:53] very weird... [18:46:31] !log restored dbtree from manual backup (should have been synced by scripts) [18:46:33] tgr: looks like the missing file script is on the h's...so a few more days to go [18:46:36] es =external storage? [18:46:37] Logged the message, Master [18:46:42] and now elasticsearch as well? [18:46:47] Krinkle: that, but in some cases elastic search [18:46:47] (03PS1) 10Ori.livneh: mediawiki::hhvm: add 'furl' cli tool [puppet] - 10https://gerrit.wikimedia.org/r/164619 [18:47:01] mutante: do we not have a different abbreviation for it? What are the hosts called? [18:47:19] had a bot trigger for that but gone :/ [18:47:22] !es [18:47:34] external storage hosts are esXXXX, elastic search hsots are elasticsearchXXXX [18:47:54] i too dislike ambiguous es acronym [18:48:11] see here https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Cluster_Servers [18:48:18] oops they're elasticXXXX [18:48:59] Ldorfman: I'd say file a bugzilla bug, maybe note that it was discussed on IRC and doesn't seem to directly be a varnish problem (mediawiki gets the request and closes the connection without responding) [18:49:11] someone more familiar with the app layer will have to investigate it [18:49:33] es is never elasticsearch [18:49:53] Krinkle: https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions [18:50:07] mutante: ^ es != elastic [18:50:12] if it does someplace, it needs to change now. [18:50:34] k :) [18:50:40] ^demon|away: :P [18:50:48] es is a name for elasticsearch outside of wmf, so we will always have this problem [18:50:59] "ES node" (Chad's commit earlier) also made me think of javascript (what doesn't..) [18:51:00] some config options are called es.*, there's a hadoop-es project, etc [18:51:02] EcmaScript Node [18:51:07] heh [18:51:08] i'm sure it is es in a lot of scripts, wasn't talking about just server names [18:51:16] ah, ok [18:51:28] i get what you mean [18:52:27] bblack, can you file the bugzilla bug for me? [18:52:39] <^demon|away> I don't like es = external store since we now have elasticsearch. [18:52:47] <^demon|away> It's a little confusing, but too late to change things now. [18:52:53] I'm not used to it, so it may take longer time for me... [18:53:28] I will appreciate it. [18:53:33] well, external storage called dibs on it, tough. [18:54:01] Undefined variable: groups in /srv/mediawiki/php-1.25wmf1/extensions/CentralAuth/specials/SpecialCentralAuth.php on line 264 [18:54:18] csteipp: I wonder what that var should be [18:55:32] <^demon|away> Krinkle: Yeah, shit happens :) [18:55:41] https://github.com/wikimedia/mediawiki-extensions-CentralAuth/commit/1a858c57b86dc85e99a3e91ee496af185d0aa25d [18:56:01] bblack: OK? [18:56:11] https://gerrit.wikimedia.org/r/#/c/162809/ [18:59:39] (03CR) 10Nemo bis: "IIRC before this change the situation was better, though I have no idea why and it might be a coincidence. https://bugzilla.wikimedia.org/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154234 (owner: 10Tim Starling) [19:02:20] bblack: Well, I just managed to make the needed change in the text as an anonymous (logged out of my account)! [19:02:49] heh [19:03:01] Ldorfman: I'm not familiar with filing bugzilla either :) [19:03:11] So, that's fine with me, content-wise. [19:03:58] If someone else can open a case on this issue or go on with checking it, fine with me... I'll leave you for now. [19:04:00] bye [19:05:32] !log purging old 'searchqa' scripts and logs from iron (gerrit 164429 removes from puppet) [19:05:41] Logged the message, Master [19:06:27] (03PS2) 10Dzahn: remove old 'searchqa' classes and files [puppet] - 10https://gerrit.wikimedia.org/r/164429 [19:08:37] (03PS1) 10Andrew Bogott: Move /var/log out of /var [puppet] - 10https://gerrit.wikimedia.org/r/164626 [19:08:39] Coren: ^ ? I'm still reading about lvm but probably won't get that sorted today, and I need a new image fairly soon to handle the ldap changes. [19:08:42] !log ran "sudo -u ocg -g ocg nodejs-ocg scripts/run-garbage-collect.js -c /home/cscott/config.js" on ocg100x boxes to clean up cache before the weekend [19:09:24] Coren: If I were to set up lvm, it would entail creating a big partition in lvmbuilder and then chopping it up with puppet on firstrun? Or do you think there's a way to make it part of the original image? [19:09:59] It can be part of the original image; IIRC [19:10:32] (03CR) 10coren: [C: 032] "This is a reasonable compromise." [puppet] - 10https://gerrit.wikimedia.org/r/164626 (owner: 10Andrew Bogott) [19:13:12] andrewbogott: Can you give a nod to https://gerrit.wikimedia.org/r/#/c/164370/ ? It's cosmetic but will make Mark happy. :-) [19:14:10] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6057: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 53 [19:14:14] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6057: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 53 [19:14:14] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6057: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 53 [19:14:14] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2039: active_shards: 6057: relocating_shards: 0: initializing_shards: 4: unassigned_shards: 53 [19:14:41] (03CR) 10Andrew Bogott: [C: 032] "/me bites back impulse to bikeshed over a motd that no users ever see anyway" [puppet] - 10https://gerrit.wikimedia.org/r/164370 (owner: 10coren) [19:16:03] (03CR) 10Dzahn: [C: 032] "deleted. servers were just iron, in this case, not actual search servers since they were just tools for ops" [puppet] - 10https://gerrit.wikimedia.org/r/164429 (owner: 10Dzahn) [19:19:25] (03PS1) 10Ottomata: Fix typo in hive role that caused hive to not work in labs [puppet] - 10https://gerrit.wikimedia.org/r/164628 [19:26:00] unsurprisingly I think I killed Zuul :-D [19:26:08] !log Zuul in some kind of death loop [19:26:14] Logged the message, Master [19:26:57] :/ [19:28:16] greg-g: https://gerrit.wikimedia.org/r/#/c/164629/ :D [19:28:21] heavilly spammed [19:28:27] !log Restarting Zuul sorry :-/ [19:28:33] Logged the message, Master [19:29:19] :( [19:29:29] hashar :) damn :P [19:29:40] FlorianSW: sorry :/ [19:29:58] FlorianSW: you want to amend that commit, remove the Change-Id so it generate a new Change in Gerrit :D [19:29:59] hashar: no problem :D Just resorting my inbox :P [19:30:15] 101% my fault [19:30:18] hashar: i want? ... yeah, i want :D [19:31:26] hashar: https://gerrit.wikimedia.org/r/#/c/164630/ :) [19:31:35] abandon the old? [19:31:47] FlorianSW: yeah [19:32:00] hashar: ok :) [19:32:00] FlorianSW: you probably don't want to have to scroll pages and pages of Jenkins-bot spam [19:33:50] andrewbogott: ^ [19:33:51] err [19:33:52] (03PS2) 10Ori.livneh: mediawiki::hhvm: add 'furl' cli tool [puppet] - 10https://gerrit.wikimedia.org/r/164619 [19:33:57] (03PS1) 10Yuvipanda: toollabs: Remove gridengine ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/164631 [19:33:58] andrewbogott: ^ [19:34:07] andrewbogott: we can merge this and then remove them by hand, I suppose [19:34:08] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::hhvm: add 'furl' cli tool [puppet] - 10https://gerrit.wikimedia.org/r/164619 (owner: 10Ori.livneh) [19:34:12] (03PS2) 10Andrew Bogott: toollabs: Remove gridengine ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/164631 (owner: 10Yuvipanda) [19:34:35] YuviPanda: we want to monitor the grid though, don't we? [19:34:40] Do you think that never worked? [19:34:43] andrewbogott: yes, I need to write a diamond collector for it [19:34:46] !log when running puppet merge: fatal: Unable to create '/var/lib/git/operations/puppet/.git/refs/remotes/origin/production.lock': File exists. [19:34:51] ok [19:34:52] Logged the message, Master [19:34:56] andrewbogott: this might've worked when we had ganglia running, but ganglia on labs has been dead for.... as long as I can remember [19:35:03] whoa [19:35:11] (03CR) 10Andrew Bogott: [C: 032] "This is spamming me so I'm anxious to see it merged" [puppet] - 10https://gerrit.wikimedia.org/r/164631 (owner: 10Yuvipanda) [19:35:14] YuviPanda: since the migration to eqiad iirc [19:35:19] hashar: ah, right [19:35:21] I attempt to rebuild it but gave up [19:35:29] hashar: yeah, but we've decent graphite now :) [19:35:46] Jeff_Green: $ stat /srv/deployment/ocg/output/ff [19:35:46] Modify: 46726-01-20 04:58:19.000000000 +0000 [19:35:49] exciting! [19:35:58] such future [19:36:00] hashar, YuviPanda, can I kill and delete the labs ganglia project? [19:36:08] It still has a bit of cruft sitting there [19:36:26] +1 [19:36:29] andrewbogott: it might still potentially be used for testing ganglia changes in prod, but I've never touched it [19:36:44] I'd see if it has been touched in the last few months by anyone, and if not, killlll [19:37:14] some part of it is not puppetized so I guess most changes occurs directly in prd [19:37:20] heh [19:37:28] quelle surprise [19:38:12] hm, looks like no instances in the project now, so I'll just ignore [19:38:16] andrewbogott: ah, ok [19:38:24] andrewbogott: can you remove the files and the cron entry as well? [19:40:03] YuviPanda: sure, presuming it's only on one node [19:40:14] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [19:40:19] andrewbogott: yeah, tools-master, It hink [19:40:20] *think [19:40:23] tools-shadow might also have it [19:43:21] (03PS2) 10Jforrester: Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 [19:43:42] (03Abandoned) 10Jforrester: Switch SpecialCite out for CiteThisPage on phase0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158120 (owner: 10Jforrester) [19:43:47] (03PS2) 10Ottomata: Fix typo in hive role that caused hive to not work in labs [puppet] - 10https://gerrit.wikimedia.org/r/164628 [19:43:49] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:43:57] (03CR) 10Ottomata: [C: 032] Fix typo in hive role that caused hive to not work in labs [puppet] - 10https://gerrit.wikimedia.org/r/164628 (owner: 10Ottomata) [19:44:05] (03CR) 10Ottomata: [V: 032] Fix typo in hive role that caused hive to not work in labs [puppet] - 10https://gerrit.wikimedia.org/r/164628 (owner: 10Ottomata) [19:44:41] (03CR) 10Jforrester: "PS2 is a rebase to squash I6be938e5 into this and onto master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (owner: 10Jforrester) [19:44:58] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:45:38] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/164631 (owner: 10Yuvipanda) [19:46:20] (03PS1) 10Manybubbles: [beta only] Add wikimedia-extra plugin preview [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 [19:46:52] (03PS1) 10Hashar: Restore Icinga alarms for contint to #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/164635 [19:52:50] (03PS2) 10Yuvipanda: icinga: Restore alarms for contint to #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar) [19:52:57] (03CR) 10Yuvipanda: [C: 031] icinga: Restore alarms for contint to #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar) [19:58:28] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:02:28] (03PS1) 10Andrew Bogott: Followup to f2b7cfc5e0d1f5aa7f38f54360f13cf79a20be6b [puppet] - 10https://gerrit.wikimedia.org/r/164643 [20:03:00] Is jenkins broken or just very sleeping today? [20:04:07] cscott: ooh what happened there? [20:04:22] (03PS3) 10RobH: settting codfw es servers mgmt [dns] - 10https://gerrit.wikimedia.org/r/164215 [20:04:43] apparently the nodejs `utimes` syscall wants seconds, not milliseconds. [20:05:15] https://gerrit.wikimedia.org/r/164638 [20:06:19] (03PS2) 10Manybubbles: [beta only] Deploy preview versions of plugins [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/164633 [20:12:43] (03CR) 10RobH: [C: 032] settting codfw es servers mgmt [dns] - 10https://gerrit.wikimedia.org/r/164215 (owner: 10RobH) [20:12:59] jenkins why you so slow [20:13:02] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:13:32] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:13:45] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:13:45] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:13:46] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:13:47] (03CR) 10RobH: [V: 032] settting codfw es servers mgmt [dns] - 10https://gerrit.wikimedia.org/r/164215 (owner: 10RobH) [20:13:52] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2040: active_shards: 6115: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [20:20:48] (03PS1) 10RobH: setting codfw es server install parameters [puppet] - 10https://gerrit.wikimedia.org/r/164648 [20:21:10] (03PS2) 10Ori.livneh: Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [20:21:35] (03PS3) 10Ori.livneh: Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [20:21:52] (03CR) 10Ori.livneh: [C: 032] Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [20:22:18] (03CR) 10Ori.livneh: [V: 032] Restore "Set bloom cache config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162333 (owner: 10Aaron Schulz) [20:22:56] !log ori Synchronized wmf-config/mc.php: Ie1ed821a7: Set bloom cache config (duration: 00m 03s) [20:23:04] Logged the message, Master [20:23:20] (03PS2) 10RobH: setting codfw es server install parameters [puppet] - 10https://gerrit.wikimedia.org/r/164648 [20:23:51] !log restarted zuul [20:23:56] Logged the message, Master [20:24:11] So is zuul not showing stats on integration a known issue? [20:24:21] oh, wait, now its showing, had error for the first minute... [20:24:28] robh: I just restarted it a second ago, it had crashed [20:24:38] I think [20:24:42] ahh [20:24:43] ok [20:25:01] and yea... i just ask shit and not read backlog like a should ;] [20:25:16] even though it was the last thing put into channel =P [20:25:48] (03PS1) 10Dzahn: add check if salt-minion is running to base [puppet] - 10https://gerrit.wikimedia.org/r/164651 [20:26:26] robh: I'm still not sure tests are actually running though... [20:26:33] they dont seem to be [20:26:39] the build queue shows a bunch of clocks. [20:26:42] in interface [20:26:46] https://integration.wikimedia.org/ci/ [20:26:57] and idle slots [20:27:02] some stuff is moving though [20:27:20] (03CR) 10Dzahn: "@palladium:~# /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^/usr/bin/python /usr/bin/salt-minion'" [puppet] - 10https://gerrit.wikimedia.org/r/164651 (owner: 10Dzahn) [20:27:22] (03PS2) 10Andrew Bogott: Followup to f2b7cfc5e0d1f5aa7f38f54360f13cf79a20be6b [puppet] - 10https://gerrit.wikimedia.org/r/164643 [20:27:27] andrewbogott: something isnt right [20:27:33] (03PS3) 10Andrew Bogott: toollabs: Remove gridengine ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/164631 (owner: 10Yuvipanda) [20:27:33] oh, wait, its going [20:27:39] suddenly the operations puppet queue just filled [20:27:41] andrewbogott: its ok [20:27:49] yeah, must just take a few to get going [20:27:51] stuff just started flying through [20:27:57] \o/ [20:28:13] not mine yet... but its working [20:28:13] (03PS1) 10Manybubbles: [beta] More cirrus config updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164654 [20:28:18] (03CR) 10Andrew Bogott: [V: 032] Followup to f2b7cfc5e0d1f5aa7f38f54360f13cf79a20be6b [puppet] - 10https://gerrit.wikimedia.org/r/164643 (owner: 10Andrew Bogott) [20:28:38] (03CR) 10Andrew Bogott: [C: 032] Followup to f2b7cfc5e0d1f5aa7f38f54360f13cf79a20be6b [puppet] - 10https://gerrit.wikimedia.org/r/164643 (owner: 10Andrew Bogott) [20:28:45] anyone mind if I do another beta deploy? [20:28:54] hrmm [20:29:04] the build executor slots are mostly idle [20:29:10] there is a backlog, they shoudl all be full [20:29:20] (03CR) 10Chad: [C: 031] [beta] More cirrus config updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164654 (owner: 10Manybubbles) [20:30:05] !log ran 'sudo -u ocg -g ocg nodejs-ocg scripts/run-garbage-collect.js -c /home/cscott/config.js' from /home/cscott/ocg/mw-ocg-service in order to clear caches (working around https://gerrit.wikimedia.org/r/164644 ) [20:30:13] Logged the message, Master [20:30:20] !log (the above was on ocg100x.eqiad.wmnet) [20:30:26] Logged the message, Master [20:30:34] (03CR) 10Manybubbles: [C: 032] [beta] More cirrus config updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164654 (owner: 10Manybubbles) [20:30:43] (03Merged) 10jenkins-bot: [beta] More cirrus config updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164654 (owner: 10Manybubbles) [20:31:18] !log manybubbles Synchronized wmf-config/CirrusSearch-labs.php: noop update to sync beta configs (duration: 00m 04s) [20:31:24] Logged the message, Master [20:33:18] ACKNOWLEDGEMENT - Host pdf2 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn -- replaced by OCG [20:33:18] ACKNOWLEDGEMENT - Host pdf3 is DOWN: CRITICAL - Plugin timed out after 15 seconds daniel_zahn -- replaced by OCG [20:37:45] !log disabling puppet on rbf1002 to test bloom filter config [20:37:51] Logged the message, Master [20:37:56] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:38:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:40:34] (03CR) 10Catrope: "OK, that's fine, I'll split it up." [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [20:41:16] ok, why is it still not to my code review... [20:45:29] (03PS3) 10RobH: setting codfw es server install parameters [puppet] - 10https://gerrit.wikimedia.org/r/164648 [20:45:33] andrewbogott: something isnt right still [20:45:40] cuz my patchset hasnt processed at all [20:45:43] and it shows idle slots [20:46:00] robh: did you submit it before I restarted zuul? [20:46:10] I think when you restart zuul it forgets everything, so you have to 'recheck' to get back in the queue [20:46:15] oh [20:46:41] that explains it [20:46:56] * robh rebases just to get a check [20:47:05] that was fast [20:47:12] sorry about that andrewbogott ;] [20:47:31] (03CR) 10RobH: [C: 032] setting codfw es server install parameters [puppet] - 10https://gerrit.wikimedia.org/r/164648 (owner: 10RobH) [20:48:18] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:48:18] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:07:09] (03PS1) 10Ori.livneh: rbf100{1,2}: set vm.overcommit_memory = 1, per redis requirements [puppet] - 10https://gerrit.wikimedia.org/r/164668 [21:08:43] PROBLEM - Parsoid on wtp1001 is CRITICAL: Connection refused [21:09:24] (03CR) 10Aaron Schulz: [C: 031] rbf100{1,2}: set vm.overcommit_memory = 1, per redis requirements [puppet] - 10https://gerrit.wikimedia.org/r/164668 (owner: 10Ori.livneh) [21:10:10] (03PS2) 10Ori.livneh: rbf100{1,2}: set vm.overcommit_memory = 1, per redis requirements [puppet] - 10https://gerrit.wikimedia.org/r/164668 [21:12:56] (03PS1) 10BBlack: add router mon for [cm]r[12]-(codfw|ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/164669 [21:13:39] cmon jenkins I don't have all day [21:13:50] (03CR) 10BBlack: [C: 032] add router mon for [cm]r[12]-(codfw|ulsfo) [puppet] - 10https://gerrit.wikimedia.org/r/164669 (owner: 10BBlack) [21:14:31] some router alerts are likely about to pop up. don't freak out, nothing's wrong [21:15:53] RECOVERY - Parsoid on wtp1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.013 second response time [21:18:48] PROBLEM - Redis on rbf1001 is CRITICAL: Connection refused [21:22:39] and I need to step away again in a few mins [21:22:45] feel free to silence them in icinga as necc :) [21:25:58] RECOVERY - Redis on rbf1001 is OK: TCP OK - 0.003 second response time on port 6379 [21:30:12] (03PS1) 10Aaron Schulz: If persistence is neither "both" nor "rdb", disable rdb snapshots in redis [puppet] - 10https://gerrit.wikimedia.org/r/164674 [21:31:14] PROBLEM - BGP status on cr1-codfw is CRITICAL: Return code of -1 is out of bounds [21:31:15] (03PS2) 10Ori.livneh: If persistence is neither "both" nor "rdb", disable rdb snapshots in redis [puppet] - 10https://gerrit.wikimedia.org/r/164674 (owner: 10Aaron Schulz) [21:31:30] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 108, down: 12, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - BRxe-5/0/0: down - BRxe-5/3/1: down - Transit: ! Telia [10Gbps DF]BRxe-5/0/2: down - BRxe-5/0/1: down - BRxe-5/2/2: down - BRxe-5/1/1: down - BRxe-5/2/3: down - BRxe-5/0/3: down - BRxe-5/1/2: down - BRxe-5/1/3: down - BRxe-5/1/0: down - BR [21:31:53] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: Return code of -1 is out of bounds [21:31:54] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 63, down: 2, dormant: 0, excluded: 0, unused: 0BRem1: down - BRxe-1/3/0: down - Transit: ! Telia [10Gbps DF]BR [21:32:03] PROBLEM - BGP status on cr2-codfw is CRITICAL: Return code of -1 is out of bounds [21:32:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 107, down: 12, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - BRxe-5/0/0: down - BRxe-5/3/1: down - BRxe-5/0/2: down - BRxe-5/0/1: down - BRxe-5/2/2: down - BRxe-5/1/1: down - BRxe-5/2/3: down - BRxe-5/0/3: down - BRxe-5/1/2: down - BRxe-5/1/3: down - BRxe-5/1/0: down - BR [21:32:33] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: Return code of -1 is out of bounds [21:32:36] andrewbogott: you know that labswiki is flooding the memcached logs? [21:32:40] is labswiki == wikitech? [21:32:44] i think that is bblack :) [21:32:52] ori: yep, wikitech. What's it saying? [21:32:53] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 72, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqiad:xe-4/2/1 (Giglinx/Zayo, ETYX/084858//ZYO) {#1062} [10Gbps MPLS]BRem1: down - BR [21:32:55] re: router interfaces [21:33:03] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 10.128.128.1, interfaces up: 35, down: 3, dormant: 0, excluded: 0, unused: 0BRfe-0/0/6: down - BRfe-0/0/5: down - BRfe-0/0/7: down - laptop connectionBR [21:33:14] Reedy, andrewbogott: still seing memcached/db errors for labswiki in dberror.log/memcached-serious.log :/ [21:33:18] andrewbogott: 2014-10-03 21:31:45 virt1000 labswiki: Memcached error: Error connecting to 127.0.0.1:11211: Connection refused [21:33:23] andrewbogott: is nutcracker running? [21:33:38] (03CR) 10Ori.livneh: [C: 032 V: 032] If persistence is neither "both" nor "rdb", disable rdb snapshots in redis [puppet] - 10https://gerrit.wikimedia.org/r/164674 (owner: 10Aaron Schulz) [21:33:51] (03CR) 10Ori.livneh: [C: 032] rbf100{1,2}: set vm.overcommit_memory = 1, per redis requirements [puppet] - 10https://gerrit.wikimedia.org/r/164668 (owner: 10Ori.livneh) [21:33:51] 122 8345 1 0 Sep14 ? 03:31:27 /usr/sbin/nutcracker --mbuf-size=65536 [21:34:17] andrewbogott: mind if i take a look? [21:34:31] wrong port? 11211 vs 11212 ? [21:34:34] ori: have at [21:34:55] Is this something that's been happening for weeks? [21:37:18] andrewbogott: virt1000 needs role::mediawiki::common [21:37:53] (03PS1) 10Yuvipanda: icinga: Move purge resource script into module [puppet] - 10https://gerrit.wikimedia.org/r/164676 [21:37:54] what's missing specifically? [21:38:32] (03CR) 10jenkins-bot: [V: 04-1] icinga: Move purge resource script into module [puppet] - 10https://gerrit.wikimedia.org/r/164676 (owner: 10Yuvipanda) [21:39:27] And is this something that only just broke or has been broken for a long time? [21:39:46] sorry, doing too many things at once. is it supposed to be using its own memcached instance, separate from the one used by the cluster? [21:40:43] andrewbogott: and yes, it's been broken since sept. 3rd, it looks like [21:40:56] yes, local memcache [21:41:00] https://dpaste.de/a8ru/raw [21:41:09] (03PS2) 10Yuvipanda: icinga: Move purge resource script into module [puppet] - 10https://gerrit.wikimedia.org/r/164676 [21:42:31] andrewbogott: anyways, yes, as hashar noticed, it's trying to connect to the wrong port [21:42:43] andrewbogott: where is this specified? usually it's /srv/mediawiki/wmf-config/mc.php [21:43:19] everything is in operations-mediawiki-config [21:43:50] * ori boggles [21:44:00] and boggots [21:44:35] there's a duplicate wiki in /srv/org/wikimedia but that's defunct and can be ignored. [21:44:52] the actual install is running in the normal /srv/mediawiki location [21:45:16] Openstack packages should pull in all the dependencies needed for the wiki, hence no role::mediawiki::common [21:46:20] (03PS1) 10Yuvipanda: icinga: Move misc files/dirs into module [puppet] - 10https://gerrit.wikimedia.org/r/164678 [21:46:22] (03PS1) 10Yuvipanda: icinga: Remove misc/icinga.pp [puppet] - 10https://gerrit.wikimedia.org/r/164679 [21:47:12] (03PS1) 10Yuvipanda: nagios_common: Move check_graphite license file into folder [puppet] - 10https://gerrit.wikimedia.org/r/164680 [21:47:24] mutante: ottomata finally, misc/icinga.pp is no more! [21:47:29] should wait until monday to merge, tho [21:47:43] and I think I'd want to get access to neon myself before I fix the warts in icinga/init.pp [21:52:21] (03CR) 10Dzahn: "Cscott: originally there used to be pdf1-3, then at some point pdf1 died and just pdf2 and pdf3 remained. if the mediawiki config mentione" [puppet] - 10https://gerrit.wikimedia.org/r/162814 (owner: 10Dzahn) [21:53:15] YuviPanda: cool! and not merging on Friday afternoon is appreciated [21:55:38] !log updated the defaut labs trusty image: updated packages, updated ldap setup, new /var/log partition [21:55:45] Logged the message, Master [21:56:14] (03PS1) 10Yuvipanda: icinga: Move checkcommands.erb into module [puppet] - 10https://gerrit.wikimedia.org/r/164681 [21:56:16] (03PS1) 10Yuvipanda: icinga: Remove unused check plugins [puppet] - 10https://gerrit.wikimedia.org/r/164682 [21:56:25] andrewbogott: yay. you should maybe do that for precise too? [21:56:49] YuviPanda: Building new precise images is pretty much broken, I'm hoping we can live without. [21:56:54] aaah [21:57:01] well, let that be then :) [21:57:04] I should maybe just remove the option from the gui [21:57:06] * YuviPanda should migrate tools to trusty at some point [21:57:10] until someone begs for it [21:57:21] andrewbogott: lots of things are still precise. betalabs, toollabs... [21:57:25] since lots of prod is still precise... [21:57:34] Hm... [21:57:47] I don't think we should remove the precise option until most of prod has migrated [21:59:11] hI think with that, I've to call the 'make icinga into a module' complete [21:59:38] 3 files left in files/icinga, but I can't really think of a proper place to put them [22:00:04] 6 patches only! nice [22:00:09] * YuviPanda does a fistpump [22:00:36] I may have to build a new base image then, not sure new ones will work after there stops being an ldap server named virt1000 [22:01:43] (03CR) 10Dzahn: "see inline comments. i went through the files you suspected are created from package, and most of them are, yes, see details though" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/164678 (owner: 10Yuvipanda) [22:02:02] YuviPanda: ^ there: [22:02:10] dpkg -L icinga-common [22:02:13] mutante: w00t, that's helpful [22:02:36] mutante: I wonder if the permissions / ownership is useful [22:02:44] mutante: since a lot of the other code just flippantly does 'root' [22:03:20] mutante: is neon's /etc/icinga/conf.d empty? [22:03:40] YuviPanda: yes, empty directory [22:03:43] ah cool [22:03:55] unsure how it got there [22:04:04] mutante: in that case, I'm going to remove everything except the rw inside nagios [22:05:27] YuviPanda: the permissions can't be just root:root though, those exec's were indeed needed at some point [22:05:35] yeah [22:05:46] icinga user for example needs to be able to write to the "rw" [22:05:51] hence it being called rw [22:05:53] but I suppose they're no longer needed, but we should probably also fix the root:root into icinga [22:06:30] i'm not so sure about them not being needed, just gotta test it [22:06:55] yeah [22:07:29] mutante: I was intending to wait until I've neon access to test and merge those (or wait for all the issues which prevent it from running on labs get fixed) [22:08:08] (03PS2) 10Yuvipanda: nagios_common: Move check_graphite license file into folder [puppet] - 10https://gerrit.wikimedia.org/r/164680 [22:08:10] (03PS2) 10Yuvipanda: icinga: Move checkcommands.erb into module [puppet] - 10https://gerrit.wikimedia.org/r/164681 [22:08:12] (03PS2) 10Yuvipanda: icinga: Remove unused check plugins [puppet] - 10https://gerrit.wikimedia.org/r/164682 [22:08:14] mutante: but if you want to lend a hand next week, that'll be awesome too [22:08:14] (03PS2) 10Yuvipanda: icinga: Move misc files/dirs into module [puppet] - 10https://gerrit.wikimedia.org/r/164678 [22:08:16] (03PS2) 10Yuvipanda: icinga: Remove misc/icinga.pp [puppet] - 10https://gerrit.wikimedia.org/r/164679 [22:08:18] (03CR) 10Dzahn: [C: 031] icinga: Remove misc/icinga.pp [puppet] - 10https://gerrit.wikimedia.org/r/164679 (owner: 10Yuvipanda) [22:08:34] oh man, I was supposed to do this to get shinken running, but I've hardly spent any time on shinken the last week :( [22:09:52] (03CR) 10Dzahn: [C: 032] nagios_common: Move check_graphite license file into folder [puppet] - 10https://gerrit.wikimedia.org/r/164680 (owner: 10Yuvipanda) [22:10:15] YuviPanda: hmm.. less dependencies would make more merges :) [22:10:21] :D true, true. [22:10:23] that license file, would have done it [22:10:37] yeah, but it's also slightly more complex local workflow for me. [22:10:48] but true, it will make merging easier [22:10:49] i understand, yea, you'd have to reset between each commit [22:10:53] yeah [22:10:56] or have a fuckload of branches [22:11:03] and some of them *are* dependent [22:11:32] yeah, when they are actually dependent then it's very right to actually make them in gerrit too [22:11:39] (03CR) 10Nuria: [C: 031] Allow local server-status requests on http [puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/164583 (https://bugzilla.wikimedia.org/71606) (owner: 10QChris) [22:11:45] (03PS3) 10Yuvipanda: nagios_common: Move check_graphite license file into folder [puppet] - 10https://gerrit.wikimedia.org/r/164680 [22:11:50] mutante: yeah. moved that one out, tho ^ [22:12:20] and rebased the other series to not include it [22:12:20] (03PS3) 10Yuvipanda: icinga: Move checkcommands.erb into module [puppet] - 10https://gerrit.wikimedia.org/r/164681 [22:12:22] (03PS3) 10Yuvipanda: icinga: Remove unused check plugins [puppet] - 10https://gerrit.wikimedia.org/r/164682 [22:12:44] YuviPanda: there, one less :) [22:12:52] :) [22:16:12] (03Abandoned) 10Yuvipanda: quarry: Specify uid of user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/152062 (owner: 10Yuvipanda) [22:16:22] (03PS1) 10Aaron Schulz: Fixed the parser cache type for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164688 [22:18:17] (03CR) 10Aaron Schulz: [C: 032] Fixed the parser cache type for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164688 (owner: 10Aaron Schulz) [22:18:34] (03Merged) 10jenkins-bot: Fixed the parser cache type for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164688 (owner: 10Aaron Schulz) [22:18:50] (03CR) 10Krinkle: "@Hashar: The virtual window is a deamon service unrelated and separate from any individual job. This is part of upstart, it doesn't say an" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [22:19:14] (03PS3) 10Yuvipanda: labs: reduce acct archiving retention [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [22:19:16] (03CR) 10Krinkle: "So those issues don't apply here." [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [22:19:25] andrewbogott: AaronSchulz's patch should fix it, we should be able to confirm in a minute [22:19:29] (03PS15) 10Krinkle: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 [22:19:36] !log aaron Synchronized wmf-config/InitialiseSettings.php: Fixed the parser cache type for labswiki (duration: 00m 03s) [22:19:45] Logged the message, Master [22:19:47] andrewbogott: by the way, it makes sense for wikidev to have shell access to virt1000; they do for all other deployment targets [22:20:07] (03CR) 10Yuvipanda: [C: 04-1] "Reduced to 7 days." [puppet] - 10https://gerrit.wikimedia.org/r/164520 (https://bugzilla.wikimedia.org/69604) (owner: 10Hashar) [22:20:24] let me know when to sync on virt1000 [22:24:56] (03CR) 10Krinkle: "If we want a separate window for each run, we wouldn't need any of this and we'd just use "xvfb-run --auto-servernum". But we simply don't" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [22:25:34] i LOVE this one [22:25:42] https://gerrit.wikimedia.org/r/#/c/164099/ [22:27:52] (03PS2) 10Dzahn: pdf servers - remove from dsh,dhcp,ganglia [puppet] - 10https://gerrit.wikimedia.org/r/162814 [22:30:08] (03CR) 10Dzahn: [C: 032] "very harmless now. first of all the servers are already down, second this just removes them from dsh (not used), dhcp (cant be reinstalled" [puppet] - 10https://gerrit.wikimedia.org/r/162814 (owner: 10Dzahn) [22:30:33] (03PS3) 10Dzahn: pdf servers - remove from dsh,dhcp,ganglia [puppet] - 10https://gerrit.wikimedia.org/r/162814 [22:30:40] (03CR) 10Dzahn: [C: 032] pdf servers - remove from dsh,dhcp,ganglia [puppet] - 10https://gerrit.wikimedia.org/r/162814 (owner: 10Dzahn) [22:31:35] !log restarted Parsoid servers after another gradual cpu load creep [22:31:42] Logged the message, Master [22:33:15] !log Updated integration/phpunit to 6c1d11d (Regenerate autoloader) [22:33:22] Logged the message, Master [22:35:10] (03PS3) 10Dzahn: remove 10.0.0.0/16 Tampa subnet from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/164241 [22:35:49] (03CR) 10Dzahn: "Alex, ok thanks, updated in PS3 to remove the entire subnet" [puppet] - 10https://gerrit.wikimedia.org/r/164241 (owner: 10Dzahn) [22:36:33] YuviPanda: https://gerrit.wikimedia.org/r/#/c/164496/ [22:36:49] YuviPanda: https://gerrit.wikimedia.org/r/#/c/164497/ [22:37:05] !log NoConnectedServersError("No connected Gearman servers") in zuul.log on gallium [22:37:09] (03CR) 10Yuvipanda: [C: 031] "Although I think the general role should die in a fire at some point." [puppet] - 10https://gerrit.wikimedia.org/r/164496 (owner: 10Dzahn) [22:37:11] Logged the message, Master [22:37:30] (03CR) 10Yuvipanda: [C: 031] protoproxy - remove pmtpa [puppet] - 10https://gerrit.wikimedia.org/r/164497 (owner: 10Dzahn) [22:37:33] Krinkle: what's the fix for that? Restart zuul or gearman or both? [22:38:53] (03CR) 10Dzahn: [C: 032] "ok, yep, pmtpa.wmflabs couldn't work anymore. so i'd say all this could do is unbreak things or nothing" [puppet] - 10https://gerrit.wikimedia.org/r/164496 (owner: 10Dzahn) [22:40:57] !log Trying a soft restart of zuul on gallium [22:41:01] (03CR) 10Dzahn: [C: 032] "@site == "pmtpa" isn't really possible anymore" [puppet] - 10https://gerrit.wikimedia.org/r/164497 (owner: 10Dzahn) [22:41:03] Logged the message, Master [22:41:31] bd808: fix for what? [22:41:45] Krinkle: zuul not seeing gearman [22:41:55] NoConnectedServersError("No connected Gearman servers") [22:42:00] in the zuul log [22:42:27] bd808: Depends. What led to this? What was being updated or restarted? [22:42:56] Neither? It just croaked. I did do a reload an hour or so ago [22:43:12] but it seemed fine until just a few minutes ago [22:43:30] The reload was to deploy a some new triggers [22:43:38] and they worked after the reload [22:44:01] This command just hangs now: /usr/local/bin/zuul-gearman.py status [22:44:54] And the status page says "Queue only mode: preparing to exit, queue length: 6" [22:45:06] So I think it's time to stop && start [22:46:33] !log Restarting zuul on gallium [22:46:39] (03PS1) 10Dzahn: delete snuggle.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164690 [22:46:40] Logged the message, Master [22:47:27] * YuviPanda goes to sleep [22:47:46] YuviPanda|zzzz: because of snuggle ? thanks! good nigh [22:49:18] bd808: zuul is fine though, the problem is in Gearman/Jenkins. Afaik Zuul doesn't ensure Gearman is running. [22:49:22] I/m not sure how to restart gearman [22:49:44] (03PS1) 10Dzahn: delete sanger SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164692 [22:49:52] but even the connection to that may be push instead of pull. e.g. the slave connections have to be reinitiated from the slave side [22:49:55] Restarting zuul seems to have fixed it. I think the gearman process is spawed from zuul [22:50:12] It's running jobs again now [22:50:27] (03PS1) 10Aaron Schulz: Removed obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164693 [22:50:52] (03PS1) 10Dzahn: delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 [22:51:52] Krinkle: Based on this diagram, I think that the zuul service on gallium controlls both gearman and zuul -- https://upload.wikimedia.org/wikipedia/commons/e/e9/Integrationwikimediaci-zuul_git_flows.svg [22:51:53] (03PS1) 10Dzahn: delete contacts.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164695 [22:52:47] bd808: ok [22:53:09] (03PS1) 10Dzahn: remove virt-star.pmtpa SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164696 [22:53:12] (03CR) 10Krinkle: [C: 031] Removed obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164693 (owner: 10Aaron Schulz) [22:56:04] (03CR) 10CSteipp: [C: 031] lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [22:56:10] (03PS1) 10Dzahn: remove ishmael SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164697 [22:57:35] (03PS1) 10Dzahn: remove blog SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164698 [22:59:14] (03PS2) 10Dzahn: remove blog SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164698 [23:23:37] (03CR) 10Dzahn: [C: 032] icinga: Restore alarms for contint to #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar) [23:24:57] (03CR) 10Dzahn: [V: 032] icinga: Restore alarms for contint to #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar) [23:45:36] CUSTOM - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [23:48:54] (03CR) 10Dzahn: "tested. works. outputs a custom message to both channels now." [puppet] - 10https://gerrit.wikimedia.org/r/164635 (owner: 10Hashar)