[00:01:57] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Puppet has 1 failures [00:02:50] Krenair: Swat is ready? [00:04:18] Krinkle, umm... I did a single patch earlier [00:04:25] Okay :) [00:04:35] the swat window ended 5 minutes ago [00:08:15] Krenair: Yeah, well, it'snot unusual to take longer [00:08:31] it doesn't take much longer than 20 minutes, usually [00:08:45] AndyRussG: ejegg|away: There is are two changes to DonationInterface's current wmf branch that were merged but not deployed [00:08:49] morning one is long, from my subjective view [00:09:20] Krinkle: I don't know much about that one, but ejegg|away said he'd be back a bit later this eve [00:09:36] The extension submodule update to core is now pulled on tin due to my deployment, however the submodule I've left as-is (not deployed) so beware [00:10:46] !log krinkle@tin Synchronized php-1.26wmf24/extensions/EventLogging/modules/ext.eventLogging.core.js: Increase maxUrlSize to 2000 (duration: 00m 17s) [00:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:18] 6operations, 7HHVM: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#1673341 (10matmarex) [00:25:53] (03PS1) 10Dzahn: contint: update Apache config for 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240936 [00:27:28] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [00:28:06] (03PS2) 10Dzahn: contint: update Apache config for 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240936 [00:30:41] (03PS1) 10Alex Monk: tcpircbot: Allow per-infile channel lists [puppet] - 10https://gerrit.wikimedia.org/r/240939 [00:31:37] (03CR) 10jenkins-bot: [V: 04-1] tcpircbot: Allow per-infile channel lists [puppet] - 10https://gerrit.wikimedia.org/r/240939 (owner: 10Alex Monk) [00:31:39] (03PS1) 10Dzahn: limn: make compatible with Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240941 [00:33:28] (03PS1) 10Dzahn: graphite: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/240942 [00:33:39] 82 > 79 characters [00:33:40] * Krenair rages [00:34:54] (03PS2) 10Alex Monk: tcpircbot: Allow per-infile channel lists [puppet] - 10https://gerrit.wikimedia.org/r/240939 [00:48:03] yuvipanda, around? [00:56:42] (03PS1) 10Alex Monk: [WIP] Move from ircecho to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/240945 [01:09:03] Krenair: do you need something reviewed? [01:09:25] I was wondering where best to test the commit above [01:09:31] e.g. which labs project [01:09:49] oh, dunno. yeah, that's a yuvi question. [01:10:24] thanks anyway [01:10:25] * Krenair sleeps [01:11:39] good night! [01:25:43] (03PS1) 10Gergő Tisza: [WIP] LDAP support [software/sentry] - 10https://gerrit.wikimedia.org/r/240949 (https://phabricator.wikimedia.org/T97133) [01:27:18] (03PS2) 10Gergő Tisza: [WIP] LDAP support [software/sentry] - 10https://gerrit.wikimedia.org/r/240949 (https://phabricator.wikimedia.org/T97133) [01:28:28] !log krinkle@tin Synchronized php-1.26wmf24/extensions/WikimediaEvents: I5608f8ffd1c - Fix trailing comma (duration: 00m 17s) [01:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:28:37] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3017_v6 [01:32:27] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [01:42:27] !log mwscript deleteEqualMessages.php --wiki alswiki [01:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:42:56] !log mwscript deleteEqualMessages.php --wiki itwikiquote [01:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:02] !log mwscript deleteEqualMessages.php --wiki roa_tarawiki [01:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:48:13] 6operations, 10Analytics-EventLogging, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team, 5Patch-For-Review: Increase maxUrlSize from 1000 to 1500 - https://phabricator.wikimedia.org/T112002#1673362 (10ori) 5Open>3Resolved a:3Krinkle [01:55:37] 6operations, 6Performance-Team, 6Release-Engineering-Team, 10Traffic, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1673434 (10ori) Brandon's conclusions seem right to me. What is left to do here? ("Delete the branches automa... [02:05:56] 6operations, 6Performance-Team, 6Release-Engineering-Team, 10Traffic, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1673451 (10Krinkle) If the only source of traffic to old wmf-branches' w/static directories is from facebook... [02:13:17] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [02:15:09] Hi Krinkle! We're not sure why our deployment branch of DonationInterface has only recently become a problem [02:15:58] but we're trying to get a patch merged to make-wmf-branch that will dissociate the branch we deploy to the payments cluster from the one used on donatewiki [02:16:10] ejegg: Ah, that's a separate issue though [02:16:15] https://gerrit.wikimedia.org/r/230705 [02:16:17] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:16:18] PROBLEM - IPsec on cp2009 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:16:57] PROBLEM - IPsec on cp2015 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:06] ejegg: I was just pointing out that there were patches backported to the deployed branch used by the main cluster, which Gerrit automatically applies to the matching mediawiki-core wmf branch, but were not deployed. As such, anyone deploying code is now seeing a dirty git status for submodule update that isn't applied. [02:17:07] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:07] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:07] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - ok: 5 connecting: cp1046_v4, cp1046_v6, cp1060_v6 [02:17:07] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:07] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:27] PROBLEM - IPsec on cp2021 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:37] In general, you shouldn't merge cherry-pick or backports commits until right before you intent to deploy them. [02:17:42] Krinkle: oh, you're talking about a core branch, not a DonationInterface branch? [02:17:46] PROBLEM - IPsec on cp2003 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:57] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:17:58] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - ok: 6 connecting: cp1046_v4, cp1046_v6 [02:18:09] On core, we only commit changes to fundraising/REL1_25 [02:18:17] I'm talking about DonationInterface [02:18:20] are those leaking into the wmf/ branches somehow? [02:18:32] https://github.com/wikimedia/mediawiki/commits/wmf/1.26wmf24 [02:18:43] 5 hours ago two commits were backported to DonationInterface's current production branch [02:18:46] but not deployed [02:19:21] I did merge master to deployment 5 hours ago, in order to deploy to the payments cluster [02:19:38] same way as we've been doing since I got here [02:19:50] So there is a single branch for both? [02:19:58] But one is not meant to be deployed in the main cluster? [02:20:48] Again, I'm not sure why it's only coming up now, but I think the problem is in make-wmf-branch's special extensions config [02:21:24] It's got DonationInterface pointing at our payments cluster deployment branch [02:21:31] I don't mind separate branch policies. That's okay-ish. [02:21:41] but that patch I linked above puts it back in the normal list [02:22:43] I'd have +2ed it myself, but wasn't sure if patches to that repo need to be deployed immediately after merge [02:22:49] But if make-wmf-branch uses "deployment" for the main cluster cut, that means commits added to deployment are expected to be deployed. Is it correct that the current assumption is that we deploy from "deployment" once when the wmf branch is cut, but then ignore updates (e.g. updates are for payment cluster only) [02:22:57] If that's the case, then there should be two separate branches. [02:23:16] My only concern is that mediawiki-core wmf branches should not point to a branch that receives commits you don't intend to deploy when they are merged. [02:23:36] point to an extension branch* [02:23:42] For the main cluster, the extension is only used for l10n strings on donatewiki [02:23:55] so wmf branches can take master [02:24:03] Okay [02:24:38] Yeah, I agree our payments cluster deploys shouldn't be interfering with the main cluster deploys [02:24:48] i'm just puzzled why it hasn't been a problem all along [02:25:03] It has been. It's been bugging me every time I deploy something. [02:25:10] I guess other deployers just ignored it until now. [02:25:13] oho! [02:25:44] Is that make-wmf-branch config patch all we need to fix the issue? [02:25:57] Yeah, the next time release engineering cuts a branch it'll cut it from master [02:26:28] ok. Do you know if I need to deploy make-wmf-branch after I +2 the patch? [02:26:44] ejegg: So those two changes, you added to the DI deployment branch; are they safe to roll out now? [02:26:52] yes, definitely [02:27:42] basically, any DI change ought to be safe to roll out to the main cluster since the translated strings are the only things used there [02:28:52] ejegg: Nope, nothing to deploy from make-wmf-branch. It's a script used once a week by deployers to auto-create wmf branches everywhere [02:29:12] ah, cool. I see it checks to make sure it's the latest version when it runs [02:29:12] afaik it's run from developers' local machine [02:29:23] merging now! [02:30:15] ah, you already did - thanks! [02:30:33] oh, that explains the sync-dir lock error I got [02:30:38] :D [02:30:43] Yeah, I was going to deploy it. [02:30:44] Go ahead [02:31:31] ?? No, I mean I just tried to +2 the config.json patch, and got a 'patch is closed' error 'cause I hadn't reloaded the page since you +2ed it [02:31:40] I haven't touched anything on tin [02:31:54] right [02:32:01] then someone else is deploying [02:32:07] hmm [02:32:08] something else rather [02:32:09] l10nupdate [02:32:15] (03CR) 10MZMcBride: "The commit message is confusing me. Is this only for Beta Labs?" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [02:32:26] Right. it's that time of day [02:33:29] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 11m 45s) [02:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:46:14] 6operations, 10RESTBase, 6Services: RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#1673464 (10GWicke) My vote would be to shelve this until we actually have a need to preserve old data for a significant project across a rename. For the small projects that were renamed so far option 1... [02:53:36] !log krinkle@tin Synchronized php-1.26wmf24/extensions/DonationInterface: 381faf5 (duration: 00m 21s) [02:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:57:37] !log sync-common failed on mw1010.eqiad.wmnet [02:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:51] 6operations, 10RESTBase: uneven load on restbase workers - https://phabricator.wikimedia.org/T113579#1673470 (10GWicke) We are using the cluster mode that relies on the Linux kernel's native accept scheduling, which isn't completely fair. The cluster module [also supports a round-robin scheme](https://nodejs.o... [03:05:06] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:26:17] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [03:28:50] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:49:37] (03CR) 10Smalyshev: "For starters, yes. Once we're sure it works on beta we do the same on production." [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [03:53:57] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:05:47] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:11:07] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [100000000.0] [04:23:25] (03CR) 10MZMcBride: "One of these two changesets should explain why the change is being made." [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [04:27:52] (03CR) 10Smalyshev: "I thought there's a lot of discussion in the bug. But I can add short summary to the commit msg too." [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [04:33:17] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:49:13] (03PS1) 10Aaron Schulz: Set "async" for SQL parser cache everywhere else [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240963 [04:57:05] (03CR) 10Ori.livneh: [C: 032] Set "async" for SQL parser cache everywhere else [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240963 (owner: 10Aaron Schulz) [04:57:11] (03Merged) 10jenkins-bot: Set "async" for SQL parser cache everywhere else [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240963 (owner: 10Aaron Schulz) [04:58:18] !log ori@tin Synchronized wmf-config/CommonSettings.php: I252970886: Set "async" for SQL parser cache everywhere else (duration: 00m 18s) [04:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:31:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1673599 (10Papaul) @Coren Thanks for working on this. Daniel helped me to test my access to the different bastion servers bast1000, bast2001, bast3001, hooft a... [05:43:56] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [05:44:46] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 35.00 ms [05:58:39] PROBLEM - puppet last run on db1011 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:56] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1673613 (10Nemo_bis) You can do `7z e -so T26675.xml.7z | grep -C 100 "186704908"`. [06:24:27] RECOVERY - puppet last run on db1011 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:30:26] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: puppet fail [06:32:26] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:28] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:06] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:07] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:16] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:37] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:38] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:07] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:17] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:57:06] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:37] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:58:27] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:59:26] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:09:37] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61484 bytes in 7.125 second response time [07:20:37] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:57] ^^^ that one is gitblit going wild on antimony [07:24:02] known issue [07:29:46] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61469 bytes in 0.327 second response time [07:39:18] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1673665 (10fgiunchedi) summing up yesterdays ops/services sync up: * we'll take the multi-instance opportunity to separate `/var` for logs/data [ops] * we'll investigate syslog, both local and remote for r... [08:01:19] 6operations, 10RESTBase: uneven load on restbase workers - https://phabricator.wikimedia.org/T113579#1673691 (10fgiunchedi) >>! In T113579#1673470, @GWicke wrote: > > @fgiunchedi, do you see any concrete issues caused by less-than-perfect load balancing at low load? not as far as I can tell but wanted to fla... [08:04:26] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 42 [08:14:02] (03CR) 10Hashar: "From a quick discussion I had with Tyler this week, seems that is going to be tested on the beta cluster first. Ie we are going to migrat" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [08:14:48] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:17] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 42 [08:22:07] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61469 bytes in 0.053 second response time [08:30:00] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/226910 (owner: 10Hashar) [08:30:15] (03CR) 10Filippo Giunchedi: [C: 031] graphite: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/240942 (owner: 10Dzahn) [08:44:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:46:46] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61469 bytes in 4.727 second response time [08:47:12] (03PS1) 10Hashar: nodepool: switch info logs from hourly to daily [puppet] - 10https://gerrit.wikimedia.org/r/240986 [08:51:35] (03CR) 10DCausse: [C: 031] [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 (owner: 10EBernhardson) [09:11:27] RECOVERY - Restbase root url on restbase2001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.119 second response time [09:11:58] RECOVERY - Restbase endpoints health on restbase2001 is OK: All endpoints are healthy [09:12:27] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [09:13:17] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [09:13:27] PROBLEM - puppet last run on restbase2005 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [09:13:47] PROBLEM - puppet last run on restbase2003 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [09:13:49] <_joe_> godog: ^^ [09:13:55] <_joe_> I guess that's you [09:14:17] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet last ran 18 hours ago [09:14:22] it is yeah, will recover shortly I just enabled puppet back again [09:14:58] I'm not sure why it fired though, bad timing perhaps [09:15:36] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.117 second response time [09:16:07] RECOVERY - Restbase endpoints health on restbase2002 is OK: All endpoints are healthy [09:16:58] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:26:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:18] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:30:06] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61469 bytes in 0.383 second response time [09:30:16] RECOVERY - Restbase endpoints health on restbase2003 is OK: All endpoints are healthy [09:30:37] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:30:57] RECOVERY - Restbase root url on restbase2003 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.123 second response time [09:33:07] RECOVERY - puppet last run on restbase2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:41:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:36] RECOVERY - puppet last run on restbase2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:56:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61485 bytes in 0.678 second response time [10:08:56] 6operations, 6Services, 7RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#1674039 (10mobrovac) 3NEW [10:09:31] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1674048 (10mobrovac) >>! In T112648#1673665, @fgiunchedi wrote: > summing up yesterdays ops/services sync up: > > * we'll take the multi-instance opportunity to separate `/var` for logs/data [ops] Tracked... [10:09:45] 6operations, 6Services, 7RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#1674039 (10mobrovac) [10:09:48] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1674051 (10mobrovac) [10:10:08] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures [10:10:37] PROBLEM - Apache HTTP on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.009 second response time [10:11:17] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1674061 (10jcrespo) Let me abandon it and you can do it on your own pace. [10:11:54] (03Abandoned) 10Jcrespo: Removing dns entries for es1001-es1010 for decom [dns] - 10https://gerrit.wikimedia.org/r/240668 (owner: 10Jcrespo) [10:12:27] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.224 second response time [10:17:26] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:48] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1674065 (10jcrespo) a:5jcrespo>3Cmjohnson [10:24:47] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 61486 bytes in 0.092 second response time [10:24:47] (03CR) 10Zfilipin: [C: 031] nodepool: switch info logs from hourly to daily [puppet] - 10https://gerrit.wikimedia.org/r/240986 (owner: 10Hashar) [10:30:56] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [10:36:02] (03CR) 1020after4: [C: 031] "python decorators: not so great actually" [tools/scap] - 10https://gerrit.wikimedia.org/r/240912 (owner: 10Thcipriani) [10:36:56] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:30:19] one back [11:30:47] RECOVERY - NFS read/writeable on labs instances on labstore1002 is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.023 second response time [11:36:47] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [11:41:32] morebots, there? [11:41:32] I am a logbot running on tools-exec-1205. [11:41:32] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [11:41:32] To log a message, type !log . [11:42:55] !log upgrading linux-image-generic on labnet1002 to get us away from the recently-crashed 3.13.0-59 [11:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:24] 6operations: Update kernel on db1011 - https://phabricator.wikimedia.org/T113720#1674242 (10MoritzMuehlenhoff) 3NEW [11:57:18] 6operations: Update kernel on db1011 - https://phabricator.wikimedia.org/T113720#1674249 (10jcrespo) p:5Triage>3Low [11:57:30] 6operations: Update kernel on db1011 - https://phabricator.wikimedia.org/T113720#1674251 (10jcrespo) a:3jcrespo [12:02:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1674252 (10Krenair) Yeah, bast4001 is ops-only ('bastiononly' doesn't get you there). I don't think bast3001 exists, that's called hooft... And bast1000 is act... [12:04:45] (03PS1) 10Mobrovac: RESTBase: config update [puppet] - 10https://gerrit.wikimedia.org/r/241018 [12:05:33] (03PS1) 10DCausse: Fix TTMServer config to use the extra plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241019 (https://phabricator.wikimedia.org/T113711) [12:09:57] (03CR) 10Nikerabbit: [C: 031] "Good catch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241019 (https://phabricator.wikimedia.org/T113711) (owner: 10DCausse) [12:10:07] (03PS2) 10Nikerabbit: Fix TTMServer config to use the extra plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241019 (https://phabricator.wikimedia.org/T113711) (owner: 10DCausse) [12:13:20] who's on deployment-puppetmaster? [12:14:19] re [12:14:34] alex@alex-laptop:~$ ssh deployment-puppetmaster.deployment-prep.eqiad.wmflabs w [12:14:34] 12:14:20 up 30 days, 12:30, 1 user, load average: 0.00, 0.01, 0.12 [12:14:34] USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT [12:14:34] mobrovac pts/0 bastion-01.basti 12:11 9.00s 0.20s 0.01s sshd: mobrovac [priv] [12:14:35] alex@alex-laptop:~$ [12:14:51] yes i know [12:14:53] :) [12:15:10] but somebody left ops/puppet there in not a good shape [12:15:19] and i need to c-p and test a RB config change [12:15:23] so ... [12:15:33] can't even stash it [12:15:45] * mobrovac grunts [12:15:49] ew, unmerged changes [12:15:52] bad discipline [12:15:56] rebase in progress; onto e5e8cb1 [12:16:18] !log Seems Zuul/Jenkins is in trouble somehow :-/ [12:16:18] yeah [12:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:56] mobrovac: /var/lib/git/operations/puppet is rebased automatically via a cronjob, but in case of conflict it leaves the working space in a dirty state -:( [12:16:59] one moment morebots [12:17:01] mobrovac, * [12:17:20] hehehe [12:17:21] kk [12:17:22] thnx Krenair [12:17:39] try now mobrovac [12:17:58] wd clean [12:17:59] yey [12:18:01] thnx Krenair! [12:18:45] I aborted the rebase, reset --hard to HEAD^ (remove a cherry pick of an old version of https://gerrit.wikimedia.org/r/#/c/239998/) and rebased to origin/production [12:19:17] now the final merged version of that commit is there [12:20:12] ah ok [12:20:15] thnx for the tip [12:21:48] !log Nodepool is dead, or at least not adding new slaves anymore [12:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:27:18] known issue with ipsec on cp boxes btw? critical since 10h [12:29:09] !sal [12:29:09] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [12:34:49] What are you guys up to today? [12:34:52] :D [12:36:07] jynus: do we have any issue with m5-master ? From labnodepool1001.eqiad.wmnet (10.64.20.18 ) I got some timeout waiting for deadlock and even a "OperationalError: (OperationalError) (2006, "MySQL server has gone away (error(32, 'Broken pipe'))") None None" [12:36:37] jynus: maybe it was just a transient network error :} [12:38:39] PROBLEM - Disk space on analytics1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:39] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:39:17] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [12:40:27] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:10] hashar, there was a labs shot outage [12:41:37] jynus: yeah seems the network has been flappy :D don't bother with m5-master, it is most probably all fine [12:41:44] maybe the issue was labs, not mysql; I didn't receive an alert, but I will check anyway [12:42:05] It takes 0 time, and I do it anyway onces a day [12:43:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: config update [puppet] - 10https://gerrit.wikimedia.org/r/241018 (owner: 10Mobrovac) [12:43:07] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [12:44:34] yes, I see a drop in activity at the same time than the labs network issue, but it never stopped giving service [12:45:03] your error is consistent with a conectivity issue [12:45:37] but traffic from production was not affected [12:45:45] I can kill your processes if you want [12:46:16] jynus: let me restart it on my side, should clear them [12:46:32] !log stopping nodepool to clear out left over mysql connections [12:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:54] one thing that cannot be done in mysql is restart your connections for you [12:48:21] application should be prepared to retry them- for example, in case of a failover [12:48:32] jynus: seems that is not properly handled [12:49:32] when I am talking about mysql, I actually mean "any client application for a server" [12:51:12] Nodepool keeps a permanent connection with the mysql server. But apparently it does not handle reconnection :/ [12:51:30] that is rather dumb [12:52:01] !log stop puppet on restbase in production -- config deployment [12:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:47] PROBLEM - Hadoop NameNode Primary Is Active on analytics1001 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby [12:54:00] !log restarted rabbitmq-server on labcontrol1001 [12:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:58:43] (03PS1) 10Alexandros Kosiaris: otrs: Install libyaml-libyaml-perl [puppet] - 10https://gerrit.wikimedia.org/r/241024 [13:00:26] Krenair: so… any chance we could deploy the https://phabricator.wikimedia.org/T113718 fix today? [13:00:46] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:58] hashar: yes, that happens in many applications- persistent connections cannot be failovered easily, and they are a problem. Some refresh connections frequently, others don't. [13:06:28] there is a bug for that, but there is not much to do from server side [13:09:08] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [13:10:14] !log enable puppet on restbase in production [13:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:57] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 4 failures [13:12:04] !log added debdeploy 0.0.7 to carbon [13:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:21] !log restbase deploying e42bf0fc [13:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:26] jynus: ack. I guess I will have to dig in nodepool code and make it more robust [13:13:44] sometimes it is a config option [13:14:03] at least in java pool of connections, etc. [13:14:15] it is using python sqlalchemy apparently [13:14:16] persistent vs. precreated [13:14:21] I will find out :-} [13:14:52] also, monitoring will help (do not know if someone got alerted) [13:16:27] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [13:17:57] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:21:47] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [13:22:07] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.005 second response time [13:23:59] _joe_: these endpoints notifications ^^ make me smile a little every time i see them :) [13:24:06] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 3 failures [13:29:54] (03PS1) 10Muehlenhoff: Various bugfixes and feature tweaks [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241031 [13:31:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Various bugfixes and feature tweaks [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241031 (owner: 10Muehlenhoff) [13:32:40] I'll restart the salt master on palladium in about 10 minutes, please speak out if I should delay that [13:38:37] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:39:17] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:39:57] PROBLEM - nova-compute process on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:42:26] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:42:57] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:43:46] RECOVERY - nova-compute process on labvirt1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:43:50] andrewbogott: seems that did it :-) [13:44:05] instances are gone and new ones are spawning properly [13:44:13] oh? I still have a couple with pending deletes [13:44:19] yeah [13:44:25] but the bulk of them have been honored [13:44:41] !log restarted saltmaster on palladium [13:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:48] andrewbogott: I have a stall deleetion request for labvirt1003 [13:47:04] yeah, looks like things are gradually catching up [13:47:14] ah no it is gone already [13:47:32] so the network flap caused compute to loose their connection with whatever central system there is ? [13:48:02] I am merely wondering whether the issue is the network flap or a weird side effect from Nodepool [13:48:06] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:48:33] MatmaRex, hi [13:48:35] MatmaRex, yes, I could, if you think it's important enough [13:48:37] PROBLEM - nova-compute process on labvirt1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [13:48:40] I thikn it was from the network going down, but I’m not sure [13:49:58] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:50:35] Krenair: it seems to be problematic for some en.wp maintenance tasks… i think it's moderately important and really trivial [13:50:36] RECOVERY - nova-compute process on labvirt1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [13:50:40] 6operations, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674499 (10BBlack) See also T113184 for previous crash. Apparently it crashed again this morning (while still depooled). Will downtime it in icinga and start tracking down what the real problem is... [13:51:20] Krenair: let me just read through all that code again to verify that there are no more gotchas, can you deploy it in ~30 minutes? [13:51:34] MatmaRex, ok [13:54:01] 6operations, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674503 (10BBlack) Checked the serial console, and it's showing this: ``` Alert! System fatal error during previous boot PCI Express Error Uncorrectable Memory Error Management Engine Mode... [13:54:16] 6operations, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674506 (10BBlack) p:5High>3Normal [13:54:27] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1671942 (10BBlack) [13:54:39] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674510 (10BBlack) a:3Cmjohnson [13:56:36] bblack: I am assuming cp1046 is depooled [13:57:10] yes, completely [13:57:24] ok...thx [13:58:56] !log nodepool back in operations [13:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:04] (03PS1) 10Hashar: contint: libssl-dev for python [puppet] - 10https://gerrit.wikimedia.org/r/241037 [14:00:04] may someone please merge the contint patch https://gerrit.wikimedia.org/r/241037 that adds libssl-dev? :-} [14:02:17] (03PS1) 10Muehlenhoff: Add some debugging notes [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241039 [14:02:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add some debugging notes [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241039 (owner: 10Muehlenhoff) [14:03:56] !log starting a Cassandra repair on restbase1006 (nodetool repair -pr -dc eqiad) [14:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:24] PROBLEM - Host lvs1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:55] PROBLEM - Confd template for /etc/pybal/pools/maps on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_maps not defined [14:05:24] PROBLEM - Confd template for /etc/pybal/pools/maps-https on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_maps-https not defined [14:05:44] PROBLEM - Confd template for /etc/pybal/pools/misc_web on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_misc_web not defined [14:05:49] heh [14:05:52] ignore those [14:06:04] PROBLEM - Confd template for /etc/pybal/pools/misc_web-https on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_misc_web-https not defined [14:06:10] race condition on initial puppetization, sometimes icinga can hit it before it's fully configured [14:06:15] PROBLEM - Confd template for /etc/pybal/pools/ocg on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_ocg not defined [14:06:16] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1674526 (10Halfak) Indeed, but the text may be more than 100 lines, so I was parsing to get the whole revision block. However, my pa... [14:06:25] PROBLEM - Confd template for /etc/pybal/pools/parsoidcache on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_parsoidcache not defined [14:06:35] PROBLEM - Confd template for /etc/pybal/pools/parsoidcache-https on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_parsoidcache-https not defined [14:06:35] PROBLEM - salt-minion processes on lvs1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:06:45] PROBLEM - service on lvs1011 is CRITICAL: NRPE: Command check_confd-state not defined [14:06:45] PROBLEM - Confd template for /etc/pybal/pools/stream on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_stream not defined [14:07:04] PROBLEM - Confd template for /etc/pybal/pools/stream-https on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_stream-https not defined [14:07:15] PROBLEM - Confd template for /etc/pybal/pools/upload on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_upload not defined [14:07:34] PROBLEM - Confd template for /etc/pybal/pools/upload-https on lvs1011 is CRITICAL: NRPE: Command check_confd_etc_pybal_pools_upload-https not defined [14:07:54] downtimed lvs101[012] [14:13:46] <_joe_> bblack: converting everything to jessie? [14:13:54] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me, the packaged python-cryptography package also uses it." [puppet] - 10https://gerrit.wikimedia.org/r/241037 (owner: 10Hashar) [14:13:57] <_joe_> oh, scratch that [14:14:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] contint: libssl-dev for python [puppet] - 10https://gerrit.wikimedia.org/r/241037 (owner: 10Hashar) [14:14:18] danke ! [14:14:41] de rien [14:14:50] we are such an international organization [14:15:15] would be even better if I thanked you in chinese and you replied in italian [14:15:31] 6operations, 10RESTBase, 10RESTBase-Cassandra: column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#1674571 (10fgiunchedi) 3NEW a:3fgiunchedi [14:15:36] to be fair, that accounts for about 1% of the school French I still remember :-) [14:16:02] sounds sufficient [14:16:04] RECOVERY - Host lvs1010 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [14:16:42] _joe_: we're using new servers in eqiad that are jessie, but they're not ready yet (instead of converting existing lvs100[1-6]) [14:17:10] all the LVS at the other DCs are converted now [14:17:24] MatmaRex, all okay? [14:18:16] Krenair: still combing it, found a bad copy-paste that is thankfully harmless. i think we can do it [14:19:14] RECOVERY - Confd template for /etc/pybal/pools/misc_web-https on lvs1011 is OK: No errors detected [14:19:26] RECOVERY - Confd template for /etc/pybal/pools/ocg on lvs1011 is OK: No errors detected [14:19:35] RECOVERY - Confd template for /etc/pybal/pools/parsoidcache on lvs1011 is OK: No errors detected [14:19:45] RECOVERY - salt-minion processes on lvs1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:19:45] RECOVERY - Confd template for /etc/pybal/pools/parsoidcache-https on lvs1011 is OK: No errors detected [14:19:54] RECOVERY - service on lvs1011 is OK: OK - confd is active [14:19:55] RECOVERY - Confd template for /etc/pybal/pools/stream on lvs1011 is OK: No errors detected [14:20:05] RECOVERY - Confd template for /etc/pybal/pools/maps on lvs1011 is OK: No errors detected [14:20:15] RECOVERY - Confd template for /etc/pybal/pools/stream-https on lvs1011 is OK: No errors detected [14:20:24] RECOVERY - Confd template for /etc/pybal/pools/maps-https on lvs1011 is OK: No errors detected [14:20:25] RECOVERY - Confd template for /etc/pybal/pools/upload on lvs1011 is OK: No errors detected [14:20:36] RECOVERY - Confd template for /etc/pybal/pools/upload-https on lvs1011 is OK: No errors detected [14:20:42] bblack (or anyone that knows) does varnish have access to the POST body of an http request? [14:20:45] RECOVERY - Confd template for /etc/pybal/pools/misc_web on lvs1011 is OK: No errors detected [14:21:11] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674597 (10Cmjohnson) A pull of the idrac log revealed this error Record: 50 Date/Time: 09/25/2015 02:12:31 Source: system Severity: Critical Description: Multi-bit me... [14:21:24] RECOVERY - Disk space on stat1002 is OK: DISK OK [14:21:26] nuria: not for logging purposes, if that's what you mean [14:21:58] bblack: yes, i was wondering if we could send posts also as part of the data that varnishkafka gets into hive [14:22:05] !log 'sudo -u hdfs hdfs haadmin -transitionToActive analytics1001-eqiad-wmnet' per otto on analytics1001 [14:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:13] bblack: use case is the API usage [14:22:24] RECOVERY - Hadoop NameNode Primary Is Active on analytics1001 is OK: Hadoop.NameNode.FSNamesystem.tag_HAState OKAY: active [14:22:35] RECOVERY - Disk space on analytics1027 is OK: DISK OK [14:22:35] RECOVERY - Disk space on analytics1026 is OK: DISK OK [14:22:37] nuria: no, we can't use POST for that [14:22:47] bblack: k got it [14:23:35] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:28:11] (03PS1) 10Muehlenhoff: Some installation docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241048 [14:28:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] Some installation docs [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/241048 (owner: 10Muehlenhoff) [14:29:16] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:31:50] Krenair: we're just waiting for jenkins, right? [14:31:56] yes [14:33:46] s/jenkins/the lame test suite running under Zend/ [14:33:47] :D [14:34:44] hashar: we were actually waiting for jenkins, all the suites were finished when i looked on integration.wm.o [14:35:14] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:35:18] that happens whenever someone force merge a patch :-/ [14:36:26] !log krenair@tin Synchronized php-1.26wmf24/includes/specials/SpecialMovepage.php: https://gerrit.wikimedia.org/r/#/c/241045/ (duration: 00m 17s) [14:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:09] MatmaRex, all done other than snapshot1001, which I think apergos was going to reinstall or something [14:37:39] thanks Krenair, it works correctly now on en.wp! [14:38:16] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674662 (10BBlack) So mem errors on all DIMMs, and the other one on the console about the PCI bus. Bad CPU? Bad board? [14:43:15] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:43:29] !log installed rpcbind and apport security updates on various servers [14:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:18] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1674722 (10chasemp) I would think bad board yeah [14:56:02] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1674730 (10BBlack) lvs1010: eth1 -> asw2-a5 xe-0/0/11 - link dead on both ends of the connection, seems to be configured correctly. lvs1010 + lvs1012: no link t... [15:00:35] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:03:24] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [5000000.0] [15:07:07] (03CR) 10Chad: [C: 032] De-decorate inside_git_dir [tools/scap] - 10https://gerrit.wikimedia.org/r/240912 (owner: 10Thcipriani) [15:09:20] (03Merged) 10jenkins-bot: De-decorate inside_git_dir [tools/scap] - 10https://gerrit.wikimedia.org/r/240912 (owner: 10Thcipriani) [15:10:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [15:23:51] 6operations, 10RESTBase, 10RESTBase-Cassandra: column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#1674805 (10fgiunchedi) p:5High>3Normal [15:29:49] (03PS5) 10Filippo Giunchedi: swift: aggregate and report container object/byte stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (https://phabricator.wikimedia.org/T92322) [15:35:18] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1674855 (10fgiunchedi) [15:35:21] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1674854 (10fgiunchedi) [15:40:15] (03PS2) 10Chad: Remove PHP localization cache code [tools/scap] - 10https://gerrit.wikimedia.org/r/240440 [15:43:22] (03PS1) 10Chad: Remove wikiversions.cdb code now that we use wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/241061 [15:53:05] (03PS2) 10Alexandros Kosiaris: otrs: Install libyaml-libyaml-perl [puppet] - 10https://gerrit.wikimedia.org/r/241024 [15:53:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Install libyaml-libyaml-perl [puppet] - 10https://gerrit.wikimedia.org/r/241024 (owner: 10Alexandros Kosiaris) [16:05:12] (03PS1) 10Thcipriani: [WIP] Add optional canary deploy group and check [tools/scap] - 10https://gerrit.wikimedia.org/r/241066 (https://phabricator.wikimedia.org/T113073) [16:08:18] (03PS12) 10Chad: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [16:08:21] (03PS1) 10Chad: Convert log.Timer and log.Stats to use context logger [tools/scap] - 10https://gerrit.wikimedia.org/r/241067 [16:09:58] (03CR) 10Chad: [C: 032] A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [16:10:15] (03Merged) 10jenkins-bot: A context manager for managing nested loggers [tools/scap] - 10https://gerrit.wikimedia.org/r/239028 (owner: 1020after4) [16:12:08] (03CR) 10Chad: [C: 032] Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [16:13:57] (03CR) 10Chad: [C: 032] Use context logger and stop passing one to sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/240126 (owner: 10Chad) [16:14:04] (03CR) 10Chad: [C: 032] Convert log.Timer and log.Stats to use context logger [tools/scap] - 10https://gerrit.wikimedia.org/r/241067 (owner: 10Chad) [16:16:54] !log cp1046 stopped icinga checks for hardware troubleshooting [16:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:28] !log duplicating database otrs to otrsupgradetest for testing the upgrade procedure [16:17:33] 288GB btw [16:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:51] akosiaris: the database size? [16:18:07] jynus: ^. I promise I 'll be good and delete afterwards and revoke the grants [16:18:26] JohnFLewis: the directory with all the .frm and .idb files [16:18:35] so in a sense, yes the database [16:18:38] heh :) [16:19:04] I really don't want to know how much it would be in a dump [16:19:16] although it's usually less due to the indexes not being there [16:22:42] wat? [16:23:16] how are you duplicating that? [16:23:41] isn't otrs in m-something? [16:24:09] ^akosiaris [16:24:58] (03PS8) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [16:25:01] (03PS3) 10Chad: Use context logger and stop passing one to sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/240126 [16:25:04] (03PS1) 10Chad: Convert most of utils to use context loggers too! [tools/scap] - 10https://gerrit.wikimedia.org/r/241068 [16:25:10] jynus: mysqldump | mysql. m2 [16:25:37] jynus: I 'll live it running in a screen over the weekend [16:25:42] I suppose it won't hurt ? [16:25:47] it will [16:25:52] unless [16:25:52] sigh [16:25:55] (03Merged) 10jenkins-bot: Convert log.Timer and log.Stats to use context logger [tools/scap] - 10https://gerrit.wikimedia.org/r/241067 (owner: 10Chad) [16:25:57] --single-transaction [16:26:35] yeah I got that [16:26:48] check xtrabackup- you can clone a single database at binary-copy speed [16:27:19] hmm I remember there is support for that in the backups that we do [16:27:26] :-) [16:27:33] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1674995 (10Papaul) ssl2002 / DGF5YQ1 hast a 500 SATA 3.5 disk and I have a 160 3.5 disk that I can use for the test. [16:27:39] not that we ever used it though [16:27:53] it is not important for the backups, we are never going to use them [16:28:06] but mysqldump will take 10 days to recover that [16:28:20] if you had to do a logical dump [16:28:26] use mydumper [16:28:51] I showed it to otto the other day, when he did something similar [16:28:58] pm for details [16:29:23] even if it does not impact reads and writes, it affects the purging having a single transaction for long [16:29:30] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1675008 (10RobH) Perfect! Just unplug its production network connection while you do a USB stick install of an OS on the spare 160GB disk alone. Then you can do the test erase on the 160GB disk. I note unp... [16:30:19] in any case, m2, you say? [16:31:28] you can abuse its slave, db2011 [16:31:41] I even have some spare machines on codfw to lend you [16:33:36] (03PS1) 10Dduvall: Propagate structured logging from target to deploy host [tools/scap] - 10https://gerrit.wikimedia.org/r/241074 (https://phabricator.wikimedia.org/T113085) [16:33:54] jynus: ok I 'll abuse the slave in codfw then [16:34:01] :-) [16:39:16] PROBLEM - very high load average likely xfs on ms-be1005 is CRITICAL: CRITICAL - load average: 237.90, 164.90, 82.94 [16:44:32] (03PS1) 10Thcipriani: Fix log_context to detect positional argument [tools/scap] - 10https://gerrit.wikimedia.org/r/241076 [16:45:17] (03CR) 10Thcipriani: [C: 032] Fix log_context to detect positional argument [tools/scap] - 10https://gerrit.wikimedia.org/r/241076 (owner: 10Thcipriani) [16:48:34] (03PS1) 10Florianschmidtwelzow: Use new page name for wmf release notes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) [16:48:40] (03Merged) 10jenkins-bot: Fix log_context to detect positional argument [tools/scap] - 10https://gerrit.wikimedia.org/r/241076 (owner: 10Thcipriani) [16:48:47] 6operations, 10Traffic, 7Pybal: pybal-related issue on host start can break service IPs... - https://phabricator.wikimedia.org/T113597#1675149 (10chasemp) I have seen `Memory allocation problem`when referencing a pool that doesn't exist like on lvs1003 this works `ipvsadm -Ln -t 10.2.2.30:9200` but: (bad po... [16:49:06] (03CR) 10Florianschmidtwelzow: "Follow up: Ib166b2114fd2f0779" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [16:50:02] (03CR) 10Greg Grossmeier: [C: 031] Use new page name for wmf release notes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [16:50:35] RECOVERY - Host cp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [16:50:35] RECOVERY - IPsec on cp2015 is OK: Strongswan OK - 8 ESP OK [16:50:36] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:44] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - 8 ESP OK [16:50:45] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - 8 ESP OK [16:50:55] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - 8 ESP OK [16:50:56] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - 8 ESP OK [16:51:01] (03CR) 10Greg Grossmeier: "I may have jumped the gun and already merged https://gerrit.wikimedia.org/r/241078 which is the counter part to this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241079 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [16:51:05] RECOVERY - IPsec on cp2003 is OK: Strongswan OK - 8 ESP OK [16:51:25] RECOVERY - IPsec on cp2021 is OK: Strongswan OK - 8 ESP OK [16:51:26] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - 8 ESP OK [16:51:26] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1675208 (10Jdlrobson) Does this also impact beta labs? I'm seeing old css/js in debug mode on T99096 [16:51:35] RECOVERY - IPsec on cp2009 is OK: Strongswan OK - 8 ESP OK [16:51:55] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - 8 ESP OK [16:52:05] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - 8 ESP OK [16:52:15] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - 8 ESP OK [16:53:09] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1675231 (10greg) No, Beta Cluster doesn't use wmfXX release branches. [16:54:47] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1675267 (10Krenair) I think it does. beta remains on php-master all the time and so this problem is m... [16:54:51] 6operations, 10ops-eqiad, 10Traffic: cp1046 is crashing and becoming unresponsive - https://phabricator.wikimedia.org/T113639#1675268 (10Cmjohnson) Steps Taken -removed all of B side Dimm and cpu -cleared log -rebooted - booted the kernel without any issues -Swapping cpu2 with cpu1 and booted without issue... [16:55:14] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a branch - https://phabricator.wikimedia.org/T99096#1675271 (10Krenair) [16:55:29] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a branch - https://phabricator.wikimedia.org/T99096#1675273 (10greg) Bah, you're right, I was conflating tasks in my head with another related one. [17:00:59] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1675313 (10Cmjohnson) Lvs1010 => asw2-a5 xe-0/0/11 is up. Verfied light on the fiber and the checked sfp module and was the wrong type. Replaced and confirmed a... [17:01:04] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a branch - https://phabricator.wikimedia.org/T99096#1675315 (10Jdlrobson) [17:02:27] (03CR) 10Chad: [C: 032] Convert most of utils to use context loggers too! [tools/scap] - 10https://gerrit.wikimedia.org/r/241068 (owner: 10Chad) [17:02:50] (03Merged) 10jenkins-bot: Convert most of utils to use context loggers too! [tools/scap] - 10https://gerrit.wikimedia.org/r/241068 (owner: 10Chad) [17:11:02] robh / mutante: https://phabricator.wikimedia.org/P2093 a nice paste for you two to look at :) [17:16:14] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:20:29] !log reboot ms-be1005, xfs [17:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:55] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: Puppet has 1 failures [17:24:44] RECOVERY - very high load average likely xfs on ms-be1005 is OK: OK - load average: 24.19, 8.74, 3.16 [17:25:57] 6operations, 10netops: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1675419 (10RobH) 3NEW a:3Cmjohnson [17:26:57] !log restarting and upgrading db1051 mysql (depooled) [17:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:54] jynus: are the master IDs (server_id) globally unique or does each slave give the same master a different ID? [17:29:12] no [17:29:31] every id has to be unique if they are part of the same topology, or replication will not work [17:29:46] the algorithm used in puppet makes sure of that [17:30:16] there could be the same id on different replication hierarchys, but we do not do that for our own sanity [17:31:10] in other words, it is currenly unique on all our infrastructure, and we intend to keep it like that [17:31:53] (independently of the slave) [17:33:04] right, so we just need to store in config (or cache) the single ID of db1052 to know what to select on any slave [17:34:29] mmmm [17:34:38] what if we failover? [17:35:07] well, I suppose it doesn't matter, we need a puppet change anyway to start pt-heartbeat there [17:35:49] (03CR) 10Chad: [C: 032] Propagate structured logging from target to deploy host [tools/scap] - 10https://gerrit.wikimedia.org/r/241074 (https://phabricator.wikimedia.org/T113085) (owner: 10Dduvall) [17:35:53] but yes, I have them on tendril easily queryable, if you need them [17:37:37] come to think of it, of course the IDs are unique, since the @@server_id query implies they are actual mysql server IDs are not pt specific [17:37:49] * AaronSchulz needs to fully wake up [17:37:51] I could return all of them, and in the special cases [17:38:04] where it has 2 masters, it may be still useful [17:38:21] that only happens on dbstore, labs and analytics [17:38:29] none of the main production servers [17:39:06] (03Merged) 10jenkins-bot: Propagate structured logging from target to deploy host [tools/scap] - 10https://gerrit.wikimedia.org/r/241074 (https://phabricator.wikimedia.org/T113085) (owner: 10Dduvall) [17:39:35] Aaron, I do not care too much about the details, we can figure it out later [17:39:41] I care more about the architecture [17:39:57] shoud this be agent-based, or on the standard mysql port? [17:40:32] I needed it separatelly because of the max_connections issue [17:40:52] what do you mean by agent based? [17:41:01] with a separate protocol [17:41:28] e.g. an http health check [17:41:58] possibly, will that already exist? [17:42:17] it is in the works, I linked it on the ticket [17:42:25] also, for the two masters case, MW knows what master has the DB cares about, so it only needs 1 result [17:42:49] can MW see the heartbeat tables directly or does it not have access? [17:42:59] I just need to add the script to xinetd [17:43:07] I can check [17:43:09] (e.g. direct sql) [17:43:15] probably it doesn't [17:43:20] that would be the KISS thing ;) [17:43:34] yes, and there it is the other problem [17:43:42] maybe it shouldn't be on a separate table [17:43:48] *database [17:44:15] but in __wmf_heartbeat, on enwiki [17:44:52] as I have now __wmf_checksums, outside of mediawiki [17:45:27] well MW assume lag is per-server for all DBs in the shard defined by wmf-config...there is no concept that per-DB parallel replication might cause different lags per DB. It also reuses connections in a way that assumes the lag is the same. [17:45:28] the question is I need it too, independently of the software, for other applications/ops checks [17:45:49] though whether it should keep assuming that could maybe be questioned [17:46:59] by "reuses connection" I mean USE to go from enwiki to frwiki or whatever [17:47:17] no attempt to recheck lag is done there [17:47:55] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:47:58] so given that, as long as wikiuser has read access, I think a separate DB seems fine [17:48:10] I can confirm it does not [17:48:15] (now) [17:48:35] it is trivial to add that permission to wikiuser [17:48:52] "trivial" [17:49:12] have to do a dangerous operation on all databases, but that is what I do everyday [17:50:03] so, in summary: add the half-working pt-heartbeat to all masters in puppet [17:50:21] add permission to wikiuser to read that table [17:50:56] and I think that's it [17:51:11] assuming there is a "get_lag()" function on mediawiki [17:52:30] also, I'd like to store the host => ID mapping in wmf-config (e.g. "db1052" => X) like we have the host => IP map already [17:53:31] why store it, if you can query it dynamically? [17:53:32] but yeah that's probably it, then I can write the new class to handle getLagTimes() [17:54:11] ahm they are different hosts [17:55:10] 6operations, 10netops: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1675564 (10RobH) It seems that Equinx also requires that we have a turn up call for this link (the RT ticket 9557 has the call details). [17:55:23] I can query the master to get the ID and store it in apc. As long as it doesn't fall out too much. [17:55:37] no, that is ok [17:55:39] 6operations, 10netops: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1675568 (10RobH) Once the cross-connect is done and Chris attaches it to mr1-eqiad, then we can schedule the turn up call. (Lame they cannot just enable it for us and go.) [17:56:11] the master query is literally SELECT @@server_id; [17:57:27] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1675582 (10JohnLewis) Wikitech documentation seems done to me now! Going to work on Meta-Wiki docs and see what I can spin up from there. [17:57:32] right, I just hate master queries (especially cross-DC), but with caching they will basically never happen so I guess that's fine [17:57:56] (03PS1) 10Andrew Bogott: Openstack: Mark failure of nova-network or nova-conductor as critical. [puppet] - 10https://gerrit.wikimedia.org/r/241101 [17:57:58] (03PS1) 10Andrew Bogott: Openstack: Add monitoring for the keystone service [puppet] - 10https://gerrit.wikimedia.org/r/241102 [17:58:27] Coren, yuvipanda, does one of you want to do the equivalent of ^^ for NFS checks? [17:59:07] Yeah, I'll do that now for most of them. [17:59:30] thanks [18:04:00] !log performing schema change on m5-master "nova" [18:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:04:07] andrewbogott^ [18:04:19] jynus: fantastic, I’ll test [18:04:44] andrewbogott: afaict, the critical ones are already paging - nfs-on-labs-instances is the 'everything works fine' one and is set critical => true [18:05:15] Which puzzles me, actually - if the network was down that test should have failed too. [18:06:14] And it did fire, at [2015-09-25 11:04:47] [18:06:41] Man, we should really set up a page to go off if paging isn’t working [18:07:10] Coren: is it possible that that page is only sent to the three of us, and we all had paging disabled on account of nighttime? [18:07:12] I would have to do some trivial deployment on mediawiki for topology changes, but as there is a high chance of breaking recent changes, I will leave it for Monday [18:07:32] Why did that not page? [18:07:36] jynus: that fixed the problem. thanks! [18:08:37] they say it was not theirt fault [18:08:58] but 2 people missing the exact migration is too much of a coincidence [18:09:05] andrewbogott: If it did, then that's a serious issue. None of the labs infrastructure should page only us three. [18:09:06] anyway, fixed now [18:09:15] jynus: you mean the busted schema? I interpreted them as saying not that it wasn’t their fault but rather that all evidence had been destroyed :) [18:09:30] ah! [18:10:39] since the migration script has been refactored since I ran it [18:11:05] robh: Can you join Coren and me over here and help us understand why pages didn’t go off this morning? [18:11:07] well migration scripts are supposed to not be refactored, ever :-) [18:11:20] jynus: yes indeed [18:11:56] jynus: This is yet another case of the OpenStack people reacting in surprise that anyone is using their software in production. The devs just build a fresh cloud every day and are kind of surprised that I don’t. [18:17:00] huh, I don't even have to touch LoadMonitorMySQL really [18:17:45] what does LoadMonitorMySQL do? [18:18:40] I see [18:18:46] but that is an issue [18:18:57] I do not know if you are working on that too [18:19:20] but MySQL Load == lag is a very bad model [18:20:32] there is T112541 [18:21:03] LoadMonitor has a scaleLoads() method but we don't use it [18:21:14] it used to check for active connection count in the old days [18:21:15] and I commented that it sould be a function of lag, # of connections and query latency [18:21:34] we only use it for avoid lagged slaves for consistency reasons [18:22:04] it is ok, but even if it is implemented somewhere else [18:22:20] at some points, we can reach max_connections [18:22:43] is there a phab task for revamping LoadMonitor to weight these factors? [18:23:04] more or less the one I just wrote up here [18:23:28] (also note that we don't use lag as a weight, just as a thresholds for preferred selection) [18:23:36] yes [18:23:54] that is why I would like to unify all that [18:23:57] we don't even have a load model at all, just saying ;) [18:24:07] both health check and load [18:24:25] but anyway [18:24:43] we can have your ticket very soon, no problem with that [18:24:53] 6operations: Quoted booleans probably stopping a lot of pages - https://phabricator.wikimedia.org/T113781#1675710 (10Andrew) 3NEW a:3Andrew [18:25:02] I am just brainstorming a different connection model [18:25:45] but based on specific issues we had, not on nothing [18:26:00] those issues, happily, did not included lagging [18:26:08] so not a factor for that [18:26:19] 6operations, 10Continuous-Integration-Infrastructure: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675730 (10Andrew) 3NEW [18:30:18] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1675755 (10JohnLewis) Meta-Wiki is done. The blocker for this now (in my opinion) is proposing a new https://meta.wikimedia.org/wiki/Mailing_lists/Standardization and then putti... [18:34:45] 6operations, 10Continuous-Integration-Config: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675765 (10hashar) [18:36:11] arg. and now a disk from db1051 fails? [18:37:31] 6operations, 10Continuous-Integration-Config: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675776 (10hashar) Seems like we want to enable the relevant puppet-lint check http://puppet-lint.com/checks/quoted_booleans/ Unfortunately some of our classes expect quoted boo... [18:38:13] 6operations, 10Beta-Cluster, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1675779 (10greg) 3NEW [18:39:02] (03PS1) 10Hashar: puppet-lint: enable quoted_booleans-check [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) [18:40:24] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675795 (10hashar) Example: https://integration.wikimedia.org/ci/job/operations-puppet-puppetlint-strict/29104/console [18:41:38] 6operations, 10ops-eqiad: db1051 degraded raid (disk) - https://phabricator.wikimedia.org/T113786#1675797 (10jcrespo) 3NEW [18:43:05] (03PS2) 10Dzahn: graphite: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/240942 [18:43:59] ACKNOWLEDGEMENT - RAID on db1051 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo T113786 [18:44:10] (03CR) 10Andrew Bogott: "I love this but need to do a lot of fixing first" [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [18:46:36] (03CR) 10Dzahn: "there are quite a few other checks disabled that we can maybe enable already or are closer to being fixed globally. for example, how about" [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [18:47:23] (03CR) 10Dzahn: [C: 032] graphite: make compatible with Apache 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/240942 (owner: 10Dzahn) [18:48:15] 6operations: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1675811 (10Krenair) [18:48:46] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675815 (10greg) Only the quoted_booleans warnings (71 of 'em): {P2094} [18:50:22] 6operations: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1675816 (10Krenair) It's possible that this was what broke in https://gerrit.wikimedia.org/r/#/c/232675/ and then got fixed in https://gerrit.wikimedia.org/r/#/c/24... [18:50:31] (03PS1) 10Chad: Comment typofix [tools/scap] - 10https://gerrit.wikimedia.org/r/241114 [18:51:45] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1675820 (10Cmjohnson) lvs1007 eth1 => asw-b xe-5/1/0 (3934) lvs1008 eth1 => asw-b xe-5/1/2 (3935) lvs1010 eth1 => asw-b xe-6/1/0 (3936) lvs1011 eth1 => asw-b xe... [18:52:53] (03CR) 10Dzahn: "so you want these to start sending SMS?" [puppet] - 10https://gerrit.wikimedia.org/r/241101 (owner: 10Andrew Bogott) [18:53:13] 6operations, 10CirrusSearch, 6Discovery, 7Documentation: Decide on and document the implementation for multi data centre CirrusSearch - https://phabricator.wikimedia.org/T105708#1675828 (10chasemp) [18:53:14] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1675829 (10chasemp) [18:53:15] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1675826 (10chasemp) 5Open>3Resolved I don't think there is more to do on this task [18:56:19] (03CR) 10Dzahn: "looks right technically if that's the goal. but before we all get them let us know how to react and what it typically breaks" [puppet] - 10https://gerrit.wikimedia.org/r/241101 (owner: 10Andrew Bogott) [19:02:05] (03CR) 10Dzahn: "lgtm, i'd just recommend to start without paging and confirm it's all good in the web ui and then switch on the paging" [puppet] - 10https://gerrit.wikimedia.org/r/241102 (owner: 10Andrew Bogott) [19:02:55] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [19:05:21] (03CR) 10Dzahn: [C: 031] "i heard no complaints about "wikiartpedia" domains being parked, add "visualwikipedia" now" [dns] - 10https://gerrit.wikimedia.org/r/197362 (owner: 10Dzahn) [19:07:28] !log configuring and enabling lvs-cross-row ports on asw-b-eqiad for lvs1007,8,10,11 [19:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:54] (03PS1) 10Dzahn: park border-wikipedia.de [dns] - 10https://gerrit.wikimedia.org/r/241122 [19:09:15] (03PS2) 10Dzahn: park border-wikipedia.de [dns] - 10https://gerrit.wikimedia.org/r/241122 [19:11:19] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1675859 (10BBlack) ^ above works for lvs1010 and lvs1011 eth2's [19:13:25] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1675866 (10BBlack) Current state: lvs1010, lvs1011, lvs1012: Aside from the row D issues in T112781, the other 3 ports seem correctly configured and working. T... [19:14:05] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:15:12] (03PS1) 10Dzahn: park softwarewikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/241123 [19:23:02] hashar: https://gerrit.wikimedia.org/r/#/c/240936/ [19:23:38] mutante: Guten Tag :-) [19:24:12] hashar: Comment ça va ? [19:24:23] mutante: yeah seen that one passing, I have no clue whether that ifversion will work [19:24:37] it works on others [19:25:12] for example just merged that for graphite and before for others [19:25:14] lets try now ? [19:25:18] i'm taking a larger change apart [19:25:24] because nobody wanted to merge it all at once [19:25:29] ok:) [19:25:32] yeah that make sense [19:25:44] (03CR) 10Hashar: [C: 031] "Lets try :-}" [puppet] - 10https://gerrit.wikimedia.org/r/240936 (owner: 10Dzahn) [19:25:51] you can restart apache as needed [19:25:57] (03PS3) 10Dzahn: contint: update Apache config for 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240936 [19:26:08] the only visible impacts would be the Zuul status page ( https://integration.wikimedia.org/zuul/ ) not critical [19:26:10] gallium or anything else that i forget [19:26:24] and the Jenkins web ui https://integration.wikimedia.org/ci/ [19:26:26] so nothing critical [19:26:30] gallium only [19:26:31] good,ok [19:26:41] (03CR) 10Dzahn: [C: 032] contint: update Apache config for 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240936 (owner: 10Dzahn) [19:26:42] nowadays gallium host the servers (Jenkins/Zuul) [19:26:48] all the jobs are on labs instances [19:27:11] there is another service (Nodepool) but it uses different manifest / role etc so that is really a different service [19:27:29] the production slave lanthanum got removed :} [19:27:40] running it on gallium [19:27:50] ah, see, i still thought about lanthanum then [19:27:51] ok [19:28:33] yup so nowadays there is only a handful of jobs running on prod on gallium. Mostly to publish stuff to doc/integration.wikimedia.org sites [19:28:50] until we figure out a better alternative [19:29:42] ok, so apache2ctl configtest [19:29:44] Syntax OK [19:29:47] \O/ [19:29:56] restart restart! [19:29:59] unrelated warning: [19:30:06] Useless use of AllowOverride in line 20 [19:30:11] in 50-server-status.conf [19:30:11] heh [19:30:31] I have seen a patch about /status recently, I think by ori [19:30:31] !log restarted apache on gallium [19:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:38] not a big deal apparently [19:30:40] yes, that what my "heh" referred to [19:30:50] well, it kind of was but it's fixed [19:31:13] yea, this warning is not important [19:31:33] 6operations, 10CirrusSearch, 6Discovery, 5codfw-rollout: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1675938 (10chasemp) [19:31:34] 6operations, 6Discovery, 5codfw-rollout: [EPIC] Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1675939 (10chasemp) [19:31:36] 6operations, 10CirrusSearch, 6Discovery, 7Documentation: Decide on and document the implementation for multi data centre CirrusSearch - https://phabricator.wikimedia.org/T105708#1675935 (10chasemp) 5Open>3Resolved a:3chasemp The scheme has been added as a section here https://wikitech.wikimedia.org/w... [19:31:44] mutante: congratulations! [19:33:14] (03PS10) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [19:33:18] hashar: :) thanks. i'm just taking apart that ^ [19:33:40] it used to be huge in the beginning [19:33:50] the goal is to rebase it into nothing [19:34:23] mutante: limn I am wondering whether we are still using it [19:35:06] exactly https://gerrit.wikimedia.org/r/#/c/240941/ [19:35:08] got introduced 3 years ago by the "old" analytics team ( https://blog.wikimedia.org/2012/07/25/meet-the-analytics-team/ ) [19:35:15] https://www.mediawiki.org/wiki/Analytics/Limn "slowly dieing" [19:35:48] schroedinger's service :) [19:35:49] might be worth pinging milimetric to ask whether limn can be shot / removed etc [19:36:09] ok, adding him [19:36:54] (03CR) 10Dzahn: "@milimetric ok? unless limn is not used anymore, then we would remove it i suppose" [puppet] - 10https://gerrit.wikimedia.org/r/240941 (owner: 10Dzahn) [19:37:43] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1675947 (10Nemo_bis) [19:40:09] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1675969 (10brion) [19:41:51] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1675977 (10JohnLewis) 5Open>3Resolved and with https://meta.wikimedia.org/wiki/Mailing_lists/Standardization I am going to call this closed! (a documentation ticket actually... [19:42:00] robh mutante ^^ :D [19:42:38] another task bits the dust [19:42:42] bites even. [19:42:52] JohnFLewis: very nice!:) [19:42:54] must be friday, i've lost the ability to type [19:43:16] robh: heard about the rename script? [19:43:35] the docs to rename a list are just a little bit shorter now [19:43:50] yea you linked me before i like [19:43:53] not sure if we want to rename all -l though :) [19:43:59] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676000 (10brion) [19:44:03] on this note: https://meta.wikimedia.org/wiki/Talk:Mailing_lists/Standardization -- all comments welcome about building a standardization policy (for users AND ops) [19:44:12] i wouldnt just rename it unless they ask and i wouldnt make a big deal about it [19:44:17] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1660709 (10brion) [19:44:18] emphasis to show it is relevant :) [19:44:23] unless they are actively hating -l there is no reason to force a rename on exisitn things [19:44:24] imo [19:44:27] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1660709 (10brion) [19:44:30] (though its just that, opinion!) [19:45:14] robh: yeah. we can work away from -l secretively when people poke us for really misc things anyways ;) [19:46:31] mutante: thank you for the ping about gallium :-} [19:46:53] and i love that renaming is scripted [19:47:00] no more one off mistakes [19:48:04] robh: john: agreed about the renaming , yea [19:48:34] robh: :) yea, exactly. that showed during the migration [19:48:48] hashar: welcome [19:48:55] *T* list names are hell [19:49:18] never capitalise things with mailman! it's too hard for it to understand :) [19:52:58] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676032 (10brion) @Joe I still see a lot of failures, but now they come with a giant WMF error page: ``` 2015-09-25T19:50:40+0000: Runner loop 0 process in slot 3 gave... [19:54:29] 6operations: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676037 (10Dzahn) 3NEW [19:55:15] 6operations: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676048 (10Dzahn) p:5Triage>3Low [19:56:00] mutante: iirc it requires an LDAP rename [19:56:15] 6operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676057 (10Krenair) [19:57:22] (03CR) 10Dzahn: "Gergő, i kept splitting this into smaller tasks and use this one as a todo list. until it rebases into nothing or just the mediawiki chang" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [19:58:23] valhallasw`cloud: yea, afair Gerrit and LDAP and wikitech .. and X [19:58:40] that's why i always waited [20:03:09] 6operations, 10Wikimedia-General-or-Unknown: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1676112 (10Krenair) 5Open>3Resolved a:3Krenair ```USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND w... [20:07:34] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676135 (10brion) Hmm maxtime=60 ? Do we really want that in the URL? :) [20:09:15] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676146 (10Paladox) Seems that https://commons.wikimedia.org/wiki/File:Wikimania_2014_-_Technology_VI_-_Views_-_FastCCI.webm and https://commons.wikimedia.org/wiki/File... [20:09:33] 6operations, 10MediaWiki-Debug-Logger: Set up a service IP for logstash - https://phabricator.wikimedia.org/T113104#1676147 (10Umherirrender) [20:10:55] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676161 (10brion) @Paladox yes, those are linked on the duplicate bug report. [20:11:56] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676162 (10Paladox) Ok. but I mean I have re run them and still taking a while. [20:14:21] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676167 (10brion) @Paladox please stop re-running transcodes; it interferes with our ability to track what's going on and fix the problem to have other people resetting... [20:17:25] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1676169 (10Paladox) Oh sorry I didn't know I shoulden have done that sorry. [20:17:26] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1676170 (10Papaul) After installation {F2635025} After Erase {F2635027} it took less than a minute to erase the data from the disk. Not only the data was erased from the disk but the process also damaged the... [20:26:46] (03CR) 10Ori.livneh: [C: 032] Remove wikiversions.cdb code now that we use wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/241061 (owner: 10Chad) [20:26:57] (03CR) 10Andrew Bogott: "> so you want these to start sending SMS?" [puppet] - 10https://gerrit.wikimedia.org/r/241101 (owner: 10Andrew Bogott) [20:27:02] (03CR) 10Ori.livneh: [C: 032] Remove PHP localization cache code [tools/scap] - 10https://gerrit.wikimedia.org/r/240440 (owner: 10Chad) [20:27:22] (03Merged) 10jenkins-bot: Remove PHP localization cache code [tools/scap] - 10https://gerrit.wikimedia.org/r/240440 (owner: 10Chad) [20:27:25] (03Merged) 10jenkins-bot: Remove wikiversions.cdb code now that we use wikiversions.php [tools/scap] - 10https://gerrit.wikimedia.org/r/241061 (owner: 10Chad) [20:29:06] thcipriani: I'll merge https://gerrit.wikimedia.org/r/#/c/232843/ and babysit in prod if you rebase [20:31:47] ori: kk, rebasing, and double-checking. Thanks! [20:32:15] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 42 [20:33:21] (03PS1) 10Ori.livneh: apache: Remove useless use of AllowOverride directive from status.conf [puppet] - 10https://gerrit.wikimedia.org/r/241156 [20:33:35] (03CR) 10Ori.livneh: [C: 032 V: 032] apache: Remove useless use of AllowOverride directive from status.conf [puppet] - 10https://gerrit.wikimedia.org/r/241156 (owner: 10Ori.livneh) [20:33:48] heh [20:33:53] in queue :) [20:34:12] it's in your queue to check the queue?:) [20:34:30] it's the "in" queue then :) [20:34:46] hah, ok:) [20:35:58] a lot of emails in the past 2 minutes, might be a good time to try and evaluate what an appropriate value for the check is actually [20:36:12] yea, how much is it [20:36:41] (03Abandoned) 10Ori.livneh: Add simple haveged module; apply on fermium [puppet] - 10https://gerrit.wikimedia.org/r/231973 (https://phabricator.wikimedia.org/T82576) (owner: 10Ori.livneh) [20:36:44] ori, did you leave a cherry-pick of an operations/puppet commit on beta earlier? [20:37:10] Krenair: not in the past few days [20:37:21] about a week ago, I think, I did cherry-pick one [20:38:25] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:38] ori, not "update mod_status configuration"? [20:38:56] PROBLEM - puppet last run on mc2010 is CRITICAL: CRITICAL: puppet fail [20:38:57] no [20:39:02] though that is my patch [20:39:05] i did not use it on beta [20:39:08] (03PS1) 10Smalyshev: Enable A/B test for combined language search. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) [20:39:12] maybe mutante or hashar? [20:39:26] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: puppet fail [20:39:35] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 42 [20:39:46] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:48] (03PS10) 10Thcipriani: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [20:39:55] PROBLEM - puppet last run on strontium is CRITICAL: CRITICAL: Puppet has 2 failures [20:39:55] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:55] PROBLEM - puppet last run on mw2063 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:06] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:16] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:17] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:40:17] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:25] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 2 failures [20:40:25] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 3 failures [20:40:33] palladium had a hiccup [20:40:39] puppet failures appear to be ephemeral [20:41:41] eh, i did not do things on beta with that patch [20:41:54] thanks for checking palladium [20:42:00] 12:30 I have seen a patch about /status recently, I think by ori [20:42:06] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:42:06] maybe hashar was doing something with it? [20:42:52] since we can confirm it's merged in production though [20:42:58] it should also be on beta [20:43:34] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:43:36] * ori nods [20:43:36] ori: rebased, although I realized I need to put a key into puppet/private files in order for puppet not to fail. [20:43:44] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:43:46] thcipriani: np, i can do that [20:43:54] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:44:04] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:44:04] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:44:33] thcipriani: what is the qualified name of the key? i don't see it in the patch [20:45:06] key_file => 'servicedeploy_rsa', ? [20:45:44] 6operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676462 (10hashar) **TL;DR: have your wikitech account renamed to "Daniel Zahn" to update your LDAP `cn` which is used as the displayed name in Gerrit.** ----- + @laner who wrote the LdapAuthenticati... [20:45:51] yep, that's it [20:45:58] !log starting a Cassandra repair on xeon (nodetool repair -pr) [20:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:06] ori: should just need to go in: /var/lib/git/labs/private/files/ssh/tin/servicedeploy_rsa [20:46:25] 6operations, 10Gerrit: Rename "Dzahn" to "Daniel Zahn" in Gerrit - https://phabricator.wikimedia.org/T113792#1676471 (10hashar) May I say RFC 4519 (http://tools.ietf.org/html/rfc4519) does not have a field for shellAccount or wikiAccount ? But then IANALdap guy. [20:46:25] er, puppet/private (I guess) [20:47:02] mutante: I have replied on your Gerrit name issue. In short gotta rename your wiki account on wikitech to change the LDAP 'cn'. The long read is at https://phabricator.wikimedia.org/T113792#1676462 [20:47:23] thcipriani: i can generate a private key file, but we will need to update the key fingerprint in your patch [20:47:40] hashar: thank you :) [20:48:11] ori: that's fine if you want to send me that info. I could gpg-encrypt the private-key I have that matches the patch. Either way. [20:54:59] (03CR) 10Milimetric: [C: 031] "Limn is still used. It turns out "slowly dying" meant *really* slow. Thanks for the update to the templates." [puppet] - 10https://gerrit.wikimedia.org/r/240941 (owner: 10Dzahn) [20:58:53] (03PS2) 10Dzahn: limn: make compatible with Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240941 [20:59:08] (03CR) 10Milimetric: "Any other blockers on this or can we merge it? The code we want to deploy to it is in fairly good shape, we'd like to get started testing" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [20:59:58] (03CR) 10Dzahn: [C: 032] limn: make compatible with Apache 2.4 syntax [puppet] - 10https://gerrit.wikimedia.org/r/240941 (owner: 10Dzahn) [21:02:42] (03PS11) 10Ori.livneh: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [21:02:47] thcipriani: updated, could you double check? [21:02:52] * thcipriani looks [21:03:39] (03CR) 10jenkins-bot: [V: 04-1] Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [21:04:56] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [21:05:06] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [21:05:24] RECOVERY - puppet last run on strontium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:06:16] RECOVERY - puppet last run on mc2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:15] RECOVERY - puppet last run on mw2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:22] ori: the change to a file seems fine, lgtm (other than the missing '$') [21:10:39] (03PS11) 10Dzahn: [WIP] Update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [21:11:20] (03PS12) 10Dzahn: mediawiki: update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [21:11:50] (03CR) 10Dzahn: [C: 031] "just mediawiki module left (and things that are lowercase maybe), so renamed this" [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [21:14:05] (03PS12) 10Thcipriani: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) [21:30:47] err WTH [21:31:11] hhvm.log is overflowing with crap like Sep 25 21:25:56 mw1020: #012Warning: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] [21:32:35] isn't there a task for that? [21:32:59] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: automated invocation of Cassandra repair jobs - https://phabricator.wikimedia.org/T92355#1106995 (10GWicke) [21:33:14] Yup [21:33:49] yes [21:34:00] https://phabricator.wikimedia.org/T112922 [21:34:26] as per usual, _joe__ is the one with the answers [21:34:45] "php sucks" [21:34:49] heh [21:36:11] Reedy: apparently HHVM has a patch for LLVM so you can write byte code instead ? [21:38:29] _joe__: well done :-) [21:38:33] have a good week-end [21:39:08] ori: the change to a file seems fine, lgtm (other than the missing '$') -- missing where? [21:39:33] ori: I updated the patch, fine now [21:39:39] ah ok :) [21:41:53] (03PS13) 10Ori.livneh: Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [21:42:01] (03CR) 10Ori.livneh: [C: 032 V: 032] Add deploy-service user [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [21:46:24] thcipriani: I forced a run on restbase1001 and on tin. no failures or anything or either host, but on tin the servicedeploy_rsa key was not provisioned [21:46:49] hmmm [21:49:28] (03PS1) 10Dzahn: mailman: raise queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/241232 [21:49:48] that's strange, well, I'm in the new keyholder group, so that's good. [21:50:43] thcipriani: never mind, it worked [21:50:48] i am an idiot and can't read [21:51:27] heh, cool, I was in the middle of typing, "I can't imagine why it didn't work" [21:51:30] (03PS2) 10Dzahn: mailman: raise queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/241232 [21:51:52] !log Armed new servicedeploy_rsa on tin [21:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:52:08] (03CR) 10John F. Lewis: [C: 031] mailman: raise queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/241232 (owner: 10Dzahn) [21:52:25] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [21:52:32] * ori already on it :) [21:52:47] (03CR) 10Dzahn: [C: 032] mailman: raise queue monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/241232 (owner: 10Dzahn) [21:53:06] !log Armed new servicedeploy_rsa on mira [21:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:53:21] :) [21:54:15] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [21:55:00] thcipriani: could i ask you to do one small follow-up task? update https://wikitech.wikimedia.org/wiki/Keyholder to mention the new key and passphrase. the passphrase is iron:/srv/passwords/services-deployment-key-passphrase [21:55:23] and possibly email ops@ about it [21:55:29] ori: absolutely, thanks for the merge! [21:55:35] np, thanks for the awesome work [21:58:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Dan Foy - https://phabricator.wikimedia.org/T113324#1676694 (10DFoy) @coren : My public key is at https://office.wikimedia.org/wiki/User:Dfoy . I've signed the 'L3' document, and the DFoy wikitech user is mine. [22:01:27] ori: thcipriani well done [22:01:49] yes, i certainly did an amazing job at pressing that button [22:01:54] thcipriani was only nominally involved [22:02:08] ori: you offered to babysit/debug, thank you :) [22:02:20] thcipriani writes nice patches :) [22:02:25] +1 [22:02:30] ori: You could've asked Paladox to rebase it for you [22:02:33] aww, you guys. [22:05:25] Reedy: I could have, very true [22:06:00] friday love fests I like [22:10:24] (03PS2) 10Andrew Bogott: Openstack: Mark failure of nova-network or nova-conductor as critical. [puppet] - 10https://gerrit.wikimedia.org/r/241101 [22:11:34] (03CR) 10Andrew Bogott: [C: 032] Openstack: Mark failure of nova-network or nova-conductor as critical. [puppet] - 10https://gerrit.wikimedia.org/r/241101 (owner: 10Andrew Bogott) [22:12:41] all right ya’ll, here it comes [22:13:18] hm… apparently gerrit won’t accept a 12-patch set. [22:13:25] Anticlimax [22:13:44] Is that like spam prevention? [22:14:08] 12-patch ? [22:16:04] !log sodium - shutting down [22:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:16:52] !! [22:16:53] (03PS2) 10Andrew Bogott: Openstack: Add monitoring for the keystone service [puppet] - 10https://gerrit.wikimedia.org/r/241102 [22:16:55] (03PS1) 10Andrew Bogott: Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) [22:16:57] (03PS1) 10Andrew Bogott: Logstash: Fixed a quoted boolean in a code comment. [puppet] - 10https://gerrit.wikimedia.org/r/241236 (https://phabricator.wikimedia.org/T113783) [22:16:58] woooo [22:16:59] (03PS1) 10Andrew Bogott: interface: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241237 (https://phabricator.wikimedia.org/T113783) [22:17:01] (03PS1) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [22:17:03] (03PS1) 10Andrew Bogott: Grafana: dequote booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241239 (https://phabricator.wikimedia.org/T113783) [22:17:05] (03PS1) 10Andrew Bogott: Gerrit role: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241240 (https://phabricator.wikimedia.org/T113783) [22:17:07] (03PS1) 10Andrew Bogott: Mark salt grain bool values with # lint:ignore:quoted_booleans [puppet] - 10https://gerrit.wikimedia.org/r/241241 (https://phabricator.wikimedia.org/T113783) [22:17:09] (03PS1) 10Andrew Bogott: webserver::php5 unquote a boolean. [puppet] - 10https://gerrit.wikimedia.org/r/241242 (https://phabricator.wikimedia.org/T113783) [22:17:11] (03PS1) 10Andrew Bogott: Webserver ca: disable the quoted-bool lint check [puppet] - 10https://gerrit.wikimedia.org/r/241243 (https://phabricator.wikimedia.org/T113783) [22:17:13] (03PS1) 10Andrew Bogott: Diamond: Turn off lint check for quoted bools. [puppet] - 10https://gerrit.wikimedia.org/r/241244 (https://phabricator.wikimedia.org/T113783) [22:17:15] (03PS1) 10Andrew Bogott: Disable quoted_boolean lint check around is_virtual refs. [puppet] - 10https://gerrit.wikimedia.org/r/241245 (https://phabricator.wikimedia.org/T113783) [22:17:17] (03PS1) 10Andrew Bogott: Root out a long chain of quoted bools in nagios/icinga/nrpe [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) [22:17:19] there we go! [22:17:21] no more lucid :D [22:17:50] nice guys [22:18:07] 6operations, 10Wikimedia-Mailing-lists: Upgrade to Mailman 3.0 - https://phabricator.wikimedia.org/T97492#1676766 (10Dzahn) [22:18:08] \o/ [22:18:09] 6operations, 7Mail: Upgrade Exim to >=4.73 - https://phabricator.wikimedia.org/T83541#1676770 (10Dzahn) [22:18:10] 6operations: mailman - replace lighttpd - https://phabricator.wikimedia.org/T84053#1676768 (10Dzahn) [22:18:12] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#1676771 (10Dzahn) [22:18:14] 6operations: shutdown sodium after mailman has migrated to jessie VM - https://phabricator.wikimedia.org/T82698#1676763 (10Dzahn) 5Open>3Resolved root@sodium:/backup# shutdown -h now W: molly-guard: SSH session detected! Please type in hostname of the machine to shutdown: sodium Broadcast message from dzahn... [22:18:18] mutante: spammer [22:18:19] 6operations: Get rid of all Ubuntu Lucid (10.04) installs - https://phabricator.wikimedia.org/T80945#1676772 (10Dzahn) 5Open>3Resolved [22:18:45] Reedy: it's all automatic phabricator wizardry :) [22:19:01] just shows how much dependency, heh [22:19:53] (03PS2) 10Smalyshev: Enable A/B test for combined language search. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/241226 (https://phabricator.wikimedia.org/T3837) [22:20:27] (03PS3) 10Andrew Bogott: Openstack: Add monitoring for the keystone service [puppet] - 10https://gerrit.wikimedia.org/r/241102 [22:21:56] (03CR) 10Andrew Bogott: [C: 032] Openstack: Add monitoring for the keystone service [puppet] - 10https://gerrit.wikimedia.org/r/241102 (owner: 10Andrew Bogott) [22:22:29] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Decommission sodium - https://phabricator.wikimedia.org/T110142#1676788 (10Dzahn) it's shutdown now. it's also removed from puppet/icinga/salt. this ticket should be complete after DNS removal, disk wiping and taking hardware out of rack or reclaim [22:28:57] (03PS2) 10Andrew Bogott: Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) [22:28:59] (03PS2) 10Andrew Bogott: Logstash: Fixed a quoted boolean in a code comment. [puppet] - 10https://gerrit.wikimedia.org/r/241236 (https://phabricator.wikimedia.org/T113783) [22:29:01] (03PS2) 10Andrew Bogott: interface: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241237 (https://phabricator.wikimedia.org/T113783) [22:29:03] (03PS2) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [22:29:05] (03PS2) 10Andrew Bogott: Grafana: dequote booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241239 (https://phabricator.wikimedia.org/T113783) [22:29:07] (03PS2) 10Andrew Bogott: Gerrit role: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241240 (https://phabricator.wikimedia.org/T113783) [22:29:09] (03PS2) 10Andrew Bogott: Mark salt grain bool values with # lint:ignore:quoted_booleans [puppet] - 10https://gerrit.wikimedia.org/r/241241 (https://phabricator.wikimedia.org/T113783) [22:29:11] (03PS2) 10Andrew Bogott: webserver::php5 unquote a boolean. [puppet] - 10https://gerrit.wikimedia.org/r/241242 (https://phabricator.wikimedia.org/T113783) [22:29:13] (03PS2) 10Andrew Bogott: Webserver ca: disable the quoted-bool lint check [puppet] - 10https://gerrit.wikimedia.org/r/241243 (https://phabricator.wikimedia.org/T113783) [22:29:15] (03PS2) 10Andrew Bogott: Diamond: Turn off lint check for quoted bools. [puppet] - 10https://gerrit.wikimedia.org/r/241244 (https://phabricator.wikimedia.org/T113783) [22:29:17] (03PS2) 10Andrew Bogott: Disable quoted_boolean lint check around is_virtual refs. [puppet] - 10https://gerrit.wikimedia.org/r/241245 (https://phabricator.wikimedia.org/T113783) [22:29:19] (03PS2) 10Andrew Bogott: Root out a long chain of quoted bools in nagios/icinga/nrpe [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) [22:30:40] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1676828 (10greg) What are the explicit blockers of this upgrade, now? I guess one that might not be a task is "time and prioritization" but that's fine. [22:32:33] (03PS3) 10Andrew Bogott: Rsync: Unquote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241235 (https://phabricator.wikimedia.org/T113783) [22:32:35] (03PS3) 10Andrew Bogott: Logstash: Fixed a quoted boolean in a code comment. [puppet] - 10https://gerrit.wikimedia.org/r/241236 (https://phabricator.wikimedia.org/T113783) [22:32:37] (03PS3) 10Andrew Bogott: interface: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241237 (https://phabricator.wikimedia.org/T113783) [22:32:39] (03PS3) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [22:32:41] (03PS3) 10Andrew Bogott: Grafana: dequote booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241239 (https://phabricator.wikimedia.org/T113783) [22:32:43] (03PS3) 10Andrew Bogott: Gerrit role: dequote booleans [puppet] - 10https://gerrit.wikimedia.org/r/241240 (https://phabricator.wikimedia.org/T113783) [22:32:45] (03PS3) 10Andrew Bogott: Mark salt grain bool values with # lint:ignore:quoted_booleans [puppet] - 10https://gerrit.wikimedia.org/r/241241 (https://phabricator.wikimedia.org/T113783) [22:32:47] (03PS3) 10Andrew Bogott: webserver::php5 unquote a boolean. [puppet] - 10https://gerrit.wikimedia.org/r/241242 (https://phabricator.wikimedia.org/T113783) [22:32:49] (03PS3) 10Andrew Bogott: Webserver ca: disable the quoted-bool lint check [puppet] - 10https://gerrit.wikimedia.org/r/241243 (https://phabricator.wikimedia.org/T113783) [22:32:51] (03PS3) 10Andrew Bogott: Diamond: Turn off lint check for quoted bools. [puppet] - 10https://gerrit.wikimedia.org/r/241244 (https://phabricator.wikimedia.org/T113783) [22:32:53] (03PS3) 10Andrew Bogott: Disable quoted_boolean lint check around is_virtual refs. [puppet] - 10https://gerrit.wikimedia.org/r/241245 (https://phabricator.wikimedia.org/T113783) [22:32:55] (03PS3) 10Andrew Bogott: Root out a long chain of quoted bools in nagios/icinga/nrpe [puppet] - 10https://gerrit.wikimedia.org/r/241246 (https://phabricator.wikimedia.org/T113783) [22:33:16] * andrewbogott is done for now, sorry about the noise [22:34:10] it's wonderful :) [22:36:20] 6operations, 10Wikimedia-Mailing-lists: Upgrade Mailman to version 3 - https://phabricator.wikimedia.org/T52864#1676829 (10JohnLewis) Blockers: * 2.1 upgrade support for Mailman 3 (serious and tested, right now it's 'partial' and patched in 2010) * Debian upstream packaging (likely next Debian distro) * Actual... [22:36:57] greg-g: with the above we can launch into a massive debate if you want ;) [22:38:23] wee [23:10:57] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [23:37:35] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [23:38:05] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:54:35] !log ori@tin Synchronized php-1.26wmf24/includes/GlobalFunctions.php: 37c6972f94: Made wfIsBadImage() use APC (duration: 00m 17s) [23:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master