[00:00:04] RoanKattouw, ^d, tgr: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150224T0000). Please do the needful. [00:07:49] <^demon|away> tgr: Ping for your swat [00:07:52] <^demon|away> (sorry for the delay) [00:08:00] aye [00:08:09] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 1 below the confidence bounds [00:10:06] !log demon Synchronized php-1.25wmf17/extensions/MultimediaViewer: (no message) (duration: 00m 06s) [00:10:10] Logged the message, Master [00:10:21] !log demon Synchronized php-1.25wmf18/extensions/MultimediaViewer: (no message) (duration: 00m 09s) [00:10:24] Logged the message, Master [00:10:28] <^demon|away> tgr: All live, plz verify [00:18:08] ^demon|away: works, thanks! [00:18:16] <^demon|away> yw [00:22:19] (03CR) 10BryanDavis: "This might be a good time to add an `"autoloader-suffix": "_mediawiki_config"` setting to to stabilize the generated autoloader's classnam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [00:24:23] ^demon|away: Can you also do https://gerrit.wikimedia.org/r/192083 ? [00:24:29] <^demon|away> looking [00:25:29] (03CR) 10Chad: [C: 032] Revert "Limit runJobs output to warning and higher severity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192083 (owner: 10Hoo man) [00:26:50] thanks [00:29:53] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10Wikimedia-Logstash, 5Patch-For-Review: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1061042 (10bd808) 5Open>3stalled The configuration change to begin testing this in pro... [00:30:12] * ^demon|away twiddles thumbs [00:30:50] twiddling thumbs is one approach, doing it right the first time is the other [00:31:38] <^demon|away> ...what did I do wrong? [00:32:15] oh, I misunderstood and thought you were twiddling your thumbs re: the task update immediately above [00:32:16] my bad [00:32:51] <^demon|away> No, waiting for zuul to decide it's ok to merge my prod config change :) [00:33:26] * ori should be a little less jumpy [00:42:54] (03CR) 10Chad: [V: 032] Revert "Limit runJobs output to warning and higher severity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192083 (owner: 10Hoo man) [00:43:13] <^demon|away> hoo: Done [00:43:13] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 07s) [00:43:21] Logged the message, Master [00:45:32] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10Wikimedia-Logstash, 5Patch-For-Review: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1061071 (10ori) If you are not sure that using Monolog is a good solution, evaluate it fur... [00:50:59] !log twentyafterfour Synchronized php-1.25wmf18/cache/l10n: (no message) (duration: 00m 03s) [00:51:05] Logged the message, Master [00:56:30] !log on osmium, removing the packages I just installed since I will do it in a chroot instead [00:56:34] Logged the message, Master [01:00:30] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [01:12:06] (03PS1) 10Andrew Bogott: Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 [01:13:01] (03CR) 10jenkins-bot: [V: 04-1] Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 (owner: 10Andrew Bogott) [01:15:59] (03PS2) 10Andrew Bogott: Add labs config files for Openstack version Juno [puppet] - 10https://gerrit.wikimedia.org/r/192483 [01:19:20] hello: any idea as to how to get the phab task that corresponds to this ticket: https://rt.wikimedia.org/Ticket/Display.html?id=7509 [01:20:30] nuria, procurement tickets have not been migrated [01:20:52] MaxSem: aham, so how do we "reactivate" the task? cc ori [01:21:07] nuria: i'm just chatting with robh about it, he's in charge of procurements [01:21:16] ori: k [01:47:18] !log ori Synchronized php-1.25wmf17/extensions/MobileFrontend/includes/modules/MobileUserModule.php: Testing a theory for T90411 with a live-hack to MobileFrontend. Will revert momentarily. (duration: 00m 07s) [01:47:22] Logged the message, Master [01:52:39] !log ori Synchronized php-1.25wmf17/extensions/MobileFrontend/includes/modules/MobileUserModule.php: Reverting live-hack (duration: 00m 07s) [01:52:43] Logged the message, Master [01:54:09] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [02:13:49] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:16:27] !log l10nupdate Synchronized php-1.25wmf17/cache/l10n: (no message) (duration: 00m 02s) [02:16:31] Logged the message, Master [02:17:34] !log LocalisationUpdate completed (1.25wmf17) at 2015-02-24 02:16:30+00:00 [02:17:38] Logged the message, Master [02:17:59] !log l10nupdate Synchronized php-1.25wmf18/cache/l10n: (no message) (duration: 00m 01s) [02:18:02] Logged the message, Master [02:19:07] !log LocalisationUpdate completed (1.25wmf18) at 2015-02-24 02:18:03+00:00 [02:19:10] Logged the message, Master [02:24:20] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:34:42] (03CR) 10Nuria: "FYI to @bblack that 1) self-inflicted huge traffic spikes are already happening and 2) it is hard for a huge spike to go undetected. We t" [puppet] - 10https://gerrit.wikimedia.org/r/192370 (owner: 10Ori.livneh) [02:48:39] (03PS1) 10Springle: repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192497 [02:49:09] (03CR) 10Springle: [C: 032] repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192497 (owner: 10Springle) [02:49:31] (03Merged) 10jenkins-bot: repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192497 (owner: 10Springle) [02:50:27] !log springle Synchronized wmf-config/db-eqiad.php: repool db1066, warm up (duration: 00m 06s) [02:50:31] Logged the message, Master [04:07:05] !log springle Synchronized wmf-config/db-eqiad.php: raise db1066 load (duration: 00m 07s) [04:07:08] Logged the message, Master [04:30:44] (03CR) 10Tim Landscheidt: Tools: Install at (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/191521 (https://phabricator.wikimedia.org/T72324) (owner: 10Tim Landscheidt) [04:38:39] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [04:41:20] (03CR) 10Tim Landscheidt: Use apt::repository instead of file resources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/123903 (owner: 10Tim Landscheidt) [04:50:03] (03PS1) 10Chmarkine: Point rel=canonical to HTTPS for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192502 [04:52:41] (03PS2) 10Chmarkine: Point rel=canonical to HTTPS for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192502 (https://phabricator.wikimedia.org/T90527) [04:55:58] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:56:59] (03CR) 10Tim Landscheidt: "Hmmm?! This change references that LDAP's "min number"?" [puppet] - 10https://gerrit.wikimedia.org/r/190978 (https://phabricator.wikimedia.org/T87527) (owner: 10Tim Landscheidt) [05:03:03] (03PS1) 10Springle: prepare db1016 to take over as m1 master [puppet] - 10https://gerrit.wikimedia.org/r/192503 [05:04:43] (03CR) 10Springle: [C: 032] prepare db1016 to take over as m1 master [puppet] - 10https://gerrit.wikimedia.org/r/192503 (owner: 10Springle) [05:16:53] (03CR) 10Tim Landscheidt: [C: 04-1] "You're right, I only tested that it will deploy properly, but not what :-). Current diff:" [puppet] - 10https://gerrit.wikimedia.org/r/148172 (owner: 10Tim Landscheidt) [05:33:23] labs is dead [05:33:33] for whoever cares [05:44:49] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [05:47:02] GerardM-: 'labs is dead' is useless. more context please, if you want help. [05:47:28] GerardM-: a little more respect and a little less dismissiveness would also be welcome [05:54:46] (I am looking into it now) [05:58:19] GerardM-: alright, this time I eat my own hat. I didn't notice because shinken-wm is silent on -labs [05:58:25] sorry about the snark. let me dig [06:00:30] looks like virt1005 is ailing [06:04:21] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:08] why is it even in the pool. [06:08:09] wat [06:10:37] am going to try paging andrewbogott_afk [06:15:29] 6operations, 10Beta-Cluster, 6Labs, 10Tool-Labs: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1061420 (10yuvipanda) p:5High>3Unbreak! [06:19:38] I can barely use this app. .. what's up? [06:20:35] abogott: hey! virt1005 seems to have died and taken instances with it. [06:20:43] abogott: which is strange since you took it out of rotation when it died earlier. [06:21:02] abogott: but I checked the 'current host' on wikitech for them and they all said virt1005 [06:21:03] Hm? It should be empty. Isn't that the one that died on Tuesday? [06:21:42] ask the nova cmdline [06:21:48] abogott: yeah, and got a confusing answe [06:21:52] | OS-EXT-SRV-ATTR:host | virt1012 | [06:21:54] | OS-EXT-SRV-ATTR:hypervisor_hostname | virt1005.eqiad.wmnet | [06:21:55] Wikitech is probably out of date [06:21:57] and it's down [06:22:00] but it says state 'ACTIVE' [06:22:21] That's normal I think. But... The instances died? [06:22:37] abogott: yup, unreachable. this also included tools-webproxy, so tools is dead... [06:22:49] (and deployment-db1, so deployment-prep is also dead) [06:22:56] What the heck [06:23:01] abogott: so I just count the 'host' and ignore the 'hypervisor_hostname'? [06:23:08] abogott: I did a nova stop and nova start, no effect. [06:23:13] Can you log in to 1005? [06:23:43] yup [06:24:12] Is it mostly happy or mostly broken? [06:24:32] abogott: seems 'happy'. no quemu, no load, no active processes... [06:24:43] abogott: 1012 has quemu processes running, etc [06:24:58] Can you locate an instance in 1012 that works? [06:25:06] looking [06:26:27] Also try service nova compute restart on 1005 [06:26:50] Grrrrrrrr autocorrect makes command lines hard [06:27:15] right. let me try [06:27:18] on 1005? [06:27:39] Yeah [06:27:50] abogott: it's stopped atm. [06:27:52] let me start? [06:28:10] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:19] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:20] !log starting nova-compute on virt1005 [06:28:22] Hm... yes, try but stop again if no help [06:28:24] Logged the message, Master [06:28:29] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:30] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:04] abogott: no help, I stopped [06:29:09] !log stopped nova-compute on virt1005 [06:29:10] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:12] Logged the message, Master [06:30:05] abogott: hmm, can't actually find a machine in virt1012 that works. let me try to list them all. [06:30:14] Other labs instances work? [06:30:30] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] See if you can find something on 1012 that wasn't on 1005 before. Might not be anything. [06:31:32] abogott: yeah, tools-login works fine, for example [06:31:39] Ok [06:31:58] I'll be online for real in a minute. [06:32:34] abogott: ok [06:34:03] abogott: openstack-juno-testing doesn't seem to have beeen on virt1005, and is on virt1012 now, and is also dead [06:34:09] (from what I can see. no ping, no ssh) [06:34:15] nova still says 'active' [06:34:48] so does bbdevel [06:36:26] ok, looking at virt1012. [06:36:34] andrewbogott: thanks. let me know how I can help. [06:36:44] andrewbogott: I did a nova stop and a nova start on tools-webproxy, didn't seem to help much [06:38:04] andrewbogott: I'm going to email labs-l [06:38:14] ok [06:39:22] 6operations, 10Incident-20150205-SiteOutage, 6MediaWiki-Core-Team, 10Wikimedia-Logstash, 5Patch-For-Review: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1061469 (10bd808) >>! In T88732#1061071, @ori wrote: > If you are not sure that using Mono... [06:40:22] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Feb 24 06:39:18 UTC 2015 (duration 39m 17s) [06:40:24] Logged the message, Master [06:44:17] it’s pretty clear that compute wasn’t running on virt1005. [06:44:34] So I suspect that the instances are still happily running on 1012, and they’ve fallen off the network. [06:44:42] Why now, though? [06:45:40] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:40] YuviPanda: can you work on building alternate web proxies while I troubleshoot this? [06:45:50] andrewbogott: right. ok. [06:46:30] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:47:41] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:59] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:48:39] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:48:45] andrewbogott: bah, it got created on virt1012 [06:48:55] hmm, I wonder if it boots up [06:48:56] hm, well, good test case! [06:48:59] yeah [06:49:00] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:13] tools-webproxy-01.eqiad.wmflabs [06:51:01] andrewbogott: interesting. DNS doesn't see it, and pinging the IP itself (10.68.17.139) is giving me a dest not reachable. [06:51:06] I wonder if this is network related... [06:51:20] since I *do* see qemu running on virt1012 [06:51:23] yeah, I’m pretty sure that the nic that virt1012 uses for instances crapped out [06:51:29] I know not why [06:51:34] oh ouch [06:51:42] the physical NIC? [06:51:42] paravoid: up? [06:51:47] Dunno, maybe. [06:51:59] I wonder if it is labnet that's having problems. [06:52:24] the network host looks ok, and of course it’s handling other instances fine… [06:52:29] right [06:55:43] physical nics seem fine, not complaining [06:55:48] hm... [06:56:51] firewall looks ok... [07:08:37] YuviPanda: what time did things start dying? [07:08:45] andrewbogott: let me look at shinken [07:09:09] andrewbogott: about 11AM IST, so that was 1h40m ago [07:09:26] that’s… arbitrary [07:09:26] andrewbogott: I'm going to try creating another instance now. see where it lands up [07:09:41] ok. I’m going to reboot virt1012 because that might help. It’s not ideal since it’ll cycle power on all instances… [07:09:57] andrewbogott: right. but if they’re unreachable anyway, even from inside labs... [07:11:15] andrewbogott: I also installed nethogs on virt1012 and 1010, and they have comparable amounts of network traffic from nova... [07:13:13] !log suspending all instances on virt1012 [07:13:17] Logged the message, Master [07:46:59] PROBLEM - Host virt1012 is DOWN: PING CRITICAL - Packet loss = 100% [07:50:40] RECOVERY - Host virt1012 is UP: PING OK - Packet loss = 0%, RTA = 1.75 ms [07:55:52] !log ‘nova reset-state —active’ and ‘nova reboot’ for EVERY instance on virt1012 [07:55:54] Logged the message, Master [07:57:49] ok, YuviPanda, are instances starting to come back up for you? [08:00:02] andrewbogott: it's bac down [08:00:19] shit [08:00:48] oh, actually, I think this is fine... [08:00:57] it’s just the delay in the reboot… [08:01:12] (which leaves me wondering, what state was it in before? I didn’t reboot it twice…) [08:01:36] andrewbogott: it booted back up as soon as virt1012 came back up [08:01:49] Yeah, seems like. That surprises me [08:01:57] When virt1005 rebooted everything was in a SHUTOFF state after it came up [08:02:08] Anyway, it’s back, again [08:03:42] <_joe_> what is back? virt1012? [08:04:02] and the instances on tit [08:04:03] *it [08:04:13] _joe_: yeah, reboot cured all [08:04:16] messily [08:04:21] <_joe_> andrewbogott: I was almost sure [08:04:54] but now I'm suspicious [08:04:59] _joe_: Sadly, we don’t know what happened or how to prevent in the future :( [08:05:40] <_joe_> I think it's most probably a kernel bug of some sort, but didn't take the time to investigate that properly [08:05:59] yeah [08:06:12] Hopefully it’s rare; there are three identical servers and this is the first time this has happened [08:07:28] YuviPanda: some instances are still showing ERROR state but seem to actually be fine. I’m going to test and clean up and such… [08:07:41] It’s not all that late, as long as I don’t wake up for the storage maintenance tomorrow :) [08:08:21] andrewbogott: heh, right. [08:08:25] YuviPanda: go ahead and send an ‘all clear’ email as soon as you’re ready. [08:08:31] andrewbogott: yup. looking around still [08:21:55] YuviPanda: hold off on the all-clear, I’m still seeing some issues [08:22:08] andrewbogott: hmm, tools are back up, so I emailed... [08:22:10] whoops [08:22:13] it’s ok [08:22:23] I’ll follow up individually if I can’t rescue these instances [08:24:24] ok [08:31:04] (03PS1) 10Springle: grants for misc db shards [puppet] - 10https://gerrit.wikimedia.org/r/192506 [08:31:06] (03CR) 10jenkins-bot: [V: 04-1] grants for misc db shards [puppet] - 10https://gerrit.wikimedia.org/r/192506 (owner: 10Springle) [08:31:10] (03PS2) 10Springle: grants for misc db shards [puppet] - 10https://gerrit.wikimedia.org/r/192506 [08:31:22] (03CR) 10Springle: [C: 032] grants for misc db shards [puppet] - 10https://gerrit.wikimedia.org/r/192506 (owner: 10Springle) [08:32:08] <_joe_> springle: ewww :) [08:32:50] <_joe_> it's not your fault, obviously. managing databases via puppet is always awkward [08:38:17] 6operations: Fix the puppet catalog compiler - https://phabricator.wikimedia.org/T90417#1061653 (10Joe) 5Open>3Resolved [08:48:32] YuviPanda: ok, I think i’m going to try sleeping now… [08:48:33] hopefully [08:48:52] andrewbogott: not used ATM, I've cleaned the 'swift' project a bit [08:48:58] (greetings) [08:49:25] (03PS2) 10Giuseppe Lavagetto: coredb::s1: use hiera, role [puppet] - 10https://gerrit.wikimedia.org/r/185921 [08:49:29] godog: great, thanks. [08:49:45] It’s not urgent, I’m just trying to get some ganglia colors back into the yellows and greens. [08:50:11] In theory the natural lifecycle of labs instances means that things rebalance over time as long as folks clean up old stuff. [08:50:50] godog: so you deleted those instances already? (I lost my backscroll so don’t remember precisely what we’re talking about :) ) [08:51:20] andrewbogott: yep I've deleted three medium instances and recreated a small one [08:51:30] that’ll help! [08:52:26] YuviPanda: I’m tuning out now. Thanks for the page earlier. [09:00:00] andrewbogott: yw! Thanks for being awesome [09:33:07] 6operations, 7HTTPS, 3HTTPS-by-default: Point rel=canonical to HTTPS for Russian Wikipedia (ruwiki) - https://phabricator.wikimedia.org/T90527#1061834 (10Aklapper) [09:36:46] _joe_: least eww i can find. at least it is simple without crazy puppet layers :) [09:38:51] <_joe_> springle: yes I agree, mine is more of a general statement that puppet is bad at things that are stateful [09:38:58] <_joe_> (and it's designed to be) [09:39:25] true [09:39:36] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1061871 (10Aklapper) >>! In T85141#1055784, @Dzahn wrote: > DBD::mysql::db selectcol_arrayref failed: Unknown column 'tag' in 'where clause' Tags were introduced in 4.4 and are priva... [09:48:47] <_joe_> springle: I'm gonna merge https://gerrit.wikimedia.org/r/#/c/185921, or if you're moving to mariadb::core soon and just need help, lemme know [09:51:15] 6operations, 6Phabricator: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1061895 (10Aklapper) [09:51:16] 10Ops-Access-Requests, 6operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1061894 (10Aklapper) [09:51:18] 6operations, 6Phabricator: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1060488 (10Aklapper) [09:52:59] _joe_: so long as it's a no-op, fine by me [09:53:32] <_joe_> springle: it should be, but it's pointless if your plans for moving to mariadb::core are within this quarter [09:53:41] <_joe_> also, I can help if needed on that [09:54:23] _joe_: they aren't decided by quarter, but by necessity as boxes switch to mariadb 10. i havn't put a timeline on it [09:54:37] <_joe_> ok [09:54:45] too many other things dictate progress [09:54:55] <_joe_> yeah I get that [09:58:21] 6operations, 3wikis-in-codfw: Console on mc2001 is unresponsive - https://phabricator.wikimedia.org/T90559#1061914 (10Joe) 3NEW a:3Joe [09:58:32] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2001 is unresponsive - https://phabricator.wikimedia.org/T90559#1061914 (10Joe) [10:01:17] 6operations, 6Phabricator, 6Project-Creators: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1061932 (10Qgil) [10:09:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] Conntrack collector for diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/192335 (owner: 10coren) [10:10:18] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Scrum-of-Scrums, and 3 others: RESTBase production hardware - 4 of 6 ready - https://phabricator.wikimedia.org/T76986#1061945 (10fgiunchedi) clarification: restbase1001 is up and racked but currently running into an issue with the debian installer and netwo... [10:25:30] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1061958 (10fgiunchedi) this seems to be https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=765577 [10:37:48] <_joe_> godog: yes seems to be it [10:38:44] indeed [10:40:25] I'm testing with passing 'debug' at boot, it didn't trigger the race 2/2 so far [10:41:46] <_joe_> how are you passing debug at boot? [10:42:05] type 'server debug' at the boot: prompt [10:51:14] right, that worked 5/5 [10:52:49] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1061973 (10fgiunchedi) of course booting with `debug` makes the race go away, I've been successfully booting restbase1001 5 out of 5 times [10:56:50] <_joe_> godog: ok I'll try [10:58:44] _joe_: let me know how it goes, it'd be a lame fix but seems reliable so far [11:05:35] (03CR) 10Alexandros Kosiaris: [C: 032] cassandra: add cassandra::metrics class and deps [puppet] - 10https://gerrit.wikimedia.org/r/191654 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:06:17] (03CR) 10Alexandros Kosiaris: [C: 032] report cassandra metrics with metrics-graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:21:08] Request: POST http://cs.wiktionary.org/w/index.php?title=Modul:Source/Rejzek&action=submit, from 10.20.0.131 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 3302443440 [11:21:12] Forwarded for: 88.101.17.127, 10.20.0.138, 10.20.0.138, 10.20.0.131 [11:21:15] Error: 503, Service Unavailable at Tue, 24 Feb 2015 11:20:47 GMT [11:21:47] 6operations: NIC misassigned (double entries) by jessie installer - https://phabricator.wikimedia.org/T90236#1061997 (10fgiunchedi) to clarify, restbase1002 also had problems but eventually won a race and the installation went through. I've left restbase1001 behind on purpose to be able to debug this, other rest... [11:21:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Ports are wrong (how did 50001 come up?) otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/191938 (https://phabricator.wikimedia.org/T89867) (owner: 10Dzahn) [11:22:38] (03CR) 10Alexandros Kosiaris: [C: 032] add zotero role class skeleton [puppet] - 10https://gerrit.wikimedia.org/r/191925 (https://phabricator.wikimedia.org/T89867) (owner: 10Dzahn) [11:22:48] PROBLEM - puppet last run on restbase1001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:23:39] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [11:27:01] 6operations, 5Patch-For-Review: Assign an internal LVS service IP for zotero - https://phabricator.wikimedia.org/T89870#1062001 (10akosiaris) >>! In T89870#1055205, @Dzahn wrote: > fwiw, i asked bblack and he checked and said we can use the entire /24 range of IPs here even though it's not a real /24 network a... [11:27:18] PROBLEM - Host restbase1001 is DOWN: PING CRITICAL - Packet loss = 100% [11:27:48] that's me [11:29:00] RECOVERY - Host restbase1001 is UP: PING OK - Packet loss = 0%, RTA = 1.46 ms [11:32:33] 6operations, 10RESTBase-Cassandra: cassandra fails to start up after provisioning / first puppet run - https://phabricator.wikimedia.org/T90564#1062016 (10fgiunchedi) 3NEW [11:32:53] 6operations, 10RESTBase-Cassandra: cassandra fails to start up after provisioning / first puppet run - https://phabricator.wikimedia.org/T90564#1062023 (10fgiunchedi) a:3fgiunchedi [11:33:02] 6operations, 10RESTBase-Cassandra: cassandra fails to start up after provisioning / first puppet run - https://phabricator.wikimedia.org/T90564#1062016 (10fgiunchedi) p:5Triage>3Normal [11:47:36] apergos: https://phabricator.wikimedia.org/T84148 ? :) [11:48:45] 6operations: Incident response protocol needs a refresh - https://phabricator.wikimedia.org/T89800#1062052 (10faidon) p:5Triage>3High a:3faidon [11:48:55] <_joe_> godog: confirmed 'server debug' fixes the rate condition [11:49:14] _joe_: every time so far? [11:49:22] <_joe_> yes [11:49:29] <_joe_> tried rebooting 5 times [11:49:43] <_joe_> it's pretty fast as the cabling is in the wrong port anyway [11:49:44] 6operations, 10ops-eqiad, 10RESTBase, 6Services: restbase1006 faulty disk controller - https://phabricator.wikimedia.org/T89639#1062055 (10faidon) [11:49:44] <_joe_> :P [11:50:12] lame fix nevertheless unfortunately, it needs to be removed afterwards from grub config as d-i rightfully carries that over [11:51:03] paravoid: not forgotten, patch in the works [11:51:27] 6operations: db1021 %iowait up - https://phabricator.wikimedia.org/T87277#1062058 (10faidon) Also see T84050: > Additionally, we're missing all kinds of MegaCli checks, like battery status errors, missing logical drives, predictive errors, different configured from runtime settings (e.g. the usual "configure Wri... [11:52:15] 1 UBN, 23 needs triage, 65 High [11:52:17] * paravoid cries [11:53:33] paravoid: is the UBN the wikitech LDAP one? [11:53:59] no, your virt ticket [11:54:24] T90530 A virt host seems down, taking down all instances with it [11:54:29] which needs an update please :) [11:55:13] yes [11:57:48] 6operations, 10Beta-Cluster, 6Labs, 10Tool-Labs: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1062069 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Fixed now - the instances were all running but no network access. Restarting nova network was of no... [11:58:30] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "@gwicke, we have considered that but either we ship compiled jars right away or try to build them as part of the deb build process, which " [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:58:31] 6operations, 10Beta-Cluster, 6Labs, 10Tool-Labs: Investigate and do incident report for strange virt1012 issues - https://phabricator.wikimedia.org/T90566#1062073 (10yuvipanda) 3NEW a:3yuvipanda [12:00:02] 6operations, 7HTTPS, 3HTTPS-by-default: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1062083 (10faidon) p:5High>3Normal [12:00:22] 6operations, 10Beta-Cluster, 6Labs, 10Tool-Labs: Investigate and do incident report for strange virt1012 issues - https://phabricator.wikimedia.org/T90566#1062073 (10yuvipanda) a:5yuvipanda>3Andrew [12:00:33] paravoid: updated, created a subtask to document / incident response [12:00:57] thanks! [12:01:15] I’ll follow up on the other ubn when Coren or andrewbogott_afk come back [12:01:35] 6operations, 10Beta-Cluster, 6Labs, 10Tool-Labs: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1062087 (10yuvipanda) [12:01:44] (03PS1) 10Filippo Giunchedi: cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/192536 [12:02:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/192536 (owner: 10Filippo Giunchedi) [12:02:35] (03PS2) 10Filippo Giunchedi: cassandra: add cassandra::metrics class and deps [puppet] - 10https://gerrit.wikimedia.org/r/191654 (https://phabricator.wikimedia.org/T78514) [12:02:48] 6operations, 7Graphite, 5Patch-For-Review: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1062104 (10faidon) So, what's the next step here and when/how is it going to happen? [12:02:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add cassandra::metrics class and deps [puppet] - 10https://gerrit.wikimedia.org/r/191654 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [12:06:01] sigh trebuchet fail there, investigating [12:07:53] 6operations, 10Wikimedia-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186#1062117 (10faidon) So, Chris, what needs to happen here? [12:08:25] 6operations, 7Monitoring: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#1062118 (10faidon) [12:13:11] (03Abandoned) 10Faidon Liambotis: rcstream: make lvs health check fetch /nginx_status [puppet] - 10https://gerrit.wikimedia.org/r/145997 (https://bugzilla.wikimedia.org/67957) (owner: 10Ori.livneh) [12:13:19] 6operations, 10Wikimedia-Stream: stream.wikimedia.org: Uneven distribution of client connections on backends - https://phabricator.wikimedia.org/T69957#1062141 (10faidon) [12:16:33] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [12:17:34] (03PS1) 10Filippo Giunchedi: cassandra: fix metrics-graphite.jar path [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/192537 [12:17:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: fix metrics-graphite.jar path [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/192537 (owner: 10Filippo Giunchedi) [12:18:32] (03PS1) 10Filippo Giunchedi: cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/192538 [12:18:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: update submodule [puppet] - 10https://gerrit.wikimedia.org/r/192538 (owner: 10Filippo Giunchedi) [12:20:12] 6operations: 503 on http://wikiversity.org/ - https://phabricator.wikimedia.org/T88774#1062155 (10faidon) 5Open>3Resolved a:3faidon [12:29:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:42:09] 6operations, 6MediaWiki-Core-Team, 10Wikimedia-General-or-Unknown: Run our own Tor client for Tor block - https://phabricator.wikimedia.org/T32716#1062214 (10faidon) Sorry, going back to this very old ticket — why do we need to use our Onionoo server rather than using torproject.org's? I haven't heard any c... [12:42:17] (03PS1) 10Filippo Giunchedi: deployment: true vs 'true' [puppet] - 10https://gerrit.wikimedia.org/r/192543 [12:43:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment: true vs 'true' [puppet] - 10https://gerrit.wikimedia.org/r/192543 (owner: 10Filippo Giunchedi) [13:04:56] lunch, I'm currently trying to figure out why trebuchet is not considering git fat while deploying cassandra metrics [13:12:48] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2013 is unresponsive - https://phabricator.wikimedia.org/T90580#1062279 (10Joe) 3NEW [13:13:01] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2001 is unresponsive - https://phabricator.wikimedia.org/T90559#1062286 (10Joe) a:5Joe>3None [13:19:06] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2013 is unresponsive - https://phabricator.wikimedia.org/T90580#1062290 (10Joe) racadm reset did bring this back up! [13:19:18] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1062294 (10Joe) [13:19:19] 6operations, 10ops-codfw, 3wikis-in-codfw: Console on mc2013 is unresponsive - https://phabricator.wikimedia.org/T90580#1062291 (10Joe) 5Open>3Resolved a:3Joe [13:25:23] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [13:44:13] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:55:54] 6operations, 10Datasets-General-or-Unknown, 10Wikidata: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1062351 (10Lydia_Pintscher) @hoo: could you have a look? [13:59:09] 6operations, 10Datasets-General-or-Unknown, 10Wikidata: Wikidata dumps contain old-style serialization. - https://phabricator.wikimedia.org/T74348#1062361 (10hoo) >>! In T74348#1062351, @Lydia_Pintscher wrote: > @hoo: could you have a look? Just kicked of the download of a dump, I'll verify some old revisio... [14:00:34] 6operations, 10Citoid: Puppetize zotero - https://phabricator.wikimedia.org/T89867#1062370 (10akosiaris) [14:00:35] 6operations, 10Citoid: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1062372 (10akosiaris) [14:00:36] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1062373 (10akosiaris) [14:00:37] 6operations, 10Citoid: Backport and using zotero-standalone for the zotero service - https://phabricator.wikimedia.org/T89866#1062375 (10akosiaris) [14:00:38] 6operations, 10Citoid: Configure zotero to use an outbound proxy - https://phabricator.wikimedia.org/T89874#1062374 (10akosiaris) [14:00:39] 6operations, 10Citoid: Assign hardware for the zotero service - https://phabricator.wikimedia.org/T89869#1062371 (10akosiaris) [14:00:40] 6operations, 10Citoid: Configure citoid to use the new zotero service - https://phabricator.wikimedia.org/T89873#1062376 (10akosiaris) [14:00:42] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1062369 (10akosiaris) [14:00:49] 6operations, 6MediaWiki-Core-Team, 10Wikimedia-General-or-Unknown: Run our own Tor client for Tor block - https://phabricator.wikimedia.org/T32716#1062377 (10Nemo_bis) > Sorry, going back to this very old ticket — why do we need to use our Onionoo server rather than using torproject.org's? At the time, the... [14:06:41] 6operations: cp* boxes, pagecache issues & trying newer kernels - https://phabricator.wikimedia.org/T83809#1062390 (10BBlack) [14:06:42] 6operations, 3HTTPS-by-default: jessie kernel vm subsystem issues for upload caches - https://phabricator.wikimedia.org/T88996#1062391 (10BBlack) [14:21:03] PROBLEM - MariaDB disk space on db2011 is CRITICAL: DISK CRITICAL - free space: /srv 86680 MB (5% inode=99%): [14:21:05] 6operations, 3HTTPS-by-default: varnish disk cache auditing/correction - https://phabricator.wikimedia.org/T90583#1062454 (10BBlack) 3NEW a:3BBlack [14:21:26] (03PS2) 10BBlack: Bump to 3.0.6plus-wm5 for 2x patches, jessie+ only [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/188825 [14:21:59] 7Blocked-on-Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 3 others: Separate config for Beta and Production for CXServer - https://phabricator.wikimedia.org/T88793#1062469 (10Pginer-WMF) [14:23:14] (03PS3) 10BBlack: Bump to 3.0.6plus-wm5 for 2x patches, jessie+ only [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/188825 (https://phabricator.wikimedia.org/T90583) [14:27:09] 7Blocked-on-Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 2 others: Provide proxy details to use for Yandex - https://phabricator.wikimedia.org/T89117#1062489 (10Pginer-WMF) [14:32:15] (03PS1) 10Filippo Giunchedi: Depend on rsync [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/192548 [14:34:43] 6operations, 3HTTPS-by-default: varnish disk cache auditing/correction - https://phabricator.wikimedia.org/T90583#1062527 (10BBlack) https://gerrit.wikimedia.org/r/#/c/188825/ <- since it doesn't seem to autolink into here based on the commit msg [14:34:48] (03PS1) 10Filippo Giunchedi: trebuchet: install git-fat on debian too [puppet] - 10https://gerrit.wikimedia.org/r/192549 [14:35:22] ^ both trivial if someone has time, fixes trebuchet + git-fat on jessie [14:35:39] (03CR) 10jenkins-bot: [V: 04-1] trebuchet: install git-fat on debian too [puppet] - 10https://gerrit.wikimedia.org/r/192549 (owner: 10Filippo Giunchedi) [14:39:12] (03PS2) 10Filippo Giunchedi: trebuchet: install git-fat on debian too [puppet] - 10https://gerrit.wikimedia.org/r/192549 [14:39:53] (03CR) 10Yuvipanda: [C: 031] trebuchet: install git-fat on debian too [puppet] - 10https://gerrit.wikimedia.org/r/192549 (owner: 10Filippo Giunchedi) [14:41:47] YuviPanda: TYVM [14:48:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Depend on rsync [debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/192548 (owner: 10Filippo Giunchedi) [14:49:00] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1062554 (10Dzahn) >>! In T85141#1061871, @Aklapper wrote: >>>! In T85141#1055784, @Dzahn wrote: >> DBD::mysql::db selectcol_arrayref failed: Unknown column 'tag' in 'where clause' >... [14:49:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] trebuchet: install git-fat on debian too [puppet] - 10https://gerrit.wikimedia.org/r/192549 (owner: 10Filippo Giunchedi) [14:51:05] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1062557 (10Aklapper) Wondering what is the status here. {T80945} (get rid of 10.04) might block this one? [14:51:08] gwicke mobrovac restbase1001 is online btw! [14:51:20] opaaa [14:51:21] nice! [14:51:28] godog: the cassandra issue has been resolved? [14:51:41] why couldn't puppet install cassandra.env.sh? [14:52:30] mobrovac: I think because the parent directory is missing, and that is because the cassandra package isn't installed yet [14:53:00] ah i see [14:53:09] so we have to wait for the next puppet run? [14:53:16] to answer your question, no that still hasn't been fixed but not a huge issue [14:53:24] no I've fixed that manually, it can't recover by itself [14:57:04] i see restbase is there && well-configured, cassandra is running [14:58:53] yep, I'm about to do a rolling restart of cassandra to pick up metrics [14:59:02] ok [14:59:15] !log rolling restart cassandra to pick up metrics configuration [14:59:17] Logged the message, Master [14:59:18] i'm trying to start restbase on 1001, but it doesn't listen to me [14:59:21] need to investigate [15:00:05] andrewbogott, cmjohnson1: Respected human, time to deploy WMF Labs maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150224T1500). Please do the needful. [15:00:41] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 5Patch-For-Review: Create a static HTML version of Bugzilla - https://phabricator.wikimedia.org/T85140#1062586 (10Dzahn) >>! In T85140#1056325, @Nemo_bis wrote: >> To get rid of Bugzilla > > This is not going to happen in this decade anyway. It is very muc... [15:01:37] 6operations, 10ops-codfw, 3wikis-in-codfw: PXE doesn't work on mc2017-18 - https://phabricator.wikimedia.org/T90586#1062590 (10Joe) 3NEW [15:03:31] apergos: maint-announce is in a bit of a mess [15:03:32] 6operations, 10Citoid: Configure citoid to use the new zotero service - https://phabricator.wikimedia.org/T89873#1062600 (10Mvolz) [15:03:34] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1062599 (10Mvolz) [15:03:35] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1062601 (10Mvolz) [15:04:25] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1047771 (10Mvolz) [15:05:26] 6operations, 10Analytics, 6Analytics-Kanban: Upgrade Analytics Cluster to Trusty, and then to CDH 5.3 - https://phabricator.wikimedia.org/T1200#1062615 (10Ottomata) 5Open>3Resolved All is good! [15:06:02] (03PS1) 10Giuseppe Lavagetto: dhcp: add entry for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/192551 [15:06:06] (03PS1) 10Dzahn: add thcipriani to deployers admin group [puppet] - 10https://gerrit.wikimedia.org/r/192552 (https://phabricator.wikimedia.org/T90467) [15:07:38] 6operations, 10ops-codfw, 3wikis-in-codfw: PXE doesn't work on mc2017-18 - https://phabricator.wikimedia.org/T90586#1062626 (10Cmjohnson) these 2 servers are the same as 1017 and 1018 in eqiad. Try using the other port on the NIC card....for some reason they're set up backwards. Double check MAC. [15:09:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to ANALYTICS RESOURCES for joal - https://phabricator.wikimedia.org/T89357#1062634 (10Dzahn) The 2 patches linked in this ticket are both merged. What's missing? [15:09:15] godog: can't seem to start restbase, i can do so manually, but /etc/init.d script nor systemctl seem to start it [15:09:19] (on rb1001) [15:09:24] any ideas? [15:10:51] <_joe_> mobrovac: does the log say something? [15:10:59] sadly, nope [15:11:30] <_joe_> mh I mean the systemd log [15:11:31] nothing in /var/log/restbase, and syslog reads: Feb 24 14:57:26 restbase1001 systemd[1]: Started LSB: REST storage API and backend orchestration layer. [15:11:32] Feb 24 11:33:57 restbase1001 restbase[4514]: Error: Cannot find module '/usr/lib/restbase/deploy/restbase/server.js' [15:11:35] ^ this? [15:11:57] mutante: that seems to be an older entry [15:12:13] ah, yes [15:12:14] permissions on /srv/deployment/restbase look too strict to me [15:12:25] lemme check [15:12:37] <_joe_> mobrovac: systemctl status restbase will give you a peek at the logs too [15:12:52] (03CR) 10Gilles: "Are you guys talking about sending an extra header to tag the requests, and trigger a 304 if it's already been preloaded with a different " [puppet] - 10https://gerrit.wikimedia.org/r/190821 (https://phabricator.wikimedia.org/T89088) (owner: 10Gilles) [15:13:04] godog: indeed, puppet mess-up ? [15:13:24] _joe_: true, forgot that :P [15:13:27] thnx [15:13:43] mobrovac: not sure tbh, I wonder why and how it worked on the other hosts [15:13:47] <_joe_> mobrovac: it's one of the niceties of systemd [15:14:07] <_joe_> oh noes I said systemd is nice in a logged channel! [15:14:13] PROBLEM - MariaDB disk space on db2011 is CRITICAL: DISK CRITICAL - free space: /srv 86780 MB (5% inode=99%): [15:14:24] lol [15:14:46] will fix the perms manually for now, but this is indeed strange [15:14:51] (03PS2) 10Giuseppe Lavagetto: dhcp: add entry for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/192551 [15:15:14] (03CR) 10Giuseppe Lavagetto: [C: 032] dhcp: add entry for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/192551 (owner: 10Giuseppe Lavagetto) [15:15:28] mobrovac: I'll open a ticket, I'm sure we'll forget otherwise [15:15:47] /srv/deployment/dropwizard/ has the right perms though [15:15:59] godog: kk [15:16:06] issue with trebuchet messing up permissions that only happens rarely, but i think happened before [15:16:31] gwicke: have you seen wrong permissions on /srv/deployment/restbase on restbase hosts before? [15:17:03] he's still asleep [15:17:04] :) [15:17:15] all of the other hosts have the correct perms [15:20:03] (03PS2) 10Dzahn: LVS configuration for zotero service [puppet] - 10https://gerrit.wikimedia.org/r/191938 (https://phabricator.wikimedia.org/T89867) [15:21:52] I think it might be related to this https://phabricator.wikimedia.org/T87843 [17:56:20] twentyafterfour: look at https://gerrit.wikimedia.org/r/192568 for a possible cleaner solution in the future [17:56:20] Bye :) [17:56:20] hmm [17:56:20] coren isn't here. [18:00:05] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150224T1800). [18:00:11] PROBLEM - manage_nfs_volumes_running on labstore1001 is CRITICAL: Connection refused by host [18:00:11] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [18:00:11] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [18:00:12] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [18:00:12] PROBLEM - dhclient process on labstore1001 is CRITICAL: Connection refused by host [18:00:31] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [18:00:50] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [18:00:50] PROBLEM - DPKG on labstore1001 is CRITICAL: Connection refused by host [18:01:21] (03PS1) 10Daniel Kinzler: Canonical location for sitelist-1.0.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192561 [18:01:43] (03PS1) 10Giuseppe Lavagetto: dhcp: add entry for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/192563 [18:02:18] <_joe_> grrrit-wm: you're a bit behind, baby [18:02:34] (03CR) 10Giuseppe Lavagetto: [C: 032] dhcp: add entry for codfw memcaches [puppet] - 10https://gerrit.wikimedia.org/r/192563 (owner: 10Giuseppe Lavagetto) [18:02:39] (03CR) 10Alexandros Kosiaris: [C: 032] "Looks fine, +2 but not merging until we actually got zotero running in production. Merging will not really hurt but rather generate an (un" [puppet] - 10https://gerrit.wikimedia.org/r/191938 (https://phabricator.wikimedia.org/T89867) (owner: 10Dzahn) [18:02:43] (03CR) 10Alexandros Kosiaris: [C: 031] Use apt::repository instead of file resources [puppet] - 10https://gerrit.wikimedia.org/r/123903 (owner: 10Tim Landscheidt) [18:06:40] RECOVERY - manage_nfs_volumes_running on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/sbin/manage-nfs-volumes [18:06:41] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:06:41] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [18:06:41] RECOVERY - configured eth on labstore1001 is OK: NRPE: Unable to read output [18:06:41] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 72 logical, 72 physical [18:07:00] RECOVERY - Disk space on labstore1001 is OK: DISK OK [18:07:20] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:07:20] RECOVERY - DPKG on labstore1001 is OK: All packages OK [18:10:00] (03PS1) 10ArielGlenn: sudo for joal on analytics, stats [puppet] - 10https://gerrit.wikimedia.org/r/192581 (https://phabricator.wikimedia.org/T89357) [18:13:12] (03CR) 10ArielGlenn: [C: 032] sudo for joal on analytics, stats [puppet] - 10https://gerrit.wikimedia.org/r/192581 (https://phabricator.wikimedia.org/T89357) (owner: 10ArielGlenn) [18:13:51] morebots, still there? [18:13:51] I am a logbot running on tools-exec-11. [18:13:51] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:13:51] To log a message, type !log . [18:16:09] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [18:28:45] YuviPanda: when you’re no longer fretting about tools, can you look at shinken and give me the start/end time for last night’s unpleasantness? [18:29:47] andrewbogott: first alert was received at 11:01 AM IST [18:30:15] andrewbogott: recovery at about 1:31 [18:37:20] RECOVERY - RAID on restbase1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [18:42:12] I got some VCL questions, anyone know about cookie expiration and time formatting? [18:47:53] twentyafterfour: Will you do the backport etc? [19:00:05] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150224T1900). [19:02:11] hoo: I intend to but I don't know exactly how to do it ... I need to submit the same patch to the 1.25wmf18 branch of that extension and then update 1.25wmf18 branch of core to point to the new submodule revision? [19:02:26] yeah [19:02:47] ok I think I can manage that :) [19:02:47] You can just click on cherry pick to in the gerrit interface to do the cherry-pick [19:03:16] hoo: ok, I'm not very good with gerrit, so thanks for the tip [19:06:33] milimetric, sprintf? [19:26:45] do I just update an extension submodule directly on tin or do I need to do it via gerrit [19:27:42] twentyafterfour: Gerrit of course [19:28:02] twentyafterfour: https://gerrit.wikimedia.org/r/192604 merge that tomorrow before creating the branches, please [19:28:06] I ask because it looks like other submodule changes happened directly on the repo [19:29:15] twentyafterfour: Sometimes people forget to submodule update and stuff ends up in a dirty state on tin [19:29:19] but that's not nice [19:29:45] also Chris from time to time needs to apply security patches w/o pushing them to gerrit [19:31:25] right there are a bunch of those right now [19:31:50] my local checkout of core doesn't even have any submodules... I'm confused [19:32:15] twentyafterfour: The submodules only exist in the wmf branches [19:32:29] I strongly recommend having two local checkouts of core [19:32:35] (03PS3) 10Chmarkine: Point rel=canonical to HTTPS for ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192502 [19:32:36] One for master and one for the wmf/* branches [19:32:41] ahh [19:32:50] Because switching between those two is a nightmare if you work on any extensions [19:32:51] why separate checkout [19:32:57] yeah [19:33:03] I can see that [19:33:19] I mean, I do 99% of my work in an extension :) [19:34:52] if an account on en.wp was flagged as a bot what would prevent edits from being flagged? [19:35:06] I have an account doing programmic things but apparently it's creating noise [19:35:13] but is supposed to be flagged already as a "bot" [19:35:45] chasemp: a) this is the wrong channel b) remember to set the bot parameter [19:36:12] chasemp: If you're using the API to do these edits, then you need to explicitly set something like &bot=1 [19:36:18] :) what is the right channel for this? It's in relation to an ops bot so kinda cheating here but not crazy I think :) [19:36:22] I should remember this better because I wrote this API module [19:36:27] not using the api [19:36:31] (03PS4) 10Chmarkine: Point rel=canonical to HTTPS for all ru projects All Russian Wikimedia projects are HTTPS only, so change the canonical links to have search engines update their indexes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192502 (https://phabricator.wikimedia.org/T90527) [19:37:05] chasemp: You should really use the api... [19:37:19] bot=1 might also work in some points in the interface (like for rollbacks) [19:37:28] but in general I guess you get undefined behaviour [19:37:44] I can't use the api as this is for measuring editing performance from the perspective of the end user [19:44:07] hello bblack, have a little time? [19:45:06] hoo: false alarm even the people complaining are apparently showing a view that does not exclude bots [19:45:09] and don't know it :) [19:45:26] :P [19:51:39] hoo: yeah, the newbies contributions page ignores the bot flag, which is what they were looking at [19:51:53] so I got very confused and then realized that, no, what I had done earlier was still working :P [19:59:54] (03PS1) 10QChris: Upgrade phabricator plugins to add add-project action [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/192613 (https://phabricator.wikimedia.org/T89967) [20:00:16] (03PS1) 1020after4: Group1 wikis to 1.25wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192614 [20:04:02] ok what did I do wrong: https://integration.wikimedia.org/ci/job/mediawiki-extensions-zend/4381/console [20:04:20] this is for https://gerrit.wikimedia.org/r/#/c/192611/ [20:04:26] couldn't merge [20:04:49] <^demon|lunch> https://integration.wikimedia.org/ci/job/mediawiki-extensions-zend/4381/consoleFull says Flow test went boom [20:05:02] !log Updated gerrit plugin its-phabricator to 25a34d7564cffb90a87110a971782195ba2db467 [20:05:08] Logged the message, Master [20:05:15] <^demon|lunch> Failed asserting that '1424721465' matches expected '1424721464' [20:05:20] <^demon|lunch> Yay, off by 1 [20:05:41] !log Updated gerrit plugin its-phabricator-from-bugzilla to 03b936b2cd8fa6adfdbee0ef68eb4b31944936c2 [20:05:44] Logged the message, Master [20:10:59] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [20:19:01] PROBLEM - DPKG on cp1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:10] PROBLEM - salt-minion processes on cp1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:19:21] PROBLEM - DPKG on cp1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:38] (03CR) 10JanZerebecki: [C: 031] Point rel=canonical to HTTPS for all ru projects All Russian Wikimedia projects are HTTPS only, so change the canonical links to have search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192502 (https://phabricator.wikimedia.org/T90527) (owner: 10Chmarkine) [20:19:40] PROBLEM - DPKG on cp1057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:49] PROBLEM - DPKG on cp1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:49] PROBLEM - DPKG on cp1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:50] PROBLEM - DPKG on cp1056 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:19:50] PROBLEM - salt-minion processes on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:00] PROBLEM - DPKG on cp1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:20:00] PROBLEM - salt-minion processes on cp1060 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:11] PROBLEM - DPKG on amssq42 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:20:20] RECOVERY - salt-minion processes on cp1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:20] PROBLEM - salt-minion processes on cp1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:26] ^ that's all me, and it's all fine [20:20:30] PROBLEM - salt-minion processes on amssq42 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:20:30] RECOVERY - DPKG on cp1069 is OK: All packages OK [20:20:49] RECOVERY - DPKG on cp1057 is OK: All packages OK [20:20:50] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [20:20:50] RECOVERY - DPKG on cp1056 is OK: All packages OK [20:21:00] RECOVERY - salt-minion processes on cp1060 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:21:20] PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:21:21] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Puppet has 2 failures [20:21:33] RECOVERY - salt-minion processes on cp1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:22:40] RECOVERY - salt-minion processes on amssq42 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:23:10] RECOVERY - salt-minion processes on cp1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:24:09] RECOVERY - DPKG on cp1064 is OK: All packages OK [20:24:30] RECOVERY - DPKG on lvs1004 is OK: All packages OK [20:27:50] RECOVERY - DPKG on cp1070 is OK: All packages OK [20:28:30] RECOVERY - DPKG on cp1065 is OK: All packages OK [20:28:42] RECOVERY - DPKG on cp1060 is OK: All packages OK [20:30:01] RECOVERY - DPKG on amssq42 is OK: All packages OK [20:30:20] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [20:31:30] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:31:47] !log restarting cr1-eqiad/re1 chassis-control; should not be traffic-disrupting [20:31:52] Logged the message, Master [20:32:50] (03PS1) 10QChris: Drop gerrit's filtering of actions based on bugtracker system [puppet] - 10https://gerrit.wikimedia.org/r/192623 [20:32:52] (03PS1) 10QChris: Explicitly associate 'Patch-For-Review' project to tasks [puppet] - 10https://gerrit.wikimedia.org/r/192624 (https://phabricator.wikimedia.org/T89967) [20:35:11] (03PS1) 10Aude: Enable WikibaseClient on Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192625 [20:36:14] (03CR) 10Aude: [C: 04-2] "still need to populate sites table etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192625 (owner: 10Aude) [20:38:20] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:39:44] (03CR) 10Rush: [C: 032] Drop gerrit's filtering of actions based on bugtracker system [puppet] - 10https://gerrit.wikimedia.org/r/192623 (owner: 10QChris) [20:39:53] (03CR) 10Rush: [C: 032] Explicitly associate 'Patch-For-Review' project to tasks [puppet] - 10https://gerrit.wikimedia.org/r/192624 (https://phabricator.wikimedia.org/T89967) (owner: 10QChris) [20:39:59] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:43:23] 19:58:48 There was 1 failure: [20:43:26] 19:58:48 [20:43:28] 19:58:48 1) Flow\Tests\Import\HistoricalUIDGeneratorTest::testRoundTrip with data set #0 (1424721464) [20:43:30] 19:58:48 Failed asserting that '1424721465' matches expected '1424721464'. [20:43:32] 19:58:48 [20:43:33] 19:58:48 /srv/ssd/jenkins-slave/workspace/mediawiki-extensions-zend/src/extensions/Flow/tests/phpunit/Import/HistoricalUIDGeneratorTest.php:31 [20:43:36] 19:58:48 /srv/ssd/jenkins-slave/workspace/mediawiki-extensions-zend/src/tests/phpunit/MediaWikiTestCase.php:132 [20:47:08] sounds like a bad test to me [20:47:18] like a race condition or such [20:49:23] make your assertion subtract 1 ;) [20:49:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [20:50:26] (03CR) 10Tim Landscheidt: "My primary motivation is https://phabricator.wikimedia.org/T63897 ("Move LabsDB aliases and NAT to DNS and LabsDB servers"), i. e. to be a" [software] - 10https://gerrit.wikimedia.org/r/191846 (owner: 10Tim Landscheidt) [20:53:43] greg-g: around? [20:53:44] aude: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [20:53:49] aaah [20:54:26] greg-g: when twentyafterfour is done, i'd like to enable wikibase on wikibooks, per https://wikitech.wikimedia.org/wiki/Deployments#Upcoming [20:55:06] think everything is ready and no hurry [20:59:00] twentyafterfour: that test probably uses the current second, twice or something please file a bug after you are done, so someone fixes that [20:59:00] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:00:18] (03CR) 1020after4: [C: 032] "deploying 1.25wmf18" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192614 (owner: 1020after4) [21:00:24] (03PS1) 10Ottomata: Parameterize yarn.nodemanager.resource.cpu-vcores, default to $::processorcount - 1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/192651 [21:00:26] (03Merged) 10jenkins-bot: Group1 wikis to 1.25wmf18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192614 (owner: 1020after4) [21:00:48] aude: almost done [21:00:52] twentyafterfour: ok [21:02:02] aude: open to also deploying another config change?: https://gerrit.wikimedia.org/r/#/c/192502/4 [21:02:32] (03CR) 10Ottomata: [C: 032] Parameterize yarn.nodemanager.resource.cpu-vcores, default to $::processorcount - 1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/192651 (owner: 10Ottomata) [21:03:20] (03PS1) 10Ottomata: Update cdh module with cpu-vcores set [puppet] - 10https://gerrit.wikimedia.org/r/192670 [21:03:49] jzerebecki: https://phabricator.wikimedia.org/T90639 [21:04:10] i think that is better for swat, since it's not related to wikibooks / wikidata [21:04:10] perfect [21:04:17] (03PS1) 10RobH: rewriting techblog.w.o to blog.w.o [apache-config] - 10https://gerrit.wikimedia.org/r/192675 [21:04:33] aude: ok, was worth a try :) [21:05:34] https://gerrit.wikimedia.org/r/#/c/192561/ might be ok though [21:06:29] mh checking [21:06:48] ok so now to deploy this I do `sync-dir php-1.25wmf18` and then sync-wikiversions [21:06:53] ? [21:07:09] making sure I'm not missing anything [21:07:25] did you update a submodule in wmf18? [21:07:40] yes [21:07:47] which one? [21:07:52] popups [21:08:04] small javascript change [21:08:08] i would sync just that extension [21:08:25] sync-dir php-1.25wmf18/extensions/Popups [21:08:32] ideally with a message/reason [21:08:55] then sync wikiversions [21:09:01] so nothing in the parent repo matters? even though the parent repo had to be updated to point to the new submodule revision? [21:09:04] (03CR) 10Ottomata: [C: 032] Update cdh module with cpu-vcores set [puppet] - 10https://gerrit.wikimedia.org/r/192670 (owner: 10Ottomata) [21:09:08] i don't think so [21:09:11] ok [21:09:31] don't forget ot run git submodule update --init --recurisve [21:09:35] --recursive [21:09:50] it needs to re-init? [21:09:57] I just ran submodule update [21:10:20] i don't think it does [21:10:30] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:10:45] the directory extensions/Popups is the only thing that changed in the parent repo, so you are already syncing that [21:11:20] so I don't need update --init --recursive? [21:11:28] init is only needed the first time (e.g. for wikibase, in cloned fresh mediawiki) [21:11:30] twentyafterfour: that should be sufficient if there is no subsubmodule in popups, [21:11:36] (03PS2) 10RobH: setting codfw wtp production dns entries [dns] - 10https://gerrit.wikimedia.org/r/191944 [21:11:46] I* [21:11:55] for my submodule updates [21:12:13] --init is also needed when new submodules got added [21:12:29] ok [21:12:59] but it is safe and quick to always do --init --recursive then you will notice when something unexpected was changed [21:13:45] i usually check https://wikitech.wikimedia.org/wiki/How_to_deploy_code even after having done this numerous times [21:13:56] just to be sure i didn't forget something this time [21:14:04] ok ... other extensions changed when I did recursive [21:14:25] hm [21:14:47] (03CR) 10Rush: [C: 031] rewriting techblog.w.o to blog.w.o [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:14:59] (03CR) 10Rush: "please be careful thanks :D" [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:15:01] can I just sync the whole thing? there are a bunch of security patches sitting on extension branches [21:15:04] could be one of them had a security path [21:15:05] patch [21:15:31] yes a bunch of security patches at the moment [21:16:11] time tradeoff, syncing each of the changed submodules is probably faster than the full thing [21:16:16] aude: I assume you're being taken care of now (sorry, just got back from two appts, one being a smog check) [21:16:18] so should I sync the whole branch and test it on mediawiki.org before I push untested code live to all teh pedias? [21:16:29] what things changed? [21:16:47] greg-g: i'll take care of it when twentyafterfour is done (no hurry for him) [21:16:58] Submodule path 'extensions/CheckUser': rebased [21:17:01] been well-communicated, etc ot the community [21:17:11] Submodule path 'extensions/ContentTranslation': rebased [21:17:21] twentyafterfour: i am not sure [21:17:33] Submodule path 'extensions/Scribunto': rebased [21:17:35] (03CR) 10RobH: [C: 032] setting codfw wtp production dns entries [dns] - 10https://gerrit.wikimedia.org/r/191944 (owner: 10RobH) [21:17:42] all security patches [21:17:45] might be ok [21:18:05] but not 100% sure enough to say [21:18:44] looking at them, i think ok [21:19:05] I don't think those are new commits. they were there before I started [21:19:16] yeah [21:20:10] if you sync the branch before sync-wikiversions, then it we can check it on test2.wikipedia etc [21:20:24] and won't be anywhere totally critical yet [21:20:35] does this mean they changed or will a second submodule update --init --recursive say the same? [21:21:57] second time says the same thing [21:22:05] apparently it's not a good test for "something changed" [21:22:40] (03CR) 10JanZerebecki: [C: 031] Canonical location for sitelist-1.0.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192561 (owner: 10Daniel Kinzler) [21:23:20] Are we in any kind of deployment window? (seems like you guys might be due to discussions, but i dont see it on deployments) [21:23:33] robh: we are [21:23:33] i'd like to push some changes to redirects on cluster but its non-time sensitive so it can wait if folks are. [21:23:35] cool [21:23:37] robh: overtime from last [21:23:41] train deploy ralmost done [21:23:44] and then have to do some wikidata stuff [21:23:51] which includes scap [21:23:58] * aude can add to the calendar [21:24:01] when you guys finish the wikidata stuff and are all done, ping me pls =] [21:24:01] !log increased Hadoop nodemanager cpu-vcores to facter $processcount - 1, this should increase hadoop cluster utilization [21:24:02] I'm just gonna sync the popups and test real quick then sync wikiversions [21:24:04] Logged the message, Master [21:24:09] my change is not a big deal, it can totally wait [21:24:14] and its for something in two days [21:24:28] robh: ok, thanks [21:24:46] (03CR) 10RobH: [C: 031] "I'll push this later today." [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:25:02] !log twentyafterfour Synchronized php-1.25wmf18/extensions/Popups: hotfix for Popups (see https://gerrit.wikimedia.org/r/#/c/192465/) (duration: 00m 06s) [21:25:05] Logged the message, Master [21:25:10] my paranoia with redirects in apache makes me want to do this after everyone else is quite finished, heh. [21:26:49] robh: ok :) [21:27:18] apparently my hotfix didn't fix anything [21:27:42] twentyafterfour: did you check on test2 or test.wikipedia? [21:28:23] yes [21:28:32] also might take a minute, if it is a js change [21:28:41] and www.wikimedia.org [21:28:42] due to resource loader caching [21:28:50] also try with debug=true? [21:28:53] ahh, how long does that cache take [21:29:05] can take a minute [21:29:17] what do you mean with debug=true [21:29:30] e.g. https://test2.wikipedia.org/wiki/Kitten?debug=true [21:29:36] bypasses resourceloader [21:29:55] nice to know [21:30:03] :) [21:30:21] doesn't help [21:30:29] so the bug remains, somehow [21:30:34] I'll dig a bit deeper [21:31:11] :/ [21:32:19] oh my beta features aren't enabled on testwiki [21:32:35] Uh-oh. Is that supposed to happen? [21:32:35] :( [21:32:36] still broken [21:33:06] well, works on some links [21:35:46] bblack: hola, yt? [21:37:45] twentyafterfour: I think it works [21:38:13] se4598|away: it's only working on a select few links, can't tell why [21:39:28] Than it's ok, TextExtracts Extension (the backend) doesn't return text for all links, if you could look into the network requests, if there's a new request on hover, then it's totally fine [21:40:33] anchored links won't work with that change, but better than nothing [21:40:44] ok [21:40:51] deploying it then [21:42:55] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: sync-wikiversions group1 to 1.25wmf18 [21:43:02] Logged the message, Master [21:43:38] aude: I'm out [21:43:45] you can do your wikidata stuff [21:44:00] nuria: yes [21:44:28] twentyafterfour: ok, thanks [21:44:43] bblack: helloooo, i have some questions about a project for which we need to change VCL and .. ahem ... (you are going to love it) [21:44:52] bblack: add at least couple cookies per user [21:46:46] nuria: do you have a link to technical details? I assume you mean a couple of fixed cookie names, which many users will use? [21:47:19] bblack: yes, doc is here (we will edit it further, today wikitech login was down) [21:47:51] bblack: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_visit_solution [21:50:59] in short: one cookie would go on .<>.org and one would go on <>.<>.org [21:51:58] for VCL, we need to know how to format "now" into "year-month" [21:52:03] and how to set a cookie's expiration date [21:52:13] (03CR) 10Dzahn: [C: 04-1] "this is (nowadays) the wrong repository. The Apache redirects have been moved by joe and ori into the puppet repo inside the mediawiki mod" [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:52:17] !log aude Started scap: Updates for enabling Wikibase on Wikibooks [21:52:22] Logged the message, Master [21:52:25] (03CR) 10Faidon Liambotis: [C: 04-1] "redirects.conf is auto-generated by redirects.dat. There's a PHP script in the same directory. You shouldn't update it manually." [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:52:29] mutante: doh... well, glad i asked you [21:52:52] (03CR) 10Faidon Liambotis: "Oh and of course, what Daniel said :)" [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:54:13] why has no one updated the damned wikitech docs on this? [21:54:19] =OP [21:54:20] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This repository is not used anymore, there is a file in the home dir of it tht should state that quite clearly. This change should be done" [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [21:54:43] https://gerrit.wikimedia.org/r/#/c/192675/ [21:54:48] bah, wrong link [21:54:48] https://wikitech.wikimedia.org/wiki/Application_servers [21:54:59] oh, it has, haha, im just stupid [21:57:39] nuria: yeah there are a lot of cache implications there, we would have to generate the cookie in VCL and emit stats for frontend cache hits from VCL as well, otherwise the cache defeats your scheme. Then there are also the standard cookie-naming issues. [21:58:40] bblack: i saw there is a vcl_hash to cache content with cookies but I am not clear from our code if we are using that [21:59:01] bblack: maybe we should have a short meeting to discuss? [21:59:32] (03Abandoned) 10Dzahn: fix some indentation and test if gerrit works [dns] - 10https://gerrit.wikimedia.org/r/192040 (owner: 10Dzahn) [21:59:36] nuria: well we'd ignore the cookie for cache-hashing purposes so as not to fragment. but the larger issue is that we'd keep serving old cookie values to clients from the cache if they weren't being generated in VCL instead of in MediaWiki [21:59:50] nuria: a meeting is fine, maybe set it up for later this week? [22:00:17] bblack: will do, do you work on central time? [22:00:38] that is the timezone I occupy, but it doesn't always have a lot of bearing on when I'm awake :) [22:00:44] (03PS1) 10Ottomata: Alert critical if important Hadoop service processes are down [puppet] - 10https://gerrit.wikimedia.org/r/192701 (https://phabricator.wikimedia.org/T89730) [22:01:15] bblack: that seems to be the trend at wikipedia man ... we will try middle of the day, will send meeting in a sec [22:01:25] ok [22:04:41] (03PS2) 10Ottomata: Alert critical if important Hadoop service processes are down [puppet] - 10https://gerrit.wikimedia.org/r/192701 (https://phabricator.wikimedia.org/T89730) [22:08:21] !log aude Finished scap: Updates for enabling Wikibase on Wikibooks (duration: 16m 03s) [22:08:28] Logged the message, Master [22:09:56] 12 [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_POS_WAIT('db1052-bin.001714', 135813027, 10) [22:10:16] this is repeated lots it's taken every slot in fatalmonitor [22:10:24] <^d> Those have been around a lot the last week [22:10:29] <^d> Different db* hosts [22:10:56] just popped up though, was not on the page until moments ago [22:11:38] that indicates a slowdown in a master, I think that's the slaves reading from master [22:11:40] i don't see why... [22:11:49] * ^d has seen a lot of these [22:12:10] <^d> Ouch, that's really bad [22:12:13] <^d> Tons of them right now [22:12:13] yeah several hundred of them popped up all at once [22:12:31] it's all for db1052-bin.001714 [22:12:32] they stopped ~ 3 min ago [22:12:38] <^d> Any opsen about who could poke db1052? [22:12:40] maybe with the scap done? [22:12:43] (03Abandoned) 10RobH: rewriting techblog.w.o to blog.w.o [apache-config] - 10https://gerrit.wikimedia.org/r/192675 (owner: 10RobH) [22:13:05] (03PS1) 10GWicke: Set a per-worker heap limit for restbase [puppet] - 10https://gerrit.wikimedia.org/r/192703 [22:14:02] (03PS1) 10RobH: rewriting techblog.wikimedia.org to blog.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/192704 [22:17:53] (03CR) 10Aude: [C: 032] Enable WikibaseClient on Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192625 (owner: 10Aude) [22:18:13] (03Merged) 10jenkins-bot: Enable WikibaseClient on Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192625 (owner: 10Aude) [22:20:37] !log aude Synchronized wmf-config/Wikibase.php: Enable Wikibooks sitelinks on Wikidata (duration: 00m 06s) [22:20:41] Logged the message, Master [22:21:32] * aude grumble.... have to bump cache epoch on wikidata again for this [22:21:42] (03PS2) 10GWicke: Set a per-worker heap limit for restbase [puppet] - 10https://gerrit.wikimedia.org/r/192703 [22:24:19] (03PS1) 10Aude: Bump cache epoch for Wikidata, for wikibooks site links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192707 [22:24:29] ^d what's the trouble with db1052? besides an uptime of 706 days :P [22:24:36] icinga is green [22:24:50] (03CR) 10Aude: [C: 032] Bump cache epoch for Wikidata, for wikibooks site links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192707 (owner: 10Aude) [22:24:51] just the slow querries [22:24:51] * jgage looks at scrollback [22:24:54] (03Merged) 10jenkins-bot: Bump cache epoch for Wikidata, for wikibooks site links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192707 (owner: 10Aude) [22:25:15] jgage: 5 [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_POS_WAIT('db1052-bin.001714', 135204311, 10) [22:25:29] lots of those in the logs all of a sudden [22:25:51] !log aude Synchronized wmf-config/Wikibase.php: Bump cache epoch for Wikidata (duration: 00m 06s) [22:25:55] Logged the message, Master [22:26:13] (03PS3) 10Awight: Revert "Enable FundraisingTranslateWorkflow on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/160494 [22:26:39] based on ganglia graphs it looks like there was a cpu spike and corresponding drop in net traffic for ~30 minutes which abated about 15 mins ago [22:26:45] are you still seeing symptoms after that? [22:26:50] <^d> Not really no [22:27:07] !log aude Synchronized wmf-config/InitialiseSettings.php: Wikibase config for Wikibooks (duration: 00m 06s) [22:27:12] Logged the message, Master [22:27:38] this makes me realize how little i know about troubleshooting problems with our db servers :\ [22:27:56] it was probably a hickkup but we've seen the same thing several times in the past couple of weeks [22:28:03] <^d> Yeah, that ^ [22:28:12] hmm [22:28:30] <^d> twentyafterfour: Maybe they're the same as usual, but just more visible since we've shut up some of the more annoying warnings/fatals :) [22:28:31] <_joe_> hey, db problems? Or just a transient issue and now it's ok? [22:28:43] <^d> transient [22:28:53] !log aude Synchronized wikidataclient.dblist: Enable Wikibase Client on Wikibooks (duration: 00m 06s) [22:28:56] <_joe_> btw the slow queries reporting is one of the nice things about HHVM [22:28:57] Logged the message, Master [22:29:10] ^d: seemed it happened when scapping [22:29:19] (03PS1) 10Jdlrobson: Correctly configure human infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192709 [22:29:23] <_joe_> I kinda remember looking at those with sean [22:29:31] <_joe_> I'll ping him tomorrow [22:30:02] ganglia graphs definitely confirm a change in behavior starting about a week ago. inbound net traffic doubled and cpu usage went up a bit. [22:30:50] <_joe_> jgage: where? [22:31:03] db1052 [22:31:05] https://ganglia.wikimedia.org/latest/?r=month&cs=&ce=&m=cpu_report&c=MySQL+eqiad&h=db1052.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS [22:31:55] I can't remember if previous errors were for the same host or different ones... [22:31:56] <_joe_> jgage: there has been a clear jump yesterday [22:32:05] yeah [22:32:11] <_joe_> did we deploy something yesterday? [22:32:25] disk usage is up too [22:32:48] <_joe_> twentyafterfour: we are probably reading more from that server [22:32:58] <^d> twentyafterfour: this one, and others. [22:33:03] <^d> 1038 too, if memory serves [22:33:28] <^d> Umm, what's up with 1033? [22:33:29] <^d> https://ganglia.wikimedia.org/latest/?c=MySQL%20eqiad&h=db1033.eqiad.wmnet&m=cpu_report&r=month&s=by%20name&hc=4&mc=2 [22:33:59] <^d> scratch 1038, it looks ok [22:34:10] <_joe_> mmmh https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=MySQL%20eqiad&h=db1052.eqiad.wmnet&r=custom&z=default&jr=&js=&st=1424817134&cs=2%2F23%2F2015%204%3A44&ce=2%2F24%2F2015%2015%3A33&v=923936&m=mysql_innodb_buffer_pool_pages_dirty&vl=pages&ti=mysql_innodb_buffer_pool_pages_dirty&z=large [22:35:11] <_joe_> this is a bit worrying [22:35:22] <_joe_> I'll ask sean tomorrow about this [22:35:33] what could even cause that? seems weird [22:36:27] <_joe_> intensive writes [22:37:00] <_joe_> it's honestly too late in the evening for me to dive into mysql troubleshooting if it's not an emergency :) [22:37:37] meh apparently it's not an emergency just a curiosity [22:37:53] I don't even know that it caused any issues whatsoever [22:40:26] the unpurges trxes is consistent with high cpu [22:40:33] (03CR) 10Aude: [C: 032] Canonical location for sitelist-1.0.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192561 (owner: 10Daniel Kinzler) [22:40:41] (03Merged) 10jenkins-bot: Canonical location for sitelist-1.0.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192561 (owner: 10Daniel Kinzler) [22:40:43] * AaronS wonders if there is a long running REPEATABLE_READ trx [22:41:46] or any long-running trx with writes [22:43:03] !log aude Synchronized docroot/mediawiki/xml/: Add sitelist export-import docs (duration: 00m 07s) [22:43:08] Logged the message, Master [22:43:35] * aude done [22:43:42] robh ^ [22:43:58] sweet, thanks [22:44:10] turns out i did my change in the old and now decommissioned repo and now my new change is in code review [22:44:16] so i likely wont be pushing right now =P [22:44:36] the entire non-emergency thing makes me think a opsen code review is worth waiting for [22:46:11] robh: ok [22:48:37] (03CR) 10Ottomata: [C: 032] Set a per-worker heap limit for restbase [puppet] - 10https://gerrit.wikimedia.org/r/192703 (owner: 10GWicke) [22:55:13] (03PS4) 10BBlack: 3.0.6plus-wm5 for storage fixes, jessie+ only [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/188825 [22:56:15] (03CR) 10BBlack: [C: 032 V: 032] "Tested on cp1008 (jessie + ext4), works as advertised." [debs/varnish] (3.0.6-plus-wm) - 10https://gerrit.wikimedia.org/r/188825 (owner: 10BBlack) [22:58:58] !log aaron Synchronized php-1.25wmf17/includes/jobqueue/jobs/RecentChangesUpdateJob.php: 6f6d7e57be0ccff8ed2473b7d250e77703c7a6dd (duration: 00m 09s) [22:59:02] Logged the message, Master [23:01:27] ah [23:03:05] not a *huge* issue but should have backported the messages patch to wmf17 [23:03:42] * aude see if l10nupdate actually takes care of it [23:08:56] csteipp, grab me in person next time you're in office, let's discuss the redirection matters:) [23:09:11] MaxSem: I'll do that.. tomorrow sometime [23:23:27] 6operations: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064224 (10Dragons_flight) 3NEW [23:29:53] 6operations, 7Monitoring: create ganglia aggregator hosts - https://phabricator.wikimedia.org/T80459#1064229 (10Dzahn) [23:30:12] AaronS: yt? [23:30:38] * AaronS was looking at some mysql status commands [23:30:38] 6operations: Document, clean up, and make a policy for dsh groups - https://phabricator.wikimedia.org/T80415#1064231 (10Dzahn) [23:31:15] AaronS: https://tendril.wikimedia.org/report/trxlist?host=^db&schema=wik&user=wik [23:31:54] _joe_: ^ if you're still up [23:32:23] 6operations: rename gerrit2 account in LDAP - https://phabricator.wikimedia.org/T80648#1064239 (10Dzahn) [23:33:50] yep, long running trx [23:33:53] all RecentChangesUpdateJob::purgeExpiredRows jobrunners, apparently within commit between batches [23:33:58] springle: if those are killed, new ones shouldn't come up [23:34:06] ah good [23:34:22] the batch commit call was missing and MW jobs use trxs ever since the hhvm migration [23:34:27] I backported that now [23:37:29] !log killed runaway RecentChangesUpdateJob::purgeExpiredRows transactions from jobrunners. db1033 db1038 db1040 db1052 db1058 [23:37:36] Logged the message, Master [23:41:10] 6operations, 7Monitoring: graph interface drops in ganglia - https://phabricator.wikimedia.org/T80515#1064258 (10Dzahn) [23:41:34] 6operations, 7Monitoring, 7network: nagios monitor transit/peering links and alert on low/high traffic - https://phabricator.wikimedia.org/T80273#1064260 (10Dzahn) [23:42:29] 6operations, 7Monitoring: revamp apaches ganglia grouping - https://phabricator.wikimedia.org/T79947#1064270 (10Dzahn) [23:43:37] 6operations: Automate bare metal builds to work similar to labs - https://phabricator.wikimedia.org/T79997#1064279 (10Dzahn) [23:43:59] 6operations, 7Pybal: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730#1064282 (10Dzahn) [23:44:19] 6operations, 7Monitoring: Fix torrus to not destroy stats when varnish is restarted - https://phabricator.wikimedia.org/T79127#1064285 (10Dzahn) [23:44:58] 6operations, 10Deployment-Systems: generate dsh nodegroups out of puppet data - https://phabricator.wikimedia.org/T79126#1064294 (10Dzahn) [23:51:36] 6operations, 6Release-Engineering: Thumbnails error on foreign repo config - https://phabricator.wikimedia.org/T84647#1064367 (10Dzahn) [23:52:50] 6operations: gitblit cannot list HEAD commit in operations/puppet.git (possibly others) - https://phabricator.wikimedia.org/T83916#1064376 (10Dzahn) [23:53:28] that sounds dupey... [23:53:53] 6operations, 6Release-Engineering: Thumbnails error on foreign repo config - https://phabricator.wikimedia.org/T84647#1064378 (10greg) [23:53:56] that's very possible, these are imported RT tickets that i just now made public [23:54:02] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064379 (10chasemp) [23:54:04] in order to find what can be closed.. yea [23:54:10] https://phabricator.wikimedia.org/T52152 ? [23:54:15] or perhaps https://phabricator.wikimedia.org/T54648 [23:55:08] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064224 (10chasemp) @andrew or @robh can we change that email on the signup page to operations@phab.wm.o? and https://phabricator.wikimedia.org/T55793#555374. [23:55:12] 6operations: Job queue ganglia monitoring @terbium stopped working - https://phabricator.wikimedia.org/T84705#1064383 (10Dzahn) [23:55:16] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064385 (10chasemp) p:5Triage>3Normal [23:56:35] 6operations, 6Release-Engineering: Thumbnails error on foreign repo config - https://phabricator.wikimedia.org/T84647#1064401 (10greg) 5Open>3Resolved a:3greg Those files are producing correct thumbnails for me now (even random sized thumbails I was throwing at it). I hope this was fixed a while ago :) [23:56:42] 6operations, 6Release-Engineering: Thumbnails error on foreign repo config - https://phabricator.wikimedia.org/T84647#1064404 (10greg) a:5greg>3None [23:58:12] 6operations: gitblit cannot list HEAD commit in operations/puppet.git (possibly others) - https://phabricator.wikimedia.org/T83916#1064408 (10chasemp) 5Open>3declined a:3chasemp @fgiunchedi, I'm declining this just because I think diffusion has replaced gitblit for ops use for this case? Unless someone co... [23:58:41] 6operations: Allow strace/gdb attachment to processes running as a user one can sudo as - https://phabricator.wikimedia.org/T84257#1064412 (10Dzahn) [23:58:46] down with gitblit, long live diffusion [23:59:30] 6operations, 7Monitoring: Job queue ganglia monitoring @terbium stopped working - https://phabricator.wikimedia.org/T84705#1064418 (10greg) [23:59:35] 6operations: Replace mysql with mariadb on virt1000 (et al) - https://phabricator.wikimedia.org/T84470#1064420 (10Dzahn)