[00:00:04] RoanKattouw, ^d, Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150305T0000). [00:00:29] incubator might have some? [00:01:38] PROBLEM - Cassandra database on restbase1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [00:01:54] meta? office? species? ten? wikimania20xxwiki? [00:02:27] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [00:03:28] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [00:04:11] interestingly, /test2?/ are considered special [00:04:56] that's expected [00:04:57] RECOVERY - Cassandra database on restbase1005 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [00:05:04] they shouldn't remain under wikipedia.org IMO [00:06:57] <^d> You guys keep wanting to change so much [00:06:57] (03PS3) 10MaxSem: Kill some usages of 'wiki' group in mobile-related settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194074 (https://phabricator.wikimedia.org/T91340) [00:07:02] <^d> First the dbnames [00:07:04] <^d> Then test.wp.o [00:07:07] <^d> What's next?!? [00:07:22] PORT STUFF TO HACK [00:07:40] Krenair, looks good? ^^^^ [00:08:07] * ^d will keep using the wiki group until someone pries it from his cold dead hands!! [00:09:03] I'm wondering if we store a record of calls to such functions. [00:09:28] It's probably okay. [00:09:48] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [00:10:14] parser functions? geodata stores its data, that's why I now looked [00:10:18] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 29096 seconds ago, expected 28800 [00:12:46] MaxSem, did you count the number of rows in geo_tags on each wiki? [00:13:39] soome have a couple of rows eg from image exif, but that's not what we're interested to support on non-content wikis [00:14:08] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [00:15:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 29395 seconds ago, expected 28800 [00:17:32] (03PS1) 10Ori.livneh: Fix lint errors [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194425 [00:18:04] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix lint errors [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194425 (owner: 10Ori.livneh) [00:20:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 29696 seconds ago, expected 28800 [00:20:38] (03PS2) 10Thcipriani: Parameterize roles for labs [puppet] - 10https://gerrit.wikimedia.org/r/194413 (https://phabricator.wikimedia.org/T90592) [00:22:49] MaxSem, are you confident it's not going to break anything? [00:22:59] yep [00:23:53] okay, shall I sync this then? [00:24:04] ]or I can [00:24:44] okay, which one of us shall sync it? [00:25:06] (03CR) 10MaxSem: [C: 032] Kill some usages of 'wiki' group in mobile-related settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194074 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [00:25:15] (03Merged) 10jenkins-bot: Kill some usages of 'wiki' group in mobile-related settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194074 (https://phabricator.wikimedia.org/T91340) (owner: 10MaxSem) [00:25:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 29996 seconds ago, expected 28800 [00:26:04] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/194074/ (duration: 00m 06s) [00:26:11] Logged the message, Master [00:26:38] PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL: Connection refused [00:27:10] ^ ? just got paged when about to hit the door [00:27:20] gwicke? [00:27:27] is that in prod now? [00:27:35] yes [00:27:50] well, it just got deployed [00:28:01] doubtful that it's being used:) [00:28:14] it's not used yet, am working on it [00:28:27] see my earlier log [00:29:05] ok [00:29:53] (03PS1) 10Thcipriani: Include hiera classes in lab instance role [puppet] - 10https://gerrit.wikimedia.org/r/194426 (https://phabricator.wikimedia.org/T90592) [00:30:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 30296 seconds ago, expected 28800 [00:30:23] they did ramp up what was available through it yesterday: https://gerrit.wikimedia.org/r/#/c/194244/ [00:34:09] (03PS1) 10Ori.livneh: Add .travis.yml [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194427 [00:34:17] MaxSem, deploying wiki ->wikipedia change now? [00:34:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Add .travis.yml [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194427 (owner: 10Ori.livneh) [00:34:28] yup [00:34:31] problems? [00:34:33] don't break zero :)) [00:34:39] not yet ;) [00:34:55] well. you plusoned it [00:35:09] hehe [00:35:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 30595 seconds ago, expected 28800 [00:35:19] you are still the one to blame ;) [00:38:31] (03PS1) 10Ori.livneh: Configure for Coverity scan [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194431 [00:38:41] (03CR) 10Ori.livneh: [C: 032 V: 032] Configure for Coverity scan [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194431 (owner: 10Ori.livneh) [00:40:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 30896 seconds ago, expected 28800 [00:41:43] yurik, at worst you'll see {{#coordinates}} tags appearing instead of being parsed [00:42:09] probably. [00:42:14] Krenair, wa? where? [00:42:24] wherever it breaks, if it has done somewhere [00:42:41] It probably hasn't. Hopefully. [00:45:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 31196 seconds ago, expected 28800 [00:50:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 31496 seconds ago, expected 28800 [00:55:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 31796 seconds ago, expected 28800 [00:59:02] (03PS1) 10Ori.livneh: Fix defects pointed out by Coverity scan [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194433 [00:59:23] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix defects pointed out by Coverity scan [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194433 (owner: 10Ori.livneh) [00:59:42] csteipp: have you played around with coverity at all? their cloud service is free for FOSS projects and it's pretty nice [01:00:02] ori: Only what I can see without being a project owner [01:00:16] I haven't contributed enough to any java/c projects to get into any projects.. [01:00:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 32096 seconds ago, expected 28800 [01:00:54] ori: But yeah, I recommended we setup varnishkafka to be regularly scanned [01:01:10] ooh, good idea [01:05:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 32396 seconds ago, expected 28800 [01:05:17] RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 5975 bytes in 0.008 second response time [01:07:13] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - milimetric - https://phabricator.wikimedia.org/T90956#1090940 (10Tnegrin) approved [01:10:08] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:10:18] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 32695 seconds ago, expected 28800 [01:12:10] (03CR) 10Dzahn: "there's rsyslogd on 58786/udp and a python process on 8053/udp. nrpe, sshd, exim, bacula-fd, ntpd should be covered by defaults in base. n" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [01:13:34] (03PS1) 10Ori.livneh: Enforce upper/lower bound on buffer size [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194442 [01:13:40] (03CR) 10Dzahn: "8053 is diamond, should be fine i think" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [01:13:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Enforce upper/lower bound on buffer size [software/statsdlb] - 10https://gerrit.wikimedia.org/r/194442 (owner: 10Ori.livneh) [01:15:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 32996 seconds ago, expected 28800 [01:16:06] is having rsyslogd running related to being a ganglia::web or anything ganglia? [01:20:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 33296 seconds ago, expected 28800 [01:25:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 33596 seconds ago, expected 28800 [01:26:20] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [01:30:18] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 33895 seconds ago, expected 28800 [01:35:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 34199 seconds ago, expected 28800 [01:40:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 34496 seconds ago, expected 28800 [01:45:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 34795 seconds ago, expected 28800 [01:46:50] (03PS1) 10GWicke: Remove restbase1006 from cassandra & restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/194445 [01:48:05] Coren: you around for a review? [01:49:10] (03PS2) 10Ori.livneh: Remove restbase1006 from cassandra & restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/194445 (owner: 10GWicke) [01:49:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove restbase1006 from cassandra & restbase hosts [puppet] - 10https://gerrit.wikimedia.org/r/194445 (owner: 10GWicke) [01:49:25] ori: thanks! [01:50:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 35096 seconds ago, expected 28800 [01:51:21] (03PS2) 10Dzahn: formalizing milimetric's access to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/193396 (owner: 10RobH) [01:52:18] (03CR) 10Dzahn: [C: 032] "we have the approval now" [puppet] - 10https://gerrit.wikimedia.org/r/193396 (owner: 10RobH) [01:55:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 35396 seconds ago, expected 28800 [01:58:50] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - milimetric - https://phabricator.wikimedia.org/T90956#1091021 (10Dzahn) >>! In T90956#1074055, @RobH wrote: > https://gerrit.wikimedia.org/r/#/c/193396/ is the patchset for this change, once we have Toby's approval. @Tn... [01:59:38] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - milimetric - https://phabricator.wikimedia.org/T90956#1091022 (10Dzahn) a:5Tnegrin>3Dzahn [01:59:46] 6operations, 6Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#1091032 (10Dzahn) [01:59:47] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - milimetric - https://phabricator.wikimedia.org/T90956#1091031 (10Dzahn) 5Open>3Resolved [02:00:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 35695 seconds ago, expected 28800 [02:02:48] (03CR) 10Dzahn: [C: 031] "and rsyslogd comes from base and that port is randomly chosen and other host that already have base::firewall also don't have a hole or it" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [02:05:13] !log l10nupdate Synchronized php-1.25wmf19/cache/l10n: (no message) (duration: 00m 02s) [02:05:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 35996 seconds ago, expected 28800 [02:05:21] Logged the message, Master [02:06:20] !log LocalisationUpdate completed (1.25wmf19) at 2015-03-05 02:05:17+00:00 [02:06:27] Logged the message, Master [02:10:17] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet last ran 36296 seconds ago, expected 28800 [02:15:17] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 174 seconds ago with 0 failures [02:17:33] !log l10nupdate Synchronized php-1.25wmf20/cache/l10n: (no message) (duration: 00m 02s) [02:17:42] Logged the message, Master [02:18:40] !log LocalisationUpdate completed (1.25wmf20) at 2015-03-05 02:17:37+00:00 [02:18:46] Logged the message, Master [02:25:50] 6operations: Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#1091043 (10Lixxx235) This needs to be added to the interwiki table as "wm2016", keeping in line with the previous Wikimania wikis. [02:30:34] 6operations: Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#1091053 (10Krenair) Looks like they did it: https://meta.wikimedia.org/wiki/Interwiki_map [02:32:46] 6operations: Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#1091054 (10Lixxx235) it's not working (if you try to search "wm2016:" on any wiki) nor is it showing up in Special:Interwiki [02:34:32] 6operations: Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#1091055 (10Krenair) Yeah, I think we have to run a script every few months to make changes to that page take effect. [02:38:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Mar 5 02:37:32 UTC 2015 (duration 37m 31s) [02:38:43] Logged the message, Master [02:42:59] 6operations: Create wikimania2016 wiki - https://phabricator.wikimedia.org/T85374#1091064 (10Dzahn) 5Resolved>3Open [02:43:16] 6operations: Create wikimania2016 wiki (update interwiki) - https://phabricator.wikimedia.org/T85374#1091067 (10Dzahn) [02:43:48] 7Puppet, 6Labs, 5Patch-For-Review: Enable including classes via hiera for labs - https://phabricator.wikimedia.org/T90592#1091070 (10scfc) [02:51:52] 7Puppet, 6Labs, 5Patch-For-Review: Enable including classes via hiera for labs - https://phabricator.wikimedia.org/T90592#1091079 (10scfc) @thcipriani: I want to include a class (let's say `mailclient`) in all instances of a project except for some specified nodes (i. e., say for instance `mailrelay` "do not... [02:56:24] 6operations, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1091089 (10Dzahn) [02:57:33] 6operations, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1091091 (10Dzahn) p:5Triage>3Low [02:59:07] mutante, we own w.wiki now? [02:59:47] huh, so we do [03:01:46] Krenair: yea, it's back (as in "we had it before but for a very short time) [03:02:03] why? [03:02:07] ICANN rules [03:02:17] meant we couldn't have it? or..? [03:02:54] the guy who owns .wiki gave it to us, then ICANN said he couldn't do that just yet, some time went by, now it's fine [03:03:02] lol [03:03:37] Any idea if there's a sane way to compare two .cdb files? [03:05:47] you want to check for the missing interwiki link, do you [03:06:02] there is package "tinycdb" in Debian [03:06:15] "command-line utility to create, analyze, dump and query cdb files".. [03:06:52] i guess dump and then diff the dumps [03:07:26] 6operations, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1091096 (10BBlack) Since this will be through the primary prod SSL clusters, we'll need to add it to our standard GlobalSign setup (a separate w.wiki SNI cert, and also adding w.wiki to our unified cert for older... [03:07:37] Yeah, or just use strings and go through diff [03:07:49] not nice, but can be done with "standard" packages [03:10:06] hoo: yeah, doesn't seem useful [03:34:53] 6operations, 6Security: Define in Puppet or remove rogue user accounts not currently defined in admin/data.yaml - https://phabricator.wikimedia.org/T90923#1091114 (10Dzahn) [03:44:26] 6operations, 10Continuous-Integration, 5Patch-For-Review: invalid byte sequence in US-ASCII - puppet issues with UTF-8 - https://phabricator.wikimedia.org/T91453#1091126 (10Dzahn) Thank you! that confirms we are past the UTF-8 related bugs and just see the next ones to fix now that we didn't see before. I s... [03:48:10] 6operations, 10Continuous-Integration: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1091130 (10Dzahn) 3NEW [03:48:50] 6operations, 10Continuous-Integration: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1091140 (10Dzahn) p:5Triage>3Normal [03:50:22] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1091130 (10Dzahn) [03:51:46] 6operations, 10Continuous-Integration, 5Patch-For-Review: invalid byte sequence in US-ASCII - puppet issues with UTF-8 - https://phabricator.wikimedia.org/T91453#1091143 (10Dzahn) p:5Triage>3Normal [03:52:28] PROBLEM - salt-minion processes on rhenium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [03:53:26] 6operations, 10Continuous-Integration, 5Patch-For-Review: invalid byte sequence in US-ASCII - puppet issues with UTF-8 - https://phabricator.wikimedia.org/T91453#1084541 (10Dzahn) >>! In T91453#1089996, @Krinkle wrote: > In the last run, only the following remained: let's fix that over here T91613 because i... [03:53:32] 6operations: Create wikimania2016 wiki (update interwiki) - https://phabricator.wikimedia.org/T85374#1091147 (10Glaisher) Someone (@Krenair?) needs to run [[ https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/scap/files/updateinterwikicache | updateinterwikicache ]]. [03:58:55] 6operations, 10Continuous-Integration, 5Patch-For-Review: invalid byte sequence in US-ASCII - puppet issues with UTF-8 - https://phabricator.wikimedia.org/T91453#1091151 (10Dzahn) 5Open>3Resolved I'll say it's resolved because the few non-ASCII chars we had in .pp files have been removed and in .erb file... [04:09:18] 6operations: pybal issue? - https://phabricator.wikimedia.org/T90839#1091160 (10Dzahn) Joe has said to check the number of open files by pybal on LVS hosts because he has seen a "too many open files" warning on lvs1003 before. I checked for that on lvs1001, and then lvs1002 because that's the one that has the O... [04:11:09] PROBLEM - configured eth on rhenium is CRITICAL: Connection refused by host [04:11:19] PROBLEM - RAID on rhenium is CRITICAL: Connection refused by host [04:11:19] PROBLEM - dhclient process on rhenium is CRITICAL: Connection refused by host [04:11:47] PROBLEM - puppet last run on rhenium is CRITICAL: Connection refused by host [04:11:48] PROBLEM - Disk space on rhenium is CRITICAL: Connection refused by host [04:12:08] PROBLEM - DPKG on rhenium is CRITICAL: Connection refused by host [04:14:37] RECOVERY - salt-minion processes on rhenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [04:14:37] RECOVERY - dhclient process on rhenium is OK: PROCS OK: 0 processes with command name dhclient [04:14:37] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [04:14:45] !log started nagios-nrpe on rhenium [04:14:53] Logged the message, Master [04:14:59] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [04:15:07] RECOVERY - Disk space on rhenium is OK: DISK OK [04:15:28] RECOVERY - DPKG on rhenium is OK: All packages OK [04:15:28] RECOVERY - configured eth on rhenium is OK: NRPE: Unable to read output [04:16:37] "unable to read output" is a kind of recovery :p [04:16:48] :) [04:40:24] 6operations, 7Graphite: scale statsd reporting/aggregation (plan) - https://phabricator.wikimedia.org/T89857#1091200 (10GWicke) [04:40:26] 6operations, 10RESTBase: Investigate apparent restbase request rate under-reporting in graphite: statsd issue? - https://phabricator.wikimedia.org/T89846#1091199 (10GWicke) [04:44:41] 6operations: pybal issue? - https://phabricator.wikimedia.org/T90839#1091201 (10Dzahn) | | lvs1001 | lvs1002 | lvs1003 | lvs1004 | lsof (all) | 191 | 140 | 407 | 185 | lsof (regular files) | 28 | 28 | 29 | 29 | lsof (TCP connects) | 153 | 102 | 312 | 146 | number of links | 13 | 13 | 13 | 13 | | number of L... [04:52:08] RECOVERY - HHVM queue size on mw1184 is OK: OK: Less than 30.00% above the threshold [10.0] [04:52:57] RECOVERY - HHVM busy threads on mw1184 is OK: OK: Less than 30.00% above the threshold [76.8] [05:08:22] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1091217 (10Dzahn) role::ci::website exists in production, role::ci::website::**labs** does not. That's the class setting up doc.wm and integration... [05:12:16] (03CR) 10Dzahn: [C: 032] noc apache,adjust old path to this file in comment [puppet] - 10https://gerrit.wikimedia.org/r/194249 (owner: 10Dzahn) [05:47:13] (03PS1) 10Dzahn: ensure there is always a newline in chained certs [puppet] - 10https://gerrit.wikimedia.org/r/194455 (https://phabricator.wikimedia.org/T84543) [05:58:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [05:59:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [06:28:28] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: puppet fail [06:28:37] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:48] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:58] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on mc2011 is CRITICAL: CRITICAL: puppet fail [06:29:28] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:58] 6operations, 6Services: Don't configure a cassandra node as its own seed - https://phabricator.wikimedia.org/T91617#1091263 (10GWicke) 3NEW [06:38:41] 6operations, 6Services: Don't configure a cassandra node as its own seed - https://phabricator.wikimedia.org/T91617#1091270 (10GWicke) [06:45:38] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:58] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on mc2011 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:48:27] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:04:15] awww [07:04:19] puppet o'clock [07:04:24] * YuviPanda awaits _joe_ [07:04:34] so I wonder if logrotate can sighup puppet or something [07:05:02] <_joe_> YuviPanda: hi [07:05:09] :D [07:35:42] (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/194246 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [07:37:20] (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/194248 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [07:38:42] (03CR) 10John F. Lewis: [C: 031] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/194247 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [07:59:22] (03PS4) 10Yuvipanda: Tools: Deploy root web automatically [puppet] - 10https://gerrit.wikimedia.org/r/148172 (owner: 10Tim Landscheidt) [08:03:40] (03CR) 10Yuvipanda: [C: 032] "Let's see how this goes!" [puppet] - 10https://gerrit.wikimedia.org/r/148172 (owner: 10Tim Landscheidt) [08:27:27] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Puppet last ran 12 days ago [08:28:28] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:58:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [08:58:56] <_joe_> mmmh is there any planned maintenance? [09:08:16] has something changed in the mediawiki train deploy process? I can't find the wmf/1.25wmf20 branch in any deployed extension? [09:09:33] Reedy: ^ :) [09:26:15] (03CR) 10Qgil: "This seems uncontroversial. Can someone please merge?" [puppet] - 10https://gerrit.wikimedia.org/r/194126 (https://phabricator.wikimedia.org/T545) (owner: 10Aklapper) [09:28:22] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1091624 (10Nikerabbit) 3NEW [09:40:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Put clearer Login/Register instructions on the Phabricator login page [puppet] - 10https://gerrit.wikimedia.org/r/194126 (https://phabricator.wikimedia.org/T545) (owner: 10Aklapper) [09:42:48] PROBLEM - RAID on mw2001 is CRITICAL: Connection refused by host [09:43:09] PROBLEM - configured eth on mw2001 is CRITICAL: Connection refused by host [09:43:19] PROBLEM - dhclient process on mw2001 is CRITICAL: Connection refused by host [09:43:28] PROBLEM - mediawiki-installation DSH group on mw2001 is CRITICAL: Host mw2001 is not in mediawiki-installation dsh group [09:43:38] PROBLEM - nutcracker port on mw2001 is CRITICAL: Connection refused by host [09:43:58] PROBLEM - nutcracker process on mw2001 is CRITICAL: Connection refused by host [09:44:09] PROBLEM - puppet last run on mw2001 is CRITICAL: Connection refused by host [09:44:19] PROBLEM - salt-minion processes on mw2001 is CRITICAL: Connection refused by host [09:44:39] PROBLEM - DPKG on mw2001 is CRITICAL: Connection refused by host [09:44:48] PROBLEM - Disk space on mw2001 is CRITICAL: Connection refused by host [09:50:10] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [09:53:40] RECOVERY - RAID on mw2001 is OK: OK: no RAID installed [09:54:08] RECOVERY - configured eth on mw2001 is OK: NRPE: Unable to read output [09:54:09] RECOVERY - salt-minion processes on mw2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:54:18] RECOVERY - dhclient process on mw2001 is OK: PROCS OK: 0 processes with command name dhclient [09:54:29] RECOVERY - DPKG on mw2001 is OK: All packages OK [09:54:38] RECOVERY - Disk space on mw2001 is OK: DISK OK [09:56:09] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 3 failures [09:58:33] (03PS1) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [09:59:25] (03CR) 10jenkins-bot: [V: 04-1] Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 (owner: 10Alexandros Kosiaris) [10:02:37] (03PS2) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [10:05:14] akosiaris: hello! We can also get deb packages build in jenkins :) [10:05:49] I have create a job template based on http://jenkins-debian-glue.org/ a bunch of shell script wrappers around cowbuilder. Make it trivial to generate .deb on patch proposals [10:05:59] (03PS1) 10Giuseppe Lavagetto: apache: install mpm files after the package [puppet] - 10https://gerrit.wikimedia.org/r/194472 [10:06:01] (03PS1) 10Giuseppe Lavagetto: mediawiki: correct dependency of php.ini file [puppet] - 10https://gerrit.wikimedia.org/r/194473 [10:06:31] hashar: yeah I know. I 've actually been puppetizing my dev environment on labs, hence that ^ [10:07:03] I am packaging Zuul right now :) [10:07:13] my new machine is a Debian/Jessie, that makes it easier [10:07:14] so it is not exactly the same thing, more like get to a point the package gets built correctly in a pristine environment [10:07:25] after that, jenkins sounds fine [10:08:15] akosiaris: an example build for pybal https://integration.wikimedia.org/ci/job/operations-debs-pybal-debian-glue/63/ [10:08:37] heh, lintian for free ? [10:08:43] + piuparts [10:09:05] iirc it even creates the cowbuilder base images automatically [10:09:25] sudo DIST= ARCH=amd64 cowbuilder --update --basepath /var/cache/pbuilder/base-precise-amd64.cow --configfile=/tmp/tmp.unqW7CWCOH [10:09:26] :D [10:09:47] -precise ? [10:09:55] DIST= nothing ? [10:10:00] so running on a precise host ? [10:10:03] yeah it ran on Precise labs instance and default to the host distribution [10:10:13] should be running on Trusty now [10:10:31] and we might be able to build the package against different DIST [10:12:25] yeah, it should be easy [10:15:34] creating a Precise image right now https://phabricator.wikimedia.org/P360 [10:18:15] heh /packaging ? [10:18:19] talking about FHS [10:19:12] variant=buildd I see, nice [10:19:42] I have updated the paste https://phabricator.wikimedia.org/P360 [10:19:57] yeah buildd seems to ship a bunch of required packages [10:20:03] else it just failed :/ [10:24:34] akosiaris: just to confirm, we need a debian branch per distribution right? Ie debian-precise / debian-jessie ? [10:26:23] (03PS1) 10Giuseppe Lavagetto: network: add codfw to the appservers network ranges [puppet] - 10https://gerrit.wikimedia.org/r/194484 [10:27:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [10:32:42] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: install mpm files after the package [puppet] - 10https://gerrit.wikimedia.org/r/194472 (owner: 10Giuseppe Lavagetto) [10:35:56] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: correct dependency of php.ini file [puppet] - 10https://gerrit.wikimedia.org/r/194473 (owner: 10Giuseppe Lavagetto) [10:36:53] hashar: hmm I assume you mean git-buildpackage. Well it is a sane way forward to have a branch per distro [10:37:05] I guess [10:37:26] IIRC it is the suggested way [10:37:35] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: correct dependency of php.ini file" [puppet] - 10https://gerrit.wikimedia.org/r/194485 [10:37:51] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "mediawiki: correct dependency of php.ini file" [puppet] - 10https://gerrit.wikimedia.org/r/194485 (owner: 10Giuseppe Lavagetto) [10:37:58] <_joe_> grrr I hate puppet [10:38:02] any idea how I can add othermirror in an existing cow image ? [10:38:14] hashar: hooks ? [10:38:29] I tend to use hooks a lot [10:38:35] 6operations, 10ops-fundraising, 10Wikimania-Hackathon-2015, 10Wikimedia-Hackathon-2015: overhaul fundraising cluster monitoring - https://phabricator.wikimedia.org/T91508#1091751 (10Qgil) [10:38:41] I created a Precise image without the apt.wm.o mirrors :( [10:38:45] otherwise, I don't think you can [10:39:00] I can still --login in it [10:39:04] edit the source.list [10:39:05] and save :) [10:39:28] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: puppet fail [10:39:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think this can go as-is. I have to reinstall 250 appservers with trusty this month and I don't want to have to add the default to all of" [puppet] - 10https://gerrit.wikimedia.org/r/194402 (owner: 10BBlack) [10:39:48] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail [10:39:49] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: puppet fail [10:39:58] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [10:39:59] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: puppet fail [10:40:05] hashar: pbuilder login --save-after-login [10:40:05] <_joe_> that's my fault ^^ but it will end soon [10:40:10] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: puppet fail [10:40:18] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: puppet fail [10:40:19] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: puppet fail [10:40:28] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: puppet fail [10:40:29] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: puppet fail [10:40:29] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: puppet fail [10:40:38] * akosiaris ignores icinga-wm [10:40:38] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: puppet fail [10:40:39] (03PS2) 10Giuseppe Lavagetto: network: add codfw to the appservers network ranges [puppet] - 10https://gerrit.wikimedia.org/r/194484 [10:40:48] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: puppet fail [10:40:59] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: puppet fail [10:40:59] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: puppet fail [10:40:59] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: puppet fail [10:40:59] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: puppet fail [10:41:08] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: puppet fail [10:41:09] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: puppet fail [10:41:29] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: puppet fail [10:41:33] _joe_: there is an error ^ [10:41:42] too much copy paste in that patchset [10:41:42] <_joe_> akosiaris: I already fixed that [10:41:52] <_joe_> akosiaris: where is an error? [10:42:08] https://gerrit.wikimedia.org/r/194484 [10:42:09] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:42:10] $all_network_subnets['production']['eqiad']['private']['private1-d-codfw']['ipv4'], [10:42:14] eqiad [10:42:15] ? [10:42:29] <_joe_> oh yes I was correcting it, as stated :) [10:42:36] ok [10:43:20] <_joe_> I was still trying to figure out a way not to have any error on first install of an appserver [10:44:32] (03PS3) 10Giuseppe Lavagetto: network: add codfw to the appservers network ranges [puppet] - 10https://gerrit.wikimedia.org/r/194484 [10:45:32] akosiaris: if you have a moment, please review my two pending patches. [10:45:40] (03CR) 10Giuseppe Lavagetto: [C: 032] network: add codfw to the appservers network ranges [puppet] - 10https://gerrit.wikimedia.org/r/194484 (owner: 10Giuseppe Lavagetto) [10:45:46] bah apt.wm.o is commented out on gallium :( [10:45:49] #deb http://apt.wikimedia.org/wikimedia precise-wikimedia main universe [10:45:53] ^--- oop [10:48:49] ah /etc/apt/sources.list.d/wikimedia.list [10:51:29] PROBLEM - puppet last run on mw2001 is CRITICAL: CRITICAL: Puppet has 1 failures [10:51:58] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - ! TiNet {#1065}BR [10:53:55] hmmm? [10:54:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] lvs: init.pp lint (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/190689 (owner: 10Matanya) [10:55:13] (03CR) 10Alexandros Kosiaris: [C: 032] base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 (owner: 10Matanya) [10:55:20] (03PS4) 10Alexandros Kosiaris: base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 (owner: 10Matanya) [10:55:26] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 (owner: 10Matanya) [10:57:29] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:58:08] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [10:58:08] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:58:09] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:58:29] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:58:29] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:58:48] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:58:58] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:58:59] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:59:06] akosiaris: any reason you didn't comment on lines 36 and 53 ? [10:59:09] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:09] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:59:09] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:59:18] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:59:29] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:59:48] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:59:49] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:59:49] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:59:58] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:58] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:49] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [11:01:23] (03PS4) 10Matanya: lvs: init.pp lint [puppet] - 10https://gerrit.wikimedia.org/r/190689 [11:01:38] RECOVERY - puppet last run on mw2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:03:05] * hashar whistles [11:03:59] (03PS1) 10Alexandros Kosiaris: Stabilize puppet hashes interpolation on yarn-site.xml.erb [puppet/cdh] - 10https://gerrit.wikimedia.org/r/194488 [11:05:18] (03PS1) 10Matanya: ipsec: fqdn is a fact [puppet] - 10https://gerrit.wikimedia.org/r/194489 [11:05:31] matanya: yeah, I missed them [11:05:40] thanks for spotting them out [11:05:47] ok, i fixed them too, thanks much for the review [11:09:22] 6operations: Provide dh-virtualenv package on apt.wikimedia.org Precise distribution - https://phabricator.wikimedia.org/T91631#1091791 (10hashar) 3NEW [11:09:26] (03PS3) 10Yuvipanda: beta: Complain if there have been *any* cherry-picks for 48h [puppet] - 10https://gerrit.wikimedia.org/r/193078 (https://phabricator.wikimedia.org/T76392) [11:09:40] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Complain if there have been *any* cherry-picks for 48h [puppet] - 10https://gerrit.wikimedia.org/r/193078 (https://phabricator.wikimedia.org/T76392) (owner: 10Yuvipanda) [11:09:41] know I understand how easy it is to get a package backported [11:09:54] took me less than 10 minutes to get dh-virtualenv build for Precise [11:10:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] ipsec: fqdn is a fact (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/194489 (owner: 10Matanya) [11:12:11] (03PS1) 10Giuseppe Lavagetto: mediawiki: correct codfw nutcracker servers list [puppet] - 10https://gerrit.wikimedia.org/r/194490 [11:12:29] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: puppet fail [11:13:36] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: correct codfw nutcracker servers list [puppet] - 10https://gerrit.wikimedia.org/r/194490 (owner: 10Giuseppe Lavagetto) [11:16:13] (03PS2) 10Matanya: ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 [11:17:04] (03CR) 10jenkins-bot: [V: 04-1] ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 (owner: 10Matanya) [11:18:20] (03PS3) 10Matanya: ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 [11:18:55] matanya: there was a reason I did not comment on the $site variables [11:19:01] those are not facts [11:19:10] they are badly named, granted but not facts [11:19:11] (03CR) 10jenkins-bot: [V: 04-1] ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 (owner: 10Matanya) [11:19:18] ok :) [11:19:19] feel like renaming it ? [11:19:22] yes [11:19:26] thanks! [11:19:27] but not now [11:19:38] i mean, i'll do it [11:19:46] yeah, no worries [11:19:55] but a better fix needs to be done for the long run [11:20:13] actually, that entire $site detection thing is not nice [11:20:21] actually it is a duplicate [11:20:31] why is it badly named? :) [11:20:38] $site vs $::site [11:20:48] it will work but causes confusion [11:20:59] RECOVERY - nutcracker port on mw2001 is OK: TCP OK - 0.000 second response time on port 11212 [11:21:00] oh is there a local-scope $site? [11:21:06] yes [11:21:08] $::site is not a fact either [11:21:10] RECOVERY - nutcracker process on mw2001 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:21:27] yeah, but it is a top scope variable set by realm.pp [11:21:52] and from what I see, that detection is reoccuring in manifests/role/ipsec.pp [11:21:59] sigh [11:22:00] which is ... unfortunate [11:22:03] <_joe_> sigh [11:22:24] * akosiaris laughing just for a change [11:22:38] * matanya is stepping on toes, as usual [11:23:00] <_joe_> matanya: it's not you of course :) [11:23:24] yes, the whole $site thing is annoying [11:24:48] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:25:35] akosiaris: would DC work instead of site ? [11:25:55] or location [11:26:09] you mean as a variable name ? yeah it would [11:26:34] yes, ok. will use location [11:26:45] actually, don't [11:27:15] I 'd say leave it as is for now, and I 'll rework the entire logic in there [11:27:22] ok [11:27:28] cause it is redoing stuff it should not be doing [11:27:43] so, just the 3 facts change would be lovely [11:27:47] (03PS4) 10Matanya: ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 [11:28:50] (03CR) 10Alexandros Kosiaris: [C: 032] ipsec: facts in the right scope [puppet] - 10https://gerrit.wikimedia.org/r/194489 (owner: 10Matanya) [11:34:59] (03PS2) 10Yuvipanda: Tools: Install at [puppet] - 10https://gerrit.wikimedia.org/r/191521 (https://phabricator.wikimedia.org/T72324) (owner: 10Tim Landscheidt) [11:35:49] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Install at [puppet] - 10https://gerrit.wikimedia.org/r/191521 (https://phabricator.wikimedia.org/T72324) (owner: 10Tim Landscheidt) [11:36:32] (03PS2) 10Yuvipanda: Remove "shell" from monthly Phabricator statistics email. [puppet] - 10https://gerrit.wikimedia.org/r/193852 (owner: 10Aklapper) [11:37:07] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove "shell" from monthly Phabricator statistics email. [puppet] - 10https://gerrit.wikimedia.org/r/193852 (owner: 10Aklapper) [11:37:23] ACKNOWLEDGEMENT - Host restbase1006 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris awaiting controller [11:38:02] apergos: I added you to https://gerrit.wikimedia.org/r/#/c/190940/ [11:41:05] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Values from mwyaml backend don't override values from ops/pupppet yaml files in hieradata/labs - https://phabricator.wikimedia.org/T90466#1091844 (10yuvipanda) This seems to be fixed now? \o/ [11:47:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:47:39] sigh [11:48:29] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:49:10] (03PS3) 10Yuvipanda: Parameterize roles for labs [puppet] - 10https://gerrit.wikimedia.org/r/194413 (https://phabricator.wikimedia.org/T90592) (owner: 10Thcipriani) [11:49:27] (03CR) 10Yuvipanda: [C: 032] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/194413 (https://phabricator.wikimedia.org/T90592) (owner: 10Thcipriani) [11:51:25] 6operations, 6Community-Liaison: Please generate a list of task IDs and number of their subscribers, ordered by number of subscribers, for the "top 100" tasks in the VisualEditor project - https://phabricator.wikimedia.org/T90860#1091849 (10Elitre) [11:52:27] (03PS1) 10Giuseppe Lavagetto: jobrunner: add ganglia config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/194494 [11:53:29] (03PS1) 10Alexandros Kosiaris: WIP: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) [11:53:45] YuviPanda: I'll have a look. also while you are here, remind me where you are testing salt syndic? [11:54:05] apergos: haven’t gotten to it yet. I’ll probably test it on deployment-prep [11:54:09] ah ha [11:54:19] (03CR) 10jenkins-bot: [V: 04-1] WIP: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) (owner: 10Alexandros Kosiaris) [11:54:23] thank good to know [11:54:25] *thanks [11:54:37] thank you jenkins-bot [11:55:19] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 64, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - ! TiNet {#1065}BR [11:56:11] apergos: I’ll keep you posted when I get to it. My pain point / end goal is toollabs, however :) [11:56:18] ah [12:00:02] (03PS2) 10Alexandros Kosiaris: WIP: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) [12:03:18] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [12:04:30] akosiaris: I could use some building advice https://phabricator.wikimedia.org/P361$130-151 :-D [12:04:54] I can manage to get my local package dh-virtualenv installed via gbp --build [12:05:03] though it works just fine via gbp --login :( [12:06:06] heh? it can't find python ? [12:06:18] dh-vitualenv [12:06:27] I suppose virtualenv not vitualenv [12:06:41] OH MY GOD [12:07:46] akosiaris: thank you very much! [12:07:58] hashar: :-) [12:08:15] I am still wondering why it is not going to install some others [12:14:17] hashar: probably cascading failures of the vitualenv thing [12:14:23] at least I 've seen that before [12:16:41] akosiaris: maybe dh-virtualenv badly interacts with the python-* build deps [12:17:23] * hashar read the doc [12:19:10] (03PS1) 10Matanya: sysctl: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/194497 [12:25:01] (03PS2) 10Yuvipanda: Include hiera classes in lab instance role [puppet] - 10https://gerrit.wikimedia.org/r/194426 (https://phabricator.wikimedia.org/T90592) (owner: 10Thcipriani) [12:25:33] (03CR) 10Yuvipanda: [C: 032 V: 032] "I quite like this. We should have a proper ENC at some point in the future, but for now, this beats futzing around with wikitech for proje" [puppet] - 10https://gerrit.wikimedia.org/r/194426 (https://phabricator.wikimedia.org/T90592) (owner: 10Thcipriani) [12:25:54] (03PS1) 10Matanya: varnish: fix param order [puppet] - 10https://gerrit.wikimedia.org/r/194500 [12:32:28] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [12:39:16] 7Puppet, 6Labs, 5Patch-For-Review: Enable including classes via hiera for labs - https://phabricator.wikimedia.org/T90592#1091895 (10yuvipanda) Note that this is just a stopgap until T85279 is done. [12:45:36] (03CR) 10Alexandros Kosiaris: [C: 032] varnish: fix param order [puppet] - 10https://gerrit.wikimedia.org/r/194500 (owner: 10Matanya) [12:46:59] (03CR) 10Alexandros Kosiaris: [C: 032] sysctl: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/194497 (owner: 10Matanya) [12:48:39] (03CR) 10Alexandros Kosiaris: [C: 032] lvs: init.pp lint [puppet] - 10https://gerrit.wikimedia.org/r/190689 (owner: 10Matanya) [12:49:40] 7Puppet, 6Labs, 5Patch-For-Review: Enable including classes via hiera for labs - https://phabricator.wikimedia.org/T90592#1091898 (10yuvipanda) Also note that this could be considered a security escalation - someone who compromises wikitech can now compromise all labs instances, *but* if wikitech is compromi... [13:05:40] 6operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1091912 (10yuvipanda) [13:11:40] (03PS4) 10Alexandros Kosiaris: txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 [13:12:06] YuviPanda: why externally blocked https://phabricator.wikimedia.org/T85442 ? [13:12:26] apergos: oh, I think I moved it to the wrong column [13:12:29] I thought that was ‘done' [13:12:30] grr [13:12:34] I think it is [13:12:45] yeah, I’ll move it to the appropriate column [13:12:47] thanks for catching [13:12:51] I haven't added them for jessie yet as I didn't know we were running it, but I'll do that on the upgradde anyways [13:13:08] I get incoming of those in my mail and check them [13:13:09] apergos: right. if you’ve added them for precise and trusty I’m good [13:13:27] (03CR) 10Alexandros Kosiaris: [C: 032] txstatsd: ensure $init_file attributes [puppet] - 10https://gerrit.wikimedia.org/r/185424 (owner: 10Alexandros Kosiaris) [13:13:29] still look like I'm on track to salt upgrade in deployment prep today [13:13:31] we'll see [13:14:00] that would mean upgrade across the cluster next week [13:14:23] 6operations, 10Beta-Cluster, 6Labs: Backport new salt-syndic packages - https://phabricator.wikimedia.org/T85442#1091926 (10yuvipanda) 5Open>3Resolved Installed fine, will re-open if it doesn't actually work :) [13:14:42] apergos: wheeeee. so then we’ll get auto accepting keys :D [13:14:55] I dunno about that, if we want them I mean [13:15:50] (03PS1) 10Phuedx: [WikiGrok] Add the "filmProducer" campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194503 [13:16:16] apergos: we would want them on per-project puppetmasters on labs, I guess [13:16:18] at least on some of them [13:16:22] but yeah, maybe not. let’s see [13:17:05] (03PS1) 10BBlack: progress on cp3xxx storage config [puppet] - 10https://gerrit.wikimedia.org/r/194504 [13:18:56] (03CR) 10Phuedx: [C: 031] [WikiGrok] Create 'screenwriter' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194378 (owner: 10Bmansurov) [13:19:05] (03CR) 10Phuedx: [C: 031] [WikiGrok] Create 'film director' campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194373 (owner: 10Bmansurov) [13:33:24] 6operations, 10Beta-Cluster: File upload area resorts to 0777 permissions to for uploaded content - https://phabricator.wikimedia.org/T75206#1091974 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Fixed now with all the www-data work. [13:37:09] (03CR) 10Yuvipanda: [C: 04-1] "Yup, this won't actually be of any use. We didn't want to let people run the rm too often, since that could possibly swamp the machine its" [puppet] - 10https://gerrit.wikimedia.org/r/193825 (https://phabricator.wikimedia.org/T87484) (owner: 10Hashar) [13:37:20] (03PS2) 10Yuvipanda: tools: Remove tomcat node definitions from puppet [puppet] - 10https://gerrit.wikimedia.org/r/193561 (https://phabricator.wikimedia.org/T91066) [13:38:27] (03PS1) 10Yuvipanda: beta: Send stats to graphite.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194508 (https://phabricator.wikimedia.org/T75881) [13:38:39] any deployers around? [13:38:49] * YuviPanda eyes MatmaRex [13:38:56] wanna deploy a beta-only change? [13:39:10] or walk me through one? [13:40:33] * Krenair waves [13:40:42] I don't think MatmaRex is a deployer [13:41:31] YuviPanda, um. Are the labs hosts going to be able to connect to labmon1001.eqiad.wmnet? [13:41:41] Krenair: yup [13:41:46] Krenair: we have a hoooleeee in the waaaalll [13:41:49] well, in the firewall [13:42:02] the real address of labsdb is also labsdb1001.eqiad.wmnet, for example [13:42:08] and the NFS server is labstore1001.eqiad.wmnet... [13:42:13] so yes, they can, if explicitly allowed [13:42:19] Interesting [13:42:22] Krenair: labmon1001 is also what hosts graphite.wmflabs.org [13:42:26] Anyway let me find the page on config changes... [13:42:50] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#In_your_own_repo_via_gerrit [13:44:32] YuviPanda, It's quite simple, but I can do it if you want [13:45:03] Krenair: let me do it, and I”ll poke if I run into issues. [13:45:07] ok [13:45:11] Krenair: also, I guess step 1 is ‘merge it’? [13:45:24] Krenair: if so, can I get a +1? [13:45:36] no, read the instructions :) [13:45:51] that's part of step 3 [13:45:51] oh [13:45:52] right [13:45:52] in gerrit: rebase, review, and merge the change [13:45:59] I missed the ‘in gerrit’ part [13:46:10] anyway, on to tin [13:46:29] alright, no pending changes [13:46:40] YuviPanda, if we no longer want to send the UDP profiling data to deployment-fluoride, and to this new host instead, +1 [13:47:15] hmm, let me verify that deployment-fluoride is useless.. [13:48:01] Krenair: yup, it’s useless for statsd [13:48:17] oh wait. [13:48:20] it’s the UDP Profiler host... [13:48:23] not the statsd one... [13:48:29] uh, but it’s being used as the statsd one? [13:48:32] * YuviPanda is confused, verifies again [13:50:26] 6operations, 10ops-esams: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#1091995 (10mark) 3NEW [13:51:09] ori, maybe you can help? [13:51:22] oh, it's like 6am there. never mind [13:51:52] YuviPanda: need help? [13:52:01] what are you deploying? [13:52:06] aude: I’m not sure anymore.. [13:52:11] ah [13:52:12] basically the patch in https://phabricator.wikimedia.org/T75881 [13:52:18] but I’m not sure if the patch is the right thing. [13:52:43] hmmm, i don't know enough to say [13:53:01] it's labs only though [13:53:30] fun thing about config changes is that it's normal to submit them yourself, +2 yourself and deploy [13:53:46] heh [13:53:47] yeah [13:53:54] tbf, that’s also true of ops/puppet [13:53:59] :/ [13:54:32] YuviPanda, if it helps, this is the line for prod: [13:54:33] $wgUDPProfilerHost = 'statsd.eqiad.wmnet'; [13:54:37] also [13:54:39] $wgUDPProfilerPort = 8125; [13:54:39] $wgAggregateStatsID = $wgVersion; [13:55:01] so it sounds like $wgUDPProfilerHost is supposed to point to the statsd host :) [13:55:16] Krenair: yup, that’s my impression as well [13:55:26] well, the worst that can happen is that it’s a noop :D [13:55:29] so maybe let’s do it [13:55:40] YuviPanda, could always live hack it on beta, sync the file, see if it works [13:55:55] pfffft [13:55:56] what Krenair says [13:56:02] if it's fine, revert changes on deployment-bastion, merge patch, sync to prod. [13:56:18] otherwise, revert changes and pretend you never touched it :p [13:56:46] ’tis ok, I like leaving permanent records of fucking up [13:56:49] localisation cache busted again? https://phabricator.wikimedia.org/T91638 [13:56:58] * aude wants to take care of https://phabricator.wikimedia.org/T85374 [13:57:03] when you are done [13:57:17] MatmaRex, localisation cache? [13:57:23] MatmaRex, are you thinking of https://phabricator.wikimedia.org/T91638 ? [13:57:30] oh bah [13:57:36] I am mixing up my tabs [13:57:49] aude, I was going to do that [13:57:55] Krenair: hm, ok [13:58:17] Krenair: I’m actually going to just livehack on beta and sync file [13:59:11] (03PS1) 10Alex Monk: Interwiki CDB update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194513 (https://phabricator.wikimedia.org/T85374) [13:59:27] Why is Gerrit throwing more HTTP 500 Internal Server Errors than usual today? [13:59:31] aude, ^ [13:59:32] :) [13:59:39] gerrit, no idea [13:59:59] there is no diff to see but assume the change is good [14:00:14] I was trying to get a reasonable diff last night [14:00:21] It was generated from the script, so.. it should be good [14:00:25] yeah [14:00:34] But I'd really prefer to check these things before syncing them to prod :) [14:00:37] and you uploaded to gerrit, which sometimes people forget [14:00:41] we can check on tin [14:00:51] * aude looks [14:02:52] (03CR) 10Alex Monk: [C: 032] Interwiki CDB update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194513 (https://phabricator.wikimedia.org/T85374) (owner: 10Alex Monk) [14:02:54] (03PS1) 10Jgreen: cnames for rkhunter proxy for boron/heka [dns] - 10https://gerrit.wikimedia.org/r/194514 [14:02:58] (03Merged) 10jenkins-bot: Interwiki CDB update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194513 (https://phabricator.wikimedia.org/T85374) (owner: 10Alex Monk) [14:02:59] it's not on tin [14:03:23] $title = Title::newFromText( 'wm2016:User:Aude' ); [14:03:41] should resolve as interwiki [14:04:18] hmm. internal server errors from gerrit [14:04:27] yeah :( [14:04:33] intermittent [14:04:43] aude, fixed, try now :p [14:04:43] Krenair: looks good [14:04:47] ok [14:04:53] 'mInterwiki' => 'wm2016', [14:05:22] !log krenair Synchronized wmf-config/interwiki.cdb: Interwiki cache update (duration: 00m 06s) [14:05:26] :) [14:05:29] Logged the message, Master [14:05:37] (03CR) 10Jgreen: [C: 032 V: 031] cnames for rkhunter proxy for boron/heka [dns] - 10https://gerrit.wikimedia.org/r/194514 (owner: 10Jgreen) [14:05:52] seems fine in prod [14:05:55] certainly working [14:05:57] yay [14:06:31] 6operations, 5Patch-For-Review: Create wikimania2016 wiki (update interwiki) - https://phabricator.wikimedia.org/T85374#1092038 (10Krenair) 5Open>3Resolved Done. [14:06:48] aude, confusingly, Reedy ran an interwiki cache update for this wiki already [14:07:13] except it was for 'wikimania2016' instead of the shorter 'wm2016' they specifically configure on meta [14:07:16] maybe it wasn't added to the interwikimap [14:07:17] oh [14:08:47] oh, the other thing I noticed was broken earlier [14:08:52] https://phabricator.wikimedia.org/diffusion/OPUP/ [14:08:59] ^d, ^ [14:09:09] git corruption [14:09:19] might be related to the gerrit 500s? Jeff_Green ^ [14:11:23] huh [14:11:35] YuviPanda, how's that beta profiling thing? [14:11:56] Krenair: I’ve a feeling it was a noop before and is a noop now. [14:12:04] tcpdumping to see if anything is being sent at all [14:12:34] fun [14:13:16] nothing so far [14:13:24] not justto labmon but to *any* statsd [14:13:34] I’ve a feeling logging-labs.php is very, very outdated... [14:13:46] 6operations: Provide dh-virtualenv 0.8 package on apt.wikimedia.org Precise distribution - https://phabricator.wikimedia.org/T91631#1092053 (10hashar) [14:14:35] YuviPanda, comparing it to logging.php, probably [14:14:37] MatmaRex, I ran var_dump( wfMessage( 'logentry-block-block' )->text() ); on tin and it took quite a long time to complete first time... and returned the english version [14:14:55] Krenair: yeah. also prod has statsd set in CommonSettings.php [14:15:20] there’s a realm guard there [14:15:29] so it’s ok.. [14:15:46] Krenair: scap is sending data to statsd [14:15:47] .t...!JBscap.rsync_common:5273|ms [14:15:53] but nothing else is going through [14:16:01] I’m going to revert, sync, and abandon patch [14:16:07] ok [14:17:00] Krenair: oh wait, your cdb update overwrote my local patch... [14:17:04] maybe that’s why? [14:17:22] let’s merge and deploy and try [14:17:32] (03PS2) 10Yuvipanda: beta: Send stats to graphite.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194508 (https://phabricator.wikimedia.org/T75881) [14:17:38] merging stuff in operations/mediawiki-config kills any live hacks on beta? [14:17:46] looks like [14:17:54] I’m pro that, actually [14:17:54] (03CR) 10Anomie: [C: 04-1] "Needs manual rebase." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 (owner: 10KartikMistry) [14:17:57] :| [14:17:59] (03CR) 10Yuvipanda: [C: 032] beta: Send stats to graphite.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194508 (https://phabricator.wikimedia.org/T75881) (owner: 10Yuvipanda) [14:18:06] does jenkins merge on mediawiki-config? [14:18:11] yes [14:18:27] Krenair: is there a project that bug could be sensibly associated to? [14:18:30] (In theory, of course.) [14:18:36] MatmaRex, i18n cache issues? [14:19:02] yeah. i don't want to just slap #operations on it [14:19:07] (03CR) 10Yuvipanda: [V: 032] beta: Send stats to graphite.wmflabs.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194508 (https://phabricator.wikimedia.org/T75881) (owner: 10Yuvipanda) [14:20:17] Krenair: hmm, looks like I can’t deploy as myself because I’m not part of the deployers group? [14:20:24] don't you have root? [14:20:27] I do... [14:20:33] but I’m afraid of just using root here [14:20:36] right [14:20:37] since it’ll probably mess up some permissions [14:20:45] hmm, maybe I sudo to you? :P [14:20:48] lol [14:21:00] are you krenair on the cluster? [14:21:00] 6operations: Provide dh-virtualenv 0.8 package on apt.wikimedia.org Precise distribution - https://phabricator.wikimedia.org/T91631#1092064 (10hashar) I originally used v0.7, I need 0.8. Got it from Debian experimental. [14:21:05] yes [14:21:05] and are you ok with me doing that? :) [14:21:12] Try not to break anything in my name, please. [14:21:15] alright [14:21:43] interesting [14:21:51] ermission denied (publickey). [14:21:51] fatal: The remote end hung up unexpectedly [14:22:03] I was under the impression I didn’t need to have key forwarding for this. [14:22:26] I’m no longer you. [14:22:43] Krenair: ^ [14:23:01] Krenair: can you just sync for me this time:? [14:23:05] okay [14:23:53] (03PS6) 10KartikMistry: CX: Publish translations to the Main namespace by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 [14:24:13] !log krenair Synchronized wmf-config/logging-labs.php: https://gerrit.wikimedia.org/r/#/c/194508/ (duration: 00m 07s) [14:24:19] YuviPanda [14:24:19] Logged the message, Master [14:24:25] thanks [14:25:10] (03CR) 10coren: Labs: Puppetize labstore1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194395 (owner: 10coren) [14:25:44] Krenair: alright, testing on beta now [14:25:57] (03PS6) 10coren: Labs: Puppetize labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/194395 [14:27:24] MatmaRex, ok so [14:27:56] alex@alex-laptop:~/Development/MediaWiki (master)$ grep logentry-block-block languages/i18n/pl.json [14:27:56] alex@alex-laptop:~/Development/MediaWiki (master)$ [14:28:16] It's not translated into that language. [14:28:22] Krenair: btw, https://phabricator.wikimedia.org/T71269 happened. everyone in deployment-prep has sudo now (including jenkins, heh) [14:28:28] 6operations, 10ops-esams: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#1092076 (10coren) p:5Triage>3Normal [14:28:33] YuviPanda, everyone? all members? [14:28:39] yeah, all members [14:28:43] 6operations, 10ops-esams: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#1092078 (10coren) a:3Cmjohnson [14:28:50] ok.. [14:29:10] Krenair: oh. hmm. like, actually not translated? [14:29:17] MatmaRex, not as far as I can tell [14:29:52] https://translatewiki.net/w/i.php?title=MediaWiki:Logentry-block-block/pl [14:29:59] !gerrit c891ff00ed86cfbf1c81dbbfae943948721945f1 [14:29:59] https://gerrit.wikimedia.org/ [14:30:03] bloody [14:30:12] gj wm-bot [14:30:17] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092083 (10coren) p:5Triage>3Normal This needs approval language from the manager, please. [14:30:19] https://gerrit.wikimedia.org/r/#/c/152003/ [14:30:26] okay, that's recently merged… duh [14:30:33] Krenair: right, so that patch is a noop [14:30:34] oh well [14:30:43] Krenair: want to close the bug and trout me? thanks [14:31:54] MatmaRex, leaving it open for the i18n team to comment [14:32:25] 6operations, 6Services: Don't configure a cassandra node as its own seed - https://phabricator.wikimedia.org/T91617#1092098 (10coren) p:5Triage>3Normal [14:32:30] not much to comment on, really [14:32:47] it could probably have been implemented with a fallback to the old, translated messages, but… meh [14:33:32] (03PS2) 10BBlack: progress on cp3xxx storage config [puppet] - 10https://gerrit.wikimedia.org/r/194504 [14:38:32] (03PS1) 10Yuvipanda: wdq-mm: Stop monit [puppet] - 10https://gerrit.wikimedia.org/r/194517 [14:38:42] (03PS2) 10Yuvipanda: wdq-mm: Stop monit [puppet] - 10https://gerrit.wikimedia.org/r/194517 [14:38:52] (03CR) 10Yuvipanda: [C: 032 V: 032] wdq-mm: Stop monit [puppet] - 10https://gerrit.wikimedia.org/r/194517 (owner: 10Yuvipanda) [14:39:18] 6operations, 10ops-esams: Rack ms-be3006 and ms-be3007 - https://phabricator.wikimedia.org/T91637#1092104 (10Cmjohnson) a:5Cmjohnson>3mark Reassigning to Mark for setup. [14:41:47] (03PS3) 10BBlack: progress on cp3xxx storage config [puppet] - 10https://gerrit.wikimedia.org/r/194504 [14:42:03] (03CR) 10BBlack: [C: 032 V: 032] progress on cp3xxx storage config [puppet] - 10https://gerrit.wikimedia.org/r/194504 (owner: 10BBlack) [14:49:12] (03PS1) 10BBlack: cp3014 -> jessie, s/mobile/upload/ [puppet] - 10https://gerrit.wikimedia.org/r/194519 [14:49:51] (03CR) 10BBlack: [C: 032 V: 032] cp3014 -> jessie, s/mobile/upload/ [puppet] - 10https://gerrit.wikimedia.org/r/194519 (owner: 10BBlack) [14:50:06] (03PS1) 10KartikMistry: CX: Add Kyrgyz (ky) and Punjabi( pa) in target [puppet] - 10https://gerrit.wikimedia.org/r/194522 [14:50:31] !log depooled cp3014 in esams pybal [14:50:38] Logged the message, Master [14:54:26] <^d> Krenair: wtf did you do to OPUP? [14:55:04] I haven't touched OPUP, apart from trying to view some files I was linked to there [14:56:10] * ^d gives Krenair all the blame! [14:56:59] PROBLEM - puppet last run on cp3014 is CRITICAL: Timeout while attempting connection [14:57:06] 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: [upstream] Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1092134 (10hashar) I am now hitting a wall because dh-virtualenv pip doesn't have network access: ``` I: dh_virtu... [14:58:32] <^d> Krenair: I really dunno what's up without shell access to the actual repo itself [14:58:48] You don't have that? [14:58:55] hm [14:58:59] <^d> I don't have shell to Phab [14:59:34] Unable to Retrieve Paths ? [14:59:47] huh? [14:59:50] corrupt ? [14:59:56] bblack: https://phabricator.wikimedia.org/diffusion/OPUP/ [15:00:01] <^d> It's bitching about 6777169a016abedd803050b7a629a69aa148d5f4 being corrupt [15:00:06] yeah [15:00:29] * YuviPanda waits patiently for wdq-mm to segfault [15:01:00] hmm.. bare repo [15:01:20] 6operations, 10ops-esams: Rack QFX5100 switches - https://phabricator.wikimedia.org/T91643#1092139 (10mark) 3NEW a:3mark [15:01:21] <^d> Yeah, phab (and gerrit) stores repos as bare. No need for a working directory :) [15:01:22] 6777169a016abedd803050b7a629a69aa148d5f4 is my last commit [15:01:29] (the cp3014 -> jessie one) [15:01:52] <^d> Maybe just drop the corrupted object and wait for it to fetch again? [15:01:58] with the git commandline, the commit is fine and pulls back to me matching [15:02:06] so I think this is phab-specific? [15:02:12] same md5sum [15:02:17] on my laptop and phab [15:02:19] <^d> Yeah, if it's good on disk [15:02:19] <^d> Ugh [15:02:36] 7dc5b69f2b591be5441791d4773d8b9a 77169a016abedd803050b7a629a69aa148d5f4 [15:02:52] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092148 (10Arrbee) This is approved. Thanks. [15:03:50] hmm it does have weird permissions though [15:03:58] -r-------- 1 phd phd 221 Mar 5 14:49 77169a016abedd803050b7a629a69aa148d5f4 [15:04:36] oh nice [15:04:42] now it is complaining about another object [15:04:56] I did a chmod go+r on the above one [15:05:44] <^d> hmm, permissions [15:05:57] is this the first time this is happening ? [15:06:11] <^d> First time I'm aware of it [15:06:26] <^d> But we also have a couple hundred repos now and I haven't looked at all of them [15:06:34] find . -perm 0400 -ls | wc -l [15:06:34] 121 [15:06:36] what the... [15:07:11] all of them created today, up to 5 hours in the past [15:07:49] 4:30 is more like it [15:08:02] ls -la [15:08:09] wrong window obviously [15:08:56] akosiaris: hola [15:09:00] :) [15:09:10] kart_: hey [15:12:09] ^d: Krenair, so multiple repositories have this behaviour. operations/puppet, operations/dns, mediawiki/extensions/ContentTranslation and the list goes on [15:12:19] <^d> Well shit. [15:12:24] akosiaris: will you be here in next 1 hourish? [15:12:31] kart_: yes [15:12:37] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1092170 (10Mvolz) @akosiaris - ouch. Is there any control there/ did you test different times for the response from the server the url is being requested... [15:12:37] akosiaris: thanks! [15:12:41] was there a related puppet change recently for the phab host? [15:12:54] (that might have spread a familiar umask problem around?) [15:14:07] bblack: https://gerrit.wikimedia.org/r/#/c/194126/ [15:14:32] that is the only one that roughly coincides with the time the problem started [15:14:36] but it makes no sense to me [15:15:54] !log depool cp3018 in esams pybal [15:16:02] Logged the message, Master [15:16:07] bblack: ok I take it back... there are repos with problems since Feb 11 [15:16:56] yeah so that's probably the standard umask issue. we just have to find where to fix it. [15:17:23] somewhere, some puppet "exec" stanzas need "umask => 022", and/or file/directory definitions need explicit modes where they lack them now [15:17:41] (03PS1) 10BBlack: cp3018 -> jessie, s/upload/mobile/ [puppet] - 10https://gerrit.wikimedia.org/r/194528 [15:18:06] (03CR) 10BBlack: [C: 032 V: 032] cp3018 -> jessie, s/upload/mobile/ [puppet] - 10https://gerrit.wikimedia.org/r/194528 (owner: 10BBlack) [15:22:41] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:51] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:41] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59704 bytes in 0.224 second response time [15:25:01] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [15:25:21] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:04] 6operations, 10Citoid: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1092204 (10Mvolz) @akosiaris is it safe to assume from your comment on T89866 this task should be "remove xulrunner?" instead of remove zotero? [15:27:06] (03CR) 10Anomie: "Why'd you lose the labs part?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 (owner: 10KartikMistry) [15:28:51] 6operations, 10Citoid: Update the citoid/deploy branch to not contain zotero deploy - https://phabricator.wikimedia.org/T89872#1092206 (10akosiaris) No, I actually meant remote the zotero parts. I already got a patch ready, will upload it soon [15:30:31] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:30:31] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:31:10] RECOVERY - Host cp3018 is UP: PING WARNING - Packet loss = 37%, RTA = 89.79 ms [15:31:17] !log restarted phd (phabricator daemon) on iridium [15:31:23] Logged the message, Master [15:34:22] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092235 (10demon) Known, we spotted on IRC a bit ago. Something's up with umasks. [15:34:46] <^d> akosiaris: Nemo_bis filed ^ for our diffusion umask issue, added [operations] [15:35:56] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1092238 (10akosiaris) @Mvolz, so these are on my DSL line so take them with a grain of salt. Also they are for the exact same content so they may not be... [15:37:26] ^d: yeah, I restarted phd to rule out some weird state in it (or to confirm it) [15:38:40] akosiaris, did you see zotero use significant CPU in your test, or was it just waiting for the external site? [15:39:01] gwicke: probably the later [15:39:30] k [15:40:27] the most effective way to improve that latency would be to store the results ;) [15:41:22] I think zotero supports that [15:41:27] well at least the firefox extension does [15:41:43] but we are assuming the same thing is being queried over and over [15:41:49] yeah, we also have this rest api / storage service that could do it [15:42:18] yeah, what is it called ? I dont remember :P [15:42:33] ;) [15:42:51] PROBLEM - salt-minion processes on cp3014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:43:08] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092253 (10Dzahn) It's a bit confusing that production access is needed to manage something on a wmflabs.org URL. Isn't that a labs instace? [15:43:23] (03PS1) 10Alexandros Kosiaris: Minor lint [puppet] - 10https://gerrit.wikimedia.org/r/194534 [15:43:53] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Minor lint [puppet] - 10https://gerrit.wikimedia.org/r/194534 (owner: 10Alexandros Kosiaris) [15:45:45] ^d: Krenair bblack: So I restarted phd and the new objects have correct permissions [15:46:03] which means we got the culprit [15:46:10] RECOVERY - salt-minion processes on cp3014 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:46:16] <^d> Command failed with error #128! COMMAND git log --skip=0 -n 15 --pretty=format:'%H:%P' '55d0cc8302c49104e59da9b8b355a4ca145b49b0' -- STDOUT (empty) STDERR fatal: loose object 558e179ed1643dc9b8c33a7fd3b5001d0ef21dba (stored in ./objects/55/8e179ed1643dc9b8c33a7fd3b5001d0ef21dba) is corrupt [15:46:17] <^d> Still [15:46:22] <^d> But mostly better [15:46:30] yeah, I need to set permissions for old objects [15:46:34] <^d> Ah ok [15:46:55] but I am happy we got the culrpit. Now it's in the hand of the phab team ;-) [15:48:10] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092260 (10coren) @dzahn stat1003 is real metal in prod. [15:50:23] kart_: Ping for SWAT in 10 minutes. [15:50:43] manybubbles, ^d, thcipriani, marktraceur, Krenair: Who wants SWAT this morning? [15:51:14] I can do if on one wants it [15:51:21] kart_: around for swat? [15:51:30] <^d> I can [15:52:26] ^d: if you _want_ to, be my guest. I can do it too [15:52:29] <^d> Actually, scratch that. I want to finish figuring out my packet loss [15:52:37] Whoever SWATs, I note the current situation is that 193835 merge-conflicts with the other two, and the rebase in PS6 lost half the patch that's still mentioned in the summary :/ [15:52:54] (03PS1) 10coren: labstore1002 to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/194537 (https://phabricator.wikimedia.org/T85604) [15:54:13] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1092276 (10Mvolz) You might try varying the sessionid every request and see if that helps. (I have experienced some issues which I think may be due to ha... [15:54:36] (03PS2) 10coren: labstore1002 to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/194537 (https://phabricator.wikimedia.org/T91640) [15:56:10] (03PS2) 10Giuseppe Lavagetto: jobrunner: add ganglia config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/194494 [15:56:14] anomie: back now. [15:56:21] need coffee. sigh. [15:56:52] (03PS1) 10BBlack: fix cp30xx storage sizes [puppet] - 10https://gerrit.wikimedia.org/r/194538 [15:56:54] (03PS1) 10BBlack: repool cp301[48] backends [puppet] - 10https://gerrit.wikimedia.org/r/194539 [15:56:56] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add ganglia config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/194494 (owner: 10Giuseppe Lavagetto) [15:57:16] (03CR) 10Giuseppe Lavagetto: [V: 032] jobrunner: add ganglia config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/194494 (owner: 10Giuseppe Lavagetto) [15:57:25] (03CR) 10BBlack: [C: 032 V: 032] fix cp30xx storage sizes [puppet] - 10https://gerrit.wikimedia.org/r/194538 (owner: 10BBlack) [15:57:45] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092284 (10akosiaris) I am the one who responded to this. After some quick checks like verifying the md5sum of the above object I noticed the object had permissions 0400 and readable only... [15:58:03] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Values from mwyaml backend don't override values from ops/pupppet yaml files in hieradata/labs - https://phabricator.wikimedia.org/T90466#1092286 (10thcipriani) 5Open>3Resolved [15:58:21] bah, I lost the race there [15:58:43] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Values from mwyaml backend don't override values from ops/pupppet yaml files in hieradata/labs - https://phabricator.wikimedia.org/T90466#1059622 (10thcipriani) @yuvipanda: yup—fixed this https://gerrit.wikimedia.org/r/#/c/193165/ [15:59:04] bblack: Want to give a quick nod to https://gerrit.wikimedia.org/r/194537 ? I don't like to self+2. :-) [16:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150305T1600). [16:00:22] anomie and kart_: fun fun. Can you fix up the patches and let me know when they are ready? are any ready right now [16:00:23] ? [16:00:56] manybubbles: just submitted. [16:00:58] (03CR) 10BBlack: [C: 031] labstore1002 to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/194537 (https://phabricator.wikimedia.org/T91640) (owner: 10coren) [16:01:31] (03PS7) 10KartikMistry: CX: Publish translations to the Main namespace by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 [16:01:33] manybubbles: give me a minute more. [16:01:43] manybubbles: we can go with 193835 [16:02:32] (03PS2) 10BBlack: repool cp301[48] backends [puppet] - 10https://gerrit.wikimedia.org/r/194539 [16:02:39] (03CR) 10BBlack: [C: 032 V: 032] repool cp301[48] backends [puppet] - 10https://gerrit.wikimedia.org/r/194539 (owner: 10BBlack) [16:02:47] (03CR) 10Manybubbles: [C: 032] CX: Publish translations to the Main namespace by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 (owner: 10KartikMistry) [16:03:25] untracked files on tin. is it known? [16:03:26] !log repooled cp301[48] in pybal [16:03:30] Logged the message, Master [16:04:20] Is Mr. Jenkins stuck somewhere? [16:04:43] bblack: Thank you. [16:05:03] looks stucked [16:05:09] * Coren "patiently" waits for jenkins. [16:05:40] around 4am zuul stopped running things [16:05:48] ETA: 0 min, queued 1 hr 58 min ago [16:05:50] I don't know *who's* 4am [16:05:52] https://integration.wikimedia.org/zuul/ [16:06:03] 6operations, 7HTTPS, 3HTTPS-by-default: Point rel=canonical to HTTPS for all Russian Wikimedia projects - https://phabricator.wikimedia.org/T90527#1092313 (10Nemo_bis) [16:06:14] who do we have kick zuul? [16:06:15] who can restart it? [16:07:12] lets see if I can [16:07:42] no: gallium.wikimedia.org [16:07:51] sorry, no: Permission denied (publickey). [16:08:10] I don't have permissions to log in to gallium. [16:08:16] ^d: do you have permissions for ^^^^ [16:08:47] <^d> is it zuul or gearman? [16:09:00] dunno [16:09:06] zuul isn't picking up any new jobs [16:09:07] <^d> usually the latter [16:09:09] they don't show up in its list [16:09:09] oh, ya'll are talking about it over here too [16:09:16] its stucked [16:09:19] please unsticked it [16:09:27] https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL <-- what I tried re zuul/jenkins [16:09:55] mw1094 is logging lots of errors now [16:10:07] ap_proxy_connect_backend disabling worker for (127.0.0.1) for 0s [16:10:14] is it sad? [16:10:31] <^d> oh gearman you already did hmm [16:10:34] ^d: I tried disconnecting/reconnecting gearman, no effect [16:10:36] yeah [16:10:50] <^d> I'll kick zuul [16:10:55] yay! [16:10:55] k [16:11:19] hopefully we don't miss many events [16:12:04] do I have to do something to have it re-pick-up events? [16:12:09] <^d> Waiting for jobs to complete.... [16:12:10] <^d> .... [16:12:11] <^d> ...... [16:12:52] manybubbles: yeah, commenting "recheck" on the gerrit change [16:13:04] (03CR) 10Manybubbles: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 (owner: 10KartikMistry) [16:13:06] manybubbles: if this is blocking a swat deploy, you can force merge it and jfdi [16:13:10] <^d> zuul's not back [16:13:15] <^d> yeah, just skip jenkins [16:13:17] (03CR) 10Manybubbles: [V: 032] CX: Publish translations to the Main namespace by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193835 (owner: 10KartikMistry) [16:13:17] <^d> and merge [16:13:18] manybubbles: *after* it comes back up :) [16:13:33] <^d> ... waiting for jobs to complete ....................................................................................................................... [16:13:35] <^d> still going [16:13:40] (03CR) 10KartikMistry: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/194522 (owner: 10KartikMistry) [16:14:02] looks like there have been two jobs stuck for a couple hours. Might have to be a bit meaner to the process. [16:14:29] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Publish translations to the Main namespace by default (duration: 00m 05s) [16:14:32] greg-g, I don't think we can always force merge? [16:14:35] Logged the message, Master [16:14:42] Krenair: no? [16:14:47] <^d> Stopped and started [16:14:49] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092330 (10akosiaris) I think I just fixed all of them. Turns out there were also directories created with the wrong permissions during. My commands to fix it were cd /srv/phab/repos s... [16:14:50] is there a way to just kill the stuck jobs? [16:14:53] for deployment branches in some repos [16:15:02] !log manybubbles Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: CX: remove labs customization (duration: 00m 07s) [16:15:07] Logged the message, Master [16:15:07] /zuul/status.json: Service Temporarily Unavailable :( [16:15:12] kart_: ^^^^^ [16:15:21] tada, maybe? [16:15:30] do your rechecks now and see what happens [16:15:32] greg-g: we can always force merge. But its ucouth [16:15:45] (03CR) 10Manybubbles: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193774 (https://phabricator.wikimedia.org/T89635) (owner: 10KartikMistry) [16:15:52] manybubbles: that's what I thought [16:15:53] (03CR) 10Manybubbles: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193775 (https://phabricator.wikimedia.org/T89337) (owner: 10KartikMistry) [16:16:09] gah, still getting "/zuul/status.json: Service Temporarily Unavailable" [16:16:21] greg-g: I used to do it before we had jenkins jobs set up on some of the java stuff [16:16:36] <^d> Starting the service gives us two zuul-server processes [16:16:40] * ^d doesn't know if that's right [16:16:45] <^d> zuul 31721 5.0 0.2 247912 19564 ? Sl 16:16 0:00 /usr/bin/python /usr/local/bin/zuul-server -c /etc/zuul/zuul-server.conf [16:16:45] <^d> zuul 31729 40.0 0.2 247912 18684 ? Sl 16:16 0:01 /usr/bin/python /usr/local/bin/zuul-server -c /etc/zuul/zuul-server.conf [16:17:04] hmmm [16:17:08] kart_: are you manually rebasing the last two? [16:17:27] ^d: is one a child of the other? [16:17:49] manybubbles: yes. give a minute. [16:17:49] making a service often does some double fork bullshit - maybe it is that but only half way? [16:17:55] kart_: sure sure [16:18:03] <^d> manybubbles: Ah, yes. [16:18:05] <^d> It is a child [16:18:24] taha, I see jobs on the zuul status page [16:18:37] ^d: its probably normal [16:18:46] i took the road this morning to go onsite i had to turn back home, because the roads were so bad [16:18:54] remember that python has the GIL so subprocess is pretty common [16:19:05] papaul: where? [16:19:10] texas [16:19:12] dallas [16:19:36] and texas drivers are crazy [16:19:49] meanwhile Idaho is having a 20+ year low snow year; the world is upside down [16:20:06] papaul: snow? [16:20:15] bd808: of course! [16:20:16] yes and ice [16:20:27] we're goign to get some more ice tonight. well, maybe starting around 3 [16:20:40] hopefully I won't have lose power and internet again. so much rage. [16:20:52] manybubbles: but it is daylight savings on sunday, right? [16:21:04] so it will be like summer, sort of :) [16:21:55] (03Abandoned) 10KartikMistry: CX: Enable Content Translation in pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193774 (https://phabricator.wikimedia.org/T89635) (owner: 10KartikMistry) [16:21:56] bah, DST is this weekend [16:22:07] (03PS2) 10KartikMistry: CX: Enable Content Translation in kywiki and pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193775 [16:22:17] and starts 3 weeks of confusion about meetings, deploy times... [16:22:23] manybubbles: merged two into one to make things easy :) [16:22:56] (03CR) 10Manybubbles: [C: 032] CX: Enable Content Translation in kywiki and pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193775 (owner: 10KartikMistry) [16:23:17] kart_: lets see if jenkins wants it [16:23:23] greg-g: and I have to fly to SF the next day! [16:23:26] wonderful! [16:23:33] Its almost not worth trying to sleep that night [16:23:36] :( [16:24:13] akosiaris: can you merge, https://gerrit.wikimedia.org/r/#/c/194522/ [16:24:35] 6:45 takeoff means arrive aroudn 5:30 means leave the house around 5 would mean wake up around 4:30..... [16:25:02] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092356 (10chasemp) Thanks @akosiaris! I have no explanation for this. Updating the json config file does trigger a restart of PHD but I'm having trouble linking up this cause with that... [16:25:20] with one less hour its just a bit silly. [16:25:40] 11 - 4:30 but really 3:30. [16:25:48] good for me. DST. I can attend SoS :) [16:26:00] unless! I get myself sick and fall asleep at 9! [16:26:08] then 9-3:30 is great! [16:26:31] go jenkins go! [16:26:50] hmm [16:26:55] not sure it is going to go manybubbles [16:27:00] :( [16:27:04] (03CR) 10Manybubbles: [V: 032] CX: Enable Content Translation in kywiki and pawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/193775 (owner: 10KartikMistry) [16:27:05] I am waiting for like 13 mins already [16:27:21] akosiaris: hours for me :D [16:27:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Add Kyrgyz (ky) and Punjabi( pa) in target [puppet] - 10https://gerrit.wikimedia.org/r/194522 (owner: 10KartikMistry) [16:27:49] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT: CX: Enable Content Translation in kywiki and pawiki (duration: 00m 07s) [16:27:54] Logged the message, Master [16:27:56] kart_: ^^^^^ [16:28:14] manybubbles: thanks [16:29:06] SWAT complete! jouncebot, I give you back the deploying conch. Keep it safe for the next deployer. [16:29:32] PROBLEM - Varnishkafka Delivery Errors per minute on cp3014 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [20000.0] [16:29:35] akosiaris: kart_ it's still going (your job): https://integration.wikimedia.org/ci/job/operations-puppet-doc/12342/console [16:29:43] manybubbles: looks good. atleast can see CX in ky and pa [16:30:33] PROBLEM - Apache HTTP on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:41] greg-g: hmm it started kind of late... [16:30:52] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:31:11] akosiaris: no, that "recheck" was sent when zuul was restarting and not taking jobs [16:31:24] it started as soon as it was told to [16:31:36] ok [16:31:46] kart_: cool. https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=146983&oldid=146963 [16:31:53] going go go and get some lunch [16:32:54] manybubbles|lunc: \0/ [16:33:27] 6operations: Provide dh-virtualenv 0.8 package on apt.wikimedia.org Precise distribution - https://phabricator.wikimedia.org/T91631#1092418 (10coren) p:5Triage>3Normal [16:33:47] akosiaris: thanks [16:36:05] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092429 (10coren) Should this be closed as resolved or renamed to something more indicative of "keeping an eye on it?" [16:38:51] * kart_ off to cxserver deployment as nothing left in SWAT [16:39:43] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [16:39:47] 7Puppet, 6operations: removing admin::groups from hiera doesn't revoke permissions - https://phabricator.wikimedia.org/T89961#1092436 (10coren) p:5Triage>3Low [16:40:20] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092438 (10Ottomata) Ah, k, just talked to Dan and clarified something. These users will need to be in the 'statistics-users' group. That is all. :) [16:40:43] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [16:43:30] !log Updated cxserver to 2695a31 [16:43:35] Logged the message, Master [16:44:00] (03PS1) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) [16:44:23] (03CR) 10jenkins-bot: [V: 04-1] Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) (owner: 10coren) [16:46:13] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [16:46:30] (03CR) 10Hoo man: "@aude: Will this have an effect on the site group of sourceswiki (in case we repopulate the db)?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) (owner: 10coren) [16:47:32] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [16:48:12] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Values from mwyaml backend don't override values from ops/pupppet yaml files in hieradata/labs - https://phabricator.wikimedia.org/T90466#1092523 (10greg) a:3thcipriani [16:48:50] (03CR) 10coren: "As a note, this will also require a patch to the DbListTests::testDatabaseNamesUseProjectNameAsSuffix test because it doesn't know that so" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) (owner: 10coren) [16:52:08] hoo: Do you know where the tests live in gerrit? [16:52:09] (03CR) 10Andrew Bogott: "This is awesomely cryptic! A comment explaining about that echo might be nice :)" [puppet] - 10https://gerrit.wikimedia.org/r/194455 (https://phabricator.wikimedia.org/T84543) (owner: 10Dzahn) [16:52:59] Coren: same repo [16:53:08] tests folder :) [16:53:18] hoo: That's WAY too easy! :-) [16:54:41] (03PS1) 10Giuseppe Lavagetto: mediawiki: arrange mw-related clusters across rows [puppet] - 10https://gerrit.wikimedia.org/r/194551 [16:55:41] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: arrange mw-related clusters across rows [puppet] - 10https://gerrit.wikimedia.org/r/194551 (owner: 10Giuseppe Lavagetto) [16:57:13] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [16:57:15] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092536 (10chasemp) p:5Triage>3Low a:3chasemp [16:57:42] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092538 (10chasemp) 5Open>3stalled >>! In T91648#1092429, @coren wrote: > Should this be closed as resolved or renamed to something more indicative of "keeping an eye on it?" assigne... [16:58:21] (03PS2) 10Giuseppe Lavagetto: mediawiki: arrange mw-related clusters across rows [puppet] - 10https://gerrit.wikimedia.org/r/194551 [16:58:33] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [16:58:41] 6operations, 6Phabricator: Unhandled Exception ("CommandException") in diffusion - https://phabricator.wikimedia.org/T91648#1092543 (10coren) Cool. :-) [17:00:04] kart_, ^d: Respected human, time to deploy Content Translation/cxserver (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150305T1700). Please do the needful. [17:00:35] ^d: I'm stuck in issues. See: -releng [17:01:37] I still say that jouncebot should say "Human minion, hear and obey: deploy X" :-) [17:02:03] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [17:02:34] Coren: we're minions. [17:02:42] Waiting for July 2015 :) [17:03:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: arrange mw-related clusters across rows [puppet] - 10https://gerrit.wikimedia.org/r/194551 (owner: 10Giuseppe Lavagetto) [17:03:23] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [17:04:19] 6operations, 10ops-esams: Rack QFX5100 switches - https://phabricator.wikimedia.org/T91643#1092568 (10mark) One of the QFXes has been racked at height 40 in OE13 (and can likely remain there). Serial console and management network have been connected, so the switch can be configured. The switch is NOT labeled... [17:04:52] (03PS3) 10Alexandros Kosiaris: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) [17:06:04] (03PS2) 10Giuseppe Lavagetto: codfw: assign IPs to mediawiki appservers in row c [dns] - 10https://gerrit.wikimedia.org/r/194302 [17:09:30] ^d: around? [17:09:40] <^d> yes [17:09:42] ^d: git fetch in wmf19 please :) [17:09:52] (issue still not solved :/) [17:09:55] (03PS1) 10Alexandros Kosiaris: Beta: Assign proxy for zotero [puppet] - 10https://gerrit.wikimedia.org/r/194552 [17:10:06] ^d: why can't kart fetch btw ? [17:10:26] <^d> heck if i know [17:10:29] bbiab: bank run [17:10:33] <^d> kart_: done [17:10:48] do you forward your ssh agent into tin ? [17:11:23] which begs the question why those repos have ssh as a remote and not https btw [17:12:26] !log kartik Started scap: Update ContentTranslation [17:12:31] Logged the message, Master [17:12:57] 6operations, 10ops-eqiad, 10RESTBase, 6Services: restbase1006 faulty disk controller - https://phabricator.wikimedia.org/T89639#1092608 (10Cmjohnson) Due to the weather conditions in our area as well as the HP distribution center in Kentucky, the HP field engineer Edwin Robles will be bringing the new con... [17:13:21] ^d: I'm only deploying to wmf19 today as we've issue(s) with wmf20 [17:13:28] twentyafterfour is looking into it. [17:13:38] akosiaris: at least partially hysterical raisins. I think there may be plans afoot to change that. [17:14:07] bd808: that'd be nice [17:15:02] The weekly new branch deploy still does quite a bit of "push patches from tin to gerrit" but that could be done via an alternate remote [17:15:42] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [17:15:51] bd808: branches are cut on tin ? [17:16:38] akosiaris: Depends on who is doing the deploy. The things that really get pushed gerrit are more symlinks and other patches that are created there [17:16:52] The process is ... not pretty [17:16:52] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [17:17:13] point taken [17:17:26] akosiaris: Gory details at -- https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [17:18:10] twentyafterfour: did you forget to create the wmf20 branch for extensions yesterday? I didn't find this branch in any extension, but in core? :) [17:18:26] FlorianSW: that's the case. [17:19:10] kart_: will they be created later? [17:19:24] FlorianSW: I'm creating them now [17:19:34] twentyafterfour: ah great, thanks :) [17:19:55] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1092629 (10Dzahn) >>! In T91625#1092260, @coren wrote: > @dzahn stat1003 is real metal in prod. Yes, i was asking why prod because it says " for the purpose of being able to manage Language... [17:20:06] <^d> The "creating the branch" bits can technically run from anywhere [17:21:22] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [17:21:29] <^d> Actually, lemme see what we can do about this now [17:21:56] ^d? [17:22:16] <^d> I'm going to fix checkoutMediaWiki to only do https instead of ssh [17:23:27] (03PS1) 10Chad: Checkout mediawiki with https instead of ssh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194553 [17:25:06] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1092652 (10GWicke) I wouldn't worry too much about external resources being slow at this point, as there's not much we can do apart from doing those requ... [17:25:46] (03CR) 10Chad: [C: 032] Checkout mediawiki with https instead of ssh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194553 (owner: 10Chad) [17:25:55] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [17:26:25] <^d> twentyafterfour: Ok, future branches will get checked out with https :) [17:26:48] ^d: how does that fix the problem? [17:27:01] <^d> It should make fetching for kart_ work :) [17:27:33] <^d> Fixing a different problem :) [17:27:37] oh [17:28:04] !log kartik Finished scap: Update ContentTranslation (duration: 15m 37s) [17:28:09] Logged the message, Master [17:28:31] <^d> Submodules should already be https [17:28:47] <^d> Yep [17:29:03] ^d: :) [17:30:10] (03Merged) 10jenkins-bot: Checkout mediawiki with https instead of ssh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194553 (owner: 10Chad) [17:31:12] !log demon Synchronized multiversion/checkoutMediaWiki.php: (no message) (duration: 00m 06s) [17:31:16] Logged the message, Master [17:31:33] <^d> !log updated php-1.25wmf(19|20) remotes to use https instead of ssh [17:31:38] Logged the message, Master [17:31:51] <^d> akosiaris: Fixed :D [17:36:28] (03PS2) 10Alexandros Kosiaris: Beta: Assign proxy for zotero [puppet] - 10https://gerrit.wikimedia.org/r/194552 [17:36:30] (03PS4) 10Alexandros Kosiaris: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) [17:36:53] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [17:37:30] 'night [17:37:52] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 2 failures [17:42:10] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1092706 (10chasemp) @mark could you approve this? Unsure who else to ask. [17:46:27] (03PS1) 10Giuseppe Lavagetto: ganglia: add codfw jobrunners to the web views [puppet] - 10https://gerrit.wikimedia.org/r/194554 [17:46:29] !log pushing 1.25wmf20 branches which were missed by yesterday's deployment [17:46:34] Logged the message, Master [17:47:14] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1092736 (10Krinkle) As far as I know a "role::ci::website::labs" role never existed, nor would we want one. There is no website in labs...? I'm cu... [17:48:26] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: add codfw jobrunners to the web views [puppet] - 10https://gerrit.wikimedia.org/r/194554 (owner: 10Giuseppe Lavagetto) [17:51:40] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1092748 (10Dzahn) Yea, that's what i was wondering too. Where does it even come from, i also couldn't find it in the puppet repo. [17:53:34] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1092760 (10scfc) The class is referenced by the instance's configuration (cf. https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000474.eqiad.wm... [17:56:45] woohoo, zotero running in las [17:57:00] ^d: thanks! [17:57:22] akosiaris: mutante: Any advice on https://phabricator.wikimedia.org/T91525 would be welcome. [17:57:33] Short of changing it to an `exec` using pip-install. [17:57:36] <^d> akosiaris: np. it was mostly my fault anyway :) [17:57:40] (assuming that works?) [17:59:06] Krinkle: -rwx------ 1 root root 281 Mar 2 21:50 /usr/local/bin/tox [17:59:08] what ? [17:59:23] Exactly [17:59:27] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 110 is out of bounds [17:59:47] It worked on the old instances, but when we set out to re-create our instances this month. Half our stuff became root-only and broke CI. [17:59:56] I've added a bunch of umask in different places, but not sure how to fix this one [18:00:12] (03CR) 10Rush: "resigning just because I'm cool with either outcome!" [puppet] - 10https://gerrit.wikimedia.org/r/194402 (owner: 10BBlack) [18:01:47] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [18:06:58] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:07:04] Krinkle: I honestly have no suggestion right now.. I assume you only experience this on trusty hosts ? [18:07:15] or jessie hosts ? [18:07:24] akosiaris: We don't have Jessie hosts yet [18:07:30] Trusty only indeed [18:07:32] Why is that? [18:09:01] (03PS3) 10Alexandros Kosiaris: Beta: Assign proxy for zotero [puppet] - 10https://gerrit.wikimedia.org/r/194552 [18:09:03] (03PS5) 10Alexandros Kosiaris: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) [18:09:13] I can't figure why a normal puppet run would set such a restrictive umask [18:09:28] 0077 [18:09:44] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1092796 (10Dzahn) >>! In T91613#1092760, @scfc wrote: > Someone needs to uncheck the corresponding marker on the configuration page. Logged in to... [18:09:45] I 'd expect a user to do it, but not a daemon [18:10:14] daemons should be 0022 [18:10:39] mind if I take a look at this tomorrow? it's like 20:00 here and I am beat [18:11:23] * Coren is back. [18:11:25] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1092799 (10Dzahn) It does exist as a "puppet group" in wikitech (https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup depending on your proj... [18:11:44] akosiaris: Afaik all of puppet was changed by ops to run in restrictive umask. Requiring manifests to individually expand it as needed. [18:11:52] akosiaris: This is both on precise and trusty [18:12:09] It's part of /usr/local/sbin/puppet-run [18:12:16] which is what is popped off of cron [18:13:10] The tox/pip permission denied has only shown up on trusty so far, but it might happen on precise as well. I didn't let it get that far and instead depooled half our cluster and back to the old instances. [18:13:11] oh damn [18:13:22] yeah I see the log line [18:13:31] It's taking over a week now just to re-create the pool of 10 instances. For them to be like the old instances were, fresh. [18:13:43] # We pass show-diff, show the log may be sensitive, [18:13:43] # so make sure it's sufficiently protected [18:13:43] umask 077 [18:13:46] We've been out of sync with upstream for about 6 months :) [18:13:51] 6operations, 10Citoid: Configure citoid to use outbound proxy - https://phabricator.wikimedia.org/T89875#1092807 (10Mvolz) Yes, the native scraper makes outbound requests. Zotero only scrapes a limited set of websites (it matches the urls by regex). If the URL doesn't match any of these, we scrape it in-house.... [18:14:48] Krinkle: yeah I think I can fix that [18:14:56] 6operations: pybal issue? - https://phabricator.wikimedia.org/T90839#1092808 (10Dzahn) regarding ocg: 16:30 < bblack> yeah there's no pybal bug either, the raw ipvs tables have just ocg100[12] IPs: 16:30 < bblack> TCP 10.2.2.31:8000 wrr -> 10.64.32.151:8000 Route 10 0 564 -> 10.64.48.42:8000 Route 10 0 565 reg... [18:15:15] akosiaris: Hm.. what do you have in mind? [18:15:47] enforcing the permissions of the log file on every run and dropping the umask [18:16:10] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tnegrin - https://phabricator.wikimedia.org/T90932#1092812 (10Dzahn) a:5Tnegrin>3mark [18:16:34] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - tfinc - https://phabricator.wikimedia.org/T90927#1092814 (10Dzahn) a:5Ottomata>3mark [18:19:07] 10Ops-Access-Requests, 6operations, 6Security: define in Puppet or remove user account - santhosh - https://phabricator.wikimedia.org/T90937#1092828 (10Dzahn) @santhosh hi @chasemp after having cleaned a couple other users and seeing the oxygen/gadolinium combination more than once, i think the chance is 9... [18:21:06] (03PS1) 10Alexandros Kosiaris: Drop the umask from puppet-run, ensure log file permissions [puppet] - 10https://gerrit.wikimedia.org/r/194557 [18:21:26] (03PS2) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) [18:21:34] (03CR) 10jenkins-bot: [V: 04-1] Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) (owner: 10coren) [18:22:35] Ah bah; of course - the jenkins test for the chaneg in mediawiki-config doesn't use the changed test in that same changeset. [18:22:45] It does [18:22:57] well, it should and does for the mediawiki tsts [18:23:14] (03CR) 10Alexandros Kosiaris: [C: 032] Drop the umask from puppet-run, ensure log file permissions [puppet] - 10https://gerrit.wikimedia.org/r/194557 (owner: 10Alexandros Kosiaris) [18:23:26] Ah! Because typo! [18:23:30] mea culpa [18:23:47] (03PS3) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) [18:23:57] !log depooled cp3014 frontend-only (esams upload) [18:24:04] Logged the message, Master [18:24:06] There we go. [18:24:29] Ah, ew! That file uses tabs [18:24:55] akosiaris: that fix isn't right [18:25:04] akosiaris: the umask->chmod one [18:25:15] akosiaris: what if the file doesn't exist? [18:25:22] (03PS4) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91354) [18:25:27] you have to touch it, then chmod it [18:25:43] I'm also not convinced that it's a good idea to rely on the default umask [18:25:45] paravoid: yeah you are right [18:26:00] not about the umask, about the touch [18:26:02] we could do a File { owner => root, group => root, mode => 0644 } or something, globally [18:26:12] it would not work in this case [18:26:14] with that syntax [18:26:25] that is a package provider [18:26:34] so File defaults are unimportant [18:27:04] well the package shouldn't assume a umask then [18:27:08] dpkg doesn't :) [18:27:20] good luck explaining that upstream [18:27:29] https://tickets.puppetlabs.com/browse/PUP-1328 [18:27:43] they just sit on that ticket for 2 years [18:29:38] This is /not/ my day. [18:29:46] (03PS1) 10Alexandros Kosiaris: Fix for e988ac4 [puppet] - 10https://gerrit.wikimedia.org/r/194561 [18:30:34] (03PS5) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91534) [18:30:38] (03CR) 10jenkins-bot: [V: 04-1] Fix for e988ac4 [puppet] - 10https://gerrit.wikimedia.org/r/194561 (owner: 10Alexandros Kosiaris) [18:31:29] (03CR) 10Krinkle: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/194561 (owner: 10Alexandros Kosiaris) [18:32:19] (03PS1) 10Giuseppe Lavagetto: ganglia: add codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/194562 [18:33:53] (03CR) 10Alexandros Kosiaris: [C: 032] Fix for e988ac4 [puppet] - 10https://gerrit.wikimedia.org/r/194561 (owner: 10Alexandros Kosiaris) [18:36:05] 6operations, 10Citoid: Puppetize zotero - https://phabricator.wikimedia.org/T89867#1092975 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/194495/ And it works works in labs fine (albeit without the proxy right now). Will move on with the proxy changes tomorrow and request team review [18:40:26] 7Blocked-on-Operations, 6operations, 10Citoid, 6Scrum-of-Scrums, 6Services: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1092995 (10akosiaris) After puppetization ( https://gerrit.wikimedia.org/r/#/c/194495/ ) deployment-zotero01 in live in Beta and surving requests. My sim... [18:40:51] (03CR) 10Dzahn: [C: 032] dbtree: add to misc varnish config [puppet] - 10https://gerrit.wikimedia.org/r/194246 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [18:40:57] akosiaris: paravoid: Want me to try and re-create the instance now? [18:41:20] me? [18:41:31] Regarding the umask stuff [18:41:36] I noticed you were talking about it [18:41:43] oh [18:41:53] Krinkle: if the puppet master has picked up the change, yeah do it [18:41:58] akosiaris: btw, there were two other issues related to permissions on https://phabricator.wikimedia.org/T91524 as well. The other two were fixed with umask. [18:42:11] akosiaris: ok [18:42:21] (03PS3) 10coren: labstore1002 to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/194537 (https://phabricator.wikimedia.org/T91640) [18:43:13] PROBLEM - RAID on mw2008 is CRITICAL: Connection refused by host [18:43:32] PROBLEM - configured eth on mw2008 is CRITICAL: Connection refused by host [18:43:43] PROBLEM - dhclient process on mw2008 is CRITICAL: Connection refused by host [18:43:48] (03CR) 10Dzahn: [C: 032] dbtree: add Apache config, move to own docroot [puppet] - 10https://gerrit.wikimedia.org/r/194248 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [18:43:48] !log depool cp3018 esams pybal [18:43:53] Logged the message, Master [18:44:03] PROBLEM - mediawiki-installation DSH group on mw2008 is CRITICAL: Host mw2008 is not in mediawiki-installation dsh group [18:44:13] PROBLEM - nutcracker port on mw2008 is CRITICAL: Connection refused by host [18:44:26] oh, heh, need dsh groups for codfw wikis now?:) [18:44:32] PROBLEM - DPKG on mw2008 is CRITICAL: Connection refused by host [18:44:32] PROBLEM - nutcracker process on mw2008 is CRITICAL: Connection refused by host [18:44:33] PROBLEM - Disk space on mw2008 is CRITICAL: Connection refused by host [18:44:33] PROBLEM - puppet last run on mw2008 is CRITICAL: Connection refused by host [18:44:43] PROBLEM - salt-minion processes on mw2008 is CRITICAL: Connection refused by host [18:44:53] PROBLEM - HHVM processes on mw2008 is CRITICAL: Connection refused by host [18:44:55] wee. 2xxx is good news [18:46:05] (03PS1) 10BBlack: swap cp301[48] roles, depool temporarily [puppet] - 10https://gerrit.wikimedia.org/r/194563 [18:46:34] (03CR) 10BBlack: [C: 032 V: 032] swap cp301[48] roles, depool temporarily [puppet] - 10https://gerrit.wikimedia.org/r/194563 (owner: 10BBlack) [18:46:52] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:49:39] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1093037 (10Krinkle) >>! In T91613#1092760, @scfc wrote: > The class is referenced by the instance's configuration (cf. https://wikitech.wikimedia.o... [18:49:47] 6operations, 10Continuous-Integration, 6Labs: Could not find class role::ci::website::labs on integration puppetmaster - https://phabricator.wikimedia.org/T91613#1093038 (10Krinkle) 5Open>3Resolved a:3Krinkle [18:52:06] ottomata: ping re: graphite1001? [18:52:12] ori: or you ^ [18:52:20] what's up with it? [18:52:21] https://gdash.wikimedia.org/dashboards/reqerror/ is a sad panda [18:52:29] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=graphite1001&nostatusheader [18:52:31] * ori looks [18:54:12] mutante: Do we have a scap rsync-slave in codfw yet? That would be a great place to start [18:54:50] bd808: i dunno know, i think not yet [18:56:16] let's not get back into the case like we had in pmtpa where the boxes there were all syncing directly from tin [18:56:48] yes, true [18:56:53] PROBLEM - HTTPS on cp3014 is CRITICAL: Return code of 255 is out of bounds [18:56:53] PROBLEM - Varnish HTTP upload-backend on cp3014 is CRITICAL: Connection refused [18:57:16] bblack: is it possible that the problem is not on graphite1001 but the varnishes' statsd instance? [18:57:28] * ori will brb [18:57:32] it's possible, but if so it's not due to a recent config change there [18:57:38] (at least, not that I'm aware of) [18:57:46] mutante: I'd actually like to see a tin-like box in codfw that got a full sync of /srv/mediawiki-staging that included all the .git/ stuff that we don't send to the rsync-slaves [18:57:51] (most of them are still on existing precise install/config, still have same error) [18:58:03] RECOVERY - Varnish HTTP upload-backend on cp3014 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.178 second response time [18:58:04] bd808: for a moment i thought we would add "apaches-codfw" then i thought "is it hhvm-codfw" now, then i thought "all doesnt't matter, we are only using mediawiki-installation anyways and that is not per datacenter" [18:58:12] RECOVERY - HTTPS on cp3014 is OK: SSLXNN OK - 36 OK [18:58:17] Then we could in theory deploy from codfw if needed [18:58:57] *nod* The old apaches-* groups are replaced by salt commands now right? [18:58:59] ori: it's entirely possible it's due to https://phabricator.wikimedia.org/T91464 , but I'm not sure why that would be a new problem, I think it's been that way for a while [18:59:20] bd808: that sounds good to me (having a tin-like box) [19:00:12] bd808: yes, dsh is not used anymore for Apache config deploy. What we still use it for is, afaict, only that one group, mediawiki-installation, which influences what we scap to and that Icinga checks for [19:01:37] i guess we can delete a bunch of old groups but it might also make sense to have one group for eqiad and one for codfw instead of just that single one [19:02:10] !log repooled cp301[48] in pybal [19:02:15] Logged the message, Master [19:02:46] (03PS1) 10Dzahn: activate Apache site for dbtree [puppet] - 10https://gerrit.wikimedia.org/r/194567 (https://phabricator.wikimedia.org/T90837) [19:03:01] PROBLEM - HHVM queue size on mw1230 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [80.0] [19:03:07] (03PS1) 10BBlack: repool cp301[48] [puppet] - 10https://gerrit.wikimedia.org/r/194568 [19:03:21] (03CR) 10BBlack: [C: 032 V: 032] repool cp301[48] [puppet] - 10https://gerrit.wikimedia.org/r/194568 (owner: 10BBlack) [19:04:12] i found a phabricator project for codfw appservers but it had no members yet [19:04:15] https://phabricator.wikimedia.org/tag/codfw-appserver-setup/ [19:05:14] duplicate of https://phabricator.wikimedia.org/tag/wikis-in-codfw/ ? [19:05:18] cmjohnson: You around? [19:05:31] yep, i am here..what's up? [19:05:32] RECOVERY - Varnishkafka Delivery Errors per minute on cp3014 is OK: OK: Less than 1.00% above the threshold [0.0] [19:05:42] RECOVERY - configured eth on mw2008 is OK: NRPE: Unable to read output [19:05:51] RECOVERY - salt-minion processes on mw2008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:06:01] RECOVERY - dhclient process on mw2008 is OK: PROCS OK: 0 processes with command name dhclient [19:06:01] RECOVERY - HHVM processes on mw2008 is OK: PROCS OK: 1 process with command name hhvm [19:06:21] RECOVERY - nutcracker port on mw2008 is OK: TCP OK - 0.000 second response time on port 11212 [19:06:32] RECOVERY - RAID on mw2008 is OK: OK: no RAID installed [19:06:41] RECOVERY - nutcracker process on mw2008 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:06:41] RECOVERY - DPKG on mw2008 is OK: All packages OK [19:06:42] RECOVERY - Disk space on mw2008 is OK: DISK OK [19:06:43] cmjohnson: labstore1002 fauks to enter the PERC Bios [19:06:49] cmjohnson: faikls* [19:06:52] fails** [19:07:07] cmjohnson: Can you check if the wriring got funky while we were fixing 1001? [19:07:42] coren: can it wait until tomorrow...it's snowing pretty hard here and the roads are terrible [19:08:11] cmjohnson: Oh, sorry. I misinterpreted "here" to "the DC". Of course it can. Ping me when you get there? [19:08:35] cmjohnson: Want me to phab it up for you? [19:08:38] oh, I am here as I am around...but did not attempt to drive to the data center today. [19:08:50] yeah if you can send me a phab task that would be great [19:08:52] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: Puppet has 6 failures [19:09:00] cmjohnson: Will do. Thanks again. [19:09:17] racadm serveraction powerdown [19:09:22] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1093093 (10Dzahn) [19:09:26] Err, yeah, that but it another window. [19:09:27] :-) [19:09:28] racadm coren coffeeup [19:09:51] PROBLEM - HHVM rendering on mw2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1525 bytes in 8.249 second response time [19:09:58] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1093094 (10Dzahn) [19:10:02] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:10:06] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#1093095 (10Dzahn) [19:10:16] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1093096 (10Dzahn) [19:12:01] PROBLEM - Apache HTTP on mw2008 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1525 bytes in 1.583 second response time [19:13:03] 6operations, 10ops-eqiad: labstore1002 fails to enter PERC bios, hangs on detecting devices - https://phabricator.wikimedia.org/T91677#1093111 (10coren) 3NEW a:3Cmjohnson [19:13:22] 6operations, 3wikis-in-codfw: setup deployment server in codfw (tin equivalent) - https://phabricator.wikimedia.org/T91678#1093119 (10Dzahn) 3NEW [19:15:04] 6operations, 10ops-eqiad: cp1047 down - https://phabricator.wikimedia.org/T88045#1093140 (10Cmjohnson) a:3Cmjohnson [19:15:10] 6operations, 3wikis-in-codfw: setup deployment server in codfw (tin equivalent) - https://phabricator.wikimedia.org/T91678#1093141 (10Dzahn) Do we have hardware we can assign for this? [19:16:31] (03PS2) 10Giuseppe Lavagetto: ganglia: add codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/194562 [19:16:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: add codfw appservers [puppet] - 10https://gerrit.wikimedia.org/r/194562 (owner: 10Giuseppe Lavagetto) [19:16:52] PROBLEM - HHVM queue size on mw1184 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [80.0] [19:18:52] 6operations, 3wikis-in-codfw: setup deployment server in codfw (tin equivalent) - https://phabricator.wikimedia.org/T91678#1093163 (10Dzahn) [19:20:12] PROBLEM - HHVM busy threads on mw1184 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [115.2] [19:20:16] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Need release.wm.o access to do future release(s) - https://phabricator.wikimedia.org/T91424#1093167 (10Dzahn) a:3coren [19:20:22] PROBLEM - HHVM busy threads on mw1230 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [115.2] [19:22:26] (03PS2) 10Dzahn: activate Apache site for dbtree [puppet] - 10https://gerrit.wikimedia.org/r/194567 (https://phabricator.wikimedia.org/T90837) [19:23:21] (03CR) 10Dzahn: [C: 032] activate Apache site for dbtree [puppet] - 10https://gerrit.wikimedia.org/r/194567 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [19:27:09] (03PS6) 10Ori.livneh: Puppet module for the zotero service [puppet] - 10https://gerrit.wikimedia.org/r/194495 (https://phabricator.wikimedia.org/T89867) (owner: 10Alexandros Kosiaris) [19:28:14] ori: heh, I saw "zerotolerance" when I first glanced at that [19:38:13] (03PS1) 10Dzahn: dbtree: move .htaccess into main config [puppet] - 10https://gerrit.wikimedia.org/r/194573 [19:38:41] (03PS2) 10Dzahn: dbtree: move .htaccess into main config [puppet] - 10https://gerrit.wikimedia.org/r/194573 (https://phabricator.wikimedia.org/T90837) [19:38:53] (03PS3) 10Dzahn: dbtree: move .htaccess into main config [puppet] - 10https://gerrit.wikimedia.org/r/194573 (https://phabricator.wikimedia.org/T90837) [19:41:53] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:41:58] (03CR) 10Dzahn: [C: 032] "@springle: this replaces the .htaccess you added" [puppet] - 10https://gerrit.wikimedia.org/r/194573 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [19:44:12] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1456 bytes in 0.204 second response time [19:44:28] 7Puppet, 6operations, 10Continuous-Integration: Puppet class Mediawiki::Packages::Fonts fails to install various fonts - https://phabricator.wikimedia.org/T91685#1093266 (10Krinkle) 3NEW [19:44:33] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Chad H needs release.wm.o access to do future release(s) - https://phabricator.wikimedia.org/T91424#1093273 (10greg) [19:46:02] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.189 second response time [19:46:32] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 66744 bytes in 0.138 second response time [19:47:12] RECOVERY - HHVM queue size on mw1230 is OK: OK: Less than 30.00% above the threshold [10.0] [19:49:22] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.059 second response time [19:49:31] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 66745 bytes in 0.200 second response time [19:50:32] PROBLEM - HHVM queue size on mw1230 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [80.0] [19:52:23] springle: do you approve of adding the CNAME for db1011? [19:52:27] tendril-backend [19:52:38] seems like a spattering of hhvm queue size overages...ori are you guys doing anyting or is this just normal stuff? [19:52:50] i don't know, i haven't investigated [19:52:52] i'll take a look [19:53:22] RECOVERY - HHVM queue size on mw1184 is OK: OK: Less than 30.00% above the threshold [10.0] [19:54:12] RECOVERY - HHVM busy threads on mw1184 is OK: OK: Less than 30.00% above the threshold [76.8] [19:54:54] <_joe_> chasemp: I just restarted both servers [19:55:01] RECOVERY - HHVM queue size on mw1230 is OK: OK: Less than 30.00% above the threshold [10.0] [19:55:11] ah ok _joe_ thanks [19:56:42] RECOVERY - HHVM busy threads on mw1230 is OK: OK: Less than 30.00% above the threshold [76.8] [19:56:47] thanks _joe_ [19:56:49] (03CR) 10Dzahn: [C: 032] add dbtree and point to misc-web [dns] - 10https://gerrit.wikimedia.org/r/194247 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [19:56:57] <_joe_> ok/win 40 [19:58:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [19:58:40] 6operations, 5Patch-For-Review: dbtree - duplicated code in 2 locations - puppetize config - https://phabricator.wikimedia.org/T90837#1093322 (10Dzahn) dbtree is now here: http://dbtree.wikimedia.org/ behind misc-web like other tools and entirely independent of mediawiki deploy [19:59:13] <_joe_> on mw2008, the problem is just that mediawiki-config still doesn't work in codfw :) [20:01:56] (03PS1) 10Dzahn: noc: adjust link to dbtree in index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194579 (https://phabricator.wikimedia.org/T90837) [20:02:21] (03PS2) 10Krinkle: noc: adjust link to dbtree in index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194579 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [20:02:42] (03CR) 10Krinkle: [C: 031] "Nice :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194579 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [20:09:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:10:20] !next [20:10:30] jouncebot: next [20:10:30] In 3 hour(s) and 49 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150306T0000) [20:14:21] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1444 bytes in 0.187 second response time [20:34:47] 10Ops-Access-Requests, 6operations: Access to stat1003 for Niklas and Kartik - https://phabricator.wikimedia.org/T91625#1093412 (10Ottomata) The data is aggregated in production, and then placed in a public place, where dashboards that live in labs can access them. [20:38:50] (03PS1) 10Dzahn: dbtree: add http->https protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/194583 (https://phabricator.wikimedia.org/T90837) [20:41:21] (03PS2) 10Dzahn: dbtree: add http->https protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/194583 (https://phabricator.wikimedia.org/T90837) [21:06:11] Krenair: Ping? [21:06:15] pong [21:06:24] what's up? [21:07:00] Can you take a look at https://gerrit.wikimedia.org/r/#/c/194549/ and see if that would break the universe? It's an obviously correct change, but it may have impact I cannot see. [21:07:15] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1093524 (10GWicke) [21:07:44] (By "obviously correct" I mean that sourceswiki is absolutely a wikisource, not that changing it this way is the correct way to go about it) [21:08:04] 7Blocked-on-Operations, 6operations, 10RESTBase, 10hardware-requests, 7RESTBase-architecture: RESTBase production hardware - 5 of 6 ready - https://phabricator.wikimedia.org/T76986#1093527 (10GWicke) [21:10:18] Coren, that changes what configuration applies to oldwikisource [21:11:04] Krenair: Is there a reliable way to test what effective change of configuration this would cause? [21:11:11] YuviPanda, maybe you can help me? [21:11:31] MaxSem: I think that he is in zzzmode at this time of day. Can I help? [21:12:19] thought I seen him active ) [21:12:52] Coren, well... I guess you can deploy to the test server and set debug headers to route your requests there [21:12:59] I DIDN’T DO IT [21:13:00] but it'd be better to verify the change properly [21:13:02] MaxSem: ‘sup? [21:13:21] I'm working on puppetization of a service, in vagrant so far, and can't get upstart to work: https://gerrit.wikimedia.org/r/#/c/189149/ [21:13:31] MaxSem: Ah! So he is. He *should* be sleeping, but his work pattern is inscrutable and mysterious. Or maybe just random. :-) [21:14:07] YuviPanda: Seriously though, do you roll dice every morning to decide what timezone you'll be in? :-) [21:14:29] Coren: :D My girlfriend left back to the UK day before, so my sleep patterns are more… free again [21:15:05] "if you roll 1 on d20, it's a critical fail so you're in all the timezones" [21:15:16] MaxSem: i’m awake but I’m far too sleepy to debug upstart. Coren could help maybe? [21:15:26] !log restarting Jenkins (and kill -9 ing it) [21:15:33] Logged the message, Master [21:15:37] I spek upstart gud. :-) [21:16:15] * Coren reads the upstart conf [21:16:59] MaxSem: You've checked that the name, usr, port and log variables expand to the right things in practice, right? [21:17:13] user* [21:17:21] 'wikisource' => true, [21:17:21] 'sourceswiki' => true, // FIXME: Why isn't this part of wikisource? [21:17:28] you'd want to update those lines [21:17:48] Krenair: Hah. FIXME indeed. [21:18:01] you'd be deploying DPL there [21:18:25] 'sourceswiki' => '//bits.wikimedia.org/favicon/wikisource.ico', would be obsolete [21:19:30] Yeah, clearly going through InitialiseSettings is a necessary part of the job. [21:19:44] and disabling VE. James_F ^ (context: https://gerrit.wikimedia.org/r/#/c/194549/5 ) [21:20:15] Hmm. Yeah, that shouldn't have VE on by default. [21:20:18] Go for it. [21:20:21] Wait. VE is enabled on sourceswiki? [21:20:29] (03CR) 10Jforrester: [C: 031] "OK by me to disable VE." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91534) (owner: 10coren) [21:20:47] Coren: special.dblist is in all but isn't in wikisource.dblist, so… yes? [21:21:00] Heh. [21:21:19] (03PS1) 10Andrew Bogott: Move californium to a public ip [dns] - 10https://gerrit.wikimedia.org/r/194618 [21:21:58] Krenair: Seems to me most of the artefacts of the switch are fixes. :-) [21:22:29] wmgMFRemovePageActions - I don't... what [21:22:42] why is wikisource specifically configured there [21:23:18] Coren, this may have an effect on wikidata due to wmgWikibaseSiteGroup [21:23:33] Oh, ew. [21:24:18] Although, I don't think it has wikidata client enabled [21:24:21] wgEnotifMinorEdits is another difference. [21:24:36] unlike.. probably most other wikisource sites [21:25:00] Yeah, sourceswiki has obviously been forgotten often when dealing with the wikisources. [21:25:19] (03CR) 10Dzahn: [C: 031] "lgtm, but only if it also moves from the labs vlan to public1-b-eqiad" [dns] - 10https://gerrit.wikimedia.org/r/194618 (owner: 10Andrew Bogott) [21:26:06] Coren, yes: https://gist.github.com/MaxSem/765d00b89129eb2e20e4 [21:29:36] Coren, anyway, basically you need to clean up InitialiseSettings and friends [21:29:51] Krenair: I knew you were the right one to ask. :-) [21:30:42] I suggest working out every config change that results, and either specifically preventing it (by special-casing sourceswiki), or mention it in the commit message [21:30:45] (03PS2) 10Dzahn: Move californium to a public ip [dns] - 10https://gerrit.wikimedia.org/r/194618 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [21:31:06] just running sudo -u www-data /usr/bin/java -jar /usr/share/java/jetty-runner.jar --port 4242 /srv/hierator/hierator.war | logger -t hierator works :/ [21:31:08] Krenair: The sense, you are making it. [21:31:14] :) [21:32:06] MaxSem: afaict, your upstart script should be okay. Checking /etc/default/hierator [21:32:16] (03CR) 10Dzahn: [C: 032] dbtree: add http->https protocol redirect [puppet] - 10https://gerrit.wikimedia.org/r/194583 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [21:32:21] hi operations [21:32:32] could someone help me with a varnishncsa question? [21:37:41] PROBLEM - Host californium is DOWN: PING CRITICAL - Packet loss = 100% [21:38:23] (03CR) 10Dzahn: "ping Yuvi, still up for this?" [puppet] - 10https://gerrit.wikimedia.org/r/177427 (https://phabricator.wikimedia.org/T71604) (owner: 10Yuvipanda) [21:39:24] (03CR) 10Dzahn: "adding apergos" [puppet] - 10https://gerrit.wikimedia.org/r/177427 (https://phabricator.wikimedia.org/T71604) (owner: 10Yuvipanda) [21:39:44] (03CR) 10Andrew Bogott: [C: 032] Move californium to a public ip [dns] - 10https://gerrit.wikimedia.org/r/194618 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [21:39:53] (03CR) 10Alex Monk: [C: 04-1] "As discussed on IRC, this would cause a lot of configuration changes (extension deployments, un-deployments) that need to be reviewed care" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T91534) (owner: 10coren) [21:41:25] !log moved californium to a public ip on labs-hosts1-b-eqiad, rebooted [21:41:31] Logged the message, Master [21:43:12] you moved it out of labs-hosts1 and into public1-b-eqiad [21:43:13] mforns, might be better to ask your question [21:43:37] I assume you already searched wikitech etc.? [21:43:53] Krenair, I want to know if varnishncsa has some kind of size limit on the logs? [21:44:12] does it truncate urls if they are too long? [21:45:37] a quick google search reveals http://t42817.web-varnish-misc.wwwtalk.info/truncated-urls-in-log-t42817.html which suggests it does (or did, in 2007) [21:48:26] Coren, /etc/default/hierator is empty because everything is in hierator.conf [21:49:47] Krenair, thanks! But I can actually read logs with more than 500 chars... [21:50:13] MaxSem: Hm. I don't get it, I see no reason why that wouldn't work. You get nothing in the logs either? [21:50:36] (03PS1) 10Dzahn: noc: redirect old dbtree URLs to new location [puppet] - 10https://gerrit.wikimedia.org/r/194701 [21:51:19] (03PS2) 10Dzahn: noc: redirect old dbtree URLs to new location [puppet] - 10https://gerrit.wikimedia.org/r/194701 (https://phabricator.wikimedia.org/T90837) [21:52:56] (03PS6) 10coren: Move sourceswiki special.dblist->wikisource.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194549 (https://phabricator.wikimedia.org/T14423) [21:53:34] Krenair: As I expected, a bit of investigation shows that the divergences were mostly unintended with sourceswiki just being forgotten. [21:54:04] (03CR) 10Dzahn: [C: 032] noc: redirect old dbtree URLs to new location [puppet] - 10https://gerrit.wikimedia.org/r/194701 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [21:54:37] yeah [21:56:41] 6operations, 5Patch-For-Review: dbtree - duplicated code in 2 locations - puppetize config - https://phabricator.wikimedia.org/T90837#1093731 (10Dzahn) - does proto redirect http->https now - old noc.wm.org/dbtree URL gets 301ed to new place - mw-config change for index.html link added to SWAT @springle just... [21:57:33] Coren, so the list of changes I brought up on IRC was not exhaustive [21:57:39] I just went through a few of them [21:57:45] Coren, Mar 5 21:51:42 mediawiki-vagrant kernel: [50537.181887] init: hierator pre-start process (6148) terminated with status 1. commenting out [ ! -r /etc/default/<%= scope['name'] %> ] && { stop; exit 0; } made it work [21:58:01] hmm [21:58:36] Coren, wmgMFRemovePageActions does not need to be mentioned. it is a no-op at the moment [21:58:36] wikisource's specific value is identical to the default [21:58:55] (03PS1) 10Andrew Bogott: Move californium to public IP. [puppet] - 10https://gerrit.wikimedia.org/r/194703 [21:59:28] MaxSem: Odd, an empty /etc/default/hierator should not have triggered that test. [21:59:53] Is it just me or is gerrit insanely slow? [22:00:04] (03PS2) 10Dzahn: Move californium to public IP. [puppet] - 10https://gerrit.wikimedia.org/r/194703 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:00:19] Coren: hashar may be upgrading, I updated the package an hour ago at his request [22:00:25] (03CR) 10Dzahn: [C: 031] Move californium to public IP. [puppet] - 10https://gerrit.wikimedia.org/r/194703 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:00:38] (03PS3) 10Dzahn: Move californium to public IP. [puppet] - 10https://gerrit.wikimedia.org/r/194703 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:01:37] (03CR) 10Andrew Bogott: [C: 032] Move californium to public IP. [puppet] - 10https://gerrit.wikimedia.org/r/194703 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:01:48] hmm, since I don't use that file I guess I should just nuke it [22:04:02] MaxSem: Remember to remove the line where it is sourced in the script too. [22:04:10] yep [22:06:20] (03PS1) 10Andrew Bogott: Move californium to a public ip, part two [puppet] - 10https://gerrit.wikimedia.org/r/194707 [22:07:11] (03PS2) 10Dzahn: Move californium to a public ip, part two [puppet] - 10https://gerrit.wikimedia.org/r/194707 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:07:22] (03PS1) 10GWicke: Improve the RESTBase API documentation [puppet] - 10https://gerrit.wikimedia.org/r/194708 [22:07:45] Coren, thanks for help! [22:10:10] (03CR) 10Dzahn: [C: 031] "californium.wikimedia.org has address 208.80.154.147" [puppet] - 10https://gerrit.wikimedia.org/r/194707 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:10:52] PROBLEM - puppet last run on rbf2002 is CRITICAL: CRITICAL: puppet fail [22:11:30] (03CR) 10Andrew Bogott: [C: 032] Move californium to a public ip, part two [puppet] - 10https://gerrit.wikimedia.org/r/194707 (https://phabricator.wikimedia.org/T84772) (owner: 10Andrew Bogott) [22:14:24] Who can deal with T90658? (someone who can't register on wikitech because of an old SVN account) [22:15:50] Maybe ^demon|busy ? [22:16:10] <^demon|busy> I don't think so [22:16:34] <^demon|busy> There's some script for it [22:17:34] Coren: (since you're listed as "On Ops duty") Can you handle T90658, or do you know who can? [22:17:35] the process used to be emailing ops-requests [22:18:36] BTW, https://phabricator.wikimedia.org/T60687 [22:23:00] There may be details in the RT ticket behind https://phabricator.wikimedia.org/T55793 [22:23:04] Coren? [22:23:23] (RT#5923) [22:24:49] (03PS2) 10GWicke: Improve the RESTBase API documentation [puppet] - 10https://gerrit.wikimedia.org/r/194708 [22:24:53] Every time I grep through the puppet repo, it complains about "modules/admin/files/home/akosiaris/.my.cnf" [22:25:14] which is just a symlink to /root/.my.cnf [22:26:47] Krenair, regarding RT 5923, there's just the actual steps listed, not more info [22:27:00] the actual steps to fix the issue? [22:27:38] yes [22:27:45] I can paste them here if wanted [22:28:32] best to document it on https://phabricator.wikimedia.org/T90658, probably [22:28:52] RECOVERY - puppet last run on rbf2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [22:29:39] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1094001 (10Aklapper) Copying the steps from similar RT #5923: 1. modify-ldap-user --mail=example@example.com --cn=Example example 2. change-ldap-passwd --random example 3. Login to wikitech as Example, using... [22:29:44] Krenair, done [22:29:50] thanks andre__afk [22:29:55] (03CR) 10Dzahn: [C: 031] Improve the RESTBase API documentation [puppet] - 10https://gerrit.wikimedia.org/r/194708 (owner: 10GWicke) [22:31:24] ^demon|busy, ^ [22:32:43] (03CR) 10Dzahn: Improve the RESTBase API documentation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194708 (owner: 10GWicke) [22:33:36] (03CR) 10GWicke: Improve the RESTBase API documentation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/194708 (owner: 10GWicke) [22:34:09] (03CR) 10Dzahn: [C: 032] Improve the RESTBase API documentation [puppet] - 10https://gerrit.wikimedia.org/r/194708 (owner: 10GWicke) [22:34:22] Coren: I will upgrade Jenkins tomorrow [22:38:31] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1094027 (10Dzahn) on a related note: https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=signup mentions both, RT and Bugzilla, we should get that updated to Phab. [22:39:39] 6operations, 6Labs: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1094036 (10Dzahn) docs here https://wikitech.wikimedia.org/wiki/Add-labs-user#Giving_users_Labs_access.2C_if_they_already_have_an_SVN_account [22:43:05] (03PS1) 10Dzahn: noc: unload mod_ssl - not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/194722 [22:43:54] (03CR) 10Dzahn: [C: 032] noc: unload mod_ssl - not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/194722 (owner: 10Dzahn) [22:52:43] (03PS1) 10GWicke: Increase restbase heap limit slightly from 250 to 300mb [puppet] - 10https://gerrit.wikimedia.org/r/194729 [22:54:57] (03PS2) 10GWicke: Increase restbase heap limit slightly from 250 to 300mb [puppet] - 10https://gerrit.wikimedia.org/r/194729 [23:03:25] (03PS1) 10Dzahn: noc: remove broken symlink to pybal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194735 [23:10:19] (03PS1) 10Dzahn: noc: add link to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194736 (https://phabricator.wikimedia.org/T90837) [23:11:31] (03PS1) 10BBlack: esams: add all caches to private vlan [dns] - 10https://gerrit.wikimedia.org/r/194738 [23:11:40] (03PS1) 10BBlack: wmf-reimage: support renaming host [puppet] - 10https://gerrit.wikimedia.org/r/194739 [23:11:42] (03PS1) 10BBlack: support both vlan domainnames for esams bits [puppet] - 10https://gerrit.wikimedia.org/r/194740 [23:12:00] (03PS2) 10Dzahn: noc: add link to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194736 (https://phabricator.wikimedia.org/T90837) [23:12:54] (03CR) 10BBlack: [C: 032] wmf-reimage: support renaming host [puppet] - 10https://gerrit.wikimedia.org/r/194739 (owner: 10BBlack) [23:13:19] (03PS3) 10Dzahn: noc: add link to db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194736 (https://phabricator.wikimedia.org/T90837) [23:13:31] (03CR) 10BBlack: [C: 032] support both vlan domainnames for esams bits [puppet] - 10https://gerrit.wikimedia.org/r/194740 (owner: 10BBlack) [23:18:08] (03PS1) 10Dzahn: noc: rm broken symlinks to mediawikiview/VE dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194741 [23:18:27] (03PS2) 10BBlack: esams: add all caches to private vlan [dns] - 10https://gerrit.wikimedia.org/r/194738 [23:24:39] (03PS1) 10Dzahn: noc: add link to new pybal config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194742 [23:25:12] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1459 bytes in 0.178 second response time [23:26:11] (03PS2) 10Dzahn: noc: add link to new pybal config files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194742 [23:26:34] (03PS3) 10BBlack: esams: add all caches to private vlan [dns] - 10https://gerrit.wikimedia.org/r/194738 [23:27:45] (03CR) 10Dzahn: [C: 032] Increase restbase heap limit slightly from 250 to 300mb [puppet] - 10https://gerrit.wikimedia.org/r/194729 (owner: 10GWicke) [23:28:19] (03CR) 10BBlack: [C: 032] esams: add all caches to private vlan [dns] - 10https://gerrit.wikimedia.org/r/194738 (owner: 10BBlack) [23:29:28] 6operations, 5Patch-For-Review: dbtree - duplicated code in 2 locations - puppetize config - https://phabricator.wikimedia.org/T90837#1094167 (10Dzahn) 5Open>3Resolved [23:33:52] (03CR) 10Chad: [C: 032] noc: adjust link to dbtree in index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194579 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [23:34:16] (03Merged) 10jenkins-bot: noc: adjust link to dbtree in index.html [mediawiki-config] - 10https://gerrit.wikimedia.org/r/194579 (https://phabricator.wikimedia.org/T90837) (owner: 10Dzahn) [23:38:01] !log demon Synchronized docroot/noc/index.html: (no message) (duration: 00m 06s) [23:38:08] Logged the message, Master [23:38:25] (03PS1) 10BBlack: esams private vlan -> jessie default [puppet] - 10https://gerrit.wikimedia.org/r/194746 [23:38:27] (03PS1) 10BBlack: esams cache dhcp: use private vlan + default jessie [puppet] - 10https://gerrit.wikimedia.org/r/194747 [23:38:28] <^demon|busy> mutante: ^^ done [23:39:14] ^demon|busy: thank you [23:39:14] (03Abandoned) 10BBlack: switch default PXE installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/194402 (owner: 10BBlack) [23:39:19] <^demon|busy> anytime [23:39:50] (03CR) 10BBlack: [C: 032 V: 032] esams private vlan -> jessie default [puppet] - 10https://gerrit.wikimedia.org/r/194746 (owner: 10BBlack) [23:40:13] paravoid, hi [23:40:23] (03CR) 10BBlack: [C: 032] esams cache dhcp: use private vlan + default jessie [puppet] - 10https://gerrit.wikimedia.org/r/194747 (owner: 10BBlack) [23:41:32] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [23:43:15] bblack, hi [23:43:47] (03PS1) 10BBlack: depool cp3022 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/194751 [23:43:59] (03CR) 10BBlack: [C: 032 V: 032] depool cp3022 for reinstall [puppet] - 10https://gerrit.wikimedia.org/r/194751 (owner: 10BBlack) [23:44:15] !log depooled cp3022 in pybal [23:44:19] mforns: what's up? [23:44:20] Logged the message, Master [23:44:35] hey bblack, I have a question on varnishncsa [23:44:41] ok [23:45:09] in analytics we are trying to troubleshoot some query strings that get truncated at 1014 bytes by varnishncsa [23:45:25] do you know something about this? [23:46:11] unfortunately I don't really, no [23:46:35] mforns: new problem, or always been this way? [23:46:56] we had not so large logs before [23:47:13] now we have them and they are failing validation, because they get truncated [23:47:55] it's entirely possible that at one of the various layers of stats proxying, a single entry has to fit in a UDP packet, which would be 1500 bytes minus various protocol and encoding overhead and other metadata, leaving you with 1014 for the URL [23:47:56] but, yea, I suppose varnishncsa continues as always has been [23:48:18] aha [23:48:28] I was also looking at: https://www.varnish-cache.org/trac/browser/bin/varnishncsa/varnishncsa.c?rev=79c2d962221bb7ce3582caa1a0be5df4841e0832#L271 [23:48:49] file a phab task for ops+analytics + needs triage and we'll sort out who can sort it out. maybe ottomata initially until/unless he assigns to another [23:49:01] in line 271 there is this trimline method that seems to me that truncates the query string, is it possible? [23:49:14] ok, makes sense [23:55:11] * legoktm will have some stuff for swat in a few minutes [23:56:06] mforns: AFAICS, the code around there is just splitting the query string from the rest of the URL (if applicable) and trimming whitespace off the ends [23:56:25] bblack, ok [23:57:44] bblack, ok thanks! I'll add operations to the phab task we already have and speak with ottomata [23:57:53] ok :) [23:57:59] tnx! [23:58:36] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by varnishncsa - https://phabricator.wikimedia.org/T91347#1094281 (10mforns) [23:59:11] (03PS1) 10Hoo man: Change dispatchChanges parameters for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/194758