[00:02:11] 1 home.gateway.home.gateway (192.168.1.1) 2.132 ms 2.110 ms 2.526 ms [00:02:11] 2 89-168-60-1.dynamic.dsl.as9105.com (89.168.60.1) 9.155 ms 10.219 ms 10.972 ms [00:02:11] 3 host-78-151-238-209.as13285.net (78.151.238.209) 13.704 ms * 15.207 ms [00:02:11] 4 * * * [00:02:15] It's all just * * * from there [00:02:20] (to bast1001.wikimedia.org) [00:02:25] yeah that sounds like a network problem [00:02:35] never mind my comment about security, for now [00:02:54] yeah, not on that channel [00:03:02] well regardless :) [00:03:25] we had security issues earlier related to routing, and I thought this was related, but now I think not [00:03:57] mmm ASN as domain name [00:04:03] no problems from here [00:06:58] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [00:08:38] Krenair: TalkTalk is your ISP? [00:08:42] yeah [00:09:46] (03CR) 10Negative24: "Going to do one more test (apply role with a new instance and see if it works). After I'll give the go ahead for merging." [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) (owner: 10Negative24) [00:10:24] Krenair: I think the problem is localized to TalkTalk, but not sure yet [00:10:31] we go through CW to reach them [00:15:18] vodafone I guess, or ascio, or whatever they are [00:15:42] !log Updated WikibaseQualityConstraints data on wikidata (wikidatawiki.wbqc_constraints) [00:15:46] Logged the message, Master [00:15:52] oh ascio is just the registrar heh [00:15:58] hmmmm [00:18:06] yeah [00:18:16] I can connect to everywhere fine if I proxy through my vps [00:19:22] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1443648 (10Dzahn) Here's a patch to add a new admin group to start with, add the 2 users and give them `privileges: ['ALL = NOPASSWD: /usr/sbin... [00:19:23] Krenair: did it recover now? [00:19:38] yep [00:20:21] it was some problem with the various network entities on your side of the pond, they seem to have fixed it now on their own [00:20:54] (we were showing a route to TalkTalk through vodafone while you were broken, and our traceroute here was broken as well. Now it's routing through C&W instead of vodafone and working again) [00:21:46] the route while broken looked funny too, who knows what was going on there.... [00:22:09] RECOVERY - puppet last run on ms-be2002 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:23:56] bblack, interesting, thanks for looking into that [00:24:40] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1443653 (10Smalyshev) WDQS service will have two services, implementing Blazegraph database and the updater. I don't have yet debian service co... [00:27:29] (03CR) 10Dzahn: "oh. so i was about to downvote it because you have "precise" but i changed it to "trusty". but then.. i did that because i saw the adminbo" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223735 (owner: 10Merlijn van Deen) [00:35:44] !log krenair Synchronized php-1.26wmf13/extensions/VisualEditor/modules/ve-mw/ui/inspectors/ve.ui.MWLinkAnnotationInspector.js: https://gerrit.wikimedia.org/r/#/c/223983/ (duration: 00m 12s) [00:35:48] Logged the message, Master [00:36:26] ^ this is totally within the swat window, by the way [00:38:03] (it fixed the issue) [00:38:18] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [00:38:18] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [00:39:22] ^^^ got it [00:39:30] thanks, i was just about to [00:39:44] !log starting restbase1004 [00:39:48] Logged the message, Master [00:40:16] (03CR) 10Dzahn: [C: 032] Add url to adminlogbot output [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [00:40:18] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [00:41:21] (03CR) 10Dzahn: [C: 032] "i checked on tools-exec-1203" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223735 (owner: 10Merlijn van Deen) [00:41:23] (03Merged) 10jenkins-bot: Update debian/changelog and debian/control [debs/adminbot] - 10https://gerrit.wikimedia.org/r/223735 (owner: 10Merlijn van Deen) [00:42:09] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.002 second response time on port 9042 [00:48:48] robh: can you help me out? [00:52:44] 6operations: URGENT: Mail alias needed vpe-staff to route to eng-mgt - https://phabricator.wikimedia.org/T105431#1443705 (10JKrauska) 3NEW [00:53:55] jgage: hey! [00:54:16] mutante might be around [00:55:21] yes, i can do that [00:55:30] 6operations: URGENT: Mail alias needed vpe-staff to route to eng-mgt - https://phabricator.wikimedia.org/T105431#1443719 (10Dzahn) a:3Dzahn [00:56:15] mutante: thanks! [00:58:33] mutante: turns out because I still have both groups in LDAP, Google group set up an internal alias when I modified the group name.. [00:58:48] so we're all ok for the moment (less urgent than I originally thought) [00:58:55] however, we do need the long term fix (exim) [00:59:01] cajoel: oh, heh, i litereally just committed that [00:59:18] should i remove it again? [00:59:23] nah [00:59:24] it's fine [00:59:25] thanks [01:00:07] and puppet applied it .. now [01:00:34] 6operations: URGENT: Mail alias needed vpe-staff to route to eng-mgt - https://phabricator.wikimedia.org/T105431#1443720 (10Dzahn) 5Open>3Resolved done as requested +# T105431 forward for OIT / cajoel +vpe-staff: eng-mgt [01:03:39] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [01:06:39] (03CR) 10Dzahn: "merged both.. built 1.7.9 on the building host in prod, copper. uploaded it to carbon to /srv/wikimedia/incoming/ .. tried to import it wi" [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180890 (owner: 10Merlijn van Deen) [01:13:07] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 21.43% of data above the critical threshold [100000000.0] [01:23:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [01:23:48] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [01:23:48] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [01:37:46] * jgage checks restbase1004 [01:38:12] !log cassandra restarted on restbase1004 [01:38:17] Logged the message, Master [01:38:48] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [01:40:15] 7Puppet, 6operations, 6Discovery, 10Wikidata, and 2 others: Make a puppet role that sets up a query service and loads it - https://phabricator.wikimedia.org/T95679#1443755 (10Lydia_Pintscher) [01:40:49] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.004 second response time on port 9042 [01:44:39] YuviPanda: hey Yuvi, following up again about getting your sessions on the hackathon schedule [01:44:54] pls :) [01:46:09] jgage: thank you! [01:47:13] my pleasure :) [01:50:13] !log bounced cassandra on restbase1004 [01:50:16] ..and once more [01:50:18] Logged the message, Master [01:50:46] that truncate command looks very attractive, considering it's only a cache at this point [01:54:12] (03PS1) 10BBlack: skip default vcl_recv for upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/223996 [01:54:26] (03PS1) 10BBlack: text varnish: pass all "Authorization: OAuth " requests [puppet] - 10https://gerrit.wikimedia.org/r/223997 (https://phabricator.wikimedia.org/T105387) [01:54:49] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1443764 (10Dzahn) [01:55:20] 6operations: Create exim Mailing Aliases - https://phabricator.wikimedia.org/T105433#1443768 (10JKrauska) 3NEW [01:55:41] 6operations: Create exim Mailing Aliases - https://phabricator.wikimedia.org/T105433#1443775 (10JKrauska) [01:56:54] 6operations, 7Mail: Create exim Mailing Aliases - https://phabricator.wikimedia.org/T105433#1443777 (10Krenair) [01:58:54] (03PS7) 10Negative24: Phabricator: Create differential puppet role [puppet] - 10https://gerrit.wikimedia.org/r/222987 (https://phabricator.wikimedia.org/T104827) [02:00:23] logmsgbot: hey [02:01:20] !log upgraded morebots for production [02:01:35] .. [02:01:50] !log will it log [02:02:54] mutante: the upgrade is causing puppet breakage, are you already on that? [02:03:23] mutante: https://dpaste.de/mV9p [02:03:24] andrewbogott: where? tools-exec ? [02:03:30] everywhere [02:03:34] well, throughout tools [02:03:48] eh.. it should only be installed on exec nodes [02:03:54] at least i know the fix [02:04:18] maybe it’s just exec nodes… I got 30 emails though :) [02:04:24] touch /usr/lib/adminbot ; apt-get remove adminbot ; apt-get install adminbot [02:04:33] ugh.. ok.. on it [02:04:51] touch /usr/lib/adminbot/README :/ [02:04:54] in theory we should be able to use salt wildcards to do things like this now :) [02:05:04] what is the saltmaster [02:05:09] labcontrol1001 [02:05:14] I can do it, though, if you want. [02:05:21] yes, please, so: [02:05:25] I suppose if I touch that file on instances that don’t have the package installed it’ll just fail right? [02:05:55] so only the second touch command [02:05:59] the README file [02:06:10] yep, I’m verifying that it fixes one before I salt it [02:06:12] and yes, it will only fail when /usr/lib/adminbot doesnt exist [02:06:37] Did you remove that file from the new package, or add it, or… I’ve never seen this happen when a .deb is upgraded. [02:06:44] no, we did not touch it [02:07:13] andrewbogott: while at it, that package should not be installed on tools-login [02:07:21] but it was without me doing anything [02:07:42] ok, I’m touching that file everywhere... [02:07:53] also, it should only influence precise [02:08:07] PROBLEM - Restbase root url on restbase1001 is CRITICAL - Socket timeout after 10 seconds [02:08:36] 6operations, 6Performance-Team: Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434#1443784 (10BBlack) 3NEW [02:08:44] I purged it from tools-login, we’ll see if puppet reinstalls [02:09:04] well if you purge it it might have deleted that path again [02:09:16] apt-get remove; apt-get install worked for sure [02:09:21] !log restarted restbase on restbase1001 [02:09:24] bblack, you shouldn't be special-casing any other namespace, but the same is true for the others [02:09:40] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 00m 35s) [02:09:48] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-10 02:09:48+00:00 [02:09:49] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.022 second response time [02:10:19] andrewbogott: https://phabricator.wikimedia.org/T105169#1443760 [02:10:55] mutante: yep, seems reasonable, I don’t know why apt was so complainy [02:10:58] and the !log doesnt even work in the new version ? arrg [02:11:13] morebots, you there? [02:11:13] I am a logbot running on tools-exec-1213. [02:11:13] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:11:13] To log a message, type !log . [02:11:17] !log testing [02:11:18] this is just sad, already so many problems just to get here [02:11:47] running that touch command seems to have satisfied puppet. [02:11:54] something :) [02:12:02] but if it doesnt work .. still sigh [02:12:18] I’m going to go back to not working :) If you want to set this aside for now I can look at the bot tomorrow, I wrote most of the most recent revisions so if it’s stupid it’s probably my fault. [02:12:23] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1443793 (10Chmarkine) [02:12:28] But it might be worth reverting things in the meantime so that we have a log [02:12:48] andrewbogott: well, there have been changes to add the year to the date [02:13:23] If !log is silent that probably means it’s timing out attempting to talk to wikitech [02:13:24] andrewbogott: yes, sorry for breaking, did not expect this one at all and nobody seems to have touched the README [02:13:37] hmm [02:13:47] the only changes were about the log format itself [02:13:52] well, that’s just a guess, it was the problem last time that happened. [02:13:57] yea.. *nod* [02:15:12] I’ll be back in ~45 and can relieve you then if you want. For debugging… note that we know what node it’s running on, so you can log in and twiddle with the source there. [02:15:17] morebots, where are you running? [02:15:18] I am a logbot running on tools-exec-1213. [02:15:18] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:15:18] To log a message, type !log . [02:15:24] behold ^ :) [02:16:21] yes, so i did restart the one for production, but not hte other instances [02:16:30] and i got the info from [02:16:43] become morebots; qstat [02:16:51] i'll try more debugging [02:20:14] i think i must take the reverting way because even that means removing it from APT, reprepro, salt .. [02:20:25] and i have an important errand [02:24:37] PROBLEM - puppet last run on mw1073 is CRITICAL Puppet has 1 failures [02:25:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 10 02:25:31 UTC 2015 (duration 25m 30s) [02:25:40] !log l10nupdate Synchronized php-1.26wmf13/cache/l10n: (no message) (duration: 05m 45s) [02:27:54] !log downgraded morebots on tools-exec-1213 [02:28:01] come on .. [02:28:29] !log LocalisationUpdate completed (1.26wmf13) at 2015-07-10 02:28:29+00:00 [02:29:37] !log temp. puppet disabled on tools-exec-1213 [02:32:43] !log are you back [02:33:12] it doesnt come back with the old version either :/ [02:33:26] there must be something else, but just following the docs how to restart it [02:33:32] and same package as before for this one [02:34:24] firewall changes? [02:34:38] andrewbogott: so i temp. disabled puppet on that one exec-node where this bot instance runs just so it doesnt re-break the downgraded package [02:34:55] yet, it still wont log . so if there is another issue, hopefully the new version will also be just fine [02:35:11] 7Puppet, 6Labs, 6Phabricator: On labs phabricator references security extension even though it isn't present - https://phabricator.wikimedia.org/T104904#1443812 (10Negative24) Possibly related: {P934} Only got these errors on a newly created instance. [02:35:30] unfortunately i really have to go and drive a car.. but i will be back asap [02:39:48] RECOVERY - puppet last run on mw1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:53:39] morebots, you back? [02:53:39] I am a logbot running on tools-exec-1213. [02:53:39] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:53:40] To log a message, type !log . [02:53:47] !log testing the log by logging a test [02:53:53] Logged the message, Master [02:56:40] mutante: I restarted the bot and it seems to be ok. I can’t tell if I’m running the old version or the new version though :) [02:58:44] andrewbogott: does the bot work? [02:59:03] yeah, but I don’t know if it’s the modified version or not [02:59:10] !log please log this with the year [02:59:14] probably best to leave things in a working state for now anyway since it’s getting late [02:59:15] Logged the message, Master [02:59:25] hrm its not [02:59:36] I think mutante reverted because it was broken [02:59:39] roger [02:59:51] I'll prod at it tomorrow, since I want to do something right [03:00:17] ‘k [03:04:39] andrewbogott: i'm back. which version is it with dpkg -l | grep adminbot [03:04:54] 1.7.6 [03:04:58] that is the old one .. yes [03:05:12] ok, now i wonder why it worked for you [03:05:16] but not for me earlier [03:05:24] also with 1.7.6 and also having restarted it [03:05:28] Did you kill and restart on tools-login? [03:05:31] oh, weird [03:05:34] does that mean the new one could also work? [03:05:43] maybe? No idea what’s going on. [03:05:48] yea, i claim i followed the docs [03:05:55] qstat .. qdel etc [03:06:13] yeah, that’s all I did [03:06:14] so the other exec nodes will have 1.7.9 [03:06:21] but we did not kill any of them on the grid [03:06:32] so they also still run as old versions [03:06:48] talking about the copies of morebots that are not in this channel but in other channels [03:06:58] * andrewbogott nods [03:07:01] hmmm [03:07:15] thinking how to best go ahead for now [03:07:19] But maybe we should leave them be until tomorrow, since things are sort of working :) [03:07:25] as long as we dont restart anything it's ok like this [03:07:32] Tomorrow we can restart labs-logbot and see if it picks up the new version [03:07:41] sounds ok, yes [03:07:56] if you dont mind leaving puppet disabled until then [03:08:01] i just disabled on this one exec node [03:08:09] so it doesnt try reinstall 1.7.9 [03:08:24] does puppet ensure->latest? [03:08:44] yes [03:08:53] that's why it failed on all so quick [03:09:07] when i just added a new version to repo [03:09:30] we can also enable puppet again, let it fail, do the workaround [03:09:37] have 1.7.9 installed like on all others [03:09:43] just not restart it again [03:10:46] want me to do that? [03:12:44] yeah, restarting puppet seems good [03:13:16] ok, hold on. there will be one more fail mail and then that's it [03:14:02] !log re-enabling puppet on tools-exec-1213, working around adminbot package install fail [03:14:08] Logged the message, Master [03:15:46] andrewbogott: ok, puppet is enabled and no errors. and meanwhile it thinks 1.7.6 is the latest version again [03:15:48] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [03:15:53] so it didnt even try to install something [03:15:58] hm, ok [03:16:04] because i had removed it from repo [03:16:06] did you rearrange things in reprepro? [03:16:08] ah, great. [03:16:11] but earlier it was still cache [03:17:00] root@carbon:~# reprepro ls adminbot [03:17:09] adminbot | 1.7.8 | trusty-wikimedia | amd64, i386, source [03:17:09] adminbot | 1.7.6 | trusty-wikimedia | amd64, i386 [03:17:20] i dont even know why it sees 1.7.6 as preicse [03:17:25] but that wasnt my change [03:17:42] either way its not broken now and unless somebody restarts all the grid jobs [03:17:54] let's leave it like this and i will certainly continue tomorrow [03:18:22] yeah, agreed. [03:18:37] thanks andrew, sorry for the hassle.. have a good night [03:18:42] no problem. [03:18:49] My laptop wants to reboot, so… see you later :) [03:18:59] cu [03:21:34] (03Abandoned) 10BBlack: sslcert: refactor std_cert [puppet] - 10https://gerrit.wikimedia.org/r/223492 (owner: 10BBlack) [03:22:03] (03PS1) 10BBlack: Revert "sslcert::std_cert explicit deps" [puppet] - 10https://gerrit.wikimedia.org/r/224000 [03:22:05] (03PS1) 10BBlack: sslcert::certificate use $group for crt, consistency with chained [puppet] - 10https://gerrit.wikimedia.org/r/224001 [03:22:07] (03PS1) 10BBlack: sslcert::certificate consistency re title-vs-name [puppet] - 10https://gerrit.wikimedia.org/r/224002 [03:22:09] (03PS1) 10BBlack: sslcert: refactor ::certificate/::std_cert around secret() [puppet] - 10https://gerrit.wikimedia.org/r/224003 [03:22:24] (03CR) 10BBlack: [C: 032 V: 032] Revert "sslcert::std_cert explicit deps" [puppet] - 10https://gerrit.wikimedia.org/r/224000 (owner: 10BBlack) [03:22:47] (03CR) 10BBlack: [C: 032 V: 032] sslcert::certificate use $group for crt, consistency with chained [puppet] - 10https://gerrit.wikimedia.org/r/224001 (owner: 10BBlack) [03:23:05] (03CR) 10BBlack: [C: 032 V: 032] sslcert::certificate consistency re title-vs-name [puppet] - 10https://gerrit.wikimedia.org/r/224002 (owner: 10BBlack) [03:37:46] (03PS1) 10BBlack: openstack::nova::compute: use secret() for key [puppet] - 10https://gerrit.wikimedia.org/r/224009 [03:45:18] PROBLEM - Restbase root url on restbase1004 is CRITICAL: Connection refused [03:47:08] RECOVERY - Restbase root url on restbase1004 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.006 second response time [03:53:42] (03PS1) 10BBlack: Add ecc-uni.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/224011 (https://phabricator.wikimedia.org/T86654) [03:54:04] (03CR) 10BBlack: [C: 032 V: 032] Add ecc-uni.wikimedia.org cert [puppet] - 10https://gerrit.wikimedia.org/r/224011 (https://phabricator.wikimedia.org/T86654) (owner: 10BBlack) [04:01:05] (03PS2) 10BBlack: sslcert: refactor ::certificate/::std_cert around secret() [puppet] - 10https://gerrit.wikimedia.org/r/224003 [04:01:08] (03PS1) 10BBlack: test puppetized dual-cert on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/224012 [04:01:47] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 25.00% of data above the critical threshold [100000000.0] [04:01:54] (03CR) 10jenkins-bot: [V: 04-1] test puppetized dual-cert on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/224012 (owner: 10BBlack) [04:03:48] 6operations, 10Traffic, 7HTTPS: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1443860 (10Chmarkine) >>! In T102814#1431376, @Reedy wrote: > > It would seem arbcom-(de|nl|en) are the main ones to worry about notifying... No need to worry about these. arbcom-... [04:04:52] (03PS2) 10BBlack: test puppetized dual-cert on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/224012 [04:06:12] 6operations, 10Traffic, 7HTTPS: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1443861 (10BBlack) Right, there's no real cert issue for any of these, it's just a matter of notifications. [04:12:08] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:12:19] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:14:57] ^ oops that was me, fixed [04:15:57] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [04:16:07] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:26:00] (03PS1) 10Springle: repool db1037 as s6 logpager; depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224014 [04:26:48] (03CR) 10Springle: [C: 032] repool db1037 as s6 logpager; depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224014 (owner: 10Springle) [04:26:54] (03Merged) 10jenkins-bot: repool db1037 as s6 logpager; depool db1030 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224014 (owner: 10Springle) [04:28:04] !log springle Synchronized wmf-config/db-eqiad.php: repool db1037; depool db1030 (duration: 00m 13s) [04:28:09] Logged the message, Master [04:33:37] !log springle Synchronized wmf-config/db-eqiad.php: depool db1037; repool db1030 (revert below) (duration: 00m 12s) [04:33:42] Logged the message, Master [04:35:04] (03PS1) 10Springle: Revert "repool db1037 as s6 logpager; depool db1030". [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224015 [04:35:40] (03CR) 10Springle: [C: 032] Revert "repool db1037 as s6 logpager; depool db1030". [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224015 (owner: 10Springle) [04:36:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 10 04:36:49 UTC 2015 (duration 36m 48s) [04:36:53] Logged the message, Master [04:38:09] PROBLEM - puppet last run on mw1231 is CRITICAL Puppet has 1 failures [04:47:35] (03CR) 10Chmarkine: [C: 031] Remove wap and mobile subdomains [dns] - 10https://gerrit.wikimedia.org/r/223972 (https://phabricator.wikimedia.org/T104942) (owner: 10BBlack) [04:54:57] RECOVERY - puppet last run on mw1231 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [04:57:38] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [05:01:18] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [05:08:47] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 14.29% of data above the critical threshold [100000000.0] [05:26:31] (03CR) 10BryanDavis: [C: 031] "+1 for concept" [puppet] - 10https://gerrit.wikimedia.org/r/223997 (https://phabricator.wikimedia.org/T105387) (owner: 10BBlack) [05:28:06] (03PS1) 10Ori.livneh: varnishrls: segment responses by cache-control max-age [puppet] - 10https://gerrit.wikimedia.org/r/224016 (https://phabricator.wikimedia.org/T104277) [05:28:23] (03PS2) 10Ori.livneh: varnishrls: segment responses by cache-control max-age [puppet] - 10https://gerrit.wikimedia.org/r/224016 (https://phabricator.wikimedia.org/T104277) [05:30:00] (03CR) 10Ori.livneh: [C: 032] varnishrls: segment responses by cache-control max-age [puppet] - 10https://gerrit.wikimedia.org/r/224016 (https://phabricator.wikimedia.org/T104277) (owner: 10Ori.livneh) [05:43:26] (03PS2) 10Dzahn: add admin group 'wikidata query service deployers' [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) [05:49:58] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1443964 (10Dzahn) Thanks for the details. @Smalyshev. I amended and uploaded a new patchset to reflect that. How about `privileges: ['ALL = (... [05:51:03] <_joe_> mutante: thanks :) [06:03:08] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [06:07:58] 6operations, 7Monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#1443988 (10akosiaris) [06:19:34] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1443992 (10Joe) So, @Krinkle's patchset did have an effect in slowing down the growth on the memory occupation, but didn't stop it. The... [06:19:37] 6operations, 7Monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#1443995 (10ori) >>! In T83729#918015, @fgiunchedi wrote: > more to the ticket's point, there doesn't seem to be any agreement of what a healthy poolcounter service looks like (besides of course the proces... [06:30:58] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:33:37] PROBLEM - puppet last run on elastic1022 is CRITICAL Puppet has 2 failures [06:33:38] PROBLEM - puppet last run on iron is CRITICAL Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on lvs2001 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 10.71% of data above the critical threshold [100000000.0] [06:36:18] PROBLEM - puppet last run on mw2184 is CRITICAL Puppet has 1 failures [06:36:37] PROBLEM - puppet last run on analytics1010 is CRITICAL Puppet has 1 failures [06:37:18] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:37:27] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:38:09] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:38:38] PROBLEM - puppet last run on mw1039 is CRITICAL Puppet has 1 failures [06:45:08] RECOVERY - puppet last run on iron is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:58] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:48] RECOVERY - puppet last run on mw2184 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:07] RECOVERY - puppet last run on analytics1010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:17] RECOVERY - puppet last run on mw1039 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:48:29] RECOVERY - puppet last run on lvs2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:57] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:42:21] good morning [07:43:24] hey hashar [07:43:47] !log reimage ms-be2013 T105213 [07:43:51] Logged the message, Master [07:48:42] 6operations, 6Performance-Team: Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434#1444134 (10Krinkle) Can be and is by default, but not necessarily: * Don't localise special page name: https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/a71b5a41/... [07:54:34] if some bright puppet gurus could help on a puppet include issue [07:55:14] from the role::nodepool class, it can't find the 'nodepool' module when invoking it with class { '::nodepool': } :-( [07:55:21] seems like a scope lookup issue [07:56:55] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [07:58:16] hashar: does puppet give a clear error, or just 'can't find class ::nodepool'? [07:58:54] Could not find declared `class ::nodepool` at /etc/puppet/manifests/role/nodepool.pp:25 on node `labnodepool1001.eqiad.wmnet` [07:58:59] valhallasw`cloud: puppet is not very helpful :D [07:59:30] task being https://phabricator.wikimedia.org/T105406 [07:59:34] with example [07:59:58] that must be an issue with puppet autoloader / namespace lookup [08:00:23] hashar: role::zuul::install does the same, so the syntax looks sane [08:00:34] potentially [08:00:40] <_joe_> hashar: need help with puppet? [08:00:50] _joe_: definitely [08:00:58] I have hit my level of incompetency there [08:01:05] task https://phabricator.wikimedia.org/T105406 has the detail [08:01:25] andrew and I got the nodepool role class and puppet module merged yesterday for labnodepool1001.eqiad.wmnet [08:01:30] but puppet goes wild [08:02:04] <_joe_> ok lemme take a look [08:02:31] at least I reproduced it on a lab instance [08:02:36] with a Precise puppet master [08:03:07] a related puppet bug is http://projects.puppetlabs.com/issues/10848 [08:03:17] seems class { '::foo': } behave differently than include ::foo [08:03:51] <_joe_> hashar: why not use include was my first curiosity [08:04:08] the role class has to load ::nodepool by giving it bunch of parameters [08:04:48] https://serverfault.com/questions/349046/could-not-find-class-and-yet-it-is-there suggests to just strace puppetmaster to see what is going on. I don't see any other nodepool-named stuff in ops/puppet, though [08:04:55] but I'll leave it to the experts now :-) [08:05:20] _joe_: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/nodepool.pp#L16 [08:05:53] <_joe_> hashar: for the parameters, use hiera [08:06:03] <_joe_> in general [08:08:22] maybe I should rename the module entry point [08:08:29] to something like module/nodepool/manifests/server.pp [08:08:31] that might fix it [08:08:34] <_joe_> nope [08:08:42] <_joe_> keep calm and let's find the problem here [08:08:58] oh I am calm [08:09:01] <_joe_> :) [08:09:07] just pissed off that I can't find the problem :-/ [08:10:10] hashar: it's puppet, it's not your fault :-p [08:11:18] <_joe_> hashar: I find a different problem with your example [08:15:03] re [08:16:00] <_joe_> so well, hashar, if you do a simple inclusion, all works [08:16:51] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1444161 (10Gilles) I didn't realize we did, so essentially if we introduce a brand new size, in terms of thumbnails we're actua... [08:17:21] <_joe_> hashar: so I guess you screwed up something in some creative way here [08:17:53] <_joe_> also, where is this happening? labnodepool is in prod right? [08:21:04] (03CR) 10Muehlenhoff: [C: 031] "The rules look good to me." [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [08:22:26] (03CR) 10Muehlenhoff: [C: 04-2] "https://gerrit.wikimedia.org/r/#/c/223751/ contains a more general purpose solution." [puppet] - 10https://gerrit.wikimedia.org/r/223540 (owner: 10Matanya) [08:22:39] <_joe_> hashar_: found the problem [08:22:45] yeah that is in prod [08:22:45] oh [08:22:53] * hashar send beers next to Roma [08:22:55] <_joe_> you created a nodepool module in labs/private [08:23:08] <_joe_> which I can imagine has a correspondence in private [08:23:19] <_joe_> which means you have two modules with the same name, yuck [08:23:31] <_joe_> I don't even want to imagine what is happening in prod [08:24:47] <_joe_> uhm actually no it works, it seems, apart from the fact that the private data doesn't get copied [08:25:24] (03CR) 10Muehlenhoff: "Although I have to add that adding the rules to the mw-rc-irc::ircserver class seems cleaner to me (since this is where the actual service" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [08:26:35] _joe_: hoooooo [08:26:44] so "just" have to change the private stuff :D [08:31:32] (03CR) 10Gilles: [C: 031] skip default vcl_recv for upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/223996 (owner: 10BBlack) [08:33:27] <_joe_> hashar: nope [08:36:38] <_joe_> hashar: still can't reproduce that error [08:36:56] (03PS1) 10Chmarkine: Secure GeoIP and WMF-Last-Access cookies [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) [08:39:59] 6operations, 10ops-codfw, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1444204 (10fgiunchedi) a:5fgiunchedi>3Papaul @papaul please go ahead and order replacement the installation assumes all disks are present to go ahead otherwi... [08:41:11] _joe_: I reproduced it on a labs instance though [08:42:12] <_joe_> hashar: ok, where exactly? [08:44:47] <_joe_> I mean what puppetmaster? [08:44:56] on nodepool-t105406.integration.eqiad.wmflabs [08:44:57] <_joe_> this really doesn't make sense :) [08:45:03] <_joe_> that's the master? [08:45:13] with the integration-puppetmaster.integration.eqiad.wmflabs has the puppet master [08:45:51] but gotta tweak a few entries in hiera to make it pass on labs [08:46:01] (03PS1) 10Amire80: Set a different wmgContentTranslationDefaultSourceLanguage for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224031 (https://phabricator.wikimedia.org/T105327) [08:46:25] <_joe_> hashar: yes you're not reproducing this issue [08:47:52] <_joe_> but locally with puppet apply --noop I obtain an error on the class declaration, since we don't pass dib_base_path [08:48:27] ahrgh [08:48:32] might be yet another error so [08:49:11] (03CR) 10KartikMistry: [C: 031] Set a different wmgContentTranslationDefaultSourceLanguage for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224031 (https://phabricator.wikimedia.org/T105327) (owner: 10Amire80) [08:50:27] <_joe_> Error: Invalid parameter pruge on File[/etc/nodepool/elements] at /home/joe/Code/WMF/puppet/modules/nodepool/manifests/init.pp:141 on node [08:50:39] <_joe_> I am definitely evaluating the class on my computer :) [08:50:44] great! [08:50:54] <_joe_> but it's puppet 3.7 I guess [08:52:22] (03PS2) 10Filippo Giunchedi: Reduce read concurrency back to 32 [puppet] - 10https://gerrit.wikimedia.org/r/223957 (owner: 10GWicke) [08:52:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Reduce read concurrency back to 32 [puppet] - 10https://gerrit.wikimedia.org/r/223957 (owner: 10GWicke) [08:52:31] (03PS1) 10Giuseppe Lavagetto: BOGUS: attempt at making nodepool compile in my tests [puppet] - 10https://gerrit.wikimedia.org/r/224032 [08:53:54] PROBLEM - puppet last run on db2001 is CRITICAL puppet fail [08:54:04] so the parameters errors would cause the class to fail initializing [08:54:12] and puppet reports a faulty / misleading error ? [08:54:14] <_joe_> hashar: I don't think so [08:54:24] <_joe_> lemme dig, it will take some time [08:55:35] (03CR) 10Filippo Giunchedi: [C: 031] skip default vcl_recv for upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/223996 (owner: 10BBlack) [08:56:42] (03CR) 10Nikerabbit: [C: 031] Set a different wmgContentTranslationDefaultSourceLanguage for English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224031 (https://phabricator.wikimedia.org/T105327) (owner: 10Amire80) [08:57:47] (03PS1) 10Muehlenhoff: Install conntrack on hosts with the base::firewall class [puppet] - 10https://gerrit.wikimedia.org/r/224033 [08:59:43] RECOVERY - puppet last run on db2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:01:18] (03CR) 10Filippo Giunchedi: [C: 031] deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 (owner: 10Dzahn) [09:06:20] <_joe_> hashar: so, even the puppet compiler seems not to complain [09:06:25] <_joe_> so now I'm baffled [09:06:26] :-((( [09:06:47] you suggested the nodepool class in private could be the issue [09:06:52] maybe it can be renamed? [09:07:45] <_joe_> there is no nodepool class there, and the directory simply isn't linked into the /etc/puppet directory [09:10:31] (03CR) 10Alexandros Kosiaris: [C: 031] Install conntrack on hosts with the base::firewall class [puppet] - 10https://gerrit.wikimedia.org/r/224033 (owner: 10Muehlenhoff) [09:12:38] (03PS8) 10Alexandros Kosiaris: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) (owner: 10Muehlenhoff) [09:13:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) (owner: 10Muehlenhoff) [09:13:45] (03CR) 10Chmarkine: [C: 04-1] "Based on my understanding of https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution, I feel that simply adding " [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) (owner: 10Chmarkine) [09:17:16] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1444258 (10Chmarkine) [09:26:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] ferm rules for bacula director (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [09:30:50] (03CR) 10Muehlenhoff: "Looks good to me, but why "bacula-sd-standalone"? I think this class should rather only provide the rule for the sd/9103 (and drop it from" [puppet] - 10https://gerrit.wikimedia.org/r/223851 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [09:31:00] (03PS2) 10Alexandros Kosiaris: Enable firejail for citoid [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [09:31:19] (03CR) 10Alexandros Kosiaris: [C: 04-2] "should be done in https://gerrit.wikimedia.org/r/#/c/223849/" [puppet] - 10https://gerrit.wikimedia.org/r/223851 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [09:35:52] (03CR) 10Alexandros Kosiaris: [C: 032] Enable firejail for citoid [puppet] - 10https://gerrit.wikimedia.org/r/219811 (https://phabricator.wikimedia.org/T98851) (owner: 10Muehlenhoff) [09:40:51] (03PS4) 10Alexandros Kosiaris: Enable firejail for mathoid [puppet] - 10https://gerrit.wikimedia.org/r/219331 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [09:52:10] (03CR) 10Muehlenhoff: "The ferm rules look good to me, but that seems like the kind of service which we should exempt from connection tracking (as per https://ge" [puppet] - 10https://gerrit.wikimedia.org/r/223844 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [09:54:47] (03CR) 10Alexandros Kosiaris: [C: 032] Enable firejail for mathoid [puppet] - 10https://gerrit.wikimedia.org/r/219331 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [10:01:34] 6operations, 6Services, 5Patch-For-Review: Service containment for nodejs-based services with firejail - https://phabricator.wikimedia.org/T101870#1444337 (10MoritzMuehlenhoff) [10:01:37] 6operations, 6Services, 5Patch-For-Review: containment for zotero - https://phabricator.wikimedia.org/T98852#1444335 (10MoritzMuehlenhoff) 5Open>3Resolved This is now enabled in production. [10:01:53] 6operations, 6Services, 5Patch-For-Review: containment for Citoid - https://phabricator.wikimedia.org/T98851#1444338 (10MoritzMuehlenhoff) 5Open>3Resolved This is now enabled in production. [10:01:56] 6operations, 6Services, 5Patch-For-Review: Service containment for nodejs-based services with firejail - https://phabricator.wikimedia.org/T101870#1350131 (10MoritzMuehlenhoff) [10:21:14] (03PS1) 10Giuseppe Lavagetto: role::nodepool: use secret() [puppet] - 10https://gerrit.wikimedia.org/r/224039 (https://phabricator.wikimedia.org/T105406) [10:24:21] today I learned about secret() [10:25:55] (03CR) 10Giuseppe Lavagetto: [C: 032] role::nodepool: use secret() [puppet] - 10https://gerrit.wikimedia.org/r/224039 (https://phabricator.wikimedia.org/T105406) (owner: 10Giuseppe Lavagetto) [10:29:38] 7Puppet, 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: puppet yields: Could not find declared class ::nodepool at /etc/puppet/manifests/role/nodepool.pp:25 - https://phabricator.wikimedia.org/T105406#1444425 (10Joe) [10:29:41] (03Abandoned) 10Matanya: poolcounter: don't track connections on the firewall [puppet] - 10https://gerrit.wikimedia.org/r/223540 (owner: 10Matanya) [10:30:39] 7Puppet, 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: puppet yields: Could not find declared class ::nodepool at /etc/puppet/manifests/role/nodepool.pp:25 - https://phabricator.wikimedia.org/T105406#1444428 (10Joe) Problem solved in production, now labnodepool reports ```... [10:43:41] 7Puppet, 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: puppet yields: Could not find declared class ::nodepool at /etc/puppet/manifests/role/nodepool.pp:25 - https://phabricator.wikimedia.org/T105406#1444447 (10Joe) 5Open>3Resolved p:5Triage>3High [10:44:13] PROBLEM - puppet last run on mw1122 is CRITICAL Puppet has 1 failures [10:44:33] PROBLEM - mathoid on sca1001 is CRITICAL: Connection refused [10:50:03] RECOVERY - mathoid on sca1001 is OK: HTTP OK: HTTP/1.1 200 OK - 888 bytes in 0.046 second response time [10:50:56] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1444458 (10Joe) I re-did the test now, I'm seeing that the ratio got worse, moving from ~ 51% to 54% over the last few hours, so it's cle... [10:55:13] RECOVERY - puppet last run on mw1122 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:58:25] (03PS1) 10Hashar: nodepool: switch SSH public key to private repo [puppet] - 10https://gerrit.wikimedia.org/r/224049 [10:59:34] <_joe_> why? [10:59:42] <_joe_> the public key is public... [11:00:21] (03CR) 10Hashar: "The related labs/private.git change is https://gerrit.wikimedia.org/r/#/c/224047/ using secret() as devised by _joe_" [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [11:01:02] _joe_: this way if we change the key, we do a single change to the private repo [11:01:10] <_joe_> hashar: no. [11:01:12] instead of updating the private one in private and the public one in operations/puppet ? [11:01:19] <_joe_> the private repo is for private data [11:01:34] I thought it would be simpler to keep them next to each other since that is a pair [11:06:20] (03PS2) 10Hashar: nodepool: fill in SSH public key [puppet] - 10https://gerrit.wikimedia.org/r/224049 [11:06:43] (03CR) 10Hashar: "Per _joe fill in the SSH Public key directly in the operations/puppet role class." [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [11:06:54] _joe_: https://gerrit.wikimedia.org/r/#/c/224049/ should be better :} [11:07:05] and thanks a ton for the module mess and use of secret() [11:10:05] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1444505 (10Krinkle) [11:16:24] lunchhh [11:23:54] 6operations, 7Monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729#1444556 (10fgiunchedi) a:5fgiunchedi>3None unlikely I'll be able to work on this anytime soon -> up for grabs [11:38:54] (03PS22) 10Paladox: Rename all main WikimediaIncubator settings to have a wg prefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [11:44:13] PROBLEM - puppet last run on mw2102 is CRITICAL puppet fail [12:02:23] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [12:02:53] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:03:03] RECOVERY - puppet last run on mw2102 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:03:13] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:04:52] on it ^^ [12:07:02] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:08:42] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [12:15:43] PROBLEM - Incoming network saturation on labstore1001 is CRITICAL 18.52% of data above the critical threshold [100000000.0] [12:20:22] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:20:39] really? [12:22:03] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:24:22] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:25:53] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [12:26:36] (03CR) 10Muehlenhoff: [C: 031] "Current version looks good, also successfully tested on a labs instance. I'll merge this on Monday unless there are further objections." [puppet] - 10https://gerrit.wikimedia.org/r/223560 (owner: 10Matanya) [12:37:23] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [12:38:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [12:39:02] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [12:40:21] !log bounce cassandra on restbae1004 [12:40:26] Logged the message, Master [12:41:23] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [12:42:53] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.010 second response time on port 9042 [12:46:08] (03CR) 10Andrew Bogott: nodepool: fill in SSH public key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [12:52:03] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:53:44] (03PS2) 10Muehlenhoff: Install conntrack on hosts with the base::firewall class [puppet] - 10https://gerrit.wikimedia.org/r/224033 [12:53:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Install conntrack on hosts with the base::firewall class [puppet] - 10https://gerrit.wikimedia.org/r/224033 (owner: 10Muehlenhoff) [13:12:09] (03PS1) 10coren: Labs: Script to back labstore filesystems up [puppet] - 10https://gerrit.wikimedia.org/r/224064 (https://phabricator.wikimedia.org/T105027) [13:17:53] 6operations: Remove poolcounter from mw1154 for housecleaning - https://phabricator.wikimedia.org/T105380#1444743 (10Southparkfan) [13:17:54] 6operations: Revert mw1154 from being a poolcounter after helium is deemed fine again - https://phabricator.wikimedia.org/T105379#1444740 (10Southparkfan) 5Open>3Resolved a:3Southparkfan Patch has been merged. [13:26:03] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1444761 (10Joe) I tried to reproduce what I'm seeing in production in a testing environment, and I can't seem to trigger that behaviour:... [13:28:05] <_joe_> what's worse than an annoying bug? An annoying bug you can't reproduce in a test [13:29:42] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [13:37:27] <_joe_> !log temporarily repooled mw1031 [13:37:32] Logged the message, Master [13:38:21] andrewbogott: good morning! _joe_ figured out the puppet error causing nodepool class to not be found [13:38:32] yeah, I saw — duplicate module name in private, right? [13:38:34] andrewbogott: the module was defined in private repo :-/ [13:38:53] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [13:38:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:39:04] (03PS2) 10Muehlenhoff: Add ferm rules for swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/223537 [13:39:09] makes sense, in retrospect [13:39:13] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [13:43:27] !log bounce cassandra on restbae1004 [13:43:50] (03CR) 10Hashar: nodepool: fill in SSH public key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [13:44:39] (03PS3) 10Hashar: nodepool: fill in SSH public key [puppet] - 10https://gerrit.wikimedia.org/r/224049 [13:44:53] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [13:46:23] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.004 second response time on port 9042 [13:48:21] (03PS1) 10Muehlenhoff: Add ferm rules for swift backends [puppet] - 10https://gerrit.wikimedia.org/r/224071 (https://phabricator.wikimedia.org/T104965) [13:52:05] (03CR) 10Faidon Liambotis: [C: 031] "Duh!" [puppet] - 10https://gerrit.wikimedia.org/r/223996 (owner: 10BBlack) [13:52:22] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [13:53:52] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [13:54:26] (03PS2) 10BBlack: skip default vcl_recv for upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/223996 [13:54:34] (03CR) 10BBlack: [C: 032 V: 032] skip default vcl_recv for upload-frontend [puppet] - 10https://gerrit.wikimedia.org/r/223996 (owner: 10BBlack) [13:56:10] (03CR) 10Andrew Bogott: "why not in modules/nodepool/files?" [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [13:58:17] andrewbogott: Seems to me the modules should be as agnostic as possible aren't they ? With custom configuration to be handled via hiera / role class / global level. [13:58:36] hashar: hm, maybe... [13:59:00] Although we don’t aspire to reuse modules really. Do you see examples elsewhere? [13:59:03] that is why i originally put the pub key directly as a role parameter [14:00:09] yeah, I see what you mean... [14:00:25] I’m not sure what the right answer is, but let’s just go back to the version with it inline since that doesn’t require an answer :) [14:00:32] sorry for the run-around [14:01:28] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1444882 (10Cmjohnson) Fairly confident the card is not working properly. Working with Dell to RMA. [14:02:47] andrewbogott: tis ok :-} [14:02:59] that is simple enough I dont mind the run around [14:04:10] (03PS4) 10Hashar: nodepool: fill in SSH public key [puppet] - 10https://gerrit.wikimedia.org/r/224049 [14:04:27] (03CR) 10Hashar: "Back to patchset 2 with inlined ssh pub key passed to the class" [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [14:04:41] patchset 4 should work https://gerrit.wikimedia.org/r/#/c/224049/4/manifests/role/nodepool.pp,unified [14:04:47] so for the story [14:04:52] I ended up to bed at 1am yesterday [14:04:59] cause of that ::nodepool madness [14:05:09] forgot to reproduce with private repo :// [14:06:03] (03PS5) 10Andrew Bogott: nodepool: fill in SSH public key [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [14:07:16] (03CR) 10Andrew Bogott: [C: 032] nodepool: fill in SSH public key [puppet] - 10https://gerrit.wikimedia.org/r/224049 (owner: 10Hashar) [14:10:11] yikes [14:14:45] I should have done much of that on a labs instance [14:14:49] :-/ [14:17:35] 6operations, 10Wikimedia-IEG-grant-review: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1444980 (10Dzahn) This is an actual application, not static HTML, so i don't want to put it on bromine.eqiad.wmnet but would prefer another separate VM. (@akosiaris what do you think) [14:19:47] * mobrovac on rb1004 cassandra [14:22:32] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [14:23:54] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.010 second response time on port 9042 [14:23:54] (03PS1) 10Dzahn: switch policysite role from zircon to bromine [puppet] - 10https://gerrit.wikimedia.org/r/224078 (https://phabricator.wikimedia.org/T105006) [14:25:21] (03PS1) 10Hashar: nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 [14:25:26] (03PS2) 10Faidon Liambotis: (WIP) Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [14:26:02] (03CR) 10jenkins-bot: [V: 04-1] nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 (owner: 10Hashar) [14:27:39] (03PS2) 10Hashar: nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 [14:28:22] (03PS1) 10BBlack: Remove partman/raid1-varnish.cfg (no longer in use) [puppet] - 10https://gerrit.wikimedia.org/r/224080 [14:28:38] (03CR) 10BBlack: [C: 032 V: 032] Remove partman/raid1-varnish.cfg (no longer in use) [puppet] - 10https://gerrit.wikimedia.org/r/224080 (owner: 10BBlack) [14:28:52] (03PS1) 10Dzahn: policysite: update Apache for 2.4, switch misc-web [puppet] - 10https://gerrit.wikimedia.org/r/224081 (https://phabricator.wikimedia.org/T105006) [14:29:25] (03PS1) 10Andrew Bogott: Move mc-labs to mc-deploymentprep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224082 [14:30:56] (03PS2) 10Dzahn: switch policysite role from zircon to bromine [puppet] - 10https://gerrit.wikimedia.org/r/224078 (https://phabricator.wikimedia.org/T105006) [14:31:18] (03CR) 10Andrew Bogott: [C: 04-1] "This is insufficient clearly, since I can't figure out where this file is referenced." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224082 (owner: 10Andrew Bogott) [14:31:57] 6operations, 10Wikimedia-Wikimania-Scholarships: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1445035 (10Dzahn) This is an actual app, as opposed to static HTML content, so i would like to put this on something other than bromine. (@akosiaris) [14:33:07] (03PS2) 10Dzahn: policysite: update Apache for 2.4, switch misc-web [puppet] - 10https://gerrit.wikimedia.org/r/224081 (https://phabricator.wikimedia.org/T105006) [14:33:40] (03PS3) 10Faidon Liambotis: (WIP) Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [14:34:01] (03CR) 10Dzahn: [C: 032] switch policysite role from zircon to bromine [puppet] - 10https://gerrit.wikimedia.org/r/224078 (https://phabricator.wikimedia.org/T105006) (owner: 10Dzahn) [14:34:09] (03CR) 10Dzahn: [C: 032] policysite: update Apache for 2.4, switch misc-web [puppet] - 10https://gerrit.wikimedia.org/r/224081 (https://phabricator.wikimedia.org/T105006) (owner: 10Dzahn) [14:35:23] (03CR) 10BBlack: "Yeah we may have to hold off on switching Last-Access for now. In general, we'll probably eventually exclude HTTP on the primary clusters" [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) (owner: 10Chmarkine) [14:36:15] (03Abandoned) 10Andrew Bogott: Move mc-labs to mc-deploymentprep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224082 (owner: 10Andrew Bogott) [14:38:02] (03PS1) 10Eevans: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/224084 [14:38:52] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [14:39:13] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [14:39:18] (03Abandoned) 10Eevans: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/224084 (owner: 10Eevans) [14:43:13] (03PS1) 10Andrew Bogott: Don't use nutcracker on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 [14:44:12] * mobrovac on cass rb1004 [14:44:46] (03PS1) 10Giuseppe Lavagetto: hhvm: set apc expiration correctly [puppet] - 10https://gerrit.wikimedia.org/r/224088 (https://phabricator.wikimedia.org/T104769) [14:45:02] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [14:45:46] 6operations, 5Patch-For-Review: move policysite to a VM - https://phabricator.wikimedia.org/T105006#1445079 (10Dzahn) ``` [terbium:~] $ apache-fast-test policy.url bromine.eqiad.wmnet testing 2 urls on 1 servers, totalling 2 requests spawning threads.. http://policy.wikimedia.org * 301 Moved Permanently http... [14:45:51] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1445085 (10Dzahn) [14:45:52] 6operations, 5Patch-For-Review: move policysite to a VM - https://phabricator.wikimedia.org/T105006#1445083 (10Dzahn) 5Open>3Resolved a:3Dzahn [14:46:07] (03PS2) 10Andrew Bogott: Don't use nutcracker on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224087 (https://phabricator.wikimedia.org/T102993) [14:46:23] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.001 second response time on port 9042 [14:46:54] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: set apc expiration correctly [puppet] - 10https://gerrit.wikimedia.org/r/224088 (https://phabricator.wikimedia.org/T104769) (owner: 10Giuseppe Lavagetto) [14:47:43] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433059 (10Dzahn) So, i moved all the sites that are just static content to bromine.eqiad.wmnet. The remaining 3 sites are actual apps as opposed to that. Therefore i think a separate VM... [14:47:49] (03PS1) 10Filippo Giunchedi: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/224090 [14:48:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/224090 (owner: 10Filippo Giunchedi) [14:54:25] is there now swat on Fridays? [14:54:31] um… no swat, I mean [14:54:33] (03PS1) 10Filippo Giunchedi: diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 [14:54:35] (03PS1) 10Filippo Giunchedi: diamond: service stats puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/224094 [14:54:58] (03PS1) 10Muehlenhoff: WIP: ferm rules for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) [14:55:12] andrewbogott: not unless greg-g approves a change to go because friday deploys are usually evil [14:55:19] (03CR) 10jenkins-bot: [V: 04-1] diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 (owner: 10Filippo Giunchedi) [14:55:21] ok [14:55:32] And gregs not here this week, I think? [14:55:44] Krenair or thcipriani can I get a review for https://gerrit.wikimedia.org/r/#/c/224087/ ? [14:56:01] It doesn’t need to get merged today of course [14:56:51] is memcached running on port 11000 on silver? [14:57:13] yes [14:57:32] I chatted with it a bit over telnet to verify it was returning the same things as nutcracker [14:57:48] (03PS1) 10Filippo Giunchedi: cassandra: upgrade cassandra-metrics-collector to latest version [puppet] - 10https://gerrit.wikimedia.org/r/224096 [14:58:02] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [14:58:04] <_joe_> well, greg is not here, but someone is filling his boots of course [14:59:12] any particular reason why we're special casing wikitech with/without nutcracker btw? [14:59:32] RECOVERY - Incoming network saturation on labstore1001 is OK Less than 10.00% above the threshold [75000000.0] [15:00:26] godog: mostly because nutcracker has failed a couple of times in the last couple of weeks. And also the wikitech use case is dumb — nutcracker is just a proxy so it’s not contributing anything. [15:00:51] So it’s a choice between needless complexity in config vs. needless complexity of actually running software [15:01:12] andrewbogott: true, however nutcracker has been showing some problems in production too [15:01:31] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent wikitech (or nutcracker) failures - https://phabricator.wikimedia.org/T105131#1445155 (10fgiunchedi) [15:01:43] the other day I catched it while it was hung, ^ is the result [15:02:10] godog: yeah, probably the same thing as happening on wikitech. [15:02:21] But, wikitech should be minimal, reliable even when the cluster is broken [15:02:51] we also have wikitech-static for that though [15:02:56] true [15:03:04] 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Isolation: Get Dan Duvall TEMP root to labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T102133#1445172 (10hashar) [15:03:17] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1445175 (10hashar) [15:03:39] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1445178 (10hashar) a:3hashar Lets give this a try. Will do it in operations/puppet.git for now then "upstream" it in the .deb package. [15:03:43] andrewbogott: I didn't want to rain on your parade but it does feeling like sweeping it under the rug :) [15:04:29] godog: yeah, that’s reasonable. In addition to the recent failures, though, every few weeks someone asks me “Why are we running nutcracker on wikitech? It doesn’t do anything? [15:04:32] " [15:04:54] So I had that task opened from quite a while ago, I’m just trying to clean up. I don’t feel that strongly, if you want to make a case to leave things as is in the bug. [15:05:49] I'll give it a go [15:06:13] Coren: are we backed up now? [15:06:42] andrewbogott: In progress, for the tools fs. It's going well enough, so I'll start others and maps later today. [15:07:02] PROBLEM - HHVM rendering on mw2127 is CRITICAL - Socket timeout after 10 seconds [15:08:43] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 72040 bytes in 1.323 second response time [15:09:18] Coren: cool [15:09:34] andrewbogott, did you read the readme in operations/mediawiki-config? it explains all about the -labs, -production, -eqiad, etc. suffixes [15:10:04] Krenair: no, but I think I understand it now. Does my patch interfere with that somehow? [15:10:28] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1445193 (10fgiunchedi) [15:10:32] andrewbogott, https://gerrit.wikimedia.org/r/#/c/224082/ would've broken things [15:10:40] Krenair: yep, hence abandoned [15:12:17] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1437053 (10fgiunchedi) note some nutcracker problems have been observed in production too in the past, what was the situation there is unknown though ``` #wikimedia-operations_2015-03.l... [15:13:12] 6operations, 6Services, 10service-template-node, 7service-runner: Log levels not being respected on service-runner services on SCA - https://phabricator.wikimedia.org/T105500#1445202 (10mobrovac) 3NEW a:3mobrovac [15:16:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: upgrade cassandra-metrics-collector to latest version [puppet] - 10https://gerrit.wikimedia.org/r/224096 (owner: 10Filippo Giunchedi) [15:18:19] 6operations, 6Services, 10service-template-node, 7service-runner: Log levels not being respected on service-runner services on SCA - https://phabricator.wikimedia.org/T105500#1445226 (10mobrovac) Differences: - working: `service-runner @ 0.1.8` and `bunyan @ 1.3.5` - not working: `service-runner @ 0.1.10`... [15:23:17] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1445230 (10Edokter) I think we should build on multiples of 120 and 160. That should result in the following common cached imag... [15:24:32] PROBLEM - Cassanda CQL query interface on restbase1004 is CRITICAL: Connection refused [15:25:02] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [15:25:20] !log bounce cassandra on restbae1004 [15:25:25] k cool [15:25:26] thnx godog [15:25:34] PROBLEM - Restbase root url on restbase1004 is CRITICAL - Socket timeout after 10 seconds [15:25:59] np mobrovac [15:26:16] on rb [15:26:53] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [15:26:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:27:16] godog: you restarted rb on rb1004? [15:27:32] ah no, started 10h ago [15:27:38] but root url critical? [15:27:38] uf [15:27:45] mobrovac: I think bad timing [15:27:50] probably [15:28:13] RECOVERY - Cassanda CQL query interface on restbase1004 is OK: TCP OK - 0.004 second response time on port 9042 [15:28:57] godog: nope, all workers died, but the master didn't respawn them [15:29:06] !log restbase restarted restabse on restbase1004 [15:29:10] Logged the message, Master [15:29:35] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1445268 (10Krinkle) >>! In T65440#1442794, @Gilles wrote: > Wouldn't it be interesting to explore using 800 displayed as 400px... [15:30:59] mobrovac: ah that thing again, seem to correlate with cassandra [15:31:12] RECOVERY - Restbase root url on restbase1004 is OK: HTTP OK: HTTP/1.1 200 - 15149 bytes in 0.005 second response time [15:31:12] most likely, yes [15:32:43] mobrovac: btw I've merged the pending code review for cassandra, we should reenable puppet cc gwicke [15:33:31] kk [15:33:58] let's make sure there are no additional changes there on the nodes though [15:34:24] yep [15:34:39] (03PS1) 10Chad: Beta: Move wikidata.beta.wmflabs.org to static mappings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 [15:35:33] PROBLEM - puppet last run on restbase1001 is CRITICAL Puppet last ran 20 hours ago [15:37:24] RECOVERY - puppet last run on restbase1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:55] (03PS2) 10Chad: Multiversion: Remove beta/prod distinction in site detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 [15:38:01] (03CR) 10jenkins-bot: [V: 04-1] Multiversion: Remove beta/prod distinction in site detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 (owner: 10Chad) [15:38:06] mobrovac: https://phabricator.wikimedia.org/P937 [15:38:49] ah right, godog, we need those settings lowered [15:40:47] (03PS3) 10Chad: Beta: Move wikidata.beta.wmflabs.org to static mappings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 [15:41:15] (03CR) 10Chad: "Restored PS1 in PS3. PS2 tried to do it all and broke tests. Needs more work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224100 (owner: 10Chad) [15:43:51] mobrovac: ack, I'm asking because there's a new version of the metrics collector lined up in puppet [15:44:35] oh right [15:45:01] (03PS3) 10BBlack: sslcert: refactor ::certificate/::std_cert around secret() [puppet] - 10https://gerrit.wikimedia.org/r/224003 [15:45:29] godog: we needed to make sure that those settings were applied over night; let me create a puppet patch for those [15:46:49] gwicke: ack [15:46:55] (03CR) 10BBlack: [C: 032] sslcert: refactor ::certificate/::std_cert around secret() [puppet] - 10https://gerrit.wikimedia.org/r/224003 (owner: 10BBlack) [15:47:08] (03PS1) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [15:47:53] (03PS5) 10Krinkle: Separate private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [15:48:55] (03PS3) 10Hashar: nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 [15:49:09] (03PS2) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [15:49:32] systemd is not that hard to configure :-} [15:49:54] nope, it's rather straightforward [15:51:26] (03PS1) 10GWicke: Improve stability by reducing concurrent reads / writes [puppet] - 10https://gerrit.wikimedia.org/r/224103 [15:52:09] <_joe_> systemd is awesome [15:52:18] indeed [15:52:31] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Puppetize Postgres 9.4 + Postgis 2.1 role for Maps Deployment - https://phabricator.wikimedia.org/T105070#1445320 (10akosiaris) The puppet postgres module that already exists in https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modul... [15:52:38] (03PS6) 10Krinkle: mwgrep: Split results between public and private wikis [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [15:52:48] (03CR) 10Krinkle: "Rebased to resolve merge conflict." [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [15:53:03] godog: https://gerrit.wikimedia.org/r/#/c/224103/ [15:53:11] (03CR) 10Krinkle: [C: 031] "Added line break between the sections and changed to sentence case." [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [15:53:54] (03PS2) 10Filippo Giunchedi: Improve stability by reducing concurrent reads / writes [puppet] - 10https://gerrit.wikimedia.org/r/224103 (owner: 10GWicke) [15:54:03] (03CR) 10Mobrovac: [C: 031] Improve stability by reducing concurrent reads / writes [puppet] - 10https://gerrit.wikimedia.org/r/224103 (owner: 10GWicke) [15:54:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "*rubberstamp*" [puppet] - 10https://gerrit.wikimedia.org/r/224103 (owner: 10GWicke) [15:54:19] hehehe [15:54:32] (03PS3) 10BBlack: test puppetized dual-cert on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/224012 [15:55:00] godog: are you re-enabling puppet? [15:55:12] gwicke: yeah I'll confirm the changes first and reenable [15:55:19] cool, thanks! [15:56:12] now to fix the source of those bursty writes.. [15:56:35] (03CR) 10BBlack: [C: 032] test puppetized dual-cert on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/224012 (owner: 10BBlack) [15:56:39] 6operations, 6Discovery, 10Maps, 6Services, 3Discovery-Maps-Sprint: Puppetize Kartotherian for maps deployment - https://phabricator.wikimedia.org/T105074#1445332 (10akosiaris) service::node should be used to create the puppet module and then the puppet role using that module. The best examples of this a... [15:57:30] gwicke: check in with urandom, he might be on it [15:57:46] yeah, I'm just creating a task [15:59:06] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1445341 (10Dzahn) a:3Dzahn [15:59:21] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1433059 (10Dzahn) [15:59:24] (03CR) 10Andrew Bogott: [C: 032] nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [15:59:35] andrewbogott: that one is totally untested :D [15:59:53] 6operations, 10Wikimedia-Wikimania-Scholarships: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1445352 (10Dzahn) [16:00:07] hashar: ah, maybe you should test on labs a bit then :) [16:00:43] PROBLEM - puppet last run on restbase1006 is CRITICAL Puppet last ran 20 hours ago [16:00:43] PROBLEM - puppet last run on restbase1004 is CRITICAL Puppet last ran 21 hours ago [16:00:49] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1445357 (10Dzahn) Alex said we should look a bit at grafana usage first. [16:01:18] (03PS2) 10Filippo Giunchedi: diamond: service stats puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/224094 [16:01:20] (03PS2) 10Filippo Giunchedi: diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 [16:01:33] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet last ran 20 hours ago [16:01:54] PROBLEM - puppet last run on restbase1005 is CRITICAL Puppet last ran 20 hours ago [16:02:53] 6operations: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1445367 (10Dzahn) 3NEW [16:03:20] 6operations, 7Tracking: tracking: move all misc services from zirconium to a VM - https://phabricator.wikimedia.org/T104946#1445375 (10Dzahn) [16:03:21] 6operations: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1445374 (10Dzahn) [16:03:32] 6operations: reclaim zirconium - https://phabricator.wikimedia.org/T105510#1445376 (10Dzahn) p:5Triage>3Low [16:08:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review: Protect background jobs from unhandled exceptions - https://phabricator.wikimedia.org/T104581#1445396 (10mobrovac) 5Open>3Resolved a:3mobrovac This has been fixed and confirmed to work in production. [16:09:24] RECOVERY - puppet last run on restbase1005 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [16:09:52] (03PS1) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [16:10:08] (03CR) 10Hashar: [C: 04-1] "Have to test that one" [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [16:10:22] RECOVERY - puppet last run on restbase1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:07] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1445411 (10BBlack) Now that all of the other supporting work is merged in puppet, cp1008 aka "pinkunicorn.wikimedia.org" now has a fully-puppetized test of... [16:11:13] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1445412 (10Dzahn) this has been added: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=bromine&service=Static+Bugzilla+HTTP [16:11:38] (03CR) 10Hashar: "Probably need to test that on labs first." [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [16:12:04] andrewbogott: thanks for all the reviews :} https://gerrit.wikimedia.org/r/#/c/224079/3 should be able to land [16:12:14] andrewbogott: forgot to add a parameter [16:12:16] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1445414 (10Dzahn) resolved (because it already checks the Apache per se) or more checks, one for each virtual host. [16:12:50] 6operations, 10ops-codfw, 7Swift: ms-be2013 - swift-storage/sdc1 is not accessible: Input/output error - https://phabricator.wikimedia.org/T105213#1445415 (10Papaul) Will have the replacement disk on site n Monday. [16:12:56] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1445416 (10Dzahn) p:5Normal>3Low [16:13:25] andrewbogott: but no hurry. I am going off now! Have a good afternoon [16:16:44] RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:19:52] 6operations, 6Analytics-Backlog, 10Deployment-Systems, 6Performance-Team, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1445428 (10Milimetric) Hey guys, so we are trying to get away from servicing ad-hoc data requests, but we... [16:19:53] RECOVERY - puppet last run on restbase1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:23:39] (03PS1) 10Chad: Prevent race condition when writing settings to cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) [16:25:53] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [16:26:44] 6operations, 10MediaWiki-ResourceLoader, 7HHVM, 5MW-1.26-release, and 3 others: HHVM memory leaks result in OOMs & 500 spikes - https://phabricator.wikimedia.org/T104769#1445453 (10Joe) Turns out I couldn't reproduce it because well, we did set the wrong hhvm ini key :P [16:30:21] (03PS4) 10Andrew Bogott: nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 (owner: 10Hashar) [16:32:34] (03CR) 10Andrew Bogott: [C: 032] nodepool: create dib_base_path (/srv/dib) [puppet] - 10https://gerrit.wikimedia.org/r/224079 (owner: 10Hashar) [16:33:12] 6operations, 6Services, 5Patch-For-Review: Service containment for nodejs-based services with firejail - https://phabricator.wikimedia.org/T101870#1445491 (10MoritzMuehlenhoff) [16:33:15] 6operations, 10Mathoid, 6Services, 5Patch-For-Review: Confine Mathoid with firejail - https://phabricator.wikimedia.org/T103094#1445489 (10MoritzMuehlenhoff) 5Open>3Resolved Firejail is now enabled in production. [16:35:26] !log restbase deploying hotfix for T105509 [16:35:30] Logged the message, Master [16:36:47] Hi, can anyone help me figure out what's up with an i18n message that isn't getting included in a CentralNotice banner correctly? In Japanese, https://meta.wikimedia.org/w/index.php?title=Wikinews/Licensure_Poll/GFDL_CC-BY-SA/O/For&banner=wm2015register&uselang=ja&force=1 However the messages should be https://meta.wikimedia.org/wiki/MediaWiki:Centralnotice-wm2015register-text1/ja and https://meta.wikimedia.org/wiki/MediaWiki:Centralno [16:39:15] 6operations, 6Analytics-Backlog, 10Analytics-EventLogging, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1445525 (10ggellerman) [16:40:13] ostriches: https://gerrit.wikimedia.org/r/#/c/223979/ getting in the way of my vagrant [16:43:38] {{done}} [16:45:05] (03CR) 10Dduvall: "How'd the compiler test go?" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [16:47:49] 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1445566 (10JanZerebecki) As seen in the backtrace https://phabricator.wikimedia.org/T105169#1444010 this needs to also work with the old log entries that have less than 5 spa... [16:49:07] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1445569 (10BBlack) Updated the packages on cp1065 for some live-testing of just that part. [16:53:27] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1445585 (10RobH) [16:54:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 35.71% of data above the critical threshold [500.0] [16:56:47] (03PS2) 10RobH: removing rapidssl_ca sha1 intermediary from repo [puppet] - 10https://gerrit.wikimedia.org/r/223816 [16:58:07] (03PS3) 10RobH: removing rapidssl_ca sha1 intermediary from repo [puppet] - 10https://gerrit.wikimedia.org/r/223816 [16:59:01] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1445596 (10BBlack) I'm going to be out on vacation starting Friday the 17th, and I also probably shouldn't turn on ECDSA just before the weekend today eithe... [16:59:32] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1445597 (10RobH) My patchset was incorrect, as I simply removed the file from the repo, and removed the stanza from certificates.pp entirely. I've changed it now to work as Brandon suggested on the patchset,... [17:00:11] (03CR) 10RobH: [C: 032] "This now is setup to work as Brandon's recent patchsets for a similar task (linked in the patchset comments in gerrit.)" [puppet] - 10https://gerrit.wikimedia.org/r/223816 (owner: 10RobH) [17:01:03] (03PS2) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [17:01:09] (03CR) 10jenkins-bot: [V: 04-1] nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [17:01:35] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1445603 (10RobH) 5Open>3stalled First of two patchsets is merged, I'll follow up on this in a day or so to ensure its been properly removed via puppet. [17:03:03] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1445613 (10RobH) [17:05:03] PROBLEM - puppet last run on wtp1017 is CRITICAL Puppet has 1 failures [17:05:12] PROBLEM - puppet last run on neon is CRITICAL Puppet has 1 failures [17:05:23] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 1 failures [17:05:23] PROBLEM - puppet last run on californium is CRITICAL Puppet has 1 failures [17:05:23] PROBLEM - puppet last run on pc1001 is CRITICAL Puppet has 1 failures [17:05:43] PROBLEM - puppet last run on mc1009 is CRITICAL Puppet has 1 failures [17:05:53] PROBLEM - puppet last run on mw2106 is CRITICAL Puppet has 1 failures [17:06:12] PROBLEM - puppet last run on mc1011 is CRITICAL Puppet has 1 failures [17:06:13] PROBLEM - puppet last run on mw1109 is CRITICAL Puppet has 1 failures [17:06:13] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 1 failures [17:06:13] PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 1 failures [17:06:14] PROBLEM - puppet last run on db2046 is CRITICAL Puppet has 1 failures [17:06:14] PROBLEM - puppet last run on mw2197 is CRITICAL Puppet has 1 failures [17:06:14] PROBLEM - puppet last run on mw2202 is CRITICAL Puppet has 1 failures [17:06:22] PROBLEM - puppet last run on mw2112 is CRITICAL Puppet has 1 failures [17:06:23] PROBLEM - puppet last run on mw2041 is CRITICAL Puppet has 1 failures [17:06:23] PROBLEM - puppet last run on cp3046 is CRITICAL Puppet has 1 failures [17:06:24] PROBLEM - puppet last run on nescio is CRITICAL Puppet has 1 failures [17:06:28] hmmmm [17:06:33] PROBLEM - puppet last run on mw1147 is CRITICAL Puppet has 1 failures [17:06:42] PROBLEM - puppet last run on uranium is CRITICAL Puppet has 1 failures [17:06:43] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet has 1 failures [17:06:43] PROBLEM - puppet last run on ocg1001 is CRITICAL Puppet has 1 failures [17:06:53] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [17:07:02] PROBLEM - puppet last run on mw1252 is CRITICAL Puppet has 1 failures [17:07:07] it's the rapidssl_CA, it's going to hit every machine [17:07:13] PROBLEM - puppet last run on mw1130 is CRITICAL Puppet has 1 failures [17:07:22] PROBLEM - puppet last run on mw2099 is CRITICAL Puppet has 1 failures [17:07:43] PROBLEM - puppet last run on mw2160 is CRITICAL Puppet has 1 failures [17:07:52] PROBLEM - puppet last run on mw2183 is CRITICAL Puppet has 1 failures [17:07:53] ah wait, not every machine, just a race condition on those running right as the merge happened [17:08:04] I think. verifying. [17:08:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:09:31] yes, just race-condition. [17:10:08] the merge deletes a file in the same move as deleting the reference to the file. so the file vanishes from the puppet fileserver for clients that were still trying to run a catalog compiled a few seconds earlier and still holding the old reference [17:10:18] they should all recover on their next run ~20 minutes later. [17:10:43] RECOVERY - puppet last run on wtp1017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:10:50] ^ I ran that one manually to check [17:11:45] 6operations, 7Mail: Create exim Mailing Aliases - https://phabricator.wikimedia.org/T105433#1445667 (10RobH) 5Open>3Resolved a:3RobH Done, these will go live once puppet runs on the mail systems. [17:14:53] RECOVERY - Disk space on labstore2001 is OK: DISK OK [17:16:24] (03PS1) 10BBlack: parsoid SSL: remove pointless SNI, add OCSP [puppet] - 10https://gerrit.wikimedia.org/r/224113 [17:17:06] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1445719 (10Whatamidoing-WMF) >>! In T65440#1444161, @Gilles wrote: > In that respect, introducing 400px as a new default has th... [17:17:12] (03CR) 10BBlack: [C: 032 V: 032] parsoid SSL: remove pointless SNI, add OCSP [puppet] - 10https://gerrit.wikimedia.org/r/224113 (owner: 10BBlack) [17:21:25] (03PS2) 10Chad: Prevent race condition when writing settings to cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) [17:21:33] RECOVERY - puppet last run on mw1147 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:21:44] RECOVERY - puppet last run on analytics1039 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:21:52] RECOVERY - puppet last run on ocg1001 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:22:03] RECOVERY - puppet last run on mw1252 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:22:13] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:22:13] RECOVERY - puppet last run on mw1130 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:22:22] RECOVERY - puppet last run on californium is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:22:22] RECOVERY - puppet last run on pc1001 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:22:23] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:22:42] RECOVERY - puppet last run on mc1009 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:23:11] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:23:19] (03PS1) 10GWicke: Bump up compactors back to 10 [puppet] - 10https://gerrit.wikimedia.org/r/224114 [17:23:30] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:23:30] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:23:40] RECOVERY - puppet last run on nescio is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:23:41] RECOVERY - puppet last run on mc1011 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:23:52] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [17:24:10] RECOVERY - puppet last run on mw1109 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:20] RECOVERY - puppet last run on db2046 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:20] (03CR) 10BryanDavis: [C: 031] Prevent race condition when writing settings to cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [17:24:21] RECOVERY - puppet last run on mw2106 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:31] RECOVERY - puppet last run on cp3046 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:40] RECOVERY - puppet last run on mw2041 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:24:41] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:25:01] RECOVERY - puppet last run on mw2160 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:25:01] RECOVERY - puppet last run on mw2202 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:25:10] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:25:11] RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:25:41] RECOVERY - puppet last run on mw2197 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:25:51] !log rebooting labstore2001 (experiments with the new raid setup caused the mapper table to fill) [17:25:55] Logged the message, Master [17:25:59] !log installed python security updates on mc* [17:26:03] Logged the message, Master [17:27:44] (03PS2) 10GWicke: Bump up compactors back to 10 [puppet] - 10https://gerrit.wikimedia.org/r/224114 [17:28:11] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:31] RECOVERY - Host labstore2001 is UPING OK - Packet loss = 0%, RTA = 44.54 ms [17:31:20] RECOVERY - puppet last run on mw2112 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:31:40] (03CR) 10Chad: [C: 032] Prevent race condition when writing settings to cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [17:31:46] (03Merged) 10jenkins-bot: Prevent race condition when writing settings to cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [17:32:00] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Bump up compactors back to 10 [puppet] - 10https://gerrit.wikimedia.org/r/224114 (owner: 10GWicke) [17:32:42] gwicke: ^ [17:32:44] !log demon Synchronized wmf-config/CommonSettings.php: prevent race condition on writing settings (duration: 00m 13s) [17:32:48] Logged the message, Master [17:32:49] bd808: ^^^ [17:32:52] godog: thank you! [17:33:00] will slowly roll it out in a bit [17:33:10] kk [17:34:17] ostriches: I don't see a storm of errors so I guess it didn't make it worse :) [17:34:36] I can't see any history of it in logstash to see when it happened last anyway [17:34:48] I was mainly fixing this on the filed task + seeing the code being wrong [17:35:05] yeah. it *should* have been pretty rare [17:36:47] bd808: Resolved the task. Thanks for the review [17:36:54] sure! [17:38:06] (03PS2) 10BBlack: text varnish: pass all "Authorization: OAuth " requests [puppet] - 10https://gerrit.wikimedia.org/r/223997 (https://phabricator.wikimedia.org/T105387) [17:38:24] (03CR) 10BBlack: [C: 032 V: 032] text varnish: pass all "Authorization: OAuth " requests [puppet] - 10https://gerrit.wikimedia.org/r/223997 (https://phabricator.wikimedia.org/T105387) (owner: 10BBlack) [17:42:34] (03PS4) 10Faidon Liambotis: (WIP) Remove support for Ubuntu Lucid/10.04 [puppet] - 10https://gerrit.wikimedia.org/r/179888 [17:45:57] PROBLEM - puppet last run on cp2021 is CRITICAL puppet fail [17:46:36] (03CR) 10Aaron Schulz: Prevent race condition when writing settings to cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [17:49:26] !log rolling restart of the cassandra cluster to apply https://gerrit.wikimedia.org/r/#/c/224114/ [17:49:30] Logged the message, Master [17:51:28] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:51:42] (03PS1) 10BBlack: misc SSL: switch to unified like others [puppet] - 10https://gerrit.wikimedia.org/r/224117 [17:51:57] (03PS1) 10BBlack: move nginx->varnish dep to common ssl::local [puppet] - 10https://gerrit.wikimedia.org/r/224118 [17:57:44] (03CR) 10Manybubbles: [C: 031] "Fine by me." [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [17:59:12] (03PS1) 10BBlack: ciphersuite: remove non-FS Camellia options [puppet] - 10https://gerrit.wikimedia.org/r/224120 [18:00:22] !log ansible -i production restbase -a 'nodetool setcompactionthroughput 90' [18:00:26] Logged the message, Master [18:06:27] (03CR) 10BBlack: [C: 032] move nginx->varnish dep to common ssl::local [puppet] - 10https://gerrit.wikimedia.org/r/224118 (owner: 10BBlack) [18:09:13] 6operations, 6Services, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1446023 (10bearND) 3NEW [18:09:51] (03PS2) 10BBlack: openstack::nova::compute: use secret() for key [puppet] - 10https://gerrit.wikimedia.org/r/224009 [18:10:11] (03PS2) 10BBlack: ciphersuite: remove non-FS Camellia options [puppet] - 10https://gerrit.wikimedia.org/r/224120 [18:10:33] (03CR) 10Andrew Bogott: [C: 031] openstack::nova::compute: use secret() for key [puppet] - 10https://gerrit.wikimedia.org/r/224009 (owner: 10BBlack) [18:10:37] 6operations, 6Services, 3Mobile Content Service, 7service-deployment-requests: New Service Request mobileapps - https://phabricator.wikimedia.org/T105538#1446033 (10bearND) [18:11:09] (03CR) 10BBlack: [C: 032] ciphersuite: remove non-FS Camellia options [puppet] - 10https://gerrit.wikimedia.org/r/224120 (owner: 10BBlack) [18:11:25] !log ansible -i production restbase -a 'nodetool setcompactionthroughput 120' [18:11:30] Logged the message, Master [18:12:03] (03PS3) 10BBlack: openstack::nova::compute: use secret() for key [puppet] - 10https://gerrit.wikimedia.org/r/224009 [18:12:14] (03CR) 10BBlack: [C: 032 V: 032] openstack::nova::compute: use secret() for key [puppet] - 10https://gerrit.wikimedia.org/r/224009 (owner: 10BBlack) [18:12:38] 6operations, 7Icinga, 5Patch-For-Review: monitor HTTP on bromine.eqiad.wmnet - https://phabricator.wikimedia.org/T104948#1446046 (10Dzahn) 5Open>3Resolved [18:12:46] (03CR) 10Andrew Bogott: [C: 031] nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [18:15:58] 6operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#1446060 (10Dzahn) p:5Triage>3Normal [18:16:12] 6operations, 10Wikimedia-IRC, 5Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#1446061 (10Dzahn) p:5Triage>3Low [18:16:37] 6operations: move grafana from zirconium to a VM - https://phabricator.wikimedia.org/T105008#1446062 (10Dzahn) p:5Triage>3Normal [18:16:44] 6operations, 10Wikimedia-IEG-grant-review: move iegreview to a VM - https://phabricator.wikimedia.org/T105007#1446063 (10Dzahn) p:5Triage>3Normal [18:16:55] 6operations, 10Wikimedia-Wikimania-Scholarships: move wikimania_scholarships to a VM - https://phabricator.wikimedia.org/T105003#1446064 (10Dzahn) p:5Triage>3Normal [18:17:45] 6operations, 10ops-eqiad: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1446066 (10Dzahn) p:5Triage>3Normal [18:18:01] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1446067 (10Dzahn) p:5Triage>3Normal [18:18:10] (03PS2) 10Andrew Bogott: mediawiki_singlenode: rename defined type [puppet] - 10https://gerrit.wikimedia.org/r/211335 (owner: 10Dzahn) [18:18:17] 6operations, 5Patch-For-Review: Mediawiki font packages: switch to Jessie - https://phabricator.wikimedia.org/T102623#1446070 (10Dzahn) p:5Triage>3Normal [18:19:16] (03CR) 10Andrew Bogott: [C: 032] mediawiki_singlenode: rename defined type [puppet] - 10https://gerrit.wikimedia.org/r/211335 (owner: 10Dzahn) [18:20:19] 6operations: sitemap.wikimedia.org ? - https://phabricator.wikimedia.org/T101486#1446091 (10Dzahn) I feel this should be a Search and Discovery thing but i see the project tag was removed. Is there a better one? [18:20:37] andrewbogott: :) thanks [18:23:56] (03PS2) 10Andrew Bogott: tabs -> 4 spaces [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180896 (owner: 10Merlijn van Deen) [18:24:02] (03CR) 10jenkins-bot: [V: 04-1] tabs -> 4 spaces [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180896 (owner: 10Merlijn van Deen) [18:26:58] (03PS3) 10Andrew Bogott: tabs -> 4 spaces + other pep8 fixes [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180896 (owner: 10Merlijn van Deen) [18:27:50] (03CR) 10Andrew Bogott: [C: 032] tabs -> 4 spaces + other pep8 fixes [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180896 (owner: 10Merlijn van Deen) [18:28:53] (03PS9) 10Andrew Bogott: Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) [18:33:08] (03CR) 10Andrew Bogott: [C: 032] Add a labsproject fact that doesn't rely on ldap config. [puppet] - 10https://gerrit.wikimedia.org/r/220991 (https://phabricator.wikimedia.org/T93684) (owner: 10Andrew Bogott) [18:37:10] andrewbogott: so should we go back to adminbot and try a new version? [18:37:20] mutante: sure. [18:37:21] or do we have to up the version after the above [18:37:38] eh, the above is just pep8 fixes. You can rebuild if you feel like it but it’s not necessary [18:37:41] ok, so let's try with the one in our own channel [18:37:45] since we might have to rebuild anyway :) [18:38:09] hmm.. i should probably install the package manuall on one box? [18:39:52] andrewbogott: re-included with reprepro [18:41:28] PROBLEM - puppet last run on cp2008 is CRITICAL Puppet has 1 failures [18:43:48] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 1 failures [18:48:51] mutante: there's a copy in ~/src [18:49:05] but it's an old one it seems [18:49:29] yeah, it's was one to test my +url change [18:49:35] valhallasw`cloud: it should get 1.7.9 hopefully [18:49:53] on the precise boxes that is [18:50:02] mutante: but that one had a bug, right? [18:50:21] valhallasw`cloud: yea, but it is strange because nobody changed the README [18:50:29] or anything that looks related [18:50:37] mutante: I meant the bot not working [18:50:46] not sure what's happening with the README either [18:50:49] (03PS2) 10BBlack: misc SSL: switch to unified like others [puppet] - 10https://gerrit.wikimedia.org/r/224117 [18:51:50] valhallasw`cloud: well, i'm not sure anymore because when i restarted the bot it was like neither of the versions worked, then andrew restarts it and it works :) [18:52:14] mutante: there were errors in the log: https://phabricator.wikimedia.org/T105169#1444010 [18:52:40] valhallasw`cloud: oh, didnt see that yet.. ok [18:52:48] PROBLEM - puppet last run on mw2044 is CRITICAL Puppet has 1 failures [18:52:50] (03CR) 10BBlack: [C: 032] misc SSL: switch to unified like others [puppet] - 10https://gerrit.wikimedia.org/r/224117 (owner: 10BBlack) [18:53:45] ValueError: need more than 4 values to unpack [18:53:45] 2015-07-10 02:52:32,718 ERROR: Died in main event loop [18:53:47] eh... [18:53:48] yeah [18:53:52] it's the date parsing I think [18:53:57] sigh..ok [18:54:03] I'll take a look [18:54:09] cool [18:55:07] PROBLEM - puppet last run on mw2171 is CRITICAL Puppet has 1 failures [18:55:28] PROBLEM - puppet last run on mw1155 is CRITICAL Puppet has 1 failures [18:58:37] RECOVERY - puppet last run on cp2008 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:58:58] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:00:04] bd808: Respected human, time to deploy Logstash cluster maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150710T1900). Please do the needful. [19:00:25] (03PS1) 10Krinkle: grafana: Set custom default dashboard [puppet] - 10https://gerrit.wikimedia.org/r/224129 [19:03:59] Can someone check icinga to see if the logstash cluster's elasticsearch health check is marked for a maintenance window yet? [19:04:14] * bd808 is looking but is not sure if he can tell [19:04:32] (03CR) 10Dzahn: "without having a personal opinion i just notice that we keep having different points of view on _where_ to put ferm rules (role classes vs" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [19:04:48] bd808: sure, one sec [19:05:02] also, mutante / andrewbogott, what is your idea on logbot actually having a bot flag or not? IMO it would make sense to not have the edits marked as bot, as they messages are really human messages [19:05:30] but maybe there's a good reason to keep them as bot messages? too many of them? [19:05:58] bd808: it's not. did you issue the command? i can do it if you tell me the duration. [19:06:19] jgage: 24 hours? [19:06:30] ok [19:07:41] (03PS3) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [19:08:26] bd808: done [19:08:27] RECOVERY - puppet last run on mw2044 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:08:31] valhallasw`cloud: i tend to agree they are human messages [19:08:33] thanks jgage [19:09:12] valhallasw`cloud: but some are messages from logmsgbot too [19:09:20] mutante: mmm, true [19:10:04] James_F, [19:10:20] !log krenair Synchronized php-1.26wmf13/extensions/VisualEditor/lib/ve/src/ce/nodes/ve.ce.TableNode.js: https://gerrit.wikimedia.org/r/#/c/224122/ (duration: 00m 12s) [19:10:24] Logged the message, Master [19:10:36] like that :) [19:10:47] RECOVERY - puppet last run on mw2171 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:10:56] it's fun enough that one bot tells another bot what to do [19:10:57] (03PS1) 10BBlack: Increase puppet-run interval from 20m to 30m [puppet] - 10https://gerrit.wikimedia.org/r/224131 [19:11:08] RECOVERY - puppet last run on mw1155 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:11:57] Hm. Maybe I should just make labs ones non-bot instead of all of them [19:12:20] 21:11 <@valhallasw`cloud> !log testlabs this will fail [19:12:20] 21:11 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL, Master [19:12:22] * valhallasw`cloud cheers [19:12:23] (03CR) 10Faidon Liambotis: [C: 031] Increase puppet-run interval from 20m to 30m [puppet] - 10https://gerrit.wikimedia.org/r/224131 (owner: 10BBlack) [19:12:26] (03PS4) 10Dzahn: ferm rules for IRCd [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) [19:12:47] (03CR) 10BBlack: [C: 032] Increase puppet-run interval from 20m to 30m [puppet] - 10https://gerrit.wikimedia.org/r/224131 (owner: 10BBlack) [19:12:54] valhallasw`cloud: oh, is that the new version , i see the year :)) [19:13:14] *nod* the issue was it didn't parse the headers correctly anymore because they don't contain years yet [19:13:19] 2015-07-10 would be even nicer [19:13:35] ah [19:13:54] we can do that as well [19:14:42] i like it because then it sorts by date [19:14:47] and xkcd [19:15:01] https://xkcd.com/1179/ [19:16:11] (03CR) 10Dzahn: "convinced. moved to module where the service is setup" [puppet] - 10https://gerrit.wikimedia.org/r/223886 (https://phabricator.wikimedia.org/T104943) (owner: 10Dzahn) [19:16:50] mutante: that was my comment ;D [19:17:08] mutante: https://wikitech.wikimedia.org/wiki/Nova_Resource:Testlabs/SAL ? :) [19:17:20] yes, to both of you:) [19:17:33] xkcd is always relevant [19:18:17] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 14.81% of data above the critical threshold [100000000.0] [19:19:20] !log Upgraded logstash1001 to elasticsearch 1.6.0 [19:19:24] Logged the message, Master [19:21:31] (03PS2) 10Dzahn: ferm rules for bacula director, storage [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) [19:21:45] (03CR) 10Dzahn: ferm rules for bacula director, storage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [19:21:49] indeed, anytime someone needs an ops ruling on something, there is a relevant xkcd comic. [19:22:22] !log Upgraded logstash1002 to elasticsearch 1.6.0 [19:22:27] Logged the message, Master [19:23:59] (03PS3) 10Dzahn: ferm rules for bacula [puppet] - 10https://gerrit.wikimedia.org/r/223849 (https://phabricator.wikimedia.org/T104996) [19:25:10] (03Abandoned) 10Dzahn: ferm rules for bacula storage [puppet] - 10https://gerrit.wikimedia.org/r/223851 (https://phabricator.wikimedia.org/T104996) (owner: 10Dzahn) [19:26:03] (03PS6) 10Dzahn: deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 [19:26:09] !log Upgraded logstash1003 to elasticsearch 1.6.0 [19:26:14] Logged the message, Master [19:26:48] (03CR) 10Dzahn: [C: 032] deployment::server: move releases::upload into role [puppet] - 10https://gerrit.wikimedia.org/r/223464 (owner: 10Dzahn) [19:26:49] ugh. I still don't get why Gerrit doesn't just add a Change-Id if none is present [19:26:56] pushing over https would be so much easier [19:27:09] valhallasw`cloud: afaik you can [19:27:15] valhallasw`cloud: after you set the "http password" in gerrit [19:27:26] mutante: yes, the pushing works, but I didn't have a Change-Id [19:27:30] ah [19:27:52] i think "git review" does that for me [19:28:39] bd808: I'm getting 'unknown index' errors from logstash, is that known? [19:28:44] *nod*. Unfortunately, git review requires ssh to set itself up [19:28:52] (03PS2) 10Merlijn van Deen: Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 [19:28:53] gwicke: I'm trying to figure it out now [19:28:54] (03PS1) 10Merlijn van Deen: ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 [19:28:56] (03CR) 10jenkins-bot: [V: 04-1] Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 (owner: 10Merlijn van Deen) [19:28:58] (03CR) 10jenkins-bot: [V: 04-1] ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 (owner: 10Merlijn van Deen) [19:28:59] oh well, this also works [19:29:06] except for the flake8 part [19:29:15] gwicke: I just upgraded the elasticsearch versions there and something isn't quite right [19:29:16] it's voting now:) [19:29:20] since Elee fixed all the issues [19:29:40] elee: ^ the bot is being worked on [19:29:41] yeah, and it should :) [19:29:50] bd808: kk, noticed your earlier log a moment later [19:30:28] pep8's 80 char maximum is crazy though [19:31:33] (03PS2) 10Merlijn van Deen: ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 [19:31:52] (03PS3) 10Merlijn van Deen: Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 [19:31:56] valhallasw`cloud: we even looked that up recently [19:31:58] PROBLEM - puppet last run on silver is CRITICAL puppet fail [19:32:00] "Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the nominal line length from 80 to 100 characters (effectively increasing the maximum length to 99 characters), provided that comments and docstrings are still wrapped at 72 characters." [19:32:18] there was a great talk on pycon about it [19:32:19] !log kibana not seeing indices after upgrading elasticsearch to 1.6.0; investigating [19:32:23] Logged the message, Master [19:32:27] PROBLEM - puppet last run on cp1064 is CRITICAL Puppet has 1 failures [19:32:45] 'if you're focusing on appeasing pep8, you're probably missing the bigger picture' [19:33:15] https://www.youtube.com/watch?v=wf-BqAjZb8M [19:33:19] PROBLEM - puppet last run on ms-be2015 is CRITICAL puppet fail [19:33:55] valhallasw`cloud: Does it have the bot flag now? [19:34:02] andrewbogott: it's configurable per-channel [19:34:05] sorry [19:34:09] valhallasw`cloud: so instead of focusing on it i should watch an hour of video about it, hehe :) [19:34:10] currently it always uses the bot flag [19:34:23] I made it per-channel configurable now [19:34:29] valhallasw`cloud: fair enough :) [19:34:43] that's cool .. configurable [19:34:57] I think it should definitely be non-bot for labs, but other usages might have other reasonable choices [19:34:57] PROBLEM - puppet last run on cp1048 is CRITICAL Puppet has 1 failures [19:35:18] PROBLEM - puppet last run on labsdb1006 is CRITICAL puppet fail [19:35:23] valhallasw`cloud: did you also fix the bug that prevents logging? I stepped away for a bit, may have missed good news [19:35:42] except I need to add wiki_bot = True to all config files, it seems [19:35:43] gwicke: curl -s 'localhost:9200/logstash-2015.07.10/_aliases?ignore_unavailable=true&ignore_missing=true' has different response from 1.6.0 hosts in mixed cluster apparently and that's freaking kibana out [19:35:46] (03CR) 10Dzahn: [C: 031] ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 (owner: 10Merlijn van Deen) [19:35:47] yeah, that was fixed as well [19:35:50] that's https://gerrit.wikimedia.org/r/#/c/224171/ [19:36:16] after those it should be a new version again [19:36:32] I'll add a changelog change [19:36:35] +1 [19:36:54] gwicke: are you seeing errors in other places than kibana? [19:37:11] (03CR) 10Dzahn: [C: 031] Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 (owner: 10Merlijn van Deen) [19:37:18] PROBLEM - puppet last run on ms-be1002 is CRITICAL Puppet has 2 failures [19:37:18] PROBLEM - puppet last run on ms-fe2002 is CRITICAL puppet fail [19:37:21] valhallasw`cloud: do you need the pep8 change? I just shortened lines an hour ago [19:37:28] PROBLEM - puppet last run on ms-fe1004 is CRITICAL Puppet has 1 failures [19:37:29] PROBLEM - puppet last run on ms-be2006 is CRITICAL Puppet has 2 failures [19:37:38] PROBLEM - puppet last run on ms-be1010 is CRITICAL Puppet has 3 failures [19:37:53] andrewbogott: I can rewrite it to be in 80 chars, I guess, but imo 80 chars is just absurdly small [19:37:59] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 2 failures [19:38:08] andrewbogott: it's still within the official standard, somewhat . "Some teams strongly prefer a longer line length. For code maintained exclusively or primarily by a team that can reach agreement on this issue, it is okay to increase the nominal line length from 80 to 100 characters " [19:38:08] PROBLEM - puppet last run on ms-be2011 is CRITICAL Puppet has 1 failures [19:38:09] (03CR) 10Merlijn van Deen: [C: 04-1] Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 (owner: 10Merlijn van Deen) [19:38:20] ^ failures from deployment::server: move releases::upload into role ? [19:38:27] PROBLEM - puppet last run on ms-be1014 is CRITICAL Puppet has 1 failures [19:38:37] PROBLEM - puppet last run on mc1012 is CRITICAL Puppet has 2 failures [19:38:38] PROBLEM - puppet last run on ms-be1005 is CRITICAL Puppet has 2 failures [19:38:49] yeah, it just seems excessive to change the standard for a single (new!) line. [19:38:49] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 1 failures [19:38:56] bblack: shouldnt be, should only influence deployment servers.. but double checking [19:38:58] PROBLEM - puppet last run on ms-be2001 is CRITICAL Puppet has 2 failures [19:38:58] PROBLEM - puppet last run on neon is CRITICAL Puppet has 2 failures [19:39:08] PROBLEM - puppet last run on ms-be1007 is CRITICAL Puppet has 2 failures [19:39:08] PROBLEM - puppet last run on logstash1002 is CRITICAL Puppet has 1 failures [19:39:08] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [19:39:08] PROBLEM - puppet last run on mc1014 is CRITICAL Puppet has 2 failures [19:39:18] PROBLEM - puppet last run on ms-fe3002 is CRITICAL Puppet has 3 failures [19:39:18] PROBLEM - puppet last run on sca1002 is CRITICAL Puppet has 1 failures [19:39:28] PROBLEM - puppet last run on ms-be2007 is CRITICAL Puppet has 1 failures [19:39:29] PROBLEM - puppet last run on ms-be1008 is CRITICAL Puppet has 1 failures [19:39:38] PROBLEM - puppet last run on ms-be3001 is CRITICAL Puppet has 2 failures [19:39:38] PROBLEM - puppet last run on mc1013 is CRITICAL Puppet has 1 failures [19:39:39] PROBLEM - puppet last run on bast4001 is CRITICAL Puppet has 1 failures [19:39:47] andrewbogott: fair enough, I'll hack a bit. [19:39:58] thanks [19:40:07] PROBLEM - puppet last run on analytics1022 is CRITICAL Puppet has 1 failures [19:40:09] bblack: hmm.. no.. it finishes fine when i run puppet manually on a random box [19:40:19] RECOVERY - puppet last run on ms-be1014 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:40:19] PROBLEM - puppet last run on mc1007 is CRITICAL Puppet has 2 failures [19:40:19] mutante: wait the bot is being worked on? [19:40:24] * elee crys [19:40:27] !log Kibana seems to be broken by mixed 1.6.0/1.3.9 cluster [19:40:31] Logged the message, Master [19:40:34] I'm joking =] [19:40:37] PROBLEM - puppet last run on mc1010 is CRITICAL Puppet has 1 failures [19:40:38] PROBLEM - puppet last run on hafnium is CRITICAL Puppet has 1 failures [19:40:48] PROBLEM - puppet last run on ms-be2008 is CRITICAL Puppet has 1 failures [19:40:49] RECOVERY - puppet last run on ms-be2001 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:40:56] hmm what is up, puppet [19:41:08] PROBLEM - puppet last run on ms-be1001 is CRITICAL Puppet has 2 failures [19:41:10] (03PS3) 10Merlijn van Deen: ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 [19:41:12] (03CR) 10jenkins-bot: [V: 04-1] ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 (owner: 10Merlijn van Deen) [19:41:28] RECOVERY - puppet last run on ms-be2007 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:42:08] PROBLEM - puppet last run on mw1211 is CRITICAL Puppet has 1 failures [19:42:18] RECOVERY - puppet last run on mc1007 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:42:22] (03PS4) 10Merlijn van Deen: Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 [19:42:28] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [19:42:37] PROBLEM - puppet last run on mw1045 is CRITICAL Puppet has 1 failures [19:42:48] PROBLEM - puppet last run on mw2100 is CRITICAL Puppet has 1 failures [19:42:57] PROBLEM - puppet last run on snapshot1002 is CRITICAL Puppet has 1 failures [19:42:58] PROBLEM - puppet last run on mw1164 is CRITICAL Puppet has 1 failures [19:42:58] PROBLEM - puppet last run on mw1205 is CRITICAL Puppet has 1 failures [19:42:58] (03PS4) 10Merlijn van Deen: ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 [19:43:07] PROBLEM - puppet last run on snapshot1001 is CRITICAL Puppet has 1 failures [19:43:55] !log rebooting logstash1004 [19:43:58] PROBLEM - puppet last run on mw2168 is CRITICAL Puppet has 1 failures [19:44:00] Logged the message, Master [19:44:44] bd808: will you be rebooting all of the logstash nodes? i only scheduled the downtime for the elasticsearch service, but i can update it. [19:44:47] RECOVERY - puppet last run on mw2100 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:45:00] oh right kernel upgrade [19:45:07] PROBLEM - Host logstash1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:45:11] i'll inform icinga [19:45:14] jgage: yeah, so just the jessie ones [19:45:17] (03CR) 10Andrew Bogott: [C: 032] ISO-fy date format & make date parsing fail gently [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224171 (owner: 10Merlijn van Deen) [19:45:18] RECOVERY - puppet last run on ms-be2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:48] RECOVERY - Host logstash1004 is UPING OK - Packet loss = 0%, RTA = 1.84 ms [19:45:48] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:46:01] (03PS5) 10Andrew Bogott: Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 (owner: 10Merlijn van Deen) [19:46:19] valhallasw`cloud: are you changing the bot configs or shall I do that? [19:46:37] andrewbogott: I fixed the configuration to fall back to True, but I'll update the labs one [19:46:41] bd808: perhaps this would be a good time to run apt-get upgrade on all of them. i did logstash1002 a few days ago while debugging a ganglia problem. [19:46:48] RECOVERY - puppet last run on mw1205 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:46:49] valhallasw`cloud: ok [19:46:50] logstash kernel update is installing linux-meta right? [19:46:57] (our 3.19 thing) [19:46:58] RECOVERY - puppet last run on logstash1002 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:46:58] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:47:06] moritz already upgraded, they're just waiting for reboot [19:47:07] valhallasw`cloud: actually, change the prod one too please? [19:47:13] bblack: I think so, yes [19:47:14] (03CR) 10Dpatrick: [C: 031] mwgrep: Split results between public and private wikis [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [19:47:16] ok, has to be that then, that's his package :) [19:47:26] I think wikitech static cares about recent changes, not sure if the bot flag will change that [19:47:42] yep, linux-meta confirmed installed on logstash1005 [19:47:49] (03CR) 10Andrew Bogott: [C: 032] Make bot flag configurable [debs/adminbot] - 10https://gerrit.wikimedia.org/r/180889 (owner: 10Merlijn van Deen) [19:47:55] jgage: I totally can do a full apt-get upgrade [19:48:04] (03PS1) 10Merlijn van Deen: Updated changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224173 [19:48:07] RECOVERY - puppet last run on mw1211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:18] RECOVERY - puppet last run on mc1012 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:48:19] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:48:36] (03CR) 10Dzahn: [C: 032] Updated changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224173 (owner: 10Merlijn van Deen) [19:48:38] (03Merged) 10jenkins-bot: Updated changelog [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224173 (owner: 10Merlijn van Deen) [19:48:46] i'll build 1.7.10 [19:48:48] RECOVERY - puppet last run on snapshot1002 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:48:55] andrewbogott: ok, done [19:48:56] (03CR) 10Legoktm: [C: 031] mwgrep: Split results between public and private wikis [puppet] - 10https://gerrit.wikimedia.org/r/214037 (owner: 10Alex Monk) [19:48:58] mutante: thanks! [19:48:59] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:49:19] RECOVERY - puppet last run on ms-be3001 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:49:29] RECOVERY - puppet last run on silver is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:49:48] RECOVERY - puppet last run on analytics1022 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:49:57] RECOVERY - puppet last run on mw2168 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:50:27] RECOVERY - puppet last run on hafnium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:50:37] RECOVERY - puppet last run on ms-be2008 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:50:48] RECOVERY - puppet last run on ms-be1007 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:50:48] RECOVERY - puppet last run on labsdb1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:52:24] !log adminbot - built and imported 1.7.10 into APT repo [19:52:29] Logged the message, Master [19:54:18] RECOVERY - puppet last run on mw1045 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:54:48] RECOVERY - puppet last run on ms-be1002 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [19:54:58] RECOVERY - puppet last run on sca1002 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:56:29] RECOVERY - puppet last run on db2056 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:56:58] RECOVERY - puppet last run on ms-fe1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:57:08] RECOVERY - puppet last run on ms-be1010 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:57:48] RECOVERY - puppet last run on cp1064 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:57:48] RECOVERY - puppet last run on ms-be2011 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:58:16] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1446271 (10Gage) Regarding running puppetmaster on !Precise: when I tried with Trusty I got this: ``` Error: Could not retrieve catalog from remote server: Error 400 on S... [19:58:48] RECOVERY - puppet last run on snapshot1001 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:58:48] RECOVERY - puppet last run on mc1014 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:59:17] RECOVERY - puppet last run on bast4001 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:00:18] RECOVERY - puppet last run on ms-be1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:00:18] RECOVERY - puppet last run on mc1010 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:00:27] RECOVERY - puppet last run on cp1048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:00:58] RECOVERY - puppet last run on ms-be2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:01:17] RECOVERY - puppet last run on ms-be1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:02:38] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:02:48] RECOVERY - puppet last run on ms-be1001 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:03:13] (03PS1) 10Dzahn: adminbot: remove README from postinst script [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224176 [20:03:17] RECOVERY - puppet last run on mc1013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:04:35] (03CR) 10Merlijn van Deen: [C: 031] adminbot: remove README from postinst script [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224176 (owner: 10Dzahn) [20:04:37] (03PS2) 10Dzahn: adminbot: remove README from postinst script [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224176 [20:04:57] RECOVERY - puppet last run on ms-fe2002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:05:18] (03CR) 10Dzahn: [C: 032] adminbot: remove README from postinst script [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224176 (owner: 10Dzahn) [20:05:20] (03Merged) 10jenkins-bot: adminbot: remove README from postinst script [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224176 (owner: 10Dzahn) [20:06:38] RECOVERY - puppet last run on mw1164 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:07:55] !log ran apt-get upgrade on logstash1004 [20:07:59] Logged the message, Master [20:10:51] !log `service elasticsearch start` not starting on logstash1004; investigating [20:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:28] open that SAL log and see the new format :) [20:11:37] robh: [20:11:53] proper date! [20:12:18] and it outputs the URL it logs to [20:12:27] morebots: are you there [20:12:27] I am a logbot running on tools-exec-1209. [20:12:27] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [20:12:27] To log a message, type !log . [20:14:07] nice, mutante [20:14:11] big endian dates ftw [20:26:08] (03CR) 10Hashar: Prevent race condition when writing settings to cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [20:27:49] (03PS1) 10Hashar: nodepool: fix typo pruge -> purge [puppet] - 10https://gerrit.wikimedia.org/r/224177 [20:36:08] bd808: how's it going with elasticsearch? [20:43:53] jgage: I just got it up. there's a weird init script problem that we hit [20:44:12] I've been talking to manybubbles about it [20:44:12] not cool / cool [20:44:26] i noticed he was logged in and thought uh-oh :) [20:44:45] we have 3 lines in /etc/default/elasticsearch that should produce $ES_JAVA_OPTS in the env [20:45:18] but something on that host was making only the last line end up being exported to the runner script [20:45:30] huh [20:45:39] and it had unexpanded env var references [20:45:58] exact same config is in beta and working fine [20:45:59] hmm i see the fix [20:46:08] <3 java [20:46:21] this is dash/systemd [20:47:12] it's systemd running a shell script that sources a shell script and then calls another shell script. [20:47:35] the sourced script is not acting like it was actually sourced properly [20:48:36] I'll write a puppet patch and somebody can help me ponder if the fix that worked is "right" or not [20:48:56] (03PS1) 10Dzahn: adminbot: add package build and upload docs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224182 [20:49:12] (03PS2) 10Dzahn: adminbot: add package build and upload docs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224182 [20:50:28] (03CR) 10Merlijn van Deen: [C: 031] adminbot: add package build and upload docs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224182 (owner: 10Dzahn) [20:51:27] (03CR) 10Andrew Bogott: [C: 032] adminbot: add package build and upload docs [debs/adminbot] - 10https://gerrit.wikimedia.org/r/224182 (owner: 10Dzahn) [21:00:35] (03PS1) 10BryanDavis: Work around /etc/default/elasticsearch variable expansion issue [puppet] - 10https://gerrit.wikimedia.org/r/224185 [21:00:45] jgage: ^ [21:07:54] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1446373 (10Gilles) >>! In T65440#1445230, @Edokter wrote: > I think we should build on multiples of 120 and 160. That should re... [21:10:57] (03CR) 10Dzahn: [C: 031] add ferm rules for memcached [puppet] - 10https://gerrit.wikimedia.org/r/222556 (owner: 10Muehlenhoff) [21:17:13] (03CR) 10BryanDavis: Prevent race condition when writing settings to cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224107 (https://phabricator.wikimedia.org/T103744) (owner: 10Chad) [21:18:14] (03CR) 10Dzahn: "so re: the @resolve part of this. i kind of like it and also added it before myself so that we don't hardcode IP addresses in puppet modul" [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [21:19:36] (03CR) 10Dzahn: "P.S. .. but if we put hardcoded IP addresses in there then it will be a reason why people say ferm rules should go to role classes, not th" [puppet] - 10https://gerrit.wikimedia.org/r/223537 (owner: 10Muehlenhoff) [21:19:52] (03PS1) 10Gergő Tisza: Remove code duplication from monolog config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224188 [21:24:01] 6operations: move calcium to a VM - https://phabricator.wikimedia.org/T105553#1446395 (10Dzahn) 3NEW [21:24:34] (03CR) 10BryanDavis: [C: 031] Remove code duplication from monolog config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224188 (owner: 10Gergő Tisza) [21:25:17] (03CR) 10Gage: [C: 032] "Seems like a reasonable workaround for a gross bug" [puppet] - 10https://gerrit.wikimedia.org/r/224185 (owner: 10BryanDavis) [21:27:44] 6operations: move OTRS to a VM? - https://phabricator.wikimedia.org/T105554#1446402 (10Dzahn) 3NEW [21:29:33] 6operations: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1446410 (10Dzahn) 3NEW [21:30:23] 6operations: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1446421 (10Dzahn) p:5Triage>3Normal [21:30:53] 6operations: move OTRS to a VM? - https://phabricator.wikimedia.org/T105554#1446422 (10Dzahn) p:5Triage>3Normal [21:31:23] 6operations: move calcium to a VM - https://phabricator.wikimedia.org/T105553#1446426 (10Dzahn) p:5Triage>3Normal [21:32:31] 6operations, 10OTRS: move OTRS to a VM? - https://phabricator.wikimedia.org/T105554#1446429 (10Krenair) [21:36:05] 6operations, 6Labs: puppet error when trying to update labs host - https://phabricator.wikimedia.org/T105556#1446454 (10Smalyshev) 3NEW a:3yuvipanda [21:47:23] (03CR) 10Smalyshev: add admin group 'wikidata query service deployers' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [21:48:42] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446483 (10Smalyshev) @DZhan should I be using systemd or upstart or sysv configurations for the services? I'm not sure if /usr/sbin/service is... [21:50:02] (03PS1) 10Dzahn: site.pp - add comments about server roles [puppet] - 10https://gerrit.wikimedia.org/r/224191 [21:54:05] (03CR) 10Dzahn: add admin group 'wikidata query service deployers' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/223984 (https://phabricator.wikimedia.org/T105185) (owner: 10Dzahn) [22:13:50] 10Ops-Access-Requests, 6operations, 6Discovery, 10Wikidata, and 2 others: Need deploy rights for Wikidata Query Service - https://phabricator.wikimedia.org/T105185#1446526 (10Dzahn) so far i have simply copied that from other examples we have for existing services. that made me assume it's the right way fo... [22:14:47] 6operations, 6Labs: puppet error when trying to update labs host - https://phabricator.wikimedia.org/T105556#1446527 (10yuvipanda) Do service puppetmaster restart on your puppetmaster and try again? [22:18:59] 6operations, 6Labs: puppet error when trying to update labs host - https://phabricator.wikimedia.org/T105556#1446528 (10Smalyshev) 5Open>3Resolved Yay! That helped. Sorry, should have thought about restarting it. [22:20:24] Any details in sending out e-mails that we know of? [22:20:45] from wiki@wikimedia.org, I haste to add [22:20:56] * odder just sent a couple of e-mails and didn't receive their copies [22:24:28] PROBLEM - Incoming network saturation on labstore2001 is CRITICAL 10.34% of data above the critical threshold [100000000.0] [22:25:11] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1446533 (10Dzahn) How about moving it to a VM ? (T105554) that also makes it jessie. [22:25:36] 6operations, 10OTRS: move OTRS to a VM? - https://phabricator.wikimedia.org/T105554#1446402 (10Dzahn) would also resolve T105125 [22:27:29] odder, I sent an email via EmailUser from one of my accounts to the other, it worked and I received a copy on the source email [22:28:16] odder: i also see mail from wiki@wikimedia in log [22:29:48] :-( [22:30:06] maybe it's my e-mail provider that sucks, although I tried sending a few other e-mails before and it worked [22:30:34] about 12 hours ago you appear in it but not after that [22:31:48] (03PS1) 10BBlack: Revert "test secret() again on cp1008" [puppet] - 10https://gerrit.wikimedia.org/r/224195 [22:32:05] (03CR) 10BBlack: [C: 032 V: 032] Revert "test secret() again on cp1008" [puppet] - 10https://gerrit.wikimedia.org/r/224195 (owner: 10BBlack) [22:34:32] mutante: can you expand on 'appear'? As in, I used Special:EmailUser or was the recipient of a message sent with it? [22:34:59] odder: as in "grep odder" [22:36:22] aha. [22:41:41] (03PS1) 10Dzahn: tendril: let puppet git clone on changes [puppet] - 10https://gerrit.wikimedia.org/r/224196 (https://phabricator.wikimedia.org/T98816) [22:45:54] (03PS2) 10Dzahn: tendril: let puppet git clone on changes [puppet] - 10https://gerrit.wikimedia.org/r/224196 (https://phabricator.wikimedia.org/T98816) [22:49:34] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1446582 (10BBlack) They're replacing the old ones. I guess technically we could name them anything, but it will be less-confusing months from now if they're lvs1001-6. We could/should perhaps... [22:49:40] (03CR) 10Dzahn: [C: 032] tendril: let puppet git clone on changes [puppet] - 10https://gerrit.wikimedia.org/r/224196 (https://phabricator.wikimedia.org/T98816) (owner: 10Dzahn) [22:51:14] !log tendril: very short maintenance downtime [22:51:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:51:42] why do we need to include the link? :? [22:52:15] for lazy clicking? [22:52:34] very as long as a puppet run takes on neon :p [22:52:58] I guess but it seems pointless [22:53:04] i.e. it would be faster to buy a new server? [22:53:05] JohnFLewis: the same bot has more than one URL [22:53:08] mutante: tell us how long then :p [22:53:24] mutante: yeah but still, makes a sort sweet message - long [22:55:12] JohnFLewis: 65.23 seconds [22:55:24] JohnFLewis: when that's the most annoying bot message in this channel maybe we should revisit the patch ;) [22:55:31] plus 10 seconds it took me to copy the missing config.php :p [22:55:34] hrmm [22:55:59] I actually like the link [22:56:58] bd808: agreed :) [23:00:13] 6operations, 7Database, 5Patch-For-Review: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1446681 (10Dzahn) removed the entire directory to get rid of .git remnants etc and let puppet freshly clone it once again. done. i still had to copy the config.php in plac... [23:01:36] 6operations, 7Database, 5Patch-For-Review: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1446702 (10Dzahn) currently there is "config.php.template" . see @neon:/srv/tendril/lib# diff config.php config.php.template [23:06:56] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1446752 (10GWicke) A lot of the issues we encountered were caused by T105509. Since deploying the RESTBase hotfix to that issue, compact... [23:10:30] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446757 (10awight) 3NEW [23:11:42] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446766 (10awight) Second proof of self-ness here: https://office.wikimedia.org/wiki/User:Awight/pubkey3 [23:25:14] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446770 (10RobH) Adam, I currently have: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDCVYUIFsCWN5odLFNGGMUJ/dgC2BB/EJ14srHC61BIWouLlahZdCOT5F2Zeuhs+aTigaTWtaFrYAOIfiChcPNSffVEMI+RTbMSZ9gXJxY294aDVe3xdd... [23:26:15] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446775 (10awight) Yes, please delete the old key, that's from a retired laptop. [23:28:39] Mutante, know what's up with labs puppet? [23:28:55] (03PS1) 10RobH: updating adam wight's public ssh key [puppet] - 10https://gerrit.wikimedia.org/r/224200 [23:29:05] I only saw errors from tools hosts, fwiw [23:29:19] mobileandrew: not yet, what is it [23:29:32] Yeah, I worry only because it could reflect an nfs issue [23:29:34] see -labs [23:31:38] (03CR) 10RobH: [C: 032] updating adam wight's public ssh key [puppet] - 10https://gerrit.wikimedia.org/r/224200 (owner: 10RobH) [23:31:49] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446780 (10RobH) Well, either web entry into phabricator or an officewiki entry would work, both is more than enough. (As such, I'm just going to process this request now. https://gerrit.wikimed... [23:34:58] Krenair, feel free to email me if anything really breaks... Otherwise I'll fix puppet in the morning. Thanks for checking. [23:35:12] ok [23:36:00] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446782 (10RobH) 5Open>3Resolved a:3RobH [23:38:30] it's the adminbot again [23:40:37] 10Ops-Access-Requests, 6operations: New production SSH key for AWight - https://phabricator.wikimedia.org/T105563#1446784 (10awight) Thank you! [23:54:14] (03PS1) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 [23:54:44] (03PS2) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) [23:55:01] (03PS3) 10Dzahn: tendril: add config template [puppet] - 10https://gerrit.wikimedia.org/r/224205 (https://phabricator.wikimedia.org/T98816) [23:56:18] bblack, do you know how varnish config in beta works? [23:57:54] (03PS4) 10Smalyshev: Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 [23:58:36] (03CR) 10jenkins-bot: [V: 04-1] Add definitions for WDQS service [puppet] - 10https://gerrit.wikimedia.org/r/223663 (owner: 10Smalyshev)