[00:00:01] yea, agree [00:00:17] mutante, do you need to list kartotherian & tilerator there? [00:00:49] if those are the 2 services ("units"), yea [00:00:58] (03PS1) 10Nuria: [WIP] mark incoming requests without cookies as such [puppet] - 10https://gerrit.wikimedia.org/r/244626 [00:01:35] yurik: but you don't have anything like "service tilerator start/stop" either? [00:01:59] ah, because tilerator-admins is a separate admin group [00:02:01] mutante, i do have access to the start/stop/unmask for both of them [00:02:03] as it should be [00:02:38] not sure if they are units or not [00:02:46] (need to read up on units) [00:02:56] (03PS2) 10Nuria: [WIP] Mark incoming requests without cookies in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/244626 [00:05:01] (03PS1) 10Dzahn: admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) [00:05:12] yurik: ^ i'm basically just copying other admin groups and don't want to introduce a difference unless we'd find out we really have to [00:05:59] but you did not have journalctl and other groups usually do.. so yea [00:06:45] i'm adding alex as you suggested [00:07:00] mutante, so will this patch give access to the log, or is will give access only after the units are set up? [00:07:47] i can give a +1 of course, but it won't matter much as i don't know the implications of it ) [00:07:50] i think you need the patch either way and setup the units maybe or they already exist [00:08:01] ok, lets merge than [00:08:07] (03CR) 10Yurik: [C: 031] admin: let kartotherian and tilerator admins read logs [puppet] - 10https://gerrit.wikimedia.org/r/244627 (https://phabricator.wikimedia.org/T115067) (owner: 10Dzahn) [00:08:23] if not, will work with akosiaris tomorrow [00:08:36] yes, that please, i cant just merge access changes like that [00:09:11] can we get another ops to review it today? friday is not so good for service deployments : [00:09:13] :) [00:09:45] i don't know, it's kind of the worst time of day for that [00:09:53] time zone wise [00:10:05] yeah, i know. Its 3am here )) [00:10:17] we need an opsen in japan [00:10:25] ack [00:11:00] yurik: what is a hostname this is on [00:11:01] +1 [00:11:35] mutante, maps-test2001.codfw.wmnet [00:13:48] bblack, if around, can you +2 a minor security change - https://gerrit.wikimedia.org/r/#/c/244627/ [00:13:49] yurik: https://phabricator.wikimedia.org/P2171 [00:14:27] mutante, only kartotherian - but scroll back to where it was crashing [00:14:33] i recovered since than [00:14:37] to an older version [00:15:39] mutante, also, you pasted august :) [00:15:46] i'm afraid that's all i have: -- Logs begin at Thu 2015-08-13 01:37:46 UTC, end at Fri 2015-10-09 00:14:56 UTC. [00:16:06] ah [00:16:07] mutante, Oct 08 23:08:30 is when it was happening [00:16:28] Oct 08 23:08:24 maps-test2001 firejail[31631]: parent is shutting down, bye... [00:16:31] Oct 08 23:08:24 maps-test2001 systemd[1]: kartotherian.service: main process exited, code=exited, status=1/FAILURE [00:16:34] Oct 08 23:08:24 maps-test2001 systemd[1]: Unit kartotherian.service entered failed state. [00:16:50] you want the Trace after that right [00:17:08] mutante, i want the lines before the ones in the bug https://phabricator.wikimedia.org/T115067 [00:17:16] yurik: you might want to set up more verbose local logging, see for example https://wikitech.wikimedia.org/wiki/RESTBase#Debugging [00:17:27] RoanKattouw: all good? [00:21:05] gwicke, something is weird is going on with the logging - the log level is at "warn", but i don't see any lines at all since 3 months ago [00:22:33] AndyRussG: Should be [00:22:35] yurik: perhaps double-check that it's really using the config you are looking at? [00:22:50] yurik: https://phabricator.wikimedia.org/P2171#8885 [00:22:56] there's the trace [00:23:07] also, if the process has write rights when using local files [00:23:08] RoanKattouw: coolo! [00:24:11] mutante, thanks! i'm begining to guess that its the firejail thats causing it (( [00:24:16] 7Blocked-on-Operations, 3Discovery-Maps-Sprint, 5Patch-For-Review, 7service-runner: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1714222 (10Dzahn) here's the trace from Oct 8 --> https://phabricator.wikimedia.org/P2171#8885 [00:25:37] gwicke, it might be a file issue - will need to work with akosiaris on that. have you ever seen -- (node) warning: possible EventEmitter memory leak detected. 11 listeners added. Use emitter.setMaxListeners() to increase limit. [00:28:32] (03CR) 10Dzahn: [C: 032] deployment: fix firewalling for mira pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/244624 (https://phabricator.wikimedia.org/T113351) (owner: 10Dzahn) [00:31:07] yurik: yeah, I have seen that; it's often a bug i n your code (forgetting to remove event listeners / use once() for example), but if you have a legitimate use case for > 10 listeners, you have to bump up the limit [00:31:15] google should yield instructions for how [00:31:30] but, really check that you want many listeners first [00:31:44] gwicke, the problem is that it runs fine under my own account on the same server [00:31:56] but not as a service [00:32:46] i have cloned the depl repo under my home dir, and copied the config file with a slight logging modification - works fine [00:33:01] will explore it further [00:36:39] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1714245 (10Dzahn) fixed. mira has firewalling now :) root@mira:/etc/ferm/conf.d# /etc/init.d/ferm start * Starting Firewall ferm [ OK ] see the resulti... [00:37:32] (03CR) 10Dzahn: "ferm is fixed now on mira and runs there. we are getting closer. after double-checking that we can go ahead here and apply it to tin" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [00:38:14] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1714246 (10Dzahn) [00:38:17] RoanKattouw: it's a bit weird, load.php on enwiki gives me the new code, but visiting the site I get the old centralNotice API [00:38:36] Maybe caching for RL modules has gotten longer? [00:38:44] Though I guess that wouldn't explain it... [00:38:53] yurik: did you try starting the service with logstash logging only? [00:39:02] that would rule out firejail file access issues [00:39:11] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1662592 (10Dzahn) [00:39:13] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1714249 (10Dzahn) [00:40:01] Maybe something w/ the localStorage module stash? [00:40:10] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1190083 (10Dzahn) firewalling also done now. fixed in T113351 and active now. [00:41:49] gwicke, i just managed to replicate it - it works with workers:0, but fails with ncpu [00:42:08] might need to raise the event listenrs [00:42:14] (03PS2) 10Dzahn: deactivate indiawikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244081 [00:43:26] (03PS3) 10Dzahn: deactivate indiawikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244081 [00:44:49] (03CR) 10Dzahn: [C: 032] deactivate indiawikipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244081 (owner: 10Dzahn) [00:44:57] RoanKattouw: ostriches: rmoen: Krenair: Maybe there's a problem with the version of JS that's being requested? I don't have the new just-deployed code on enwiki [00:44:59] https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.centralNotice.bannerController%2CbannerHistoryLogger%2CchoiceData%2Cdisplay%2CgeoIP%2CkvStore%2CkvStoreMaintenance%2ClegacySupport%2CstartUp%7Cext.centralauth.centralautologin%7Cext.gadget.WatchlistBase%2CWatchlistGreenIndicators%7Cext.uls.init%2Cinterface%2Cpreferences%2Cwebfonts%7Cext.visualEditor.desktopArticleTarget.init%7Cext.visualEditor. [00:44:59] track%2Cve%7Cjquery.byteLength%2Ccookie%2CtabIndex%2Cthrottle-debounce%2Ctipsy%7Cmediawiki.Title%2CUri%2Capi%2Ccldr%2Ccookie%2CjqueryMsg%2Clanguage%2Ctemplate%2Cuser%7Cmediawiki.language.data%2Cinit%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.startup%7Cmediawiki.special.changeslist%7Cmediawiki.template.regexp%7Cmmv.head%7Cskins.vector.js%7Cuser.defaults&skin=vector&version=ff89d45e9785 [00:45:10] ^ fetches the old code [00:45:22] https://en.wikipedia.org/w/load.php?modules=ext.centralNotice.display&debug=false [00:45:28] ^ fetches the new just-deployed code [00:45:43] The first URL is the one in my network console on enwiki [00:46:03] Any pointers? I should smoke test, but on the new code 8p [00:47:25] greg-g: Krinkle: ^ ? [00:49:25] Is it just that I should wait? Did I miss something obvious? [00:50:10] Helloooooooo? [00:55:11] (03CR) 10Yurik: "correct, lets keep it as is for now. Where would you like to document it? Tilerator in the intro already warns about it -- https://githu" [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [00:56:56] thcipriani: ^ ? [00:57:14] marktraceur: ^ ? [00:57:22] * AndyRussG pings like a maniac [00:57:50] maybe ori knows [00:57:57] can anyone please tell me whassup with a RL deploy? [00:58:44] ori: ^ ? Where'dja put mi code? [00:59:26] Reedy: ^ ? [01:00:12] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [01:02:21] MaxSem: ^ ? [01:03:53] AndyRussG, touch and resync? [01:04:16] MaxSem: Well I didn't to the deploy... should I try that? [01:04:23] What do I touch? [01:04:28] do that [01:04:51] Resync as in run sync-dir like I would normally? [01:04:54] cd CentralNotice && find -type f -exec touch {} \; [01:05:18] It was a SWAT deploy [01:05:23] then yeah, sync-dir [01:05:31] MaxSem: what will that do? [01:05:38] Hmm [01:05:43] updates all file timestamps [01:05:46] Sorry, I was in a meeting, should have /away-ed [01:05:58] I'm not sure if updating file timestamps still works, but we could try [01:06:19] AndyRussG: How can I tell, for myself, whether I'm looking at new code or old code? [01:06:35] RoanKattouw: np, I'm just going nuts here, MaxSem has kindly been keeping me company :) [01:06:55] errrgh [01:06:57] riiight [01:07:07] hashes are now used [01:07:08] RoanKattouw: in the JS console, the old code will have a mw.centralNotice.getData() function [01:07:26] OK, so if mw.centralNotice.getData exists, it's old [01:07:30] And the new code will have a mw.centralNotice.getDataProperty() function [01:07:35] RoanKattouw: yep [01:08:12] OK confirmed that I get old code even with a hard refresh in incognito mode [01:08:45] MaxSem: Hashes where? RL versions? [01:09:02] did hashes of actual module content [01:10:07] OK, I will try to touch the files, see if that works [01:10:24] RoanKattouw: K thanks much! [01:10:38] It does work on mediawiki.org for me BTW [01:10:48] And on dewii [01:10:54] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1714316 (10Eevans) MVP here: https://github.com/eevans/icinga-checkcql, comments welcome. [01:10:55] Hmm! [01:12:10] RoanKattouw: funny, on mediawiki.org I'm getting the old, though dewiki gives me the new [01:13:40] I had thought it might be a problem with the localStorage module stash, but no, enwiki was pulling the old RL versions off the network, I guess due to the version URL param to load.php [01:14:33] Yeah, probably [01:14:35] Talking to ori IRL now [01:14:44] K thx! [01:15:06] FWIW I got no improvement after deleting the localStorage stash [01:15:23] No, I know [01:15:25] I tried in incognito mode [01:15:28] (Ctrl+Shift+N in chrome) [01:17:30] Ah yeah right [01:17:34] !log catrope@tin Synchronized php-1.27.0-wmf.1/extensions/CentralNotice/resources/subscribing/ext.centralNotice.display.js: Add trailing newline to try to flush out ResourceLoader issue (duration: 02m 15s) [01:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:18:14] !log Getting failures from sync-file / scap because mira.codfw.wmnet doesn't respond to ssh [01:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [01:18:58] * RoanKattouw waits a minute for the startup cache to expire [01:20:19] Oh, grah, the newline gets minified away of course [01:20:31] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [01:20:32] hah, d'oh [01:20:56] it responds to me... [01:22:11] Ahm [01:22:13] Oh, also [01:22:14] wmf.1 [01:22:16] Silly me [01:22:19] Everything is on wmf.2 now [01:22:53] although it doesn't seem to respond internally for l10nupdate/mwdeploy? [01:23:42] can get to it from bastions but not tin [01:23:46] Aha [01:23:54] Sounds like a firewall issue then [01:24:01] Krenair: Would you mind filing a task about that? [01:24:17] !log catrope@tin Synchronized php-1.27.0-wmf.2/extensions/CentralNotice/resources/subscribing/ext.centralNotice.display.js: Add period to try to flush out ResourceLoader issue (duration: 01m 26s) [01:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:24:26] At least I can report that the new CentralNotice code that does show up on dewiki smoke tests [01:24:28] urandom: I love your coding style. https://github.com/eevans/icinga-checkcql/blob/master/check.js is a pleasure to read. [01:24:54] Whee [01:24:56] OK that fixed it for me [01:25:08] nice [01:25:15] * ori goes home, having not helped in any way whatsoever [01:25:20] :) [01:25:27] 6operations: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714322 (10Krenair) 3NEW [01:25:31] Why did RL have to break when Timo is on a plane :S [01:25:47] Thanks Krenair [01:25:58] it's a test [01:25:59] RoanKattouw: cool! yeah enwiki has the new code from here too... :D [01:26:04] and we failed it, kids [01:26:28] Eh depends on what was being tested [01:26:48] RoanKattouw: should I worry the same issue might have happened to other projects but that it hasn't been noticed? [01:26:56] RoanKattouw_away: also thanks a ton!!!!! [01:27:12] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [01:28:05] ori: oh, thank you! [01:46:43] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:47:32] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:48:22] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [01:55:56] 7Blocked-on-Operations, 3Discovery-Maps-Sprint, 5Patch-For-Review, 7service-runner: Kartotherian service logs inaccessible (systemd?) and not updated (/var/log) - https://phabricator.wikimedia.org/T115067#1714347 (10Yurik) a:3akosiaris [02:32:53] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 06m 11s) [02:33:01] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [02:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:53] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [02:35:42] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [02:37:22] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 3.004 second response time on port 9042 [02:42:32] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [02:45:53] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: puppet fail [02:49:22] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 3.001 second response time on port 9042 [02:54:31] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [02:56:29] (03PS2) 10: On Beta Cluster: Use different logo for login form [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [03:00:13] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [03:02:14] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [03:04:42] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.004 second response time on port 9042 [03:09:53] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [03:13:12] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:16:32] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.011 second response time on port 9042 [03:38:33] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [03:46:53] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 3.003 second response time on port 9042 [04:30:23] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection refused [04:30:52] PROBLEM - Analytics Cassandra database on aqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [04:43:39] (03PS1) 10Dzahn: deployment: fix firewalling for sync-file/scap tin [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) [04:44:13] RECOVERY - Analytics Cassandra database on aqs1003 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [04:44:29] (03PS2) 10Dzahn: deployment: fix firewalling for sync-file/scap tin [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) [04:45:22] 6operations, 5Patch-For-Review: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714454 (10Dzahn) [04:45:23] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1714453 (10Dzahn) [04:46:57] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1662592 (10Dzahn) ..and this is why it's good we did this only on mira and not on tin yet. -> T115075 [04:47:02] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.006 second response time on port 9042 [04:49:30] (03CR) 10Dzahn: [C: 032] deployment: fix firewalling for sync-file/scap tin [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) (owner: 10Dzahn) [04:53:55] (03CR) 10Dzahn: "root@mira:~# iptables -L | grep "tin.eqiad"" [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) (owner: 10Dzahn) [04:55:41] PROBLEM - Analytics Cassanda CQL query interface on aqs1002 is CRITICAL: Connection timed out [04:56:34] 6operations, 5Patch-For-Review: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714464 (10Dzahn) this happened with T113351#1714245 . but we just applied this on mira and not on tin yet partly for this reason, double-check if there are missing rules for deployment. please see the fix... [04:57:00] 6operations, 5Patch-For-Review: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714466 (10Dzahn) a:3Dzahn [04:57:06] 6operations, 5Patch-For-Review: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714467 (10Dzahn) 5Open>3Resolved [04:57:07] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1662592 (10Dzahn) [04:57:13] RECOVERY - Analytics Cassanda CQL query interface on aqs1002 is OK: TCP OK - 0.998 second response time on port 9042 [04:58:15] 6operations: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714322 (10Dzahn) [04:59:05] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1714472 (10Dzahn) ssh from tin fixed. do we have any other issues to fix before we can put this on tin ? [05:09:49] 6operations: ssh from tin to mira broken - https://phabricator.wikimedia.org/T115075#1714477 (10Dzahn) ``` root@tin:~# nmap mira.codfw.wmnet -p 22 .. Host is up (0.034s latency). PORT STATE SERVICE 22/tcp open ssh ``` [05:13:12] !log @RoanKattouw re: sync-file ssh to mira.codfw.wmnet: fixed! sorry. -> T115075#1714464 [05:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:14:50] Krenair: ^ [05:14:55] ..off [05:18:45] 6operations, 10MediaWiki-extensions-BounceHandler: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1714478 (1001tonythomas) >>! In T114984#1712361, @Jgreen wrote: >> but the bounce emails gets POSTed back to the API of test2.wikipedia.org fr... [05:34:38] (03PS3) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [05:35:32] (03PS4) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [05:36:32] (03PS5) 10Dzahn: deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 [05:40:50] (03PS1) 10Dzahn: apache: remove wikimania.asia redirect [puppet] - 10https://gerrit.wikimedia.org/r/244635 [05:44:54] (03CR) 10Dzahn: [C: 031] "yep, that's right. role logstash does:" [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [05:49:57] (03CR) 10Dzahn: "@akosiaris what do you think about this? because i know we have discussed it a couple times one way or another" [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [05:51:19] (03CR) 10Dzahn: [C: 031] Hiera-based assignment of grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [05:52:31] (03CR) 10Dzahn: "once https://phabricator.wikimedia.org/T113351 comes to the conclusion that we fixed all remaining issues" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [06:30:32] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 7 failures [06:30:34] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:31:01] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:21] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [06:32:22] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:41] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [06:39:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [06:42:22] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:42:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 4 below the confidence bounds [06:51:59] (03PS1) 10Ori.livneh: tcpconnstates: avoid duplicating 'network' key path segment [puppet] - 10https://gerrit.wikimedia.org/r/244637 [06:52:19] (03PS2) 10Ori.livneh: tcpconnstates: avoid duplicating 'network' key path segment [puppet] - 10https://gerrit.wikimedia.org/r/244637 [06:52:26] (03CR) 10Ori.livneh: [C: 032 V: 032] tcpconnstates: avoid duplicating 'network' key path segment [puppet] - 10https://gerrit.wikimedia.org/r/244637 (owner: 10Ori.livneh) [06:52:29] (03PS11) 10Muehlenhoff: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [06:52:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 5 below the confidence bounds [06:56:02] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. But we should coordinate before applying this on tin: Add it to https://wikitech.wikimedia.org/wiki/Deployments and merg" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [07:01:02] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 5 below the confidence bounds [07:29:42] (03PS1) 10Ori.livneh: Diamond: enable TCP collector [puppet] - 10https://gerrit.wikimedia.org/r/244640 [07:29:55] (03PS2) 10Ori.livneh: Diamond: enable TCP collector [puppet] - 10https://gerrit.wikimedia.org/r/244640 [07:30:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Diamond: enable TCP collector [puppet] - 10https://gerrit.wikimedia.org/r/244640 (owner: 10Ori.livneh) [07:46:03] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [07:48:15] (03PS2) 10Muehlenhoff: Enable ferm on tin [puppet] - 10https://gerrit.wikimedia.org/r/240083 [07:48:57] (03CR) 10Muehlenhoff: "All rules are now in place (and already running on mira), only needs some coordination with releng." [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [07:49:46] (03Abandoned) 10Muehlenhoff: Enablke ferm for hadoop standby [puppet] - 10https://gerrit.wikimedia.org/r/237100 (owner: 10Muehlenhoff) [08:00:12] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:00:39] (03CR) 10Alexandros Kosiaris: [C: 031] "Yes, please!! base::firewall was just fine in site.pp for the migration but now that this has been almost done, moving it to roles is the " [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [08:16:09] ori: I guess we'll just delete the old metrics after https://gerrit.wikimedia.org/r/#/c/244637 ? [08:16:55] godog: already on it [08:19:13] sweet [08:20:50] !log Purged graphite[12]001:/var/lib/carbon/whisper/servers/*/TcpConnStatesCollector and graphite[12]001:/var/lib/carbon/whisper/servers/*/network/work; cleaning up after https://gerrit.wikimedia.org/r/#/c/244637/ [08:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:21:07] godog: {{done}} [08:21:49] \o/ thanks ori [08:26:11] (03CR) 10Filippo Giunchedi: [C: 04-1] "should this go to the role class instead?" [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [08:26:25] (03CR) 10Filippo Giunchedi: [C: 04-1] "should this go to the role class instead?" [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [08:27:00] https://youtu.be/I6G0CnBSWVk?t=169 [08:35:50] (03Abandoned) 10Filippo Giunchedi: cassandra: WIP support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/231512 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [08:46:20] (03PS3) 10Alexandros Kosiaris: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) [08:46:22] (03PS2) 10Alexandros Kosiaris: tilerator: comment about the port argument [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) [08:46:24] (03PS1) 10Alexandros Kosiaris: maps: reorg the role classes [puppet] - 10https://gerrit.wikimedia.org/r/244645 [08:55:18] (03PS1) 10Jcrespo: Hide tools shard as no labsdb host is shown [software/dbtree] - 10https://gerrit.wikimedia.org/r/244646 [08:56:25] (03CR) 10Jcrespo: [C: 032] Hide tools shard as no labsdb host is shown [software/dbtree] - 10https://gerrit.wikimedia.org/r/244646 (owner: 10Jcrespo) [08:56:34] (03CR) 10Jcrespo: [V: 032] Hide tools shard as no labsdb host is shown [software/dbtree] - 10https://gerrit.wikimedia.org/r/244646 (owner: 10Jcrespo) [08:57:33] (03PS1) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 [08:58:16] (03CR) 10jenkins-bot: [V: 04-1] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (owner: 10Filippo Giunchedi) [09:01:10] (03PS2) 10Filippo Giunchedi: restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 [09:01:23] (03CR) 10Filippo Giunchedi: [C: 04-1] restbase: move to systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/244647 (owner: 10Filippo Giunchedi) [09:05:54] (03CR) 10Alexandros Kosiaris: [C: 032] tilerator: comment about the port argument [puppet] - 10https://gerrit.wikimedia.org/r/244437 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [09:06:13] (03CR) 10Alexandros Kosiaris: [C: 032] maps: reorg the role classes [puppet] - 10https://gerrit.wikimedia.org/r/244645 (owner: 10Alexandros Kosiaris) [09:11:08] !log poweroff sodium, remove salt key, remove puppet storedconfigs in preparation for reinstall reinstall as a VM for temporary puppetmaster testing. [09:11:18] !log poweroff rhodium, remove salt key, remove puppet storedconfigs in preparation for reinstall reinstall as a VM for temporary puppetmaster testing. [09:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:20] sigh [09:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:38] somehow I keep returning to our last lucid box... [09:12:32] good riddance! [09:17:52] !log deployed visual glitch fix to dbtree [09:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:42] ^I've rebased directly from terbium, please shout at me if I supposed not to do that [09:29:33] (03CR) 10Daniel Kinzler: "QChris: if only there was a way to do this that didn't involve setting up this exact version of gerrit with our exact hacks on my own box." [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [09:36:53] PROBLEM - puppet last run on mw2071 is CRITICAL: CRITICAL: Puppet has 1 failures [09:38:21] (03PS1) 10Revi: Modify timezone for cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244649 (https://phabricator.wikimedia.org/T115048) [09:50:24] (03PS1) 10Jcrespo: Add pt-heartbeat start & execution script to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244651 (https://phabricator.wikimedia.org/T114752) [09:56:54] (03PS2) 10Jcrespo: Add pt-heartbeat start & execution script to mariadb [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244651 (https://phabricator.wikimedia.org/T114752) [10:03:43] (03PS5) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [10:04:33] RECOVERY - puppet last run on mw2071 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:08:53] (03CR) 10Hashar: [C: 031] gerrit: add cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [10:34:13] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [10:38:33] (03PS1) 10Mobrovac: RESTBase: add global domain and Mathoid spec [puppet] - 10https://gerrit.wikimedia.org/r/244656 (https://phabricator.wikimedia.org/T102030) [10:38:40] godog: ^^ [10:44:02] (03PS1) 10Jcrespo: Set MariaDB 10 as the default version when using WMF packages [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244657 [10:51:57] (03PS6) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [10:58:14] (03CR) 10Alex Monk: "Looks like this also should have allowed the reverse (ssh from Mira to Tin) but didn't" [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) (owner: 10Dzahn) [10:59:59] mobrovac: good to go now? [11:00:25] godog: yup, lemme disable puppet before you merge [11:00:56] !log restarting db1022's mysql (depooled) for configuration testing [11:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:41] mobrovac: kk [11:01:55] godog: kk, puppet disabled in prod, let's roll [11:02:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: add global domain and Mathoid spec [puppet] - 10https://gerrit.wikimedia.org/r/244656 (https://phabricator.wikimedia.org/T102030) (owner: 10Mobrovac) [11:02:52] mobrovac: yup, merged [11:02:55] godog: could you force puppet in staging? [11:03:54] mobrovac: yep [11:04:32] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1714965 (10mobrovac) >>! In T93886#1714316, @Eevans wrote: > MVP here: https://github.com/eevans/icinga-checkcql, comments welcome. +1, nice work @eevans. O... [11:05:49] mobrovac: should be good to go [11:05:57] thnx godog [11:06:00] * mobrovac deploying to staging [11:10:03] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15118 bytes in 0.119 second response time [11:10:03] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [11:10:43] expected ^ restbase was stopped there while looking into multi-instance cassandra [11:11:36] godog: all good, proceeding to prod [11:12:12] (03CR) 10Joal: "Sorry nuria, I can't help here: never coded Varnish stuff, so I would only say bullshit :)" [puppet] - 10https://gerrit.wikimedia.org/r/244626 (owner: 10Nuria) [11:12:20] mobrovac: kk, don't forget to !log [11:12:27] nope [11:13:38] godog: force puppet run in prod, please [11:16:19] !log force-run puppet on restbase after merging https://gerrit.wikimedia.org/r/#/c/244656/ [11:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:40] (03PS3) 10Muehlenhoff: Hiera-based assignment of grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) [11:19:27] mobrovac: {{done}} [11:19:31] cheers [11:19:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Hiera-based assignment of grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [11:22:40] !log restbase deploying aaee7c31 [11:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:25:03] PROBLEM - puppet last run on mc2002 is CRITICAL: CRITICAL: puppet fail [11:25:16] (03PS1) 10Muehlenhoff: Remove debdeploy grain for now, needs additional handling for systems without a value set via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244667 (https://phabricator.wikimedia.org/T111006) [11:25:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove debdeploy grain for now, needs additional handling for systems without a value set via Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244667 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [11:26:32] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [11:26:43] RECOVERY - puppet last run on mc2002 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [11:27:22] PROBLEM - puppet last run on mc2004 is CRITICAL: CRITICAL: puppet fail [11:28:23] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: puppet fail [11:29:12] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: puppet fail [11:29:30] 6operations, 6Services, 3Discovery-Maps-Sprint: Kartotherian does not start in producton - https://phabricator.wikimedia.org/T115074#1714990 (10mobrovac) [11:30:24] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [11:30:33] 6operations, 6Services, 3Discovery-Maps-Sprint: Kartotherian does not start in producton - https://phabricator.wikimedia.org/T115074#1714993 (10mobrovac) Also, make sure #service-runner's at the latest version (0.2.10) [11:31:03] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [11:31:53] ^ the puppet failures on ms-fe3001 and mc2* are from me, fixed now, should recover soon [11:31:59] (03PS1) 10Giuseppe Lavagetto: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 [11:32:26] (03CR) 10jenkins-bot: [V: 04-1] Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 (owner: 10Giuseppe Lavagetto) [11:33:49] (03PS2) 10Giuseppe Lavagetto: Re-adding PyBalConfigurationObserverError [debs/pybal] - 10https://gerrit.wikimedia.org/r/244669 [11:36:42] 6operations, 6Services, 3Discovery-Maps-Sprint: Kartotherian does not start in producton - https://phabricator.wikimedia.org/T115074#1714999 (10Yurik) Update: the service runs as a single instance, but fails as ncpu. This might be related to the fact that mapnik shared lib is used by a large number of npms [11:37:10] (03PS1) 10Muehlenhoff: Only create salt grain for non-empty returns from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244671 (https://phabricator.wikimedia.org/T111006) [11:49:19] (03CR) 10Yurik: [C: 031] maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [11:49:46] akosiaris, around? could you +2 https://gerrit.wikimedia.org/r/#/c/244627/ [11:50:02] !log bounce mathoid on sca100[12], stray instance found not running firejail [11:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:51:16] mobrovac: {{done}} [11:51:30] grazie godog! [11:51:33] works now [11:51:35] all good [11:52:22] (03CR) 10Yurik: "Please make sure we have access to the logs -- similar to https://gerrit.wikimedia.org/r/#/c/244627/1/modules/admin/data/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T112914) (owner: 10Alexandros Kosiaris) [11:53:23] RECOVERY - puppet last run on mc2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [11:54:22] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:55:12] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:56:03] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1715039 (10fgiunchedi) >>! In T114711#1711861, @faidon wrote: > Having one zone per row sounds fine to me, as is the table with the final allocation of zones. > > I'm worrying a bit about two things... [11:56:04] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:56:11] mobrovac: sweet, no problem [11:56:23] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:56:53] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:58:58] 6operations, 10Mathoid, 10RESTBase, 6Services, 5Patch-For-Review: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1715040 (10mobrovac) 5Open>3Resolved This has been deployed in production [11:59:06] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1715043 (10mobrovac) [12:21:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [12:25:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [12:34:24] yurik: that's a sudo request. I can't just +2 unfortunately. need some input from the rest of ops [12:34:45] akosiaris, it was written by another op :) [12:36:00] I know [12:36:17] there's a policy for sudo requests. need to be discussed in meeting [12:36:44] mark may be able to expedite it [12:36:54] akosiaris, sure. From what i understood, this was copied from other services [12:36:59] mark ? https://gerrit.wikimedia.org/r/#/c/244627/1/modules/admin/data/data.yaml [12:56:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [13:01:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [13:02:01] I think the increase number of fails is Applebot hiting bad image urls [13:04:59] this 3rd party error happens a lot- many external tools assume that a thumb is achieved adding XXXpx- at the front, and that is not true for large urls [13:05:49] should we care? Should we return a different error code? [13:06:22] 6operations, 10MediaWiki-extensions-BounceHandler: BounceHandler still HTTP posting to test2.wikipedia.org API in production - https://phabricator.wikimedia.org/T114984#1715136 (10Jgreen) > How about we change that to meta.wikimedia.org ? We have the API listening in Meta - https://meta.wikimedia.org/w/api.php... [13:10:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [13:13:53] jynus: other than 404 ? [13:14:28] well, not we return 503 [13:14:31] *now [13:14:54] not sure what the issue is, but the way you describe it, it sounds we should return 404 [13:15:05] but 503 is proxy error, so is that return by varnish ? [13:16:03] I can check [13:16:05] well, generically it's service unavailable but not sure which urls you are talking aobut [13:16:34] let me give you an example [13:16:50] this image: https://commons.wikimedia.org/wiki/File:Fran%C3%A7ois_Ier_montre_%C3%A0_Marguerite_de_Navarre,_sa_s%C5%93ur,_les_vers_qu'il_vient_d'%C3%A9crire_sur_une_vitre_avec_son_diamant_-_Fleury_Fran%C3%A7ois_Richard_-_MBA_Lyon_2014.jpg [13:17:10] its thumb is https://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Fran%C3%A7ois_Ier_montre_%C3%A0_Marguerite_de_Navarre%2C_sa_s%C5%93ur%2C_les_vers_qu'il_vient_d'%C3%A9crire_sur_une_vitre_avec_son_diamant_-_Fleury_Fran%C3%A7ois_Richard_-_MBA_Lyon_2014.jpg/521px-thumbnail.jpg [13:17:24] (note the thumbnail.jpg at the end) [13:17:35] fine up to now [13:18:00] and the original is https://upload.wikimedia.org/wikipedia/commons/5/54/Fran%C3%A7ois_Ier_montre_%C3%A0_Marguerite_de_Navarre%2C_sa_s%C5%93ur%2C_les_vers_qu%27il_vient_d%27%C3%A9crire_sur_une_vitre_avec_son_diamant_-_Fleury_Fran%C3%A7ois_Richard_-_MBA_Lyon_2014.jpg [13:18:29] many applications assume, incorrectly [13:18:34] that the thumb will be on [13:19:47] cannot find the exact hit now :-) [13:19:56] (03CR) 10Rush: [C: 031] "yeah if no longer applicable please delete :)" [puppet] - 10https://gerrit.wikimedia.org/r/244555 (owner: 10Dzahn) [13:20:07] but on [name]/521px-[name] [13:20:36] upload or commons ? [13:20:40] for example: /wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_(Three_Beauties_of_the_Present_Day)From_Bijin-ga_(Pictures_of_Beautiful_Women),_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg/200px-Kitagawa_Utamaro_-_Toji_san_bijin_(Three_Beauties_of_the_Present_Day)From_Bijin-ga_(Pictures_of_Beautiful_Women),_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg [13:20:55] upload [13:21:17] they hit https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg/200px-Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg [13:21:55] instead of the correct: https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Kitagawa_Utamaro_-_Toji_san_bijin_%28Three_Beauties_of_the_Present_Day%29From_Bijin-ga_%28Pictures_of_Beautiful_Women%29,_published_by_Tsutaya_Juzaburo_-_Google_Art_Project.jpg/200px-thumbnail.jpg [13:22:40] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1715143 (10BBlack) Unless we have a bug, we already have the behavior you're asking for, on the primary cluster... [13:23:55] which leads me to the original thought, should we care about this incorrect behaviour? [13:24:18] incorrect by 3rd pary apps, not us [13:25:22] yes, we should. we should be emitting a 404, not crash and have varnish handle it [13:26:22] I forget: https://phabricator.wikimedia.org/T106517 [13:26:26] *forgot [13:26:44] the entire thumbnailing infrastructure is under a lot of discussion btw [13:26:50] there was a meeting like a month ago [13:26:57] about the state of thumbnailing [13:27:37] I know, not caring much, it is just that from time to time, I forget about that ticket and we have a spike of 503 [13:27:52] not caring much == do not consider an urgency [13:28:18] (03PS3) 10Rush: Fix phabricator basedir in vcs.pp / phabricator-ssh-hook [puppet] - 10https://gerrit.wikimedia.org/r/244506 (https://phabricator.wikimedia.org/T100519) (owner: 1020after4) [13:33:11] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1715150 (10chasemp) Sure, I created you an analyst account. There should be an email for you. There is an API as well, here is an example client lib https://github.com/jasonarewhy/catchpoint-api-py... [13:33:29] (03PS4) 10Rush: Fix phabricator basedir in vcs.pp / phabricator-ssh-hook [puppet] - 10https://gerrit.wikimedia.org/r/244506 (https://phabricator.wikimedia.org/T100519) (owner: 1020after4) [13:35:24] (03CR) 10Rush: [C: 032] Fix phabricator basedir in vcs.pp / phabricator-ssh-hook [puppet] - 10https://gerrit.wikimedia.org/r/244506 (https://phabricator.wikimedia.org/T100519) (owner: 1020after4) [13:36:37] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1715152 (10jcrespo) We have a particular case that happens a lot, for files with long names like [[ https://commons.wikimedia.org/wiki/File:... [13:41:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [13:43:21] (03CR) 10Ottomata: "Ok Ok ok, I'm going to move camus properties into puppet......." [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [13:43:41] (03CR) 10BBlack: [C: 04-1] "This should be in templates/varnish/analytics.inc.vcl.erb with the rest of the analytics code, probably just as inlines in the primary ana" [puppet] - 10https://gerrit.wikimedia.org/r/244626 (owner: 10Nuria) [13:45:42] (03CR) 10Hashar: [C: 031] varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [13:46:12] (03PS1) 10Rush: phab: vcs user manages ssh interaction [puppet] - 10https://gerrit.wikimedia.org/r/244679 [13:46:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 7 below the confidence bounds [13:47:38] (03PS2) 10Rush: phab: vcs user manages ssh interaction [puppet] - 10https://gerrit.wikimedia.org/r/244679 [13:47:55] (03PS2) 10Alexandros Kosiaris: Update servermon configuration for 0.7 [puppet] - 10https://gerrit.wikimedia.org/r/223347 [13:50:32] (03CR) 10Rush: [C: 032] phab: vcs user manages ssh interaction [puppet] - 10https://gerrit.wikimedia.org/r/244679 (owner: 10Rush) [13:53:36] !log Restarted Zuul, had a deadlocked job [13:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:20] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1715162 (10fgiunchedi) >>! In T93886#1714316, @Eevans wrote: > MVP here: https://github.com/eevans/icinga-checkcql, comments welcome. LGTM overall! some poi... [14:00:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [14:04:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:05:23] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [14:05:25] (03PS4) 10BBlack: varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [14:05:55] (03PS2) 10Muehlenhoff: Only create salt grain for non-empty returns from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244671 (https://phabricator.wikimedia.org/T111006) [14:06:21] (03CR) 10BBlack: [C: 032] varnish: minor lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243854 (owner: 10Dzahn) [14:06:33] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1715180 (10Eevans) >>! In T93886#1714965, @mobrovac wrote: >>>! In T93886#1714316, @Eevans wrote: >> MVP here: https://github.com/eevans/icinga-checkcql, com... [14:11:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [14:16:24] (03PS3) 10Muehlenhoff: Only create salt grain for non-empty returns from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244671 (https://phabricator.wikimedia.org/T111006) [14:16:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [14:26:56] (03CR) 10Muehlenhoff: [C: 032 V: 032] Only create salt grain for non-empty returns from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244671 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [14:28:16] (03PS2) 10Rush: phab: set diffusion.ssh-host for diffusion [puppet] - 10https://gerrit.wikimedia.org/r/244683 [14:28:25] (03CR) 10Rush: [C: 032 V: 032] phab: set diffusion.ssh-host for diffusion [puppet] - 10https://gerrit.wikimedia.org/r/244683 (owner: 10Rush) [14:31:43] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:33:23] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1715205 (10Krenair) See my comment on https://gerrit.wikimedia.org/r/#/c/244633/2 [14:34:22] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:37:32] !log more configuration testing (with puppet disabled) and several mysql restarts on db1022 [14:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:32] (03PS1) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:38:58] (03PS2) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:39:35] (03CR) 10jenkins-bot: [V: 04-1] Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:41:11] (03PS3) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:41:48] (03CR) 10jenkins-bot: [V: 04-1] Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:43:30] (03PS4) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:44:04] (03CR) 10jenkins-bot: [V: 04-1] Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:44:49] (03PS1) 10Muehlenhoff: Migrate initial manually configured debdeploy grains to values read from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244685 (https://phabricator.wikimedia.org/T111006) [14:45:12] (03PS5) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:45:14] (03PS2) 10Muehlenhoff: Migrate initial manually configured debdeploy grains to values read from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244685 (https://phabricator.wikimedia.org/T111006) [14:46:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Migrate initial manually configured debdeploy grains to values read from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/244685 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [14:46:37] (03CR) 10Ottomata: [C: 032] Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:46:45] (03PS6) 10Ottomata: Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) [14:46:51] (03CR) 10Ottomata: [V: 032] Add camus module, use camus::job to import eventlogging data [puppet] - 10https://gerrit.wikimedia.org/r/244684 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:47:06] moritzm: :) merge yours ok? [14:47:21] (03CR) 10Alex Monk: "See I6c0143f9c" [puppet] - 10https://gerrit.wikimedia.org/r/244633 (https://phabricator.wikimedia.org/T115075) (owner: 10Dzahn) [14:47:53] did that a second ago [14:48:00] oh ok [14:48:03] ja [14:48:03] see [14:48:22] (03PS1) 10Alex Monk: Update a couple of firewall rules to include mira alongside tin [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) [14:51:53] 6operations, 7Database: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1715270 (10jcrespo) With its default configuration, due to our extreme configuration of the following variables: ``` #max_connections = 5000 #table_open_cache = 50000 #table... [14:54:02] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection refused [14:55:39] (03PS1) 10Ottomata: Suffix each kafka broker with :9092 for camus::job [puppet] - 10https://gerrit.wikimedia.org/r/244687 (https://phabricator.wikimedia.org/T115114) [14:55:44] (03CR) 10jenkins-bot: [V: 04-1] Suffix each kafka broker with :9092 for camus::job [puppet] - 10https://gerrit.wikimedia.org/r/244687 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [14:55:50] (03PS2) 10Ottomata: Suffix each kafka broker with :9092 for camus::job [puppet] - 10https://gerrit.wikimedia.org/r/244687 (https://phabricator.wikimedia.org/T115114) [14:55:52] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.005 second response time on port 9042 [14:56:36] (03CR) 10Ottomata: [C: 032] Suffix each kafka broker with :9092 for camus::job [puppet] - 10https://gerrit.wikimedia.org/r/244687 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [15:00:37] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715316 (10Andrew) We should be able to use out-of-warranty 8g systems for: - labservices2001 (designate, pdns, future ldap) - labmetal2001 -- a host d... [15:02:03] (03PS1) 10Ottomata: Use camus::job for webrequest data import [puppet] - 10https://gerrit.wikimedia.org/r/244688 (https://phabricator.wikimedia.org/T115114) [15:03:35] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715323 (10coren) I'd call labnet2001 "labnet"; it's the network control node independently of the technology (imo) [15:03:40] (03CR) 10Ottomata: [C: 032] Use camus::job for webrequest data import [puppet] - 10https://gerrit.wikimedia.org/r/244688 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [15:05:44] (03PS1) 10Ottomata: Add slog alias for tailing syslog to otto's .bash_aliases [puppet] - 10https://gerrit.wikimedia.org/r/244689 [15:05:57] (03PS2) 10Ottomata: Add slog alias for tailing syslog to otto's .bash_aliases [puppet] - 10https://gerrit.wikimedia.org/r/244689 [15:06:02] (03CR) 10Ottomata: [C: 032 V: 032] Add slog alias for tailing syslog to otto's .bash_aliases [puppet] - 10https://gerrit.wikimedia.org/r/244689 (owner: 10Ottomata) [15:06:59] !log starting nodetool cleanup on restbase-test2002 [15:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:37] (03PS1) 10Ottomata: Remove unused camus cron job [puppet] - 10https://gerrit.wikimedia.org/r/244690 (https://phabricator.wikimedia.org/T115114) [15:12:28] (03CR) 10Ottomata: [C: 032] Remove unused camus cron job [puppet] - 10https://gerrit.wikimedia.org/r/244690 (https://phabricator.wikimedia.org/T115114) (owner: 10Ottomata) [15:13:23] (03PS1) 10Muehlenhoff: releases: move base::firewall into the role [puppet] - 10https://gerrit.wikimedia.org/r/244691 [15:14:19] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715347 (10Andrew) [15:15:19] !log bouncing Cassandra on restbase-test2001-a [15:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:18] (03PS2) 10Ottomata: Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [15:16:48] (03PS1) 10Muehlenhoff: dnsrecursor: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244692 [15:20:00] (03PS1) 10John F. Lewis: mw_rc_irc: standarise puppet naming [puppet] - 10https://gerrit.wikimedia.org/r/244695 [15:21:33] (03CR) 10Ottomata: "Done, camus properties now are here instead of refinery. Madhu, I've moved your mediawiki properties file over here, and templatized kafk" [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [15:22:47] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715359 (10Andrew) We have two 8g servers out of warranty, and two 16g servers out of warranty. So if we don't mind gobbling up all of those we can pha... [15:25:01] 6operations, 10RESTBase-Cassandra, 5Patch-For-Review: better cassandra process checks - https://phabricator.wikimedia.org/T108306#1715375 (10Eevans) [15:25:37] (03CR) 10John F. Lewis: [C: 04-1] "change to modules/cdh keeps getting noticed in my changes." [puppet] - 10https://gerrit.wikimedia.org/r/244695 (owner: 10John F. Lewis) [15:29:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:29:14] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:29:52] (03Abandoned) 10John F. Lewis: mw_rc_irc: standarise puppet naming [puppet] - 10https://gerrit.wikimedia.org/r/244695 (owner: 10John F. Lewis) [15:32:34] 6operations, 10Architecture, 10Incident-20150423-Commons, 10MediaWiki-RfCs, and 6 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#1715422 (10GWicke) > Are you sure that the cache_parsoid varnish cluster is not (perhaps indirectly) involve in... [15:32:51] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1715423 (10Eevans) restbase-test2001.codfw.wmnet has now been configured for 2 instances ({a,b}), but something is amiss with the data sizes. ```... [15:35:09] !log performing Cassandra cleanup on restbase-test2003.codfw [15:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:36:01] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1715433 (10Joe) I finally got a repeatable way to reproduce this behaviour: - set a service to test IdleConnection only on a backend running apache - stop apache... [15:36:52] <_joe_> ottomata: you have one patch to merge on palladium [15:37:32] 6operations, 7Database: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1715447 (10jcrespo) By setting the following configuration: ``` # Activating P_S by default performance_schema = 1 # downsizing performance schema memory usage: T99485 performance_schema_digests_size... [15:37:53] _joe_: ah, sorry, yeah, was about to submit one more...then standup caught me [15:38:05] (03PS1) 10Ottomata: Logrotate camus log files [puppet] - 10https://gerrit.wikimedia.org/r/244696 (https://phabricator.wikimedia.org/T110598) [15:40:14] 6operations, 7Database: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1715464 (10jcrespo) Setting `performance_schema_digests_size = 5000` [15:41:26] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Replace uses of monitoring::ganglia with monitoring::graphite_* [5 pts] - https://phabricator.wikimedia.org/T90642#1715469 (10Ottomata) [15:46:08] (03CR) 10Ottomata: [C: 032] Logrotate camus log files [puppet] - 10https://gerrit.wikimedia.org/r/244696 (https://phabricator.wikimedia.org/T110598) (owner: 10Ottomata) [15:46:19] (03PS1) 10John F. Lewis: mw_rc_irc: rename module to standard naming [puppet] - 10https://gerrit.wikimedia.org/r/244699 [15:46:41] <_joe_> ottomata: if I score points on the analytics board by solving the ganglia thing, what do I win? [15:46:48] ganglia thing? [15:47:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:47:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:47:15] <_joe_> https://phabricator.wikimedia.org/T90642 [15:47:32] <_joe_> I mean it's like a competition or do you get like a teddy bear when you reach 10 points, a pony when you score 10000 etc.? [15:47:34] _joe_: it is already done! [15:47:35] ohhh [15:47:36] hahah [15:47:38] the pts [15:47:42] <_joe_> eheh [15:47:47] (03PS2) 10John F. Lewis: mw_rc_irc: rename module to standard naming [puppet] - 10https://gerrit.wikimedia.org/r/244699 [15:47:56] <_joe_> sorry, making fun of all religions (including Scrum) is part of my dna :P [15:48:00] i think the only real value is making grace very happy [15:48:33] _joe_, you'll be happy to know, that this quarter we resolved more tasks and increased our point velocity compared to last quater [15:48:41] <_joe_> although if scoring more than 40 points a week got you free beer that'd be cool [15:48:43] which to me means: we succesfully implemented more buearcracy [15:48:47] <_joe_> ottomata: I already know [15:48:57] <_joe_> I was in the QR meeting :P [15:49:00] hehe [15:49:27] more tasks were created and dragged across the kanban board! [15:49:33] woohoo! [15:50:01] (03PS1) 10Rush: phab: manage phab-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/244700 [15:50:33] (03PS1) 10John F. Lewis: lists: use more sane values for queue checks [puppet] - 10https://gerrit.wikimedia.org/r/244701 [15:51:14] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1715480 (10Eevans) An additional point of data, in the logs on xenon.eqiad.wmnet (10.64.0.200), there are CF id mismatches relating to the CF in q... [15:51:21] (03PS2) 10Rush: phab: manage phab-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/244700 [15:52:11] (03CR) 10Rush: [C: 032] phab: manage phab-ssh service [puppet] - 10https://gerrit.wikimedia.org/r/244700 (owner: 10Rush) [15:56:55] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1715483 (10fgiunchedi) on the peer side (xenon in this case) this shows ```lines=5 INFO [STREAM-IN-/10.192.16.154] 2015-10-09 15:18:14,136 Strea... [15:59:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [16:00:27] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, and 2 others: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1715484 (10chasemp) >>! In T100519#1707242, @Legoktm wrote: >>>! In T100519#1706391, @chasemp wrote: >> @demon explained some of the histor... [16:04:44] !log rolling restart cassandra test cluster T95253 [16:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:39] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1715503 (10kevinator) 5Open>3Resolved [16:12:30] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Fix active namenode monitoring so that ANY active namenode is an OK state. [8 pts] - https://phabricator.wikimedia.org/T89463#1715523 (10kevinator) 5Open>3Resolved [16:14:00] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715528 (10RobH) I just want to ensure we have a summary so far, since the task description has since shifted to be slightly innaccurate: * This is a p... [16:18:06] (03PS1) 10Muehlenhoff: etherpad: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244706 [16:18:11] 6operations, 7Database: implement performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1715535 (10jcrespo) This seem to work ok for now, with a 100M footprint: ``` # Activating P_S by default performance_schema = 1 # downsizing performance schema memory usage: T99485 performance_schema_di... [16:18:22] (03PS7) 10Giuseppe Lavagetto: Convert logging from print to twisted.python.log [debs/pybal] - 10https://gerrit.wikimedia.org/r/244138 [16:19:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:20:09] (03PS1) 10Muehlenhoff: lists: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244709 [16:21:53] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Replace uses of monitoring::ganglia with monitoring::graphite_* [5 pts] - https://phabricator.wikimedia.org/T90642#1715547 (10kevinator) 5Open>3Resolved [16:22:43] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: /media/math/{format} is CRITICAL: Test Mathoid - test formula returned the unexpected status 500 (expecting: 200) [16:24:24] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [16:24:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:30:02] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:33:11] (03PS1) 10Jcrespo: Add the posibility of enabling the performance_schema engine [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/244710 [16:34:09] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1715592 (10Andrew) [16:39:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:39:32] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [16:41:14] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [16:44:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:49:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:51:22] PROBLEM - Restbase endpoints health on restbase-test2003 is CRITICAL: /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200) [16:53:12] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [16:55:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [17:03:04] (03PS1) 10Jcrespo: [WIP] Enabling Async IO on newer kernels and selectively, P_S [puppet] - 10https://gerrit.wikimedia.org/r/244713 [17:06:43] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [17:08:06] (03CR) 10Dzahn: "my idea was that i wanted it right next to where the cert is installed" [puppet] - 10https://gerrit.wikimedia.org/r/244614 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [17:09:17] (03CR) 10Dzahn: "i found it makes sense to have it right next to where the cert is installed, if that would have been in the role i'd added this there too." [puppet] - 10https://gerrit.wikimedia.org/r/244617 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [17:09:29] (03PS2) 10John F. Lewis: lists: use more sane values for queue checks [puppet] - 10https://gerrit.wikimedia.org/r/244701 [17:11:02] (03CR) 10Dzahn: "@fillipo here i put it in the role, just whereever the cert is installed" [puppet] - 10https://gerrit.wikimedia.org/r/244618 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [17:12:01] (03CR) 10Dzahn: [C: 032] lists: use more sane values for queue checks [puppet] - 10https://gerrit.wikimedia.org/r/244701 (owner: 10John F. Lewis) [17:17:00] (03CR) 10Dzahn: "besides the list server we also use spamassassin on role::mail::mx hosts, i checked it's also not "spamd" but "debian-spamd" on mx1001" [puppet] - 10https://gerrit.wikimedia.org/r/244555 (owner: 10Dzahn) [17:19:41] (03CR) 10John F. Lewis: [C: 031] "seems Debian uses the new id and group so this can be removed as unnecessary." [puppet] - 10https://gerrit.wikimedia.org/r/244555 (owner: 10Dzahn) [17:21:51] (03CR) 10Jdlrobson: "Do we have any suitable logos in another config variable somewhere...?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [17:21:53] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:24:12] (03PS2) 10Dzahn: admin: remove spamd from enforce-users-groups [puppet] - 10https://gerrit.wikimedia.org/r/244555 [17:24:39] (03CR) 10Dzahn: [C: 032] "yep, don't see it anywhere, also not with salt" [puppet] - 10https://gerrit.wikimedia.org/r/244555 (owner: 10Dzahn) [17:26:59] andrewbogott, wait, are you deploying that backport now? [17:29:28] jdlrobson: betacommons.png betawikidata.png betawiki.png betawikiversity.png [17:31:43] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:32:02] (03PS1) 10Chad: Support easy cloning of git repositories from Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/244715 [17:32:38] (03CR) 10jenkins-bot: [V: 04-1] Support easy cloning of git repositories from Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/244715 (owner: 10Chad) [17:33:42] (03CR) 10Dzahn: "/srv/mediawiki/w/static/images/project-logos# ls *beta*" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [17:34:01] (03PS2) 10Chad: Support easy cloning of git repositories from Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/244715 [17:35:54] Where does our PHP fatal error page come from these days? [17:37:07] https://github.com/wikimedia/operations-mediawiki-config/blob/master/hhvm-fatal-error.php [17:38:55] we really need to get those dblists out of the root of the directory [17:39:47] thanks Reedy [17:40:25] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1715753 (10GWicke) @chasemp: Thanks, am logged in & am getting nice info. Small tidbits: - uptime last month, globally: 99.977% - mean latency for Foobar around 200ms from most locations, 70ms from... [17:41:52] (03PS1) 10Ori.livneh: IdleConnection: set keepalive [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) [17:41:57] ^ _joe_ [17:42:19] (03CR) 10jenkins-bot: [V: 04-1] IdleConnection: set keepalive [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [17:42:43] why you always gotta be in my shit, jenkins [17:45:52] 6operations, 6Services: Set up external uptime metrics for REST API - https://phabricator.wikimedia.org/T115022#1715775 (10GWicke) 5Open>3Resolved a:3GWicke [17:55:34] chasemp: ping [17:55:45] pong? [17:56:03] hey, just noticed that the test is still hitting rest.wikimedia.org, which is deprecated [17:56:25] could you change that to https://en.wikipedia.org/api/rest_v1/page/html/Foobar ? [17:57:46] at http://portal.catchpoint.com/ui/Content/Tests/TestDetail.aspx?id=97770 [17:58:20] done [17:58:30] grazie! [17:58:30] (03PS1) 10BryanDavis: vagrant-lxc: Fix sudo rule for finding lcx command paths [puppet] - 10https://gerrit.wikimedia.org/r/244718 (https://phabricator.wikimedia.org/T115080) [18:00:32] (03PS2) 10Rush: vagrant-lxc: Fix sudo rule for finding lcx command paths [puppet] - 10https://gerrit.wikimedia.org/r/244718 (https://phabricator.wikimedia.org/T115080) (owner: 10BryanDavis) [18:02:19] (03CR) 10Rush: [C: 032] vagrant-lxc: Fix sudo rule for finding lcx command paths [puppet] - 10https://gerrit.wikimedia.org/r/244718 (https://phabricator.wikimedia.org/T115080) (owner: 10BryanDavis) [18:04:05] (03CR) 10Dzahn: "https://commons.wikimedia.org/wiki/Category:Wikimedia_beta_logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T115078) (owner: 10Jdlrobson) [18:07:07] andrewbogott, okay I'm just going to sync this then [18:07:34] in future, please do not merge to the deployment branches unless you're planning to actually deploy the commit [18:08:55] (03CR) 10Mdann52: "I apologise for that - it is just moving chunks of text about, changing the bug ID's to Txxxxx and one fix to some broken code. I don't kn" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) (owner: 10Mdann52) [18:13:19] !log krenair@tin Synchronized php-1.27.0-wmf.2/extensions/OpenStackManager/nova/OpenStackNovaProject.php: https://gerrit.wikimedia.org/r/#/c/244707/ (duration: 01m 13s) [18:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:55] (03PS2) 10Ori.livneh: IdleConnection: set keepalive [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) [18:21:46] (03CR) 10Dzahn: [C: 031] "after Alex comments on other changes, i am convinced this is right now:)" [puppet] - 10https://gerrit.wikimedia.org/r/243123 (owner: 10Muehlenhoff) [18:22:26] (03PS2) 10Dzahn: releases: Move the base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243123 (owner: 10Muehlenhoff) [18:22:52] (03CR) 10Dzahn: [C: 032] releases: Move the base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243123 (owner: 10Muehlenhoff) [18:24:00] (03PS1) 10John F. Lewis: icinga: ensure hiera lookups for all contact_group defs [puppet] - 10https://gerrit.wikimedia.org/r/244722 [18:24:10] (03PS2) 10John F. Lewis: icinga: ensure hiera lookups for all contact_group defs [puppet] - 10https://gerrit.wikimedia.org/r/244722 [18:24:58] (03CR) 10Dzahn: [C: 031] Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243122 (owner: 10Muehlenhoff) [18:26:01] (03PS1) 10Catrope: Enable Flow beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244724 (https://phabricator.wikimedia.org/T115100) [18:26:13] (03CR) 10Dzahn: [C: 031] Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [18:27:19] (03CR) 10Catrope: [C: 04-2] "Hold until evening SWAT on Oct 12" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244724 (https://phabricator.wikimedia.org/T115100) (owner: 10Catrope) [18:27:25] RoanKattouw: Woo. Exciting. [18:29:36] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1715930 (10GWicke) I think I encountered a similar problem after accidentally bootstrapping a node off itself in the past (by having it listed as... [18:29:52] (03PS2) 10Dzahn: lists: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244709 (owner: 10Muehlenhoff) [18:29:59] (03CR) 10Alex Monk: [C: 031] Rename Azerbaijani Wikisource project and namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242096 (https://phabricator.wikimedia.org/T114002) (owner: 10Siebrand) [18:30:21] (03CR) 10Dzahn: [C: 032] lists: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244709 (owner: 10Muehlenhoff) [18:30:23] (03CR) 10John F. Lewis: [C: 031] "http://puppet-compiler.wmflabs.org/982/ tells me all I need to know!" [puppet] - 10https://gerrit.wikimedia.org/r/244722 (owner: 10John F. Lewis) [18:30:41] mutante: ^^ +2? :) [18:32:38] JohnFLewis: that's really good! the contact groups for analytics servers were messed up all this time [18:32:51] just a minute while i was watching the last change [18:33:03] this should also unblock the access requests [18:33:11] okay :) hopefully once we confirm it actually works on neon we can: [18:33:44] a) fix the two icinga ARs, b) convert all $nagios_contact_groups to hiera and c) fix a regression :) [18:33:56] JohnFLewis: full ack! :) thank you [18:35:10] (03PS2) 10Dzahn: videoscalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/242180 (owner: 10Muehlenhoff) [18:35:22] (03PS2) 10Dzahn: imagescalers: move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243122 (owner: 10Muehlenhoff) [18:37:05] (03CR) 10Dzahn: "yep, works now on mira. let's use https://gerrit.wikimedia.org/r/#/c/223458/ though to achieve the same" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [18:37:07] (03PS2) 10Alex Monk: Modify timezone for cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244649 (https://phabricator.wikimedia.org/T115048) (owner: 10Revi) [18:38:01] (03PS3) 10Dzahn: icinga: ensure hiera lookups for all contact_group defs [puppet] - 10https://gerrit.wikimedia.org/r/244722 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [18:38:03] (03CR) 10Alex Monk: [C: 031] Modify timezone for cswiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244649 (https://phabricator.wikimedia.org/T115048) (owner: 10Revi) [18:39:06] (03PS4) 10Dzahn: icinga: ensure hiera lookups for all contact_group defs [puppet] - 10https://gerrit.wikimedia.org/r/244722 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [18:43:02] (03CR) 10Dzahn: [C: 032] "Thank you for this! As mentioned on T111243, T105229 and related discussion the contact groups override never actually worked. this means " [puppet] - 10https://gerrit.wikimedia.org/r/244722 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [18:44:32] (03PS1) 10Reedy: Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 [18:44:34] (03PS1) 10Reedy: Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 [18:44:41] (03CR) 10jenkins-bot: [V: 04-1] Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 (owner: 10Reedy) [18:44:50] lol [18:45:12] oh, damn it, unit tests [18:45:42] dbconfigTests::testDbAssignedToAnExistingCluster [18:45:51] does that dbconfig test need the symlink you delete ? [18:46:03] yeah, I didn't update it to the new location of the file [18:47:05] (03CR) 10MZMcBride: "Related: ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 (owner: 10Reedy) [18:48:46] (03CR) 10Giuseppe Lavagetto: [C: 031] "With the caveat that this might not solve our problem unless we can significantly reduce tcp_keepalive_time, it's still a valid correction" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [18:48:56] Apparently I'm blind [18:51:50] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1715971 (10ori) @Joe pointed out on IRC that the default `tcp_keepalive_time` is 300s, which is much longer than we'd like to take to recogni... [18:55:02] Reedy: Coding blind is pretty impressive. [18:57:17] (03PS2) 10Reedy: Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 [18:57:19] (03PS2) 10Reedy: Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 [18:57:24] (03CR) 10jenkins-bot: [V: 04-1] Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 (owner: 10Reedy) [18:57:27] (03CR) 10jenkins-bot: [V: 04-1] Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 (owner: 10Reedy) [18:57:42] katie: even more impressive than it sounds :) http://blog.freecodecamp.com/2015/01/a-vision-of-coding-without-opening-your-eyes.html [18:58:22] whee, missing / [19:00:52] (03PS3) 10Reedy: Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 [19:00:54] (03PS3) 10Reedy: Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 [19:01:02] (03CR) 10jenkins-bot: [V: 04-1] Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 (owner: 10Reedy) [19:02:29] (03CR) 10Dzahn: "confirmed on neon, the override DOES work, but still only for the NTP service on erbium and only that, not other services on it or other h" [puppet] - 10https://gerrit.wikimedia.org/r/244722 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [19:03:29] (03CR) 10JanZerebecki: [C: 031] Explicitly set wmgMFNearby = false for wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244591 (https://phabricator.wikimedia.org/T114869) (owner: 10Aude) [19:04:40] Krenair: I +2’d, waited for the merge, and then ran sync-common on silver. Did I miss a step? [19:05:28] Oh, I suppose ‘git fetch’ on the deployment host [19:06:47] :-) [19:07:32] (03CR) 10Dzahn: "root@neon:/etc/icinga# diff -u3 puppet_services.cfg /home/dzahn/puppet_services.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/244722 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [19:08:37] (03PS4) 10Reedy: Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 [19:08:39] (03PS4) 10Reedy: Delete dblist symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244728 [19:09:01] JohnFLewis: you know what's also still fun. the size of puppet_services.cfg the change was in line "114164" [19:09:26] it used to create duplicate entries [19:11:43] (03CR) 10Nuria: ">This should be in templates/varnish/analytics.inc.vcl.erb with the rest of >the analytics code, probably just as inlines in the primary a" [puppet] - 10https://gerrit.wikimedia.org/r/244626 (owner: 10Nuria) [19:12:13] (03PS1) 10John F. Lewis: base: don't set contact_groups unless they override icinga defaults [puppet] - 10https://gerrit.wikimedia.org/r/244732 [19:12:19] ebernhardson: Interesting piece. [19:12:29] (03PS2) 10John F. Lewis: base: don't set contact_groups unless they override icinga defaults [puppet] - 10https://gerrit.wikimedia.org/r/244732 [19:13:34] (03PS1) 10Ori.livneh: Provide a smooth migration path of dblist files to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244733 [19:15:50] (03CR) 10Dzahn: Update a couple of firewall rules to include mira alongside tin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [19:16:10] (03CR) 10Nuria: Add cron that schedules camus imports for mediawiki Avro Binary data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [19:18:14] (03PS1) 10John F. Lewis: nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 [19:18:24] (03PS2) 10John F. Lewis: nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 [19:18:30] (03PS1) 10Reedy: Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 [19:19:00] (03CR) 10jenkins-bot: [V: 04-1] nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [19:19:59] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1716052 (10AKoval_WMF) I also find these steps a little confusingly written, tbh @RobH. But I do think I get the premise. Let's see if I understand this situation correctl... [19:21:13] (03CR) 10Amire80: Enable CX suggestions in ast, bn, ml, nb, ta and ukwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244142 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [19:21:16] (03PS3) 10John F. Lewis: nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 [19:21:31] (03PS4) 10John F. Lewis: nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 [19:22:27] 7Puppet, 10Continuous-Integration-Config, 5Continuous-Integration-Scaling: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1716053 (10hashar) [19:23:27] (03PS1) 10Amire80: Fix nbwiki to nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244736 [19:23:59] (03PS2) 10Ori.livneh: Provide a smooth migration path of dblist files to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244733 [19:24:04] Reedy: ^ [19:24:16] (03CR) 10John F. Lewis: [C: 031] "http://puppet-compiler.wmflabs.org/985/ is a lot more promising." [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [19:24:47] (03CR) 10John F. Lewis: [C: 031] "no real change - just something meta and style wise." [puppet] - 10https://gerrit.wikimedia.org/r/244732 (owner: 10John F. Lewis) [19:27:49] (03CR) 10Amire80: "Follow-up fix in I5a1a0a9cd5fc50e7c3c97d6839713249845a190b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244142 (https://phabricator.wikimedia.org/T112848) (owner: 10KartikMistry) [19:29:08] subbu: btw, I added tcp connection reporting to our base server monitoring stack, so now we have, e.g., http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1444418898.572&target=servers.wtp1001.network.connections.ESTABLISHED [19:30:37] ori, lovely. [19:30:39] Woo [19:31:37] (03CR) 10Ori.livneh: [C: 032] Provide a smooth migration path of dblist files to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244733 (owner: 10Ori.livneh) [19:31:43] (03Merged) 10jenkins-bot: Provide a smooth migration path of dblist files to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244733 (owner: 10Ori.livneh) [19:33:00] You could've waited a few minutes and I would've done that :p [19:33:18] a few MINUTES?? [19:33:20] * ori yawns [19:33:34] !log ori@tin Synchronized multiversion/MWWikiversions.php: I9d4cbd3d67: Provide a smooth migration path of dblist files to dblists/ (duration: 01m 13s) [19:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:46] I'm not on my laptop atm. [19:34:16] I forgive you. [19:46:39] (03PS1) 10Paladox: [Timeline] Update path to extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 [19:46:58] (03PS2) 10Paladox: [Timeline] Update path to extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 [19:50:15] (03PS1) 10Ori.livneh: Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 [19:50:21] (03CR) 10jenkins-bot: [V: 04-1] Add MEDIAWIKI_DBLIST_DIR define, set to MEDIAWIKI_STAGING_DIR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 (owner: 10Ori.livneh) [19:53:56] (03CR) 10Nuria: "Making note for self." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244626 (owner: 10Nuria) [19:54:57] (03CR) 10Alex Monk: Update a couple of firewall rules to include mira alongside tin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [19:58:10] (03PS1) 10Reedy: Add dblist to many paths [puppet] - 10https://gerrit.wikimedia.org/r/244743 [20:01:45] (03PS2) 10Alex Monk: Update a couple of firewall rules to include mira alongside tin [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) [20:02:37] (03PS1) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [20:02:42] 7Puppet, 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1716267 (10hashar) a:3hashar [20:04:49] (03CR) 10Hashar: [C: 04-1] "That one is for an early EU morning deploy since it has the potential to SEVERLY disrupt the whole CI. I think I will:" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [20:05:53] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [20:13:37] (03PS5) 10Reedy: Move dblists to dblist folder [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244727 [20:14:42] (03CR) 10Hashar: "I have updated the Jenkins job ( https://gerrit.wikimedia.org/r/#/c/243997/ )" [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [20:18:22] (03CR) 10Dzahn: [C: 032] base: don't set contact_groups unless they override icinga defaults [puppet] - 10https://gerrit.wikimedia.org/r/244732 (owner: 10John F. Lewis) [20:19:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra/CQL query interface monitoring - https://phabricator.wikimedia.org/T93886#1716348 (10Eevans) >>! In T93886#1715162, @fgiunchedi wrote: > * error messages should include a description of what when wrong How about this? https://gith... [20:20:14] (03PS5) 10Dzahn: nrpe: convert contact_groups to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [20:21:15] (03CR) 10Dzahn: [C: 032] "confirmed by compiler diff" [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [20:22:37] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1716351 (10JohnLewis) >>! In T107445#1711531, @Selsharbaty-WMF wrote: > At this point we will have one list only, right? education-collab-private. And the archives will no... [20:23:31] (03PS2) 10Ottomata: Introducing aqs.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/242134 (owner: 10Alexandros Kosiaris) [20:23:39] (03CR) 10Ottomata: [C: 032 V: 032] Introducing aqs.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/242134 (owner: 10Alexandros Kosiaris) [20:25:45] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on beta cluster puppetmaster. The Jenkins job has been updated as well." [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [20:30:50] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [20:30:57] (03CR) 10Dzahn: Update a couple of firewall rules to include mira alongside tin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [20:31:17] (03CR) 10Dzahn: Update a couple of firewall rules to include mira alongside tin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [20:31:39] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:32:05] (03CR) 10Reedy: "Should this one have been updated too?" [puppet] - 10https://gerrit.wikimedia.org/r/244743 (owner: 10Reedy) [20:33:38] (03PS3) 10Alex Monk: Update a couple of firewall rules to include mira alongside tin [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) [20:35:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [20:39:00] (03PS1) 10ArielGlenn: dumps: don't escape commands not run in shell [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244799 [20:39:02] (03PS1) 10ArielGlenn: dumps: unfix a camelcase, imported module not fixed up yet [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244800 [20:39:04] (03PS1) 10ArielGlenn: dumps; fix another indentation screwup from the pylint [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/244801 [20:39:15] (03CR) 10Dzahn: "works fine :) diff here: https://phabricator.wikimedia.org/P2179" [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [20:40:18] (03CR) 10Dzahn: "now analytics people will actually get notifications for a whole bunch of things they never got while the code make it look like they woul" [puppet] - 10https://gerrit.wikimedia.org/r/244734 (owner: 10John F. Lewis) [20:40:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [20:45:31] (03PS1) 10BryanDavis: vagrant-lxc: Update sudoer rules for v1.2.0+ [puppet] - 10https://gerrit.wikimedia.org/r/244802 (https://phabricator.wikimedia.org/T115080) [20:46:17] (03CR) 10Reedy: [C: 04-1] "InitialiseSettings?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 (owner: 10Paladox) [20:48:42] (03CR) 10Ori.livneh: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244740 (owner: 10Ori.livneh) [20:51:12] (03CR) 10Paladox: "Well currently jenkins wont run the extension-unittests-generic test because the extension dosent follow the naming of the repo in the fil" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 (owner: 10Paladox) [20:51:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [20:53:32] (03CR) 10Reedy: "The globals are wrong anyway" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244739 (owner: 10Paladox) [20:55:27] ori: the poor define '/srv...' are not pleasing the tests unfortunately :-( ( re https://gerrit.wikimedia.org/r/244740 ) [20:56:05] ori: maybe multi version need to be taught which define to load, and then one can inject fixtures . ebernhardson added such list of define() recently [20:57:49] Why do the other ones work fine though? :/ [20:58:27] (03CR) 10BryanDavis: "Tested via cherry-pick on vagrant-lxc-trusty.mediawiki-core-team.eqiad.wmflabs with vagrant-lxc v1.2.1 (latest release)" [puppet] - 10https://gerrit.wikimedia.org/r/244802 (https://phabricator.wikimedia.org/T115080) (owner: 10BryanDavis) [20:58:48] (03CR) 10Yuvipanda: [C: 032] vagrant-lxc: Update sudoer rules for v1.2.0+ [puppet] - 10https://gerrit.wikimedia.org/r/244802 (https://phabricator.wikimedia.org/T115080) (owner: 10BryanDavis) [20:58:57] hashar: getting tests into mediawiki-config was a real pain, not having those hardcoded paths would help a ton but i just didn't have the time to refactor [20:59:49] (03CR) 10ArielGlenn: [C: 031] "no other changes needed that I can see as far as the dumps and related jobs go. I didn't check the rest." [puppet] - 10https://gerrit.wikimedia.org/r/244743 (owner: 10Reedy) [20:59:57] ty apergos :) [20:59:59] ebernhardson: the holy grail would be to clone both wmf branches and run integration tests with all projects we have :D [21:00:27] yw and I'm ugh midnight. gone to bed, mus tleave house at 8 am with bells on [21:00:29] ebernhardson: anyway, your patch was a good step forward [21:00:33] or at least with bags packed and on my shoulder [21:00:36] good night [21:00:42] apergos: sleep well ! [21:00:55] thanks! [21:01:55] safe travels! [21:02:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [21:03:06] (03PS3) 10Yuvipanda: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [21:03:18] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [21:03:22] (03PS1) 10Ori.livneh: Make dbconfigTests pass when there are no configs to validate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244804 [21:03:27] Reedy: ^ [21:03:31] hashar: yeah :/ [21:04:06] (03CR) 10Ori.livneh: [C: 032] Make dbconfigTests pass when there are no configs to validate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244804 (owner: 10Ori.livneh) [21:04:12] (03Merged) 10jenkins-bot: Make dbconfigTests pass when there are no configs to validate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244804 (owner: 10Ori.livneh) [21:04:21] wtf gerrit [21:04:29] mediawiki-config->invoke('reedy', 'ori')->wait('success') [21:04:32] https://gerrit.wikimedia.org/r/#/c/243857/ is 'submitted' but has the button greyed and not actually merged [21:05:05] though [21:05:12] https://gerrit.wikimedia.org/r/#/c/244804/1/tests/dbconfigTest.php that one is not ideal :D [21:05:27] it's a shitty test [21:05:33] it's an integration test, not a unit test [21:05:44] but i figured deleting it outright, as i would like to do, would be more controversial [21:06:22] (03PS3) 10Ottomata: Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:06:24] unit tests should test units of code; this is testing that the person who wrote the configuration file didn't make a mistake [21:06:29] well you could have marked it as skipped whenever the section is empty() [21:06:48] but it's not actually skipped [21:07:00] the test checks that no configuration violates the constraint [21:07:10] if the config is empty, then no configuration violates the constraint [21:07:13] the test passed [21:07:16] it wasn't skipped [21:07:17] yup true [21:07:19] skipped would be not checking it [21:07:24] (03CR) 10Ottomata: Add cron that schedules camus imports for mediawiki Avro Binary data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:08:01] (03PS4) 10Dzahn: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 [21:08:10] (03CR) 10Madhuvishy: [C: 031] Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:08:34] (03PS5) 10Yuvipanda: toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [21:08:38] (03PS4) 10Ottomata: Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:09:01] (03CR) 10Yuvipanda: [V: 032] toollabs: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/243857 (owner: 10Dzahn) [21:09:08] (03PS5) 10Ottomata: Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:09:16] (03CR) 10Ottomata: [C: 032 V: 032] Add cron that schedules camus imports for mediawiki Avro Binary data [puppet] - 10https://gerrit.wikimedia.org/r/244601 (https://phabricator.wikimedia.org/T113521) (owner: 10Madhuvishy) [21:11:06] ebernhardson: am going to merge https://gerrit.wikimedia.org/r/#/c/240305/, can you confirm that the dynamic scripting stuff is turned off? :) [21:14:55] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [21:17:08] (03PS14) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [21:19:03] (03PS15) 10Yuvipanda: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [21:19:25] (03CR) 10Yuvipanda: [C: 032] "\o/ I'll change the roles on the instances in https://tools.wmflabs.org/watroles/role/misc::labsdebrepo by hand now." [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [21:20:11] (03CR) 10Dzahn: "thank you very much :)" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [21:21:57] (03PS1) 10John F. Lewis: monitoring: append sms to contact groups, don't override with admins,sms [puppet] - 10https://gerrit.wikimedia.org/r/244806 [21:25:07] (03CR) 10Yuvipanda: "And https://tools.wmflabs.org/watroles/role/misc::labsdebrepo is empty!" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [21:25:56] yuvipanda: :) "misc" will die soon, yay [21:26:38] mutante: :D I've updated docs too [21:27:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [21:34:54] hey opsen, anything weird change with mariadb recently? beta cluster is down and we can't get it started quickly: [21:34:56] 21:33 < Krenair> Starting MySQL [21:34:57] yuvipanda: dynamic scripting is off, yes [21:34:59] 21:33 < Krenair> * Couldn't find MySQL manager (/usr/bin/mysqlmanager) or server (/usr/bin/mysqld_safe) [21:35:24] greg-g: or.i claimed responsibility on -labs I think, said he is fixing. [21:35:33] (03PS4) 10Dzahn: Update a couple of firewall rules to include mira alongside tin [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [21:35:44] greg-g: when he's done please swat him with [[Labs_labs_labs]] [21:36:03] heh [21:36:18] yuvipanda: also we've just finished up the script to copy indices from one cluster to another (just needs review, but appears to work). Along with adding code so we can set per-cluster shard/replica counts. in theory we are go for next week [21:36:28] (03CR) 10Dzahn: [C: 032] Update a couple of firewall rules to include mira alongside tin [puppet] - 10https://gerrit.wikimedia.org/r/244686 (https://phabricator.wikimedia.org/T113351) (owner: 10Alex Monk) [21:36:29] ebernhardson: wooo awesome. [21:36:43] ebernhardson: I won't be around much next week (offsite) but can probably help with +2s [21:36:49] kk [21:37:08] ebernhardson: and since you've root on it too I guess you can troubleshoot / kill it if it takes down something more important :D [21:37:15] ebernhardson: and I guess we can do load / perf testing afterwards... [21:37:59] yup [21:38:29] cooool [21:38:36] we're doing ok on the 6 week deadline too I guess [21:40:44] i was hoping to move a bit faster, but it turns out the multi-cluster stuff needed a few more steps [21:40:59] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1716485 (10Dzahn) >>! In T111243#1707062, @Smalyshev wrote: > @DZahn... [21:49:38] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716501 (10Krenair) Adding #operations [21:53:26] again? really... [21:56:10] (03PS1) 10John F. Lewis: fix hiera key for wdqs (contactgroups) [puppet] - 10https://gerrit.wikimedia.org/r/244813 [21:56:19] mutante: ^^ [21:56:27] PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: puppet fail [21:57:39] (03PS1) 10John F. Lewis: move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 [21:57:54] (03PS2) 10John F. Lewis: fix hiera key for wdqs (contactgroups) [puppet] - 10https://gerrit.wikimedia.org/r/244813 [21:58:04] (03PS2) 10John F. Lewis: move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 [21:58:16] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [21:58:25] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:58:33] (03CR) 10John F. Lewis: [C: 04-1] "Do not merge yet. Needs to be run in a compilers to see if no monitoring groups are lost." [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [21:58:35] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716531 (10Dzahn) Hi @eross and all, while it is true that there was an education@ mail alias in the past that was controlled by ops, nowadays this is not the case anymore. from our git log: "dzahn: deactivate education@ mail alias" This... [22:03:27] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [22:03:46] (03PS1) 10BryanDavis: vagrant::mediawiki: Make port forwards public [puppet] - 10https://gerrit.wikimedia.org/r/244815 (https://phabricator.wikimedia.org/T115139) [22:07:13] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716543 (10eross) Yes this was helpful. Thank you! [22:08:06] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1716552 (10Tgr) (Image in question: [[ https://commons.wikimedia.org/wiki/File:Kitagawa_Utamaro_-_Toji_san_bijin_(Three_Beauties_of_the_Prese... [22:14:43] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1470600 (10Tgr) Filed T115155 in any case. [22:15:00] (03CR) 10John F. Lewis: "http://puppet-compiler.wmflabs.org/986/ shows all but the CI ones are correct (analytics/team-services become admins,analytics/team-servic" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [22:16:52] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716598 (10Dzahn) @eross Ok, i'll close this one here then or you can just use the same ticket to follow-up with the OIT side of things? [22:17:40] (03PS3) 10Dzahn: fix hiera key for wdqs (contactgroups) [puppet] - 10https://gerrit.wikimedia.org/r/244813 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [22:17:51] (03CR) 10Dzahn: [C: 032] fix hiera key for wdqs (contactgroups) [puppet] - 10https://gerrit.wikimedia.org/r/244813 (https://phabricator.wikimedia.org/T111243) (owner: 10John F. Lewis) [22:19:39] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716601 (10eross) You can close this ticket, I am OIT. I just wanted to make sure with the alias. Thank you ! [22:20:05] 6operations: Education Alias - https://phabricator.wikimedia.org/T115150#1716604 (10Dzahn) 5Open>3Resolved a:3Dzahn sure, no problem [22:20:18] 6operations, 7Mail: Education Alias - https://phabricator.wikimedia.org/T115150#1716607 (10Dzahn) [22:22:47] (03PS3) 10John F. Lewis: move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 [22:22:57] (03PS4) 10John F. Lewis: move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 [22:23:28] (03CR) 10John F. Lewis: [C: 031] "Gallium now has groups added via host hiera file. This is good to go." [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [22:25:54] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:31:03] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [22:33:48] (03CR) 10BryanDavis: "tested via cherry-pick on vagrant-lxc-trusty.mediawiki-core-team. Respects user changes as expected." [puppet] - 10https://gerrit.wikimedia.org/r/244815 (https://phabricator.wikimedia.org/T115139) (owner: 10BryanDavis) [22:34:46] (03PS2) 10Yuvipanda: vagrant::mediawiki: Make port forwards public [puppet] - 10https://gerrit.wikimedia.org/r/244815 (https://phabricator.wikimedia.org/T115139) (owner: 10BryanDavis) [22:35:10] (03CR) 10Yuvipanda: [C: 032 V: 032] "All hail Bryan Davis, master of the vagrants." [puppet] - 10https://gerrit.wikimedia.org/r/244815 (https://phabricator.wikimedia.org/T115139) (owner: 10BryanDavis) [22:56:24] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:57:26] 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1716696 (10chasemp) [23:11:02] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1716724 (10Dzahn) @JohnLewis thanks for taking this (and using the script for it we have since recently which makes the list renaming process more standard than the former... [23:31:14] (03CR) 10Dzahn: [C: 031] "now that overriding contact groups for non-ops works for non-critical services, this follow-up will also fix it for critical services." [puppet] - 10https://gerrit.wikimedia.org/r/244806 (owner: 10John F. Lewis) [23:40:46] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/987/argon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/244699 (owner: 10John F. Lewis) [23:42:43] (03CR) 10Dzahn: [C: 031] "meanwhile merged https://gerrit.wikimedia.org/r/#/c/244686/" [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) (owner: 10Hoo man) [23:48:06] (03PS1) 10Yuvipanda: puppet: Have a 'secret' repository for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/244827 (https://phabricator.wikimedia.org/T112005) [23:48:27] andrewbogott: chasemp ^ if you can look at that slight(?) abomination [23:49:00] yes but probably not tonight as I have to run now [23:50:02] chasemp: kk [23:54:31] (03CR) 10Dzahn: "and merged this one to add to the list: https://gerrit.wikimedia.org/r/#/c/244686/" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff)