[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151201T0000). [00:00:04] MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:05] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858788 MB (50% inode=99%) [00:00:23] hi. [00:00:32] hm 2 am. must be bedtime [00:00:38] happy swatting [00:02:13] night apergos :) [00:02:26] :-) [00:05:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 386 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858788 MB (50% inode=99%) [00:10:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 385 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858784 MB (50% inode=99%) [00:15:13] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 385 MB (5% inode=76%): /dev 32199 MB (99% inode=99%): /run 6441 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /a 858784 MB (50% inode=99%) [00:15:43] PROBLEM - puppet last run on es2010 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:16:33] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:16:33] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:16:34] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:16:45] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:03] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:03] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:04] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:04] PROBLEM - puppet last run on mw2196 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:04] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:06] :/ [00:17:13] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:13] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:13] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:14] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:14] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:15] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:15] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:23] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:24] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:24] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:25] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:25] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:25] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:33] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:33] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:34] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:34] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:34] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:34] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:35] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:35] PROBLEM - puppet last run on mw2003 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:43] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:43] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:43] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:45] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:45] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:45] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:53] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:53] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:54] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:54] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:17:55] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:15] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:15] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:15] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:16] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:16] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:16] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:16] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:23] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:33] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:34] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:43] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:45] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:18:45] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [00:19:13] killed icinga-wm and is check a random puppet run [00:28:14] the one I looked at really didn't run puppet for about 6 hours... [00:28:32] it was still executing puppet-run, but not the agent itself [00:29:03] W: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/dists/trusty-updates/main/source/Sources Hash Sum mismatch [00:29:06] W: Failed to fetch http://ubuntu.wikimedia.org/ubuntu/dists/trusty-updates/main/binary-amd64/Packages Hash Sum mismatch [00:29:10] E: Some index files failed to download. They have been ignored, or old ones used instead. [00:29:41] basically, apt-get update failed because of the above, and that in turn blocked puppet from running in the puppet-run cron [00:30:15] (because puppet-run is a shellscript running under -e) [00:30:17] (03CR) 10TTO: [C: 031] Enable subpages on custom aliases from 112 to 119 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254475 (owner: 10Dereckson) [00:31:12] aaah, i just got to the manual puppet-run part [00:31:22] yea, so the Sum mismatch thing we saw before [00:31:27] but only on trusty [00:34:23] probably the better question is why our puppet agent alerts are so wonky to begin with [00:34:27] bblack: looks like our own repo gets out of sync with ubuntu upstream, i remember we had a check for that [00:34:39] that would maybe explain the Sum mismatch [00:35:07] probably apt-get update failure shouldn't block puppet-run anyways [00:35:16] agreed [00:35:44] (otherwise we have some chicken and egg problems if we use puppet to update an apt sources file poorly) [00:35:55] could be solved with salt, but still ugly [00:36:26] there is already a timeout there though.. [00:36:51] but that doesnt catch this error condition where just one source fails [00:37:23] regardless, the timeout would cause abnormal exit code, which would prevent the actual agent running [00:39:45] exit code 100 [00:39:50] from that apt-get update [00:40:14] tries to find the check for the apt repo itself [00:40:35] and indeed "Ubuntu mirror in sync with upstream" [00:40:42] /srv/mirrors/ubuntu is over 6 hours old. [00:40:48] but just WARN so far [00:40:51] (03PS1) 10BBlack: puppet-run: do not let apt failures block agent [puppet] - 10https://gerrit.wikimedia.org/r/256148 [00:41:32] I think faidon last messed with that script, will wait for him to object heh [00:41:50] ah:) seems good [00:47:01] !log catrope@tin Synchronized php-1.27.0-wmf.7/resources/src/mediawiki/mediawiki.ForeignStructuredUpload.js: SWAT (duration: 00m 30s) [00:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:52] !log catrope@tin Synchronized php-1.27.0-wmf.7/extensions/VisualEditor: SWAT (duration: 00m 29s) [00:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:22] bblack: did you do this? /srv/mirrors/ubuntu is over 0 hours old. [00:50:13] so the issue is gone now because the repo is in sync again and so apt-get update works again [00:50:50] no, I didn't do it [00:51:02] AFAICS, they actually started working about 6 hours after they started failing [00:51:24] something got itself back in sync on its own? [00:51:38] yea [00:51:45] where something is reprepro update i think [00:52:06] but i dont see a cron for it yet [00:53:30] ah, found it. modules/mirrors [00:54:04] so it's a race [00:54:13] this gets updated every,, you guessed it.. 6 hours [00:54:22] hour => '*/6', [00:54:42] at minute 43 [00:59:44] so if.. ubuntu updated packages in the last 6 hours and our mirror does not have them yet.. then apt-get update fails.. then puppet run fails, and we need to be lucky enough to do this with the right timing [00:59:49] because puppet runs at minute 44 [00:59:59] and the apt upgrade at minute 43 [01:00:47] eh, i mean "update-ubuntu-mirror" [01:07:25] you mean, this is a race for all clients that run near minute 44? [01:07:30] (03CR) 10Dzahn: [C: 031] "the apt-get update command failed when update-ubuntu-mirror failed or was not running often enough because then our repo is not in sync wi" [puppet] - 10https://gerrit.wikimedia.org/r/256148 (owner: 10BBlack) [01:07:36] because the clients should be splayed [01:09:15] bblack: eh.. yea.. i guess. i just picked a random server and it happened to be that one minute apart :p [01:09:38] but there were too many affected servers for that then [01:23:26] (03PS1) 10Dzahn: mirrors: sync ubuntu mirrors more often [puppet] - 10https://gerrit.wikimedia.org/r/256155 [01:24:40] bblack: ^ i guess that too and i'll also add faidon [01:27:34] (03CR) 10Dzahn: "i'm also not sure about the rolematcher.py part,maybe it's safer to replace it with the new hostname as well, unless we are really sure th" [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [01:38:22] (03PS1) 10Yuvipanda: ores: Add paging icinga check for home page [puppet] - 10https://gerrit.wikimedia.org/r/256158 (https://phabricator.wikimedia.org/T119340) [01:38:33] mutante: ^ is the thing I was asking about [01:39:06] mutante: so I need to make sure that it pages only me and halfak [01:39:11] (once I add him as a contact) [01:39:23] o/ [01:40:12] don't want to page everyone yet :) [01:42:23] yuvipanda: what's a good name for that "team" (you and halfak) [01:42:32] mutante: 'ores'? [01:42:38] we can be team-ores if you want to stick to that [01:42:43] instead of just ores [01:42:45] okay [01:43:13] let me start by adding this as email notification for just you [01:43:21] then if it works we switch it to paging? [01:43:36] mutante: +1 [01:44:05] (03PS1) 10Ori.livneh: Add proof-of-concept variants of [[en:Barack Obama]] for T119797 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256159 [01:44:20] * yuvipanda goes afk for about 15mins [01:45:30] (03CR) 10Ori.livneh: [C: 032] Add proof-of-concept variants of [[en:Barack Obama]] for T119797 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256159 (owner: 10Ori.livneh) [01:45:51] (03Merged) 10jenkins-bot: Add proof-of-concept variants of [[en:Barack Obama]] for T119797 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256159 (owner: 10Ori.livneh) [01:46:18] James_F: $wgEchoSharedTrackingDB = 'wikishared'; merged on gerrit but not tin [01:46:31] OK to merge? [01:46:51] (03CR) 10Dzahn: [C: 04-1] "you are using "ores" as the contact group in the service definition but where you add the group it's "revscoring"." [puppet] - 10https://gerrit.wikimedia.org/r/256158 (https://phabricator.wikimedia.org/T119340) (owner: 10Yuvipanda) [01:47:06] oh, it's labs only [01:47:07] yeah [01:47:10] merged [01:47:52] !log ori@tin Synchronized docroot and w: (no message) (duration: 00m 30s) [01:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:08] mutante: am idiot, etc. is that patch enough to add emails for me and him though? [01:49:38] (03PS2) 10Dzahn: ores: Add paging icinga check for home page [puppet] - 10https://gerrit.wikimedia.org/r/256158 (https://phabricator.wikimedia.org/T119340) (owner: 10Yuvipanda) [01:49:50] yuvipanda: yea, i think it is, i was about to merge that now [01:50:03] i'll check on neon [01:50:38] mutante: \o/. need to add a contact for halfak on private tho [01:50:58] yes, i was about to say that [01:51:00] i'll add that too [01:51:17] (03PS3) 10Dzahn: ores: Add paging icinga check for home page [puppet] - 10https://gerrit.wikimedia.org/r/256158 (https://phabricator.wikimedia.org/T119340) (owner: 10Yuvipanda) [01:51:50] yuvipanda: already exists :) [01:52:38] haha reallY? [01:52:45] have we already been paging halfak for things? [01:52:46] yea, i think i added him [01:52:53] mutante: is it setup to page too? [01:52:55] no, emailing [01:53:12] Oh! I get pages when the db slaves are angry [01:53:17] Rather emails [01:53:20] Are those "pages"? [01:53:21] ah [01:53:33] no, by "pages" we mean SMS [01:53:36] halfak: well, depends :D are they enough for you or do you want SMS too? [01:53:39] I don't get a cool little pager to wear on my belt? [01:53:49] I want SMS for ORES [01:53:52] ok [01:54:00] mutante: so he has a google voice number. I'm not sure how to set that one up [01:54:12] it will be tricker to make sure you get SMS for _just this one service but not the other services_ [01:54:13] halfak: do you know of any way to send you an SMS via email? [01:54:24] we might have to just use 2 contacts [01:54:28] Maybe... I do have gvoice. [01:54:29] mutante: if we want to be hacky we can create a halfak-sms user [01:54:50] halfak: right, so for other US people we just send mail to @ and it gets sent as SMS [01:54:56] halfak: I think gvoice might have something like that [01:55:09] yea, so it depends on your provider [01:55:12] Looks like I can set up a filter in gmail to do it. [01:55:15] if they have a gateway for this [01:55:24] ok, i'm just slow to reply here [01:55:41] I gotta go for about 15mins to get ready for an event in the evening. [01:55:47] afk for a bit, I'll be back soon [01:56:09] usually a gateway exists if they are a bigger company that owns their own network and not a reseller [01:56:15] mutante: feel free to merge etc :D [01:56:20] and thank you very very much! [01:57:17] ok! merging the email part [01:57:25] (03CR) 10Dzahn: [C: 032] ores: Add paging icinga check for home page [puppet] - 10https://gerrit.wikimedia.org/r/256158 (https://phabricator.wikimedia.org/T119340) (owner: 10Yuvipanda) [02:01:19] halfak: would it matter to you that you get SMS for this one specific service but NOT for the other things you get email for already? [02:01:34] halfak: or would it be more like "either SMS for all or for none" [02:02:17] the second part is easier, the first one also works just a little hacky with 2 separate users [02:02:17] mutante, it looks like a gmail filter is my best better for getting an SMS with a gvoice message, so we can do email across the board. [02:02:55] halfak: ah:) ok! [02:03:28] fwiw, we can also setup timezones so that it does not do this 24/7 [02:03:43] mutante, for ORES, it should get me out of bed. [02:03:48] alright [02:04:53] halfak: if there are any issues with the gvoice part let us know. we can probably also use what ops uses for this [02:05:16] mutante, will do. Thanks. :) [02:18:30] !log restarted hhvm on mw1147 [02:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:06] yuvipanda: for when you are back .. so that check isn't realized on neon like this.- the contact group is, but the service and host are not [02:22:54] i'll try to find out but also dinner .. [02:23:23] it doesnt break anything though [02:23:36] like the icinga config check is fine, so i'll see later [02:23:58] sees a puppet fail on labcontrol1001 where it would be added [02:25:02] bah, duplicate definition because service called "main_page" exists twice now, hah [02:25:05] fixing that [02:26:09] all of them should be as specific as possible. -> "main_page_ores" [02:26:52] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 10m 14s) [02:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:57] (03PS1) 10Dzahn: ores/tools: fix duplicate icinga service name [puppet] - 10https://gerrit.wikimedia.org/r/256160 [02:29:16] (03PS2) 10Dzahn: ores/tools: fix duplicate icinga service name [puppet] - 10https://gerrit.wikimedia.org/r/256160 (https://phabricator.wikimedia.org/T119340) [02:31:14] (03PS3) 10Dzahn: ores/tools: fix duplicate icinga service name [puppet] - 10https://gerrit.wikimedia.org/r/256160 [02:32:37] 6operations, 6Labs, 10Tool-Labs, 7Icinga: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#1840320 (10Dzahn) 3NEW [02:33:53] (03PS4) 10Dzahn: ores/tools: fix duplicate icinga service name [puppet] - 10https://gerrit.wikimedia.org/r/256160 [02:34:00] (03CR) 10Dzahn: [C: 032] ores/tools: fix duplicate icinga service name [puppet] - 10https://gerrit.wikimedia.org/r/256160 (owner: 10Dzahn) [02:39:22] 6operations, 10Analytics-EventLogging, 7Icinga: eventlog2001 - CRITICAL status of defined EventLogging jobs - https://phabricator.wikimedia.org/T119930#1840331 (10Dzahn) 3NEW [02:42:40] (03PS1) 10Dzahn: ores monitoring: fix typo for contact group param [puppet] - 10https://gerrit.wikimedia.org/r/256161 [02:43:01] (03CR) 10jenkins-bot: [V: 04-1] ores monitoring: fix typo for contact group param [puppet] - 10https://gerrit.wikimedia.org/r/256161 (owner: 10Dzahn) [02:43:04] (03PS2) 10Dzahn: ores monitoring: fix typo for contact group param [puppet] - 10https://gerrit.wikimedia.org/r/256161 [02:43:39] (03CR) 10Dzahn: [C: 032 V: 032] ores monitoring: fix typo for contact group param [puppet] - 10https://gerrit.wikimedia.org/r/256161 (owner: 10Dzahn) [02:47:06] started icinga-wm [02:47:15] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:57:05] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [03:09:53] Reedy, Krenair, mutante: " !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 10m 14s)" -- it looks like it all worked! [03:10:01] yay [03:10:29] logs all look good, no hidden soft errors [03:10:31] nice! [03:10:47] now from mira ?:) [03:12:01] we should probably run a manual test from there first, but yeah I guess that's a next step [03:12:55] cool [03:13:18] mutante: thanks for taking care of it! [03:13:34] this guy is using vim to present but doesn't have relno turned on [03:13:37] this is distracting [03:13:57] * bd808 doesn't use relno [03:14:42] (this episode of vim confessions brought to you by ...) [03:15:24] Emacs! [03:15:50] It's always funny when competing shows buy airtime on each other :p [03:16:19] heh :D [03:16:44] I guess for people not used to relno using relno is terrible [03:16:47] yuvipanda: except..it still doesn't work >;p [03:16:55] for some reason it does not create the host now [03:17:14] also https://github.com/jupyter/jupyterhub/issues/109 fun, I was looking for a solution for this and bam now apparently I told people I'll work on it many many months ago [03:17:17] but i can absolutely confirm i did _the exact same thing_ in another place and it works [03:17:33] mutante: hmm that's strange.. [03:18:06] mutante: it's the same thing we're using for tools too [03:18:20] yes, and the same thing i use for blog certs [03:18:49] what a perfect example of apergos' law [03:19:18] 1.5 hours later because it was "5 minutes" [03:19:54] heh [03:20:09] stares at modules/icinga/manifests/monitoring/certs.pp and it's _the same thing_...grrr [03:21:55] oh, and i use absolute line numbers all the time:) [03:22:55] v2jd ftw [03:22:58] :D [03:23:07] (v3kd is worse without relno) [03:23:46] bd808: also, re your wikitech-l post about varnish + htcp, I'm wondering if varnish (or something?) + rcstream would work [03:24:28] since rcstream is much easier to deal with than htcp [03:24:57] eah I think you'd want something semi-robust for the purge messages [03:25:18] is HTCP more robust than rcstream over internet level links? [03:25:21] maybe even go so far as to use kafka [03:25:30] also doesn't HTCP require communication with ops while rcstream doesn't [03:25:35] oh yeah, we can use rcstream over kafka :D [03:26:08] * yuvipanda has vaguely grand plans involving kafka for https://phabricator.wikimedia.org/T100082 [03:26:30] k8s had daemonsets, so am waiting for someone to setup kafka 0.9 on them [03:26:42] 0.9 has authentication [03:27:07] Unlike what g.wicke linked I think it would only be for "local" traffic inside some network (uni, big corp, maybe isp) [03:27:29] now that we are https everwhere it would probably be trickier to do [03:27:29] +1 but how do you get HTCP from *us* to them? [03:27:41] oh yeahhhhhhhhh, I had totally forgotten about that [03:28:03] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4.2 - https://phabricator.wikimedia.org/T107762#1840369 (10GWicke) [03:28:30] bd808: yeah, you're right, I think that'll require fucking around with self-signed certs and trust issues [03:28:37] or using a different domain [03:29:41] for the purge messages off the top of my head... kafka topic with N days of history and custom consumer on the appliance [03:30:16] 0.9 has TLS for kafka and authentication too [03:30:20] if the last purge you knew about wasn't still in the topic then dump all cache and start over [03:31:03] right [03:31:12] but I guess you'll need a different domain [03:31:22] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4.2 - https://phabricator.wikimedia.org/T107762#1840372 (10GWicke) @yuvipanda, @moritzmuehlenhoff, @faidon: What is the current plan for unstable Debian repositories / backports? Is there a way to access the node 4.2... [03:31:26] you could poll every 5 minutes or so and keep up I think. The appliance would want to bypass cache for authed users too [03:31:37] but tls is the big trick [03:31:57] yeah [03:32:20] esp. with hsts and cert pinning [03:32:24] err [03:32:25] cert preload? [03:33:07] maybe you could do something fancy where the private key isn't on the appliance and is instead fetched from us at runtime and somehow secured in ram? [03:33:17] there would always be attacks though [03:33:25] yeah, I don't think we'll want to let the private key leave us [03:33:37] we could possibly find a way to implement what cloudflare does [03:33:48] but that's going to be hard I think [03:35:21] if it was easy would would already be doing it. I guess the big question is whether or not it is worth the effort [03:35:41] yeah [03:37:23] it might be interesting if WP0 contracts had the option to provide edge cache like that [03:38:17] I think they had to switch from Host based to IP based whitelisting for the SSL switchover [03:38:53] but absent some sort of highly secured physical appliance I'm not sure how to pull off TLS [03:39:32] yeah [03:39:37] I think it's a non-starter [03:40:55] (03PS1) 10Dzahn: ores: disable new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/256165 [03:41:45] (03PS2) 10Dzahn: ores: disable new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/256165 [03:41:52] (03CR) 10Dzahn: [C: 032] ores: disable new monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/256165 (owner: 10Dzahn) [03:52:14] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [04:22:05] (03PS1) 10Andrew Bogott: Wikitech: Explicitly rebuild smw data four times/day [puppet] - 10https://gerrit.wikimedia.org/r/256170 [04:52:10] (03CR) 10Santhosh: CX: Use ContentTranslationRESTBase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [05:12:35] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [05:13:05] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [05:13:35] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [05:16:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [05:16:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [05:17:35] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [05:20:35] PROBLEM - Mobile HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [150.0] [05:24:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [05:24:25] RECOVERY - Mobile HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [05:24:35] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:27:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [05:35:04] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [05:39:36] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:40:35] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:44:37] 6operations, 10procurement: Upgrade 100[7-9] to match codfw (and restbase100[1-6]) hardware - https://phabricator.wikimedia.org/T119935#1840448 (10GWicke) 3NEW [05:45:04] 6operations, 10procurement: Upgrade 100[7-9] to match codfw (and restbase100[1-6]) hardware - https://phabricator.wikimedia.org/T119935#1840457 (10GWicke) [05:46:07] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840459 (10GWicke) [05:46:45] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840448 (10GWicke) [06:06:44] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: puppet fail [06:30:15] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: puppet fail [06:30:24] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:15] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [06:31:35] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:45] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:17] (03CR) 10Faidon Liambotis: [C: 04-1] Labs: switch PAM handling to use pam-auth-update (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [06:52:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Dec 1 06:52:40 UTC 2015 (duration 52m 39s) [06:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:54:27] (03CR) 10Faidon Liambotis: [C: 04-1] ""Hash sum mismatches' are mirror inconsistencies that happened in the upstream mirror. They have nothing to do with /our/ syncing, unless " [puppet] - 10https://gerrit.wikimedia.org/r/256155 (owner: 10Dzahn) [06:55:36] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:35] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:45] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:17] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1840488 (10Arrbee) [07:15:25] PROBLEM - Check size of conntrack table on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:46] PROBLEM - RAID on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:15:55] PROBLEM - salt-minion processes on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:05] PROBLEM - Disk space on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:05] PROBLEM - DPKG on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:24] PROBLEM - puppet last run on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:45] PROBLEM - configured eth on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:55] PROBLEM - dhclient process on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:17:41] (03CR) 10Faidon Liambotis: [C: 04-1] "So this is currently like that by design, although I can see arguments for changing it." [puppet] - 10https://gerrit.wikimedia.org/r/256148 (owner: 10BBlack) [07:18:36] RECOVERY - configured eth on planet1001 is OK: OK - interfaces up [07:18:44] RECOVERY - dhclient process on planet1001 is OK: PROCS OK: 0 processes with command name dhclient [07:19:05] RECOVERY - Check size of conntrack table on planet1001 is OK: OK: nf_conntrack is 0 % full [07:19:34] RECOVERY - RAID on planet1001 is OK: OK: no RAID installed [07:19:35] RECOVERY - salt-minion processes on planet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:19:45] RECOVERY - Disk space on planet1001 is OK: DISK OK [07:19:45] RECOVERY - DPKG on planet1001 is OK: All packages OK [07:19:56] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:25:14] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [07:26:14] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:26:44] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [07:27:05] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:27:45] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:28:15] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:25] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:28:36] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:50:48] 7Blocked-on-Operations, 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4.2 - https://phabricator.wikimedia.org/T107762#1840539 (10MoritzMuehlenhoff) Debian unstable with stick with the 4.2 LTS release, while Debian experimental will follow the latest releases (like 5.2 currently). We ca... [08:09:30] (03PS8) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [08:15:27] (03PS1) 10Giuseppe Lavagetto: admin: add my secondary ssh key (w. OTP) [puppet] - 10https://gerrit.wikimedia.org/r/256173 [08:17:15] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add my secondary ssh key (w. OTP) [puppet] - 10https://gerrit.wikimedia.org/r/256173 (owner: 10Giuseppe Lavagetto) [08:34:55] (03CR) 10Nikerabbit: [C: 031] CX: Use ContentTranslationRESTBase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255102 (https://phabricator.wikimedia.org/T111562) (owner: 10KartikMistry) [08:50:16] (03Abandoned) 10Isart: fixing user/pass on MySQL_PS template [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256008 (owner: 10Isart) [09:23:45] (03PS1) 10Jcrespo: For some reason nodepool db doesn't exist, making backups fail [puppet] - 10https://gerrit.wikimedia.org/r/256175 [09:25:24] (03CR) 10Jcrespo: [C: 032] For some reason nodepool db doesn't exist, making backups fail [puppet] - 10https://gerrit.wikimedia.org/r/256175 (owner: 10Jcrespo) [09:25:59] (03PS1) 10Merlijn van Deen: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 [09:34:35] (03PS1) 10Jcrespo: It turns out that the correct name is nodepooldb [puppet] - 10https://gerrit.wikimedia.org/r/256177 [09:35:02] (03CR) 10Jcrespo: [C: 032] It turns out that the correct name is nodepooldb [puppet] - 10https://gerrit.wikimedia.org/r/256177 (owner: 10Jcrespo) [09:38:02] (03PS5) 10Addshore: Set WDQS 5min expiry for internal access Port:8888 [puppet] - 10https://gerrit.wikimedia.org/r/256039 (https://phabricator.wikimedia.org/T119941) [09:43:02] (03PS1) 10Hashar: Simplify phabricator statistics emails [puppet] - 10https://gerrit.wikimedia.org/r/256179 [09:50:14] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 644 [09:52:15] mmm [09:52:22] is that FR? [09:53:25] yes it is [09:55:14] RECOVERY - check_mysql on db1008 is OK: Uptime: 1524643 Threads: 6 Questions: 24730186 Slow queries: 10255 Opens: 22420 Flush tables: 2 Open tables: 64 Queries per second avg: 16.220 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:09:44] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1840761 (10faidon) 3NEW [10:12:48] (03PS1) 10Addshore: Fix 403 -> 404 typo in webpref asset-check.js [puppet] - 10https://gerrit.wikimedia.org/r/256182 [10:20:18] addshore: btw I'm clinic duty this week, I'll try to get to the wqds reviews but unlikely [10:20:50] godog: that would be amazing :) they should both be quite trivial! [10:20:52] addshore: also related, what do we need to do to get https://gerrit.wikimedia.org/r/#/c/255720/ deployed to production too? [10:21:01] I'm just about to reply there :) [10:21:57] godog: https://gerrit.wikimedia.org/r/255720 replied! [10:22:22] ha-ha, thanks! [10:22:48] and as for the WDQS stuff, I can baby sit the changes & check everything is fine if you do go for them! [10:24:55] addshore: yeah I haven't been involved much with wqds so don't have much context really [10:25:45] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1840856 (10jcrespo) [10:28:24] godog: so basically the file I touch is the nginx config which is on each of the wdqs hosts [10:28:29] (03PS1) 10Addshore: Add wikidata mainpage to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256187 (https://phabricator.wikimedia.org/T117555) [10:28:31] (03PS1) 10Addshore: Add WD Q64 static version to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256188 (https://phabricator.wikimedia.org/T117555) [10:29:14] godog: the /raw/ entry for the blazegraph bit can be seen on port 9999. nginx is acting as a proxy to limit what can be done and most importantly force the setting of the max execution time header :) [10:29:48] (03CR) 10jenkins-bot: [V: 04-1] Add WD Q64 static version to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256188 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [10:30:55] good morning :-} [10:31:16] any idea why salt times out when I first use it then after a few tries ends up working reliably ? [10:31:31] <_joe_> for some value of reliably [10:31:44] seems the server does not keep open connections with salt-minions and it takes a while to warm up all the minions [10:32:07] <_joe_> yeah that's completely horrible [10:32:09] so I end up doing a few salt --timeout 60 --show-timeout cmd.run hostname [10:32:12] to warm it up [10:32:20] <_joe_> sigh [10:32:39] hashar: apergos is fixing it up, see https://phabricator.wikimedia.org/T115287 for the gory details [10:32:55] (03PS1) 10Jcrespo: Retry replicated transactions infinitely [puppet] - 10https://gerrit.wikimedia.org/r/256189 [10:33:07] (03PS2) 10Addshore: Add WD Q64 static version to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256188 (https://phabricator.wikimedia.org/T117555) [10:33:39] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1840875 (10fgiunchedi) p:5High>3Normal seems to be working fine too on a jessie host, I can't see from wikitech what classes are applied to limn1, maybe that has to do with it... [10:34:02] the server does indeed keep open connections [10:34:07] but that's a whole different story [10:35:52] ah [10:36:06] well the task is a long read, but I am happy to know it is a known issue :-} [10:36:28] I found some log errors on my salt master, will look at it ( 2015-12-01 10:31:41,024 [salt.loaded.int.returner.local_cache][ERROR ] An extra return was detected from minion integration-slave-trusty-1012.integration.eqiad.wmflabs, please verify the minion, this could be a replay attack ) :D [10:36:48] the extra return means that there are two salt minions on the host in question [10:36:51] (probably) [10:38:18] (03PS1) 10Addshore: Add authors to wikidatabuilder cron-build.sh [puppet] - 10https://gerrit.wikimedia.org/r/256190 [10:38:26] apergos: indeed! thanks a bunch [10:38:34] yw [10:43:47] (03CR) 10Jcrespo: [C: 032] "Applied live." [puppet] - 10https://gerrit.wikimedia.org/r/256189 (owner: 10Jcrespo) [10:49:16] (03PS1) 10Addshore: Remove Einsteinium hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/256191 [10:49:18] (03PS1) 10Addshore: Remove wdqs-roots group from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/256192 [10:49:20] (03PS1) 10Addshore: Remove unused wdqs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/256193 [10:50:49] !log restarting on cassandra instances on restbase1008 (to effect openjdk security updates and a few depending libs) [10:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:59] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1840893 (10fgiunchedi) @tgr I've copied them to `/tmp/rsvg-T112421/` on `deployment-bastion` [11:06:43] (03CR) 10Aklapper: [C: 031] "Thanks. No strong opinions hence +1" [puppet] - 10https://gerrit.wikimedia.org/r/256179 (owner: 10Hashar) [11:12:08] 6operations, 10Incident-20150825-Redis, 5Patch-For-Review: Enable memory overcommit for all redis hosts with persistance - https://phabricator.wikimedia.org/T91498#1840942 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi looks like this is completed, tentatively resolving [11:13:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [5000000.0] [11:16:33] 6operations, 7Database: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1840955 (10jcrespo) They will run out of disk space around the 15 March 2016. [11:24:47] 6operations, 10Traffic, 10Wikimedia-Stream, 5Patch-For-Review: rcstream service on port 443 is broken, spamming logs - https://phabricator.wikimedia.org/T118956#1840985 (10fgiunchedi) paging @joe @ori looks like https://gerrit.wikimedia.org/r/253917 could be merged? [11:28:59] 6operations, 6Services, 5Patch-For-Review, 7RESTBase-architecture: Separate /var on restbase100x - https://phabricator.wikimedia.org/T113714#1841011 (10fgiunchedi) a:3fgiunchedi [11:31:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [11:33:00] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Decom observium.wikimedia.org - https://phabricator.wikimedia.org/T118790#1841046 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi doesn't look like there's either `observium` database or a `observium` user in production, resolving [11:37:39] (03CR) 10Alexandros Kosiaris: [C: 031] "I think it's OK to decouple the successful execution of those 2 things (apt-get update and puppet runs) and not rely on this script failin" [puppet] - 10https://gerrit.wikimedia.org/r/256148 (owner: 10BBlack) [11:38:43] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#1841074 (10fgiunchedi) [11:39:14] (03PS2) 10Alexandros Kosiaris: etherpad: Remove some old ignored settings [puppet] - 10https://gerrit.wikimedia.org/r/256032 [11:39:21] (03CR) 10Alexandros Kosiaris: [C: 032] etherpad: Remove some old ignored settings [puppet] - 10https://gerrit.wikimedia.org/r/256032 (owner: 10Alexandros Kosiaris) [11:46:58] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1841099 (10Lydia_Pintscher) p:5Normal>3High Folks, Can we please get this moving? It is nearly a year since the ticket was opened. [11:51:34] (03PS4) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [11:52:16] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [11:52:21] <_joe_> godog: mind to re-look? ^^ [11:52:30] <_joe_> the PS, not the alarm [11:52:33] (03CR) 10jenkins-bot: [V: 04-1] base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [11:52:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [11:52:47] <_joe_> rubocop fascism [11:54:26] yeah I find it silly honestly [11:54:47] <_joe_> also, I can't make my function work without do |args| [11:54:51] <_joe_> as puppet barfs [11:54:58] <_joe_> what am I supposed to do? [11:56:09] <_joe_> hashar: ^^ [11:57:28] (03PS1) 10Muehlenhoff: openldap: Document setup of cn=repluser and cn=admin [puppet] - 10https://gerrit.wikimedia.org/r/256206 [12:00:16] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:00:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:01:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:01:45] _joe_: I don't even know what "newfunction(:puppet_ssldir, :type => :rvalue) do |args" does [12:01:46] :( [12:02:31] <_joe_> hashar: yeah my point is that puppet forces me to do that [12:02:36] yup [12:02:44] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:02:51] flake8 has a way to ignore a single line with # noqa [12:03:05] can't find a similar per line ignore rule in rubocop though [12:04:39] <_joe_> hashar: I think I have a good idea [12:04:44] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [12:05:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:06:42] _joe_: another way is to ignore that rule entirely via .rubocop.yml [12:07:13] (03CR) 10Filippo Giunchedi: [C: 04-1] base::certificates: add puppet's CA to the trusted store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [12:08:36] (03CR) 10JanZerebecki: [C: 031] Remove Einsteinium hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/256191 (owner: 10Addshore) [12:09:30] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#1841136 (10fgiunchedi) FWIW ipvsadm 1.28 is in stretch, once all pybal machines are jessie we could backport it to `jessie-backports` [12:10:01] (03CR) 10JanZerebecki: [C: 031] Remove wdqs-roots group from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/256192 (owner: 10Addshore) [12:10:50] <_joe_> godog: you're actually wrong :) [12:10:50] (03CR) 10JanZerebecki: [C: 031] Remove unused wdqs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/256193 (owner: 10Addshore) [12:10:57] (03CR) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [12:13:18] _joe_: heh, it does really seem to go against puppet's grain [12:13:32] <_joe_> yeah [12:14:09] <_joe_> godog: well tbh it's our own fault [12:14:30] <_joe_> godog: normally you won't have your ssldir vary based on totally unrelated variables :P [12:15:05] yeah but still it'd have to be hardcoded in the manifest [12:15:26] (03CR) 10Filippo Giunchedi: base::certificates: add puppet's CA to the trusted store (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [12:15:34] <_joe_> godog: normally there is no reason to have the ssldir vary between the master and the client, either [12:15:50] <_joe_> it's really only self-inflicted pain [12:16:20] <_joe_> what would be cleaner, btw? [12:16:36] <_joe_> copying the CA cert around on the master? [12:16:46] yeah [12:16:49] <_joe_> I think that will be even more hoorrible in the end [12:17:12] <_joe_> and heh, thinking again, no it might work [12:17:30] <_joe_> well, there is a race condition on the master, heh [12:17:36] <_joe_> so no, it can't work [12:17:59] which race condition? [12:18:24] <_joe_> you wouldn't be able to run puppet on the masters the first time, as the config is brought up by the puppet classes. So, in the case of a self-hosted puppetmaster what will happen is: [12:18:42] <_joe_> 1) puppet runs fine until we become the puppetmaster [12:19:22] <_joe_> 2) we become the puppetmaster; and we thus install somewhere accessible via puppet the CA (which will lie around unversioned, btw) [12:19:54] <_joe_> and no, it might wor [12:20:50] <_joe_> nope, when we're becoming puppetmaster, we would read the ssldir which is still the old one, at the next puppet run it will be fixed [12:21:16] <_joe_> so yeah, not a big deal; the big deal is having an unversioned file lying around within our puppet dirs [12:21:28] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: clarify how to download a package [puppet] - 10https://gerrit.wikimedia.org/r/256125 (owner: 10Merlijn van Deen) [12:21:29] <_joe_> I think that's way worse than what I did [12:21:35] does it have to be in the puppet dirs if it is an eyesore? [12:21:55] I have this feeling that the hardcoded ssldir in that function will bite us down the roat [12:21:56] (03PS2) 10Alexandros Kosiaris: package_builder: clarify how to download a package [puppet] - 10https://gerrit.wikimedia.org/r/256125 (owner: 10Merlijn van Deen) [12:21:58] road [12:22:10] <_joe_> godog: well, wait [12:22:38] <_joe_> we can use it everywhere we need to define a non-default ssldir [12:22:40] <_joe_> in fact [12:23:10] <_joe_> so the hardcoded value is in one place only [12:23:29] <_joe_> as it is now (actually in a manifest) [12:24:42] <_joe_> godog: let me try and tell me what do you think about it [12:25:02] yup, sounds good [12:25:45] 6operations, 7Graphite, 7Monitoring: Add monitoring for analytics-statsv service - https://phabricator.wikimedia.org/T117994#1841164 (10fgiunchedi) p:5Triage>3Normal [12:26:04] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [150.0] [12:26:05] 6operations, 10ops-codfw: es2010 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T117848#1841166 (10fgiunchedi) p:5Triage>3Normal [12:26:36] 6operations, 10Traffic: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#1841168 (10fgiunchedi) p:5Triage>3Normal [12:28:12] 6operations: Enforce password requirements for account creation on wikitech - https://phabricator.wikimedia.org/T118386#1841178 (10fgiunchedi) p:5Triage>3Normal @andrew iirc password minimums on wikitech are now being enforced? [12:28:39] 6operations, 10Traffic: cache_upload should give an informative 404 rather than 403 on req.http.host != upload.wikimedia.org - https://phabricator.wikimedia.org/T118394#1841180 (10fgiunchedi) p:5Triage>3Normal [12:29:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [12:30:29] <_joe_> godog: actually you gave me an idea about how to solve this [12:32:04] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:32:31] 6operations, 10RESTBase: API portal loads on domains without RESTBase, but lacks styling - https://phabricator.wikimedia.org/T118410#1841187 (10fgiunchedi) p:5Triage>3Normal anything for #operations in this case? [12:35:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] openldap: Make slapd.conf 0440 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256009 (owner: 10Muehlenhoff) [12:35:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:38:55] 6operations: uwsgi takes a long time to restart (Debian Jessie in labs) - https://phabricator.wikimedia.org/T118495#1841193 (10fgiunchedi) p:5Triage>3Normal I can't reproduce on graphite2.eqiad.wmflabs using `service`, it is a specific command taking a long time to return or uwsgi to come up and bind the por... [12:40:02] 6operations, 7Database, 7Performance: number of database updates multiplied x3 since 29 October - https://phabricator.wikimedia.org/T117398#1841197 (10fgiunchedi) p:5Triage>3Normal [12:48:26] 6operations: Investigate why mw2208 is powered off - https://phabricator.wikimedia.org/T118857#1841216 (10fgiunchedi) the machine has been indeed reinstalled but puppet hasn't run yet, the SEL doesn't contain any faults tho ``` /admin1-> racadm getsel Record: 1 Date/Time: 01/16/2015 00:14:21 Source:... [12:59:39] !log reimage mw2208 [12:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:00:33] 6operations: Investigate why mw2208 is powered off - https://phabricator.wikimedia.org/T118857#1841242 (10fgiunchedi) I'm going to reinstall it since puppet didn't run anyway yet and it yield an error when forcing a manual run [13:00:44] 6operations: Investigate why mw2208 is powered off - https://phabricator.wikimedia.org/T118857#1841243 (10fgiunchedi) p:5Triage>3Normal [13:11:29] 6operations, 10RESTBase, 6Services: API portal loads on domains without RESTBase, but lacks styling - https://phabricator.wikimedia.org/T118410#1841273 (10Krenair) not sure, probably more for #services [13:17:14] (03CR) 10Muehlenhoff: openldap: Make slapd.conf 0440 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256009 (owner: 10Muehlenhoff) [13:19:36] (03PS1) 10Muehlenhoff: opnldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 [13:20:27] (03PS2) 10Muehlenhoff: openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 [13:28:29] 6operations, 10RESTBase, 6Services: Switch RESTBase to use service::node - https://phabricator.wikimedia.org/T118401#1841281 (10fgiunchedi) p:5Triage>3Normal [13:29:49] 6operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#1841283 (10fgiunchedi) p:5Triage>3Normal [13:32:03] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1841290 (10fgiunchedi) p:5Triage>3Normal [13:34:44] PROBLEM - configured eth on mw2208 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:13] PROBLEM - dhclient process on mw2208 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:42] PROBLEM - mediawiki-installation DSH group on mw2208 is CRITICAL: Host mw2208 is not in mediawiki-installation dsh group [13:35:53] PROBLEM - nutcracker port on mw2208 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:36:03] PROBLEM - nutcracker process on mw2208 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:38:43] RECOVERY - configured eth on mw2208 is OK: OK - interfaces up [13:39:13] RECOVERY - dhclient process on mw2208 is OK: PROCS OK: 0 processes with command name dhclient [13:39:52] RECOVERY - nutcracker port on mw2208 is OK: TCP OK - 0.000 second response time on port 11212 [13:40:03] RECOVERY - nutcracker process on mw2208 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [13:41:18] andre__: https://tools.wmflabs.org/contact/ might be of use to you as well [13:41:44] andre__: it's basically a search engine for groups/users in ldap/phabricator to help find contact information for a given project easily [13:42:58] uh nice! thanks [13:44:08] it can be a bit slow when it needs to get a lot of information from phabricator (it tries to map ldap users to phabricator users) [13:44:31] hey guys. what's the channel to ask about the direction of wikimedia technology and policy, such as citoid and reflinks and the templates and everything? [13:44:55] i thought that was #wikimedia-tech but nobody's talking in there forever [13:45:52] i'm wanting to be abreast of the future of citations. i thought someone said that ideally the citations would someday move to wikidata or something like that. [13:46:36] i mean the references would move there as a single centralized repository, and wikipedia articles would cite things from there. [13:49:30] dtm: actually high level talks like that are usually on the mailinglist. talk on irc is usually more directly between people (unless it is an architecture or office meeting). [13:51:15] (03PS2) 10Muehlenhoff: openldap: Make slapd.conf 0440 [puppet] - 10https://gerrit.wikimedia.org/r/256009 [13:52:17] but as far as I am aware, beyond citoid, there are no major things being planned atm. [13:52:39] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: clean up history dumps code, dump only missing pages for reruns [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254820 (owner: 10ArielGlenn) [13:53:34] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix up cleanup of old files from previous run [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254836 (owner: 10ArielGlenn) [13:54:34] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: new option 'cleanup' to require cleanup when dump rerun [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/254837 (owner: 10ArielGlenn) [13:54:42] thedj, okay. well there was someone i had been speaking with in one of these technical channels long ago, about things like cite.php and the things i mentioned [13:55:07] thedj, so is this the only technical channel on irc? i just see activity in here and #wikipedia-en [13:55:29] no, there are chans for almost every focus area. [13:55:32] oh wait there's #mediawiki. maybe that was it. [13:56:10] mediawiki for more 3rd party facing, wikimedia-dev for the majority of developer talk [13:56:39] #mediaiwki is for non-wikipedia-related foreign deployments? [13:56:49] non-WMF [13:56:54] yeah [13:56:56] ok. [13:57:05] 6operations, 6Services: Update node_js to latest 0.10.x release - https://phabricator.wikimedia.org/T119218#1841349 (10MoritzMuehlenhoff) > @cscott wrote: > For that reason I'm a bit surprised that the maintained ubuntu releases aren't following releases on node's stable branch. I *hope* they are cherry-picki... [13:57:38] dtm: https://meta.wikimedia.org/wiki/IRC/Channels#MediaWiki_and_technical [13:58:24] thedj, oh thanks [13:58:41] holy cow they have AI [13:58:50] is that for antiabuse? [13:59:31] it's also not really complete/up-to-date [13:59:46] dtm: among others, yes [13:59:57] machine learning to detect vandalism [14:00:16] there was a blogpost about it yesterday as a matter of fact. [14:00:29] somebody went on an irc channel creation spamming spree [14:00:36] as is evident by that list [14:01:03] lot's of groups and focus areas. :) [14:01:15] "focus" lol yeah [14:01:37] (03CR) 10BBlack: "Well, what we had here didn't really notify us of the broken mirror either. It just stopped running puppet on trusty hosts for 6 hours si" [puppet] - 10https://gerrit.wikimedia.org/r/256148 (owner: 10BBlack) [14:02:04] so are you guys the WMF sysadmins? [14:02:56] some here are yes. Others are just those who want to observe what goes on, or need to regularly communicate with that team. [14:03:47] clearly we need a channel for each article. i dont know how they think we'll ever get this encyclopedia done [14:04:12] thedj, as a fellow sysadmin, i salute you [14:06:40] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Set up backend per-IP limits on varnish for WDQS - https://phabricator.wikimedia.org/T119917#1841372 (10BBlack) It would be best to use the header `X-Client-IP` as the notion of the client IP address for these sorts of purposes. This is intend... [14:07:03] 6operations, 6Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Set up backend per-IP limits on varnish for WDQS - https://phabricator.wikimedia.org/T119917#1841374 (10BBlack) [14:09:08] dtm: i'm actually one of those ppl observing :) Though I'm a backup sysadmin in my actual job. [14:10:13] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [14:11:17] thedj, well done. [14:13:01] (03PS1) 10ArielGlenn: dumps: import and method name fixes for monitor.py after the refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256216 [14:14:23] thedj, so are all these warning alerts in here accurate? they're not overly sensitive? that's a lot of critical failures [14:15:06] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: import and method name fixes for monitor.py after the refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256216 (owner: 10ArielGlenn) [14:15:09] the rate of warning/critical alerts in here is usually roughly appropriate for what the team needs to deal with (maybe a little heavy in some cases, it's not perfect) [14:15:12] Just trying to edit testwiki Talk:Sandbox: [cb216c27] Exception Caught: Request to parsoid for 'html' to 'wikitext' conversion of content connected to title "Talk:Sandbox" failed: 503 [14:15:18] Should I file it, or is it a known issue? [14:15:22] but most of them don't amount to publicly-visible problems, as things are usually redundant [14:17:03] bblack, well i do understand that the system can consider something critical, that users would never notice. just curious. [14:17:18] i mean i'm just surprised that it's resulting in notifications [14:18:05] so what's happening in times like that? is a watchdog automatically correcting it or does it not get any intevention? [14:18:19] dtm: remember that there are hundreds of servers, for dozens of different tasks. Some of them in production rotation, some not... etc [14:18:57] dtm: depends on the message :) [14:19:00] filed it anyway https://phabricator.wikimedia.org/T119967, probably not a big deal though [14:19:08] tto: seems smart [14:19:22] sorry for interrupting your discussion :) [14:19:49] word. [14:20:14] * dtm flags tto as CRITICAL [14:20:26] try not to be so critical, tto. this is a collegiate environment. [14:20:42] ha! [14:21:41] hm, icinga does love CRITICAL, so much so that it normally says it twice in a row [14:22:13] having said that.. this type of discussion does really belong in #mediawiki or #wikimedia-dev, so I propose we move it there, so that operations ppl can use this for operations monitoring and discussion. [14:22:20] also I've never been sure why puppet fails are considered CRITICAL, but then I know barely anything about ops [14:26:50] tto, sounds good [14:27:05] well puppet fails are CRITICAL to us for internal reasons. Usually is means our infrastructure is (at least temporarily) misbehaving, or it means we've committed a bad configuration change. [14:27:19] but that doesn't necessarily mean CRITICAL in terms of affecting site operations for end users. [14:28:04] bblack, so is that something that you adapt to? aren't they kinda continuous? [14:29:24] (03PS1) 10ArielGlenn: dumps: fix bad indent in logger from refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256218 [14:29:36] (03PS5) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [14:30:03] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix bad indent in logger from refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256218 (owner: 10ArielGlenn) [14:30:41] (03CR) 10jenkins-bot: [V: 04-1] base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [14:34:57] <_joe_> godog: apart from jenkins niceness, can you take a look at the last iteration of https://gerrit.wikimedia.org/r/252681 ? [14:37:39] dtm: the changes are continuous too :) [14:37:53] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:43:18] 10Ops-Access-Requests, 6operations, 6Multimedia, 5Patch-For-Review: Give Bartosz access to stat1003 ("researchers" and "statistics-users") - https://phabricator.wikimedia.org/T119404#1841451 (10ArielGlenn) @Trevor, can you sign off please? [14:44:52] bblack, so what would happen if nobody was monitoring and adjusting it all day? [14:45:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant jgirault an jan_drewniak access to the eventlogging db on stat1003 and hive to query webrequests tables on stat1002 - https://phabricator.wikimedia.org/T118998#1841452 (10ArielGlenn) @jgirault, can you verify that you have access to stat1002 please... [14:45:16] dtm: things would slowly go crazy [14:45:41] but probably the majority of the alerts we get are either immediately or slightly-delayed reactions to things we're actually doing [14:45:49] so if nobody was doing anything, the rate would be considerably lower :) [14:45:53] ha! [14:46:04] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1841453 (10ArielGlenn) @RobH, any luck from the vendors? [14:46:05] so you're making production changes all day? [14:46:10] yup [14:46:26] why's that? isn't there a schedule? [14:46:40] there's a schedule for major code deploys, and we try to warn and predict on outages [14:46:56] but the volume of work is pretty huge, and most of it doesn't impact user service [14:47:13] wow. [14:48:02] bblack, could i ask what kind of things require continuous unscheduled production changes? what's an example? [14:48:19] and again, i'm just curious. i have no complaint. ;) [14:48:19] <_joe_> dtm: operations for large websites rarely uses deployment schedules [14:48:43] <_joe_> for a series of reasons, including the fact you are often in reactive mode [14:49:00] i guess the biggest reason i could imagine is abuse [14:49:14] <_joe_> and the velocity at which you want to move to keep up with development; especially here with our very limited resources [14:49:14] and WMF is probably one of the biggest global targets, i'd imagine. [14:49:59] <_joe_> dtm: I worked with teams for web operations who supposedly worked on a strict ITIL change-management diet; they were always struggling [14:50:25] (03CR) 10Alexandros Kosiaris: [C: 04-1] package_builder: add option to use built packages during build (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [14:52:06] <_joe_> FWIW, this is not my first gig working for a large-scale web operation, and I talk with a lot of ops from other places - almost no one does deployment schedules for ops [14:52:24] _joe_, interesting. lol. [14:53:11] <_joe_> dtm: this doesn't mean that e.g. when we transitioned to HHVM we didn't have a precise rollout schedule [14:53:27] <_joe_> which we felt behind on day 3 of 150, or something like that :P [14:53:29] wow ITIL? i'd never heard of that. its wikipedia article is the height of obnoxiousness and i want to delete it so bad ;) [14:53:30] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Make slapd.conf 0440 [puppet] - 10https://gerrit.wikimedia.org/r/256009 (owner: 10Muehlenhoff) [14:53:30] akosiaris: maybe a mkdir -p ${BUILDRESULT} ? [14:53:43] last i heard, the IT industry was a pure free-for-all. glad to know there are some standards! ;) [14:53:59] <_joe_> dtm: ITIL is even more obnoxious when you have to suffer it, I've been told [14:54:00] still a bloodbath though huh [14:54:05] akosiaris: now that I got it to work, I'm super happy with package_builder \o/ [14:54:37] <_joe_> valhallasw`cloud: package_builder is awesome :) [14:55:11] (03PS6) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [14:55:21] <_joe_> let's see if the ruby cop is now happy... [14:56:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] openldap: Allow passing a higher size limit for LDAP queries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256213 (owner: 10Muehlenhoff) [14:56:45] so i'm going to guess that you guys don't get quite enough of that $58.5 million for 2015 huh? [14:57:17] valhallasw`cloud: happy that you like it [14:58:57] <_joe_> dtm: I think the wages of people working on what the operations team do here in a similarly size website might well exceed $60 millions/year [14:59:40] _joe_, wow. [14:59:57] _joe_, so you're saying that you're on a mostly volunteer staff? [15:00:04] by far [15:00:55] i mean i knew that as of a few years ago. but i thought there were only a few volunteers in total a few years ago. i didn't realize there were that many volunteers now. [15:01:09] <_joe_> no, I'm saying that we're not in numbers that allow to slow down our work, and doing deploys for operations only at scheduled times of the week definitely slows you down [15:01:48] <_joe_> dtm: maybe I wasn't clear - I said that google, facebook, alibaba, baidu, amazon spend way more than 60 mil/year on "operations" people [15:01:55] oh. [15:01:58] <_joe_> just on wages :P [15:02:08] yeah i think that's what they make *each* [15:02:11] at google [15:02:18] eh no [15:02:18] golden handcuffs. [15:03:23] (03CR) 10Alexandros Kosiaris: [C: 04-1] openldap: Document setup of cn=repluser and cn=admin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256206 (owner: 10Muehlenhoff) [15:07:32] (03CR) 10Alexandros Kosiaris: [C: 031] Allow configurable LDAP indices in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255973 (owner: 10Muehlenhoff) [15:08:40] (03CR) 10Ottomata: [C: 04-1] "Hold off on this for a bit. Joal is looking into some differences between what we tag as client_ips, and what is now coming in on the IP " [puppet] - 10https://gerrit.wikimedia.org/r/253474 (https://phabricator.wikimedia.org/T118557) (owner: 10BBlack) [15:08:55] (03CR) 10Alexandros Kosiaris: [C: 031] Add LDAP index for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/255978 (owner: 10Muehlenhoff) [15:09:12] i'm not finding any stats pages about wikipedia performance. like the number of articles served per second. [15:09:17] (03PS3) 10Aude: Enable data access for wikinews, mediawiki.org and wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 (https://phabricator.wikimedia.org/T109780) [15:10:02] oh maybe it's WP:STAT [15:12:03] (03PS1) 10ArielGlenn: dumps: fix config so adds-changes dumps skip labswiki [puppet] - 10https://gerrit.wikimedia.org/r/256226 [15:12:20] don't worry, guys. i'm working on Wikipedia 1.0 because the future of wikipedia is offline traffic. ;) taking that burden right off ya. [15:13:55] (03CR) 10ArielGlenn: [C: 032] dumps: fix config so adds-changes dumps skip labswiki [puppet] - 10https://gerrit.wikimedia.org/r/256226 (owner: 10ArielGlenn) [15:14:32] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 78 failures [15:14:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [15:14:42] dtm: _joe_ : https://grafana.wikimedia.org/dashboard/db/save-timing?panelId=8&fullscreen [15:14:45] Edit count per minute [15:15:58] (03PS3) 10Muehlenhoff: openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 [15:16:02] Krinkle, thanks i was looking also at http://ganglia.wikimedia.org/latest/ [15:18:39] (03PS2) 10Merlijn van Deen: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 [15:19:06] (03CR) 10Merlijn van Deen: package_builder: add option to use built packages during build (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [15:19:27] dtm: also http://reportcard.wmflabs.org/graphs/edits [15:19:52] for higher aggregates [15:20:06] Krinkle, do we have a general stat of outbound traffic? [15:20:13] per second or whatever [15:20:23] https://grafana-admin.wikimedia.org/dashboard/db/varnish-traffic [15:20:33] (03PS3) 10Merlijn van Deen: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 [15:20:35] https://grafana.wikimedia.org/dashboard/db/varnish-traffic [15:20:39] (03PS2) 10Muehlenhoff: openldap: Document setup of cn=repluser and cn=admin [puppet] - 10https://gerrit.wikimedia.org/r/256206 [15:20:45] request count at the bottom of https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [15:21:02] usually around 8M http requests per minute [15:21:11] to end-users [15:21:21] from the outer layers of varnish proxies [15:22:04] amounting to 4-35 Gbps of traffic per data centre [15:22:16] depends a fair bit - from https://grafana.wikimedia.org/dashboard/db/varnish-traffic [15:23:11] this includes of course page resources (e.g. images, JS, CSS etc.) [15:23:18] and API traffic and more [15:23:47] it doesn't include non-HTTP traffic, for that the ganglia graphs are probably more useful. [15:24:07] Though the global ethernet traffic can be confusing since it also includes internal traffic. [15:24:20] a single request may be counted at multiple layers that way [15:24:43] I prefer to use the varnish traffic at grafana for that reason [15:25:20] Krinkle, wow. [15:26:08] i believe the biggest web site i worked on was 28,800 banner ads per second, from dozens of data centers. and not very dynamic. but that was in 1999. [15:26:08] dtm: The identifiers relate to clusters per https://wikitech.wikimedia.org/wiki/Clusters [15:26:50] dtm: Yeah, it's big :D [15:26:51] back when a couple terabytes of hard drives cost $1m [15:28:09] Transmission is outgoing I believe. [15:28:22] So that's upload to users / http responses download from users [15:28:53] It shows nicely how Europe wakes up and goes to sleep each day (the red line) [15:29:51] (03PS1) 10Andrew Bogott: Remove labs_ldap_dns_ip_override for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/256228 (https://phabricator.wikimedia.org/T119762) [15:30:12] Krinkle, what's non-http traffic? can i mount wikipedia via nfs? [15:30:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [15:30:42] dtm: non-http traffic is for example: MySQL queries between web servers and data bases [15:30:50] When composing the response to a request [15:30:54] (which is internal of course) [15:31:14] There is also external non-http, such as event notification services. We have a websocket stream for real-time events as edits happen. [15:31:34] https://wikitech.wikimedia.org/wiki/RCStream - http://codepen.io/Krinkle/pen/laucI/?editors=101 [15:31:53] And there's various things like redis and memcached, which are also non-http protocols. [15:32:07] There is some external mounting services, in a way [15:32:25] We provide massive XML-versions dumps of our databases periodically [15:32:29] Whcih are available over HTTP I think? [15:32:36] FTP * [15:33:11] Or maybe we do HTTP only and some of the volunteer mirrors use FTP, I'm not sure [15:33:33] You can e.g. get an XML dump of all en.wikipedia page content or all en.wikipedia page meta data etc. [15:33:59] (03PS3) 10Muehlenhoff: Allow configurable LDAP indices in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255973 [15:35:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Allow configurable LDAP indices in openldap module [puppet] - 10https://gerrit.wikimedia.org/r/255973 (owner: 10Muehlenhoff) [15:37:20] http [15:37:33] and yes at least one mirror makes them available over ftp [15:38:55] dtm: IRC is another :) [15:39:46] We host one of the servers for freenode, and we also have our own irc server (read-only) for consuming edit notifications. It's obsoleted by the websocket feed, but still used by lots of tools. [15:40:16] http://wikipulse.herokuapp.com/ this has a lot less information in the graph but it is kinda cute [15:40:25] for edit frequencies :-) [15:40:56] this sort of discussion should happen on #wikimedia-tech FYI [15:42:02] then there is the labs environment [15:42:09] and the user tool platform too [15:42:23] godog: unlikely to get around to those wdqs patches today then? ;) [15:43:14] addshore: I'm afraid so [15:43:14] (03PS7) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [15:43:15] <_joe_> addshore: use puppetswat for those! [15:43:35] oh wait, puppet swat! [15:43:39] indeed [15:43:44] <_joe_> addshore: it's in the deployments calendar [15:43:45] as long as they are truly truly minor [15:44:46] they are indeed minor, and only 1 is actually blocking me so I can just put that one in! [15:44:58] (03CR) 10Filippo Giunchedi: [C: 031] "+1 only because as stated in the function the only reason it exists is to cater for self-hosted labs puppetmaster, and untangling that wou" [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:46:13] (03CR) 10Alexandros Kosiaris: [C: 031] "I like the idea and the approach as well" [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:48:35] (03CR) 10ArielGlenn: "The deal with these changes on master is that it's not the branch in use. I'm working towards being able to merge the ariel branch into m" [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [15:48:38] 6operations: Weird message from the Facebook team to list admins - https://phabricator.wikimedia.org/T119232#1841537 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi @Selsharbaty-WMF it wouldn't necessary show up in moderation queue and very unlikely it is security related, closing but please reopen if you see... [15:48:47] (03PS2) 10Muehlenhoff: Add LDAP index for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/255978 [15:50:59] apergos: _joe_ should I seek approval or just leave it on https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=212865&oldid=212854 for now? [15:51:19] (03CR) 10coren: Labs: switch PAM handling to use pam-auth-update (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [15:51:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add LDAP index for sudoUser [puppet] - 10https://gerrit.wikimedia.org/r/255978 (owner: 10Muehlenhoff) [15:52:40] get greg's approval I guess, since there's the freeze on, I mean you can certainly keep it in the list there [15:52:49] addshore: [15:53:08] (03CR) 10Andrew Bogott: "puppet compiler confirms this is a no-op." [puppet] - 10https://gerrit.wikimedia.org/r/256228 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [15:53:08] greg-g ^^ [15:55:51] (03PS5) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [15:55:57] paravoid: ^^ with your fixes [15:56:01] 6operations: sort of SLA clarification grafana.wikimedia.org - https://phabricator.wikimedia.org/T119558#1841557 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi yup grafana is supported, thanks for the heads up. We're deprecating gdash-based dashboards and moving ours there too. closing but reopen if you have... [15:56:43] (03CR) 10jenkins-bot: [V: 04-1] Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [15:56:45] (03PS2) 10Andrew Bogott: Remove labs_ldap_dns_ip_override for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/256228 (https://phabricator.wikimedia.org/T119762) [15:57:34] (03PS6) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [15:57:47] * Coren mumbles 'dwim!' at puppet. [15:58:32] 6operations, 6WMDE-Analytics-Engineering: sort of SLA clarification grafana.wikimedia.org - https://phabricator.wikimedia.org/T119558#1841565 (10Addshore) [15:58:40] (03CR) 10Andrew Bogott: [C: 032] Remove labs_ldap_dns_ip_override for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/256228 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151201T1600). Please do the needful. [16:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] 6operations, 6Reading-Admin, 10Reading-Community-Engagement: Improve UX Strategic Test - https://phabricator.wikimedia.org/T117826#1841570 (10fgiunchedi) p:5Triage>3Normal triaging as part of #operations, feel free to change it of course! [16:00:18] (03PS1) 10Andrew Bogott: Puppet compiler lies! [puppet] - 10https://gerrit.wikimedia.org/r/256230 [16:00:41] (03PS2) 10Andrew Bogott: Revert "Remove labs_ldap_dns_ip_override for labcontrol1002." [puppet] - 10https://gerrit.wikimedia.org/r/256230 [16:00:44] 6operations, 7Database: Create a Master-master topology between datacenters for easier failover - https://phabricator.wikimedia.org/T119642#1841572 (10fgiunchedi) p:5Triage>3High [16:00:59] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1841574 (10fgiunchedi) p:5Triage>3Normal [16:01:28] apergos: addshore that looks fine to me [16:01:36] awesome :) [16:01:46] (03CR) 10Andrew Bogott: [C: 032] Revert "Remove labs_ldap_dns_ip_override for labcontrol1002." [puppet] - 10https://gerrit.wikimedia.org/r/256230 (owner: 10Andrew Bogott) [16:02:11] (03PS1) 10Jcrespo: [WIP] Unfinished db table check tools References: T104459 [software] - 10https://gerrit.wikimedia.org/r/256231 [16:02:39] ok, you wanna note 'approved by greg' on the slot? for whoever has puppet swat from ops [16:02:47] then gtg [16:02:54] apergos: I have just added the line :) [16:02:55] James_F|Away: I can SWAT if you're around [16:02:58] perfect [16:03:04] 6operations, 6Commons: image magick striping colour profile of PNG files [probably regression] - https://phabricator.wikimedia.org/T113123#1841577 (10fgiunchedi) p:5Triage>3Normal I'm getting 404 on the link above, do you have other samples @bawolff ? [16:03:28] greg-g: if you are also okay with this one then I would also love to add it (as there is nothing else in the slot) https://gerrit.wikimedia.org/r/#/c/256111/ [16:03:54] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1841579 (10fgiunchedi) p:5Triage>3Normal [16:03:56] addshore: yup, good to me [16:04:48] excellent! [16:09:01] (03CR) 10Alexandros Kosiaris: package_builder: add option to use built packages during build (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [16:09:30] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1841584 (10Eevans) >>! In T116247#1839888, @Ottomata wrote: > @gwicke and I discussed the schema/revision in meta issue in IRC today. He had an idea... [16:10:18] (03PS8) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [16:11:33] (03PS1) 10ArielGlenn: salt minion will wait up to 10 seconds to reauth to master [puppet] - 10https://gerrit.wikimedia.org/r/256235 [16:11:54] thcipriani: Hey, sorry, arround now. [16:11:56] thcipriani: (And thank you.) [16:12:09] James_F: np :) [16:12:43] PROBLEM - HHVM rendering on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:49] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:12:51] (03CR) 10ArielGlenn: [C: 032] salt minion will wait up to 10 seconds to reauth to master [puppet] - 10https://gerrit.wikimedia.org/r/256235 (owner: 10ArielGlenn) [16:13:29] (03Merged) 10jenkins-bot: Enable VisualEditor for all accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250474 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:13:44] PROBLEM - Apache HTTP on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:52] <_joe_> on it ^^ [16:13:53] PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:14:02] PROBLEM - Disk space on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:14:13] PROBLEM - SSH on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:42] PROBLEM - RAID on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:14:43] PROBLEM - salt-minion processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:03] PROBLEM - nutcracker process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:12] PROBLEM - configured eth on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:33] PROBLEM - Check size of conntrack table on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:33] PROBLEM - nutcracker port on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:15:40] 6operations, 5Continuous-Integration-Scaling, 7Icinga, 7Monitoring, 7Nodepool: flapping "permission denied" disk space alarm for temporary image on labnodepool1001 - https://phabricator.wikimedia.org/T104975#1841595 (10coren) 5Open>3Resolved a:3coren This is now fixed for the labstores as well (T11... [16:16:13] PROBLEM - dhclient process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:16:22] PROBLEM - HHVM processes on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:16:38] <_joe_> uhm it seems like it died of memory exhaustion [16:16:59] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1841600 (10ArielGlenn) New repo is already available, yay! In the meantime, looked into the initial delay of minions that don't respond to commands after having been 'idle' for awhile; indeed... [16:17:46] <_joe_> yup :/ [16:17:48] <_joe_> https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=mw1127.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1448986636&g=mem_report&z=large&c=API%20application%20servers%20eqiad [16:18:16] 6operations, 10Datasets-General-or-Unknown, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Push dumps.wm.o logs files to stat1002 - https://phabricator.wikimedia.org/T118739#1841606 (10ArielGlenn) a:3ArielGlenn [16:18:52] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:19:12] 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1841610 (10ArielGlenn) Reminder that the monthly fulls are running again; if people could keep an eye out and let me know p... [16:19:15] 6operations: bond eth interfaces on ms1001 - https://phabricator.wikimedia.org/T89829#1841612 (10coren) [16:19:40] <_joe_> andrewbogott: failures on labs-ns0 [16:19:53] _joe_: thanks, looking [16:19:54] 6operations: bond eth interfaces on ms1001 - https://phabricator.wikimedia.org/T89829#1841613 (10ArielGlenn) It is needed if we want it to share the downloads with dataset1001, which I do. [16:21:16] (03CR) 10Filippo Giunchedi: [C: 04-1] "thanks for the review Isart!" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 (owner: 10Isart) [16:21:27] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: Enable VisualEditor for all accounts on eswiki. Part I [[gerrit:250474]] (duration: 01m 29s) [16:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:37] * James_F checks. [16:21:52] RECOVERY - DPKG on mw1127 is OK: All packages OK [16:21:53] RECOVERY - Disk space on mw1127 is OK: DISK OK [16:22:12] RECOVERY - dhclient process on mw1127 is OK: PROCS OK: 0 processes with command name dhclient [16:22:13] RECOVERY - SSH on mw1127 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:22:13] RECOVERY - HHVM processes on mw1127 is OK: PROCS OK: 6 processes with command name hhvm [16:22:22] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1841636 (10fgiunchedi) p:5Triage>3Normal [16:22:34] RECOVERY - RAID on mw1127 is OK: OK: no RAID installed [16:22:43] RECOVERY - salt-minion processes on mw1127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:22:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for all accounts on eswiki. Part II [[gerrit:250474]] (duration: 00m 28s) [16:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:22:54] <_joe_> that's the oom killer finally killing hhvm [16:23:02] RECOVERY - nutcracker process on mw1127 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:23:03] RECOVERY - configured eth on mw1127 is OK: OK - interfaces up [16:23:11] (03CR) 10Jcrespo: "here (Filippo win me on the comment)" (032 comments) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 (owner: 10Isart) [16:23:23] RECOVERY - Check size of conntrack table on mw1127 is OK: OK: nf_conntrack is 0 % full [16:23:23] RECOVERY - nutcracker port on mw1127 is OK: TCP OK - 0.000 second response time on port 11212 [16:23:58] thcipriani: Everything looks OK. Thank you. [16:24:07] James_F: thanks for checking! [16:24:16] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1841638 (10fgiunchedi) p:5Triage>3Normal [16:24:24] (03CR) 10Giuseppe Lavagetto: [C: 032] base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [16:24:31] (03PS9) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/252681 (https://phabricator.wikimedia.org/T114638) [16:24:46] <_joe_> uhm let's see if it works in production [16:25:07] <_joe_> ofc it didn't with the compiler, because settings::ssldir is something weird [16:27:28] (03PS1) 10Giuseppe Lavagetto: Revert "base::certificates: add puppet's CA to the trusted store" [puppet] - 10https://gerrit.wikimedia.org/r/256239 [16:27:41] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "base::certificates: add puppet's CA to the trusted store" [puppet] - 10https://gerrit.wikimedia.org/r/256239 (owner: 10Giuseppe Lavagetto) [16:28:21] <_joe_> sigh [16:28:34] <_joe_> /var/lib/puppet/ssl vs /var/lib/puppet/ssl/ [16:28:39] * _joe_ headdesks [16:28:54] * _joe_ hangs himself [16:29:46] 6operations, 10ops-codfw, 6Labs: rack and connect labstore-array4-codfw in codfw - https://phabricator.wikimedia.org/T93215#1841654 (10coren) 5Open>3Invalid Given that plans for a NFS-serving labstore in codfw are put on indefinite hiatus and that the current role of that server is to serve as destinatio... [16:31:12] 6operations: Redirect revisions from svn.wikimedia.org to https://phabricator.wikimedia.org/rSVN - https://phabricator.wikimedia.org/T119846#1841661 (10fgiunchedi) p:5Triage>3Low [16:31:13] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: puppet fail [16:31:22] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: puppet fail [16:31:32] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: puppet fail [16:31:32] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [16:31:33] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: puppet fail [16:31:33] PROBLEM - puppet last run on lvs1009 is CRITICAL: CRITICAL: puppet fail [16:31:42] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: puppet fail [16:31:52] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: puppet fail [16:31:53] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [16:31:54] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: puppet fail [16:31:54] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: puppet fail [16:31:54] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: puppet fail [16:32:02] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [16:32:03] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: puppet fail [16:32:03] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: puppet fail [16:32:03] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: puppet fail [16:32:04] PROBLEM - puppet last run on mw2088 is CRITICAL: CRITICAL: puppet fail [16:32:12] _joe_, that is your mistake, right? [16:32:13] PROBLEM - puppet last run on analytics1045 is CRITICAL: CRITICAL: puppet fail [16:32:13] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: puppet fail [16:32:14] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: puppet fail [16:32:15] <_joe_> yes [16:32:21] <_joe_> but I reverted immediately [16:32:21] 6operations, 7Database: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1841664 (10fgiunchedi) p:5Triage>3Normal [16:32:22] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: puppet fail [16:32:22] ok, so I do not worry [16:32:30] yeah, I know, takes a while [16:32:33] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: puppet fail [16:32:33] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: puppet fail [16:32:43] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: puppet fail [16:32:43] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: puppet fail [16:32:43] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: puppet fail [16:32:43] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: puppet fail [16:32:43] PROBLEM - puppet last run on mw2061 is CRITICAL: CRITICAL: puppet fail [16:32:44] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [16:32:44] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: puppet fail [16:32:44] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: puppet fail [16:32:45] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [16:32:49] <_joe_> that many? [16:32:51] <_joe_> wtf [16:32:52] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [16:32:52] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: puppet fail [16:32:52] PROBLEM - puppet last run on db2062 is CRITICAL: CRITICAL: puppet fail [16:32:52] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: puppet fail [16:32:52] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: puppet fail [16:32:53] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [16:33:02] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: puppet fail [16:33:02] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: puppet fail [16:33:12] PROBLEM - puppet last run on mw2035 is CRITICAL: CRITICAL: puppet fail [16:33:23] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: puppet fail [16:33:32] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: puppet fail [16:33:33] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: puppet fail [16:33:33] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: puppet fail [16:33:46] I think the slave makes the change take a lot of time [16:33:52] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail [16:33:52] PROBLEM - puppet last run on mw1079 is CRITICAL: CRITICAL: puppet fail [16:33:53] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: puppet fail [16:33:53] PROBLEM - puppet last run on suhail is CRITICAL: CRITICAL: puppet fail [16:34:04] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [16:34:21] and in general it being not transactional at all [16:34:33] _joe_: Yeah, when a puppet storm starts, it tends to explode a lot. I'll recover just as quickly once the checks starts pulling in. [16:34:46] <_joe_> well, now I have to figure out what the hell failed, because WTF [16:34:52] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:34:57] _joe_: Because period-of-puppet-run and period-of-icinga-check mismatch. :-( [16:35:33] _joe_: What I do is do a manual puppet run on some of the affected boxen to make sure the next run will be clean, then let it settle down. [16:37:26] <_joe_> something tells me this has to do with ruby 1.8 vs ruby 1.9 [16:37:32] <_joe_> vs ruby whatever [16:37:48] is this going to break puppet client auth for fixing itself? [16:38:41] (trying one now) [16:39:31] apparently it fixes itself fine [16:40:36] that is the first sign of skynet- puppet gaining auto-fixing properties at 16:39 on judgement day [16:40:44] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:42:15] at 18:45 it will have read all Wikipedias and undestood that humans are the problem for true Neutral POV [16:43:01] <_joe_> bblack: no this change is mostly harmless [16:43:16] <_joe_> I just can't understand how is it possible that it doesn't work [16:43:48] <_joe_> I mean for some reasons I can't understand when I test it locally it works fine [16:45:06] (03PS1) 10ArielGlenn: dumps: fix exception when dryrun is enabled without logging [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256244 [16:45:08] (03PS1) 10ArielGlenn: dumps abstracts: varname fixup after refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256245 [16:45:25] (03PS1) 10Filippo Giunchedi: scap: restore mw2208 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/256246 (https://phabricator.wikimedia.org/T118857) [16:45:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] scap: restore mw2208 after reimage [puppet] - 10https://gerrit.wikimedia.org/r/256246 (https://phabricator.wikimedia.org/T118857) (owner: 10Filippo Giunchedi) [16:46:52] PROBLEM - RAID on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:46:58] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: fix exception when dryrun is enabled without logging [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256244 (owner: 10ArielGlenn) [16:47:33] PROBLEM - nutcracker port on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:33] PROBLEM - Check size of conntrack table on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:47:46] 6operations, 5Patch-For-Review: Investigate why mw2208 is powered off - https://phabricator.wikimedia.org/T118857#1841783 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi see related T93857, idrac was changed [16:47:53] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps abstracts: varname fixup after refactor [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256245 (owner: 10ArielGlenn) [16:48:03] PROBLEM - DPKG on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:04] PROBLEM - Disk space on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:13] PROBLEM - dhclient process on mw1127 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:48:13] PROBLEM - SSH on mw1127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:12] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] Avoid breaking full phabricator URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [16:49:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] package_builder: add option to use built packages during build (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [16:50:52] PROBLEM - salt-minion processes on mw1127 is CRITICAL: Timeout while attempting connection [16:51:30] (03CR) 10Alexandros Kosiaris: [C: 031] openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 (owner: 10Muehlenhoff) [16:53:24] RECOVERY - Check size of conntrack table on mw1127 is OK: OK: nf_conntrack is 0 % full [16:53:24] RECOVERY - nutcracker port on mw1127 is OK: TCP OK - 0.000 second response time on port 11212 [16:53:43] RECOVERY - Apache HTTP on mw1127 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.568 second response time [16:53:53] RECOVERY - DPKG on mw1127 is OK: All packages OK [16:53:53] RECOVERY - Disk space on mw1127 is OK: DISK OK [16:54:03] RECOVERY - SSH on mw1127 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:54:03] RECOVERY - dhclient process on mw1127 is OK: PROCS OK: 0 processes with command name dhclient [16:54:42] RECOVERY - RAID on mw1127 is OK: OK: no RAID installed [16:54:43] RECOVERY - HHVM rendering on mw1127 is OK: HTTP OK: HTTP/1.1 200 OK - 65907 bytes in 0.552 second response time [16:54:43] RECOVERY - salt-minion processes on mw1127 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:54:52] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.028 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [16:55:22] !log restarting cassandra instances on restbase2001 (to effect openjdk security updates and a few depending libs) [16:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:59] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1841824 (10fgiunchedi) decommissioning still going ``` Sending 686 files, 616799776807 bytes total. Already sent 385 files, 263287846227... [16:57:33] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:57:50] (03PS7) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [16:57:54] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:58:03] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:58:13] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:13] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:24] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:58:33] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:58:44] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:58:44] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:58:52] RECOVERY - puppet last run on db2062 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:58:52] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:58:53] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:58:58] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1841834 (10jcrespo) Immediate problems have been fixed and purging has been restarted, however, the long term problem persist until we can schedule some maintenance for defragmenting and co... [16:59:03] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:59:13] RECOVERY - puppet last run on elastic1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:22] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:23] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1841836 (10jcrespo) 5Open>3Resolved [16:59:23] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:23] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:59:24] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:24] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:33] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:33] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:59:33] RECOVERY - puppet last run on lvs1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:43] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:53] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:59:53] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:53] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:59:53] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:59:54] RECOVERY - puppet last run on suhail is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:00:02] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:02] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:00:03] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:04] Deploy window Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151201T1700) [17:00:04] Addshore: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:12] RECOVERY - puppet last run on analytics1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:22] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:00:23] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:34] RECOVERY - puppet last run on ms-be1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:00:38] !log restarting cassandra instances on restbase2002 (to effect openjdk security updates and a few depending libs) [17:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:43] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:00:43] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:00:43] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:43] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:44] RECOVERY - puppet last run on mw2061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:00:52] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:00:53] RECOVERY - puppet last run on analytics1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:01:03] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:01:19] (03PS2) 10Dzahn: Simplify phabricator statistics emails [puppet] - 10https://gerrit.wikimedia.org/r/256179 (owner: 10Hashar) [17:01:22] RECOVERY - puppet last run on mw2035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:01:47] (03CR) 10Dzahn: [C: 032] Simplify phabricator statistics emails [puppet] - 10https://gerrit.wikimedia.org/r/256179 (owner: 10Hashar) [17:02:02] RECOVERY - puppet last run on mw1079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:02] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:02:02] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:14] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:02:54] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:05:40] !log restarting cassandra instances on restbase200[3-6] (to effect openjdk security updates and a few depending libs) [17:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:02] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1841878 (10fgiunchedi) what was the specific error from paramiko? ecdsa-sha2-nistp256 support has been introduced in 1.10 with https://github.com/paramiko/paramiko/commit/ebdbf... [17:08:09] (03PS8) 10Thcipriani: RESTBase configuration for scap3 deployment [puppet] - 10https://gerrit.wikimedia.org/r/252887 [17:08:53] ACKNOWLEDGEMENT - configured eth on lvs1007 is CRITICAL: eth3 reporting no carrier. daniel_zahn https://phabricator.wikimedia.org/T104458 [17:08:53] ACKNOWLEDGEMENT - configured eth on lvs1008 is CRITICAL: eth3 reporting no carrier. daniel_zahn https://phabricator.wikimedia.org/T104458 [17:08:53] ACKNOWLEDGEMENT - configured eth on lvs1009 is CRITICAL: eth3 reporting no carrier. daniel_zahn https://phabricator.wikimedia.org/T104458 [17:09:09] 6operations, 6Labs: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1841879 (10fgiunchedi) [17:12:07] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#1841884 (10fgiunchedi) do we have monitorable PDUs in ulsfo? that might help too, see also {T119631} [17:12:56] do I just wave my arms around to find someone for puppetswat? The wikipage says yuvi will be doing them but he seems afk :) [17:14:22] addshore: its actually akosiaris and andrewbogott this week [17:14:28] they just didnt add themselves to the deployments page [17:14:32] awesome! :) ahh okay! [17:14:58] i just pinged them so we shall see [17:15:27] oh man, I’m the worst, I’m always nagging people to add themselves to the calendar [17:15:45] :D [17:15:51] at least your here ;) [17:16:23] andrewbogott: Week of Nov 30th: No normal deploys - high prioity SWATs only (pre-approval from Greg and/or Katie needed) (Why: First of Dec Fundraising) [17:16:32] so keep in mind for the puppet swat, i did not review the patches [17:16:44] but you may have to push them to next week if they break that freeze [17:16:50] addshore: give me a minute... [17:16:51] next week is a normal deployment week. [17:17:14] so you may have been pinged simply to triage them to the following week ;] [17:17:30] well, they have been approved by greg ;) [17:18:17] 6operations: ulsfo: add a DNS recursor - https://phabricator.wikimedia.org/T82996#1841898 (10fgiunchedi) [17:18:25] addshore: you putting 'approved by greg' im not sure counts man [17:18:34] anyone can put that no offense [17:18:58] i dont see him voting on the patchsets though [17:18:59] robh: indeed, He approved in this channel around an hour ago (if you check the logs) although they are not the nicest to link to! [17:19:04] * robh is still looking at tasks, oh [17:19:07] oh, ok [17:19:28] greg-g: Yo, just say approved again so andrewbogott can see it! (im just trying to make sure we respect your code freeze ;) [17:19:36] * robh goes looking in backlog [17:19:46] 6operations: Implement Outage Communications Protocol - https://phabricator.wikimedia.org/T79134#1841901 (10fgiunchedi) [17:19:48] [16:01:28] and [16:03:56] in http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20151201.txt ;) [17:20:08] oh, found it [17:20:12] 6operations: Implement Outage Communications Protocol - https://phabricator.wikimedia.org/T79134#1841905 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi we've been actively working on this, see also https://wikitech.wikimedia.org/wiki/Incident_response for the current process [17:20:14] andrewbogott: indeed, greg approved those to go [17:20:24] I approve two of addshore's patches [17:20:25] :) [17:20:26] yep, reading... [17:20:30] Its a shame you can link to a line number in those logs ;) [17:20:31] heh, or that [17:20:48] (03PS6) 10Andrew Bogott: Set WDQS 5min expiry for internal access Port:8888 [puppet] - 10https://gerrit.wikimedia.org/r/256039 (https://phabricator.wikimedia.org/T119941) (owner: 10Addshore) [17:21:12] things greg wont ever be mad about: paranoia during code freezes [17:21:34] i could likely drop 'during code freezes' and statement stands. [17:21:45] you could stop at paranoia [17:21:48] heh [17:22:03] (03CR) 10Andrew Bogott: [C: 032] Set WDQS 5min expiry for internal access Port:8888 [puppet] - 10https://gerrit.wikimedia.org/r/256039 (https://phabricator.wikimedia.org/T119941) (owner: 10Addshore) [17:22:28] (03PS4) 10Andrew Bogott: Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) (owner: 10Smalyshev) [17:22:38] addshore: what hosts do those apply to? [17:22:39] 6operations, 10Traffic: LVS testing needs to include internal services testing - https://phabricator.wikimedia.org/T83467#1841910 (10fgiunchedi) [17:22:49] andrewbogott: wdqs100[12].eqiad.wmnet [17:24:57] addshore: I don’t suppose you have a login there? [17:25:01] nope [17:25:34] ok. that patch breaks nginx. Want me to revert, or debug? [17:26:03] short term debug would be great [17:26:40] nginx: [emerg] invalid port in "88888" of the "listen" directive in /etc/nginx/sites-enabled/wdqs:11 [17:26:43] maybe one too many 8’s? [17:26:53] oh balls, [17:26:55] >65536 ? [17:27:03] yup, the top post has 1 too many 8s... [17:27:03] yeah, I 'd say that's the problem [17:27:08] *port [17:27:19] i really should’ve caught that myself [17:27:36] addshore: want to fix? [17:27:44] yup, making patch now, 2 secs [17:27:53] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: Connection refused [17:28:22] PROBLEM - WDQS HTTP Port on wdqs1001 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused [17:28:33] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: Connection refused [17:29:43] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:30:14] (03PS1) 10Addshore: WDQS fix port 88888 to 8888 [puppet] - 10https://gerrit.wikimedia.org/r/256254 [17:30:15] andrewbogott: ^^ [17:30:22] after that, the error is: [17:30:23] nginx: [emerg] "proxy_set_header" directive is not allowed here in /etc/nginx/sites-enabled/wdqs:47 [17:30:49] I think we need to revert this and y’all need a labs test box [17:30:52] okay, revert, this just silly [17:31:23] (03PS1) 10Addshore: Revert "Set WDQS 5min expiry for internal access Port:8888" [puppet] - 10https://gerrit.wikimedia.org/r/256255 [17:31:25] andrewbogott: ^^ [17:31:35] yeah revert it, something is not right there [17:31:38] (03Abandoned) 10Addshore: WDQS fix port 88888 to 8888 [puppet] - 10https://gerrit.wikimedia.org/r/256254 (owner: 10Addshore) [17:31:47] like the use of if ? [17:31:52] https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/ [17:31:56] (03PS2) 10Andrew Bogott: Revert "Set WDQS 5min expiry for internal access Port:8888" [puppet] - 10https://gerrit.wikimedia.org/r/256255 (owner: 10Addshore) [17:32:15] akosiaris: indeed if is evil but should work :/ [17:32:30] but it doesn't [17:32:37] akosiaris: indeed! [17:32:50] SMalyshev, addshore, do you have a test system already? If not, would you like to set something up right now? [17:33:02] (03CR) 10Andrew Bogott: [C: 032] Revert "Set WDQS 5min expiry for internal access Port:8888" [puppet] - 10https://gerrit.wikimedia.org/r/256255 (owner: 10Addshore) [17:33:04] I have a test system, yes [17:33:38] ok [17:34:33] RECOVERY - WDQS HTTP Port on wdqs1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 80 [17:34:52] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5615 bytes in 0.001 second response time [17:35:53] RECOVERY - puppet last run on wdqs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:13] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5615 bytes in 0.002 second response time [17:36:14] Unlike proxy_pass, you cannot put proxy_set_header inside an if block [17:36:48] (03PS1) 10Addshore: WIP Revert "Revert "Set WDQS 5min expiry for internal access Port:8888"" [puppet] - 10https://gerrit.wikimedia.org/r/256256 [17:36:56] (03CR) 10Addshore: [C: 04-1] WIP Revert "Revert "Set WDQS 5min expiry for internal access Port:8888"" [puppet] - 10https://gerrit.wikimedia.org/r/256256 (owner: 10Addshore) [17:37:59] 6operations, 6Labs: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1841969 (10coren) >>! In T106871#1841878, @fgiunchedi wrote: > what was the specific error from paramiko? ecdsa-sha2-nistp256 support has been introduced in 1.10 with... [17:39:13] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [17:39:33] RECOVERY - mediawiki-installation DSH group on mw2208 is OK: OK [17:40:03] addshore: want me to merge https://gerrit.wikimedia.org/r/#/c/256111/ still? [17:40:17] (03PS5) 10Andrew Bogott: Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) (owner: 10Smalyshev) [17:40:29] andrewbogott: sure! [17:40:44] (03PS3) 10Luke081515: Allow sysops to add and remove accounts from bot group on mai.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [17:41:02] (03CR) 10jenkins-bot: [V: 04-1] Allow sysops to add and remove accounts from bot group on mai.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [17:41:13] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:41:57] (03CR) 10Luke081515: "Sorry, I can't rebase it via gerrit, someone other need to." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239854 (https://phabricator.wikimedia.org/T111898) (owner: 10Mdann52) [17:41:59] (03CR) 10Andrew Bogott: [C: 032] Add served-by header [puppet] - 10https://gerrit.wikimedia.org/r/256111 (https://phabricator.wikimedia.org/T119508) (owner: 10Smalyshev) [17:42:47] (03PS2) 10Addshore: Set WDQS 5min expiry for internal access Port:8888 (try2) [puppet] - 10https://gerrit.wikimedia.org/r/256256 [17:43:11] SMalyshev: andrewbogott ^^ it looks like that is actually the correct solution for the first patch [17:43:45] (03PS3) 10Addshore: Set WDQS 5min expiry for internal access Port:8888 (try2) [puppet] - 10https://gerrit.wikimedia.org/r/256256 [17:44:18] 6operations, 10Wikimedia-Planet: update *.planet.wikimedia.org certificate on cluster (implement) - https://phabricator.wikimedia.org/T119987#1842006 (10RobH) 3NEW a:3Dzahn [17:45:05] 6operations, 10Wikimedia-Planet: update *.planet.wikimedia.org certificate on cluster (implement) - https://phabricator.wikimedia.org/T119987#1842006 (10RobH) Daniel, if you didn't want to implement this, I can instead (either works!) Just let me know. [17:46:12] (03CR) 10Dereckson: "Actually, at 46b087b491a4564451e39e661955f88893c02277 (the current HEAD of ariel branch), there is no .gitignore file." [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [17:46:18] (03PS1) 10BBlack: varnish::reqstats: make ensure=>absent actually work [puppet] - 10https://gerrit.wikimedia.org/r/256258 [17:46:20] (03PS1) 10BBlack: varnish: ability to remove a named instance [puppet] - 10https://gerrit.wikimedia.org/r/256259 [17:46:22] (03PS1) 10BBlack: cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) [17:46:33] (03Abandoned) 10Dereckson: Created .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [17:47:14] (03CR) 10ArielGlenn: "hmm I'll check, I have it in my repo but if not..." [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [17:47:46] addshore: If you’re desperate and get a +1 from SMalyshev I can try again, but better yet you could try it on labs and submit for tomorrow’s swat. [17:47:53] (03CR) 10jenkins-bot: [V: 04-1] varnish: ability to remove a named instance [puppet] - 10https://gerrit.wikimedia.org/r/256259 (owner: 10BBlack) [17:48:12] It would be great to get it in but I'll let SMalyshev look at it (if he has time now) [17:48:13] (03CR) 10jenkins-bot: [V: 04-1] cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [17:48:32] the exact case we hit is actually covered here http://serverfault.com/questions/506972/nginx-why-i-cant-put-proxy-set-header-inside-an-if-clause [17:49:54] (03PS1) 10Thcipriani: Update hieradata for trebuchet module [puppet] - 10https://gerrit.wikimedia.org/r/256261 (https://phabricator.wikimedia.org/T119988) [17:51:21] 6operations, 6Labs: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1842041 (10coren) FTR, the version provided by Trusty is `1.10.1-1git1build1` but I can confirm that the above patch did not make it into it (checking `paramiko/hostke... [17:54:10] andrewbogott: did you run into any issues with the served-by header? [17:54:36] puppet is happy. I’ll leave it to you to verify that it actually does something. [17:54:50] (03PS2) 10BBlack: cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) [17:54:52] (03PS2) 10BBlack: varnish: ability to remove a named instance [puppet] - 10https://gerrit.wikimedia.org/r/256259 [17:55:43] well that certainly is interesting if puppet is happy, as it would appear the header remains unset ;) I wonder if varnish strips it for some reason.. [17:55:56] (03CR) 10jenkins-bot: [V: 04-1] cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [17:56:50] addshore: it might not be merged on 1002 yet, let me do that [17:57:23] (03PS3) 10BBlack: cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) [17:57:29] <_joe_> andrewbogott: check 1001, it's the one on which I had disabled the cron [17:58:03] addshore: yeah I see it in the config but not in the response... weird [17:58:23] _joe_: sorry, I was talking about wdqs1002, not labcontrol1002 [17:58:29] but I can still check labcontrol1001 if you want me to :) [17:59:07] Coren: hah! what's the script in question in https://phabricator.wikimedia.org/T106871#1842041 ? for some reason I had assumed it ran on labstore on jessie [17:59:11] hmm... looks like nginx is not sending it [17:59:17] SMalyshev: at a guess that must be something varnishy, especially if it appears correctly in the nginx conf after a puppet run [17:59:29] oh.. [17:59:51] godog: It does. I'm trying to figure out why 1.15 *also* does not work at this moment (but lots of internal uses of paramiko are under Trusty which is our least common denominator) [17:59:53] even though the config says it should... no idea why. [18:00:52] Coren: I see, thanks for taking a look! [18:01:11] addshore: the config looks normal but no header in nginx response... I'll try to test on labs and see what's wrong [18:01:25] SMalyshev: fun nginx is fun.. [18:01:47] yeah... who knows maybe location resets headers or proxy does something? [18:03:21] SMalyshev: so that's no header even when making the request straight to nginx? [18:03:39] X-Served-By: wdqs1001 [18:03:52] addshore: that's what I see [18:04:03] yeh, so varnish is doing something then :/ [18:04:15] addshore: no, I mean I see no header from nginx [18:04:18] do you? [18:04:22] yes [18:04:28] hmm that's weird. [18:04:36] from stat1002 curl --verbose --proxy webproxy:8080 "http://wdqs1001.eqiad.wmnet:80/status" [18:04:42] and I get the header in the response [18:04:44] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1842104 (10Zdzislaw) [18:04:47] ohh now I see it too [18:04:55] it didn't restart? [18:05:24] waaaaait... I see it on static requests but not on sparql requests [18:05:52] looks like proxy does mess with it [18:06:22] so my trick of putting it to server seems to have failed and we need (at least a copy) inside location probably [18:06:25] 6operations, 6Labs: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1842110 (10coren) More data: Jessie (which labstore1001 has) has 1.15.1 which //does// include the above patch; but it doesn't work. I found out why - it's an obvious... [18:06:36] godog: It's even more stupid than I thought. ^^ [18:06:40] SMalyshev: indeed I also see if on /status through varnish but not for the sparql endpoint itself [18:06:52] 6operations, 10ops-eqdfw, 10RESTBase: check for spare disk bays in restbase1007-1009 - https://phabricator.wikimedia.org/T119896#1842111 (10Cmjohnson) 5Open>3Resolved Confirmed that restbase1007-1009 will take up to 6 more disks. I have disk carriers on-site. [18:06:55] oh nginx you are so much fun [18:06:58] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1842114 (10GWicke) [18:07:01] SMalyshev: indeed ;) [18:07:19] although it looks like https://gerrit.wikimedia.org/r/#/c/256256/ should be okay now SMalyshev [18:07:37] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840448 (10GWicke) [18:08:03] addshore: let me test it on labs [18:09:17] Coren: heheheh oops [18:09:37] godog: That had me facepalm. :-) [18:09:40] SMalyshev: hah, and for the issue we just ran into it looks like there is a module called more_set_headers which is to be used with proxy_pass... [18:10:51] Coren: seems like a pretty serious paramiko bug to me heh [18:11:54] godog: I'll say. It means that no matter what keys you have, the only one it will ever try to use is ssh-rsa unless the server *refuses* to negociate it. [18:11:57] <_joe_> paramiko is horrible as far as chipers et al are concerned [18:12:47] _joe_: No argument from me there. [18:12:55] addshore: yeah http://stackoverflow.com/questions/14501047/how-to-add-a-response-header-on-nginx-when-using-proxy-pass#14508087 looks like it has to be inside location to work for proxy... no idea why [18:13:16] SMalyshev: so that would require having it twice I imagine [18:13:26] addshore: oh crap: "These directives are inherited from the previous level if and only if there are no add_header directives defined on the current level." [18:13:52] * SMalyshev should read every line of the manual, not just every third one [18:13:56] (03PS1) 10Chad: Gerrit: use Diffusion for repo browsing [puppet] - 10https://gerrit.wikimedia.org/r/256262 [18:14:11] so because add_header is set in the sparql location it then ignores the add_header for served by? [18:14:26] godog: lulz. Have you read the commit message for that patch? "More sophisticated key negotiation is still necessary in the case [18:14:26] where we have an ECDSA key for the server and it offers us both RSA [18:14:26] and ECDSA. In this case, we will pick RSA and fail because we don't [18:14:26] have it. [...]" [18:14:36] so yeah, if you have different headers, you need to repeat them on each level. fun. [18:14:57] !log restarted apache on iridium to deploy https://phabricator.wikimedia.org/rPHEXd724c51a4144f7de548546875097da57ee1b38d7 [18:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:06] addshore: yeah looks like using add_header resets it for current level. So the fix would be to repeat it [18:15:15] *makes a patch* [18:15:17] (03PS2) 10Chad: Gerrit: use Diffusion for repo browsing [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) [18:15:18] Coren: hehe no I missed that, well at least we know what's fixable next! [18:16:09] (03PS1) 10Addshore: WDQS - Repeat addheader X-Served-By for sparql location [puppet] - 10https://gerrit.wikimedia.org/r/256263 (https://phabricator.wikimedia.org/T119508) [18:16:10] SMalyshev: ^^ [18:16:45] (03CR) 10Smalyshev: [C: 031] WDQS - Repeat addheader X-Served-By for sparql location [puppet] - 10https://gerrit.wikimedia.org/r/256263 (https://phabricator.wikimedia.org/T119508) (owner: 10Addshore) [18:16:56] godog: Indeed. [18:17:02] I'm sure we could come up with some quote about always meaning always half the time in nginx... :P [18:18:11] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1842154 (10RobH) @cmjohnson already checked the disk bays of restbase1007-1009 on T119896 and they do have room to accomodate more disks. >>! In T119896#1842111, @Cmjohnson wro... [18:19:02] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10vm-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1842158 (10JMinor) @Joe @mark any updates here or other info I can provide for this request? I don't have m... [18:19:04] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1842159 (10RobH) This seems to have a cross-over/duplication of disk expansion with T119659. As such, I think we should drop the disk requests for restbase1007-1009 off T119659,... [18:19:08] addshore: yeah :) [18:23:43] (03PS2) 10Andrew Bogott: WDQS - Repeat addheader X-Served-By for sparql location [puppet] - 10https://gerrit.wikimedia.org/r/256263 (https://phabricator.wikimedia.org/T119508) (owner: 10Addshore) [18:23:44] SMalyshev: how about the other patch? :) [18:24:25] addshore: need to test it, will do it today. I'll try to finish it before meetings but if not then a bit later [18:24:41] okay, if you give it a +1 then I'll stick it in the puppet swat for tommorrow [18:25:08] (03CR) 10Andrew Bogott: [C: 032] WDQS - Repeat addheader X-Served-By for sparql location [puppet] - 10https://gerrit.wikimedia.org/r/256263 (https://phabricator.wikimedia.org/T119508) (owner: 10Addshore) [18:26:34] andrewbogott: SMalyshev lovely, that one works on the sparql endpoint now! [18:26:44] cool! [18:26:46] x-served-by:wdqs1002 :) [18:27:08] yay! [18:28:14] addshore: btw, I plan to reload the DBs from dump once the new dump is ready (due to the fix for duplication bug) so do not be alarmed if hosts go down and pop up for next couple of days [18:28:24] We could tentatively go for 256256 ? ;) [18:28:37] SMalyshev: awesome! :) [18:29:20] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1842228 (10RobH) I've chatted with @gwicke in IRC. Having the disk space is useful even without the expansion of the cpu/ram. As such, the disk upgrade will be handled on the ol... [18:29:49] ahh, of course it needs a manual rebase now anyway [18:33:19] (03PS4) 10Addshore: Set WDQS 5min expiry for internal access Port:8888 (try2) [puppet] - 10https://gerrit.wikimedia.org/r/256256 [18:34:11] addshore: did you test https://gerrit.wikimedia.org/r/#/c/256256/4 on labs? [18:36:08] (03CR) 10Paladox: [C: 031] "I +1 but diffusion doesn't let you see un merged patches like gitblit does." [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [18:36:31] andrewbogott: I dont have access to the test query box but I'm just going to quickly put the config on an nginx install and confirm it works [18:36:42] ok [18:36:46] addshore: I can set you up with access to test box [18:37:34] Can some one rebase https://gerrit.wikimedia.org/r/#/c/239854/? I can't, but this needed. Thanks! [18:38:56] SMalyshev: that would be great! [18:39:13] addshore: what's your wikitech username? [18:39:17] addshore [18:39:23] (03PS1) 10Andrew Bogott: Remove reference to labs-ns1placeholder.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/256265 (https://phabricator.wikimedia.org/T119762) [18:40:41] addshore: you should now be able to access wdq-beta.eqiad.wmflabs [18:42:50] woops something is wrong there with the patch, let me see [18:43:07] <<<<<<< HEAD ;) [18:43:12] yup [18:43:57] I wonder where it meses up... let me update [18:44:29] ahh, it's local puppetmaster messed up [18:45:22] (03PS1) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/256267 (https://phabricator.wikimedia.org/T114638) [18:45:25] (03CR) 10Andrew Bogott: "The puppet compiler still thinks that labs-ns1.wikimedia.org is 208.80.153.15. I don't know why, but this can't be merged until everyone " [puppet] - 10https://gerrit.wikimedia.org/r/256265 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [18:45:53] <_joe_> andrewbogott: what do you mean? [18:46:26] _joe_: you mean regarding my last gerrit comment? [18:46:40] <_joe_> yes [18:47:09] Everything I check agrees that labs-ns1.wikimedia.org = labs-ns1placeholder.wikimedia.org = 208.80.154.102 [18:47:21] <_joe_> what and how do you check? [18:48:05] where as https://puppet-compiler.wmflabs.org/1398/labcontrol1002.wikimedia.org/ suggests that the compiler is getting the old ip for labs-ns1, 208.80.153.15 [18:48:21] um… dig and dig @ns0 and dig@ns1 [18:48:22] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [18:48:31] _joe_: are you getting 208.80.153.15 from your digs? [18:49:12] addshore: ok, now puppet is clean... it's a weird setup there, sometimes goes wrong [18:49:29] <_joe_> andrewbogott: how does a puppet compilation imply the result of dig? [18:50:29] _joe_: well… I don’t know where the puppet compiled is getting the ip 208.80.153.15 [18:50:31] if not from dns [18:50:37] *compiler [18:50:49] <_joe_> andrewbogott: from a fact maybe? [18:51:01] <_joe_> andrewbogott: what I should dig for? [18:51:16] just a second, let me sort out the whole chain [18:51:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [18:52:21] <_joe_> andrewbogott: no wait [18:52:57] * andrewbogott waits [18:53:16] <_joe_> dig labs-ns1.wikimedia.org @labs-recursor1.wikimedia.org is your problem [18:53:29] <_joe_> you _must_ clear the recursor cache, or reduce the ttl a lot [18:53:36] <_joe_> you're like 3 minutes out [18:53:56] (03PS1) 10Ori.livneh: Configure Redis diamond collector to observe all jobqueue instances [puppet] - 10https://gerrit.wikimedia.org/r/256268 [18:54:03] <_joe_> ok I have to leave, it's the 8th consecutive day of 10+ hours of work [18:54:16] _joe_: have a good night! [18:54:26] <_joe_> I might be back later, I want to finish this ssldir thing [18:54:40] (03CR) 10Chad: "I know. Inline comment to Christian about if we can avoid it for the time being." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [18:54:59] SMalyshev: looks like it worked? [18:55:28] addshore: looks like it [18:56:30] and its listening correctly curl --verbose "http://wdq-beta:8888/status" [18:57:05] yep looks like queries work... let me check if I can see the timeout [18:57:22] (03CR) 10Andrew Bogott: [C: 032] "ok, the compiler is happy now." [puppet] - 10https://gerrit.wikimedia.org/r/256265 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [18:57:28] I'm guessing that recorded in the blazegraph log somewhere? [18:57:39] addshore: yes [18:57:50] yep, works too [18:58:09] (03CR) 10Smalyshev: [C: 031] "Works for me on wdq-test labs machine" [puppet] - 10https://gerrit.wikimedia.org/r/256256 (owner: 10Addshore) [18:58:28] (03PS5) 10Andrew Bogott: Set WDQS 5min expiry for internal access Port:8888 (try2) [puppet] - 10https://gerrit.wikimedia.org/r/256256 (owner: 10Addshore) [18:58:34] addshore: didn't test 8888 port so if you can test it does what you need it to do then it's be fine [18:58:52] but on my side everything seems to be fine [18:59:45] (03CR) 10Andrew Bogott: [C: 032] "Thanks for testing :)" [puppet] - 10https://gerrit.wikimedia.org/r/256256 (owner: 10Addshore) [19:00:55] (03CR) 10Smalyshev: [C: 031] "it's a relic" [puppet] - 10https://gerrit.wikimedia.org/r/256191 (owner: 10Addshore) [19:01:00] andrewbogott: thanks for your time :) [19:01:13] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [19:01:32] addshore: you’re merged and applied, want to verify that it’s doing something useful? [19:01:40] (03CR) 10Smalyshev: [C: 031] Remove wdqs-roots group from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/256192 (owner: 10Addshore) [19:02:03] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [19:02:18] (03CR) 10Smalyshev: [C: 031] Remove unused wdqs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/256193 (owner: 10Addshore) [19:02:29] andrewbogott: {{doing}} [19:02:56] (03PS1) 10Andrew Bogott: Remove the now-redundant and unused labs-ns1placeholder. [dns] - 10https://gerrit.wikimedia.org/r/256270 (https://phabricator.wikimedia.org/T119762) [19:05:55] andrewbogott: SMalyshev so accessing it directly inside the network on 8888 works and looks good [19:06:05] ok, great [19:06:08] addshore: great [19:06:11] looks like it is still inaccessible to webproxy.eqiad.wmnet though >.> [19:06:16] * andrewbogott declares puppetswat to be over, finally [19:06:26] um... [19:06:42] that I cannot explain [19:06:51] curl --verbose --proxy webproxy:8080 "http://wdqs1001.eqiad.wmnet:8888/status" should work afaik, I'll look into it [19:08:03] my guess is webproxy is not covered by $INTERNAL [19:09:06] webproxy.eqiad.wmnet is an alias for carbon.wikimedia.org. [19:09:06] carbon.wikimedia.org has address 208.80.154.10 [19:09:08] We should figure out / unify webproxy vs url-downloader. [19:09:16] not internal at least by name [19:09:18] :) [19:09:40] maybe it has internal interface dunno [19:11:34] Well, in that case srange => '(($INTERNAL @resolve(carbon.wikimedia.org)))' would probably work, although I will check INTERNAL first.. [19:11:57] INTERNAL is 10.0.0.0/8 & 2620:0:100::/56 [19:12:32] PROBLEM - puppet last run on db2042 is CRITICAL: CRITICAL: puppet fail [19:12:54] I get 2620:0:861:1:208:80:154:10 for webproxt, but I can only imagine the requests on the other side leave through the external interface not local one [19:15:09] what does webproxy.eqiad.wmnet have to do with internal requests ? [19:16:12] akosiaris: well all requests from analytics hosts are blocked, hence use of webproxy for http requests [19:16:56] 6operations, 6Services: Consider using XFS for Cassandra data partition - https://phabricator.wikimedia.org/T120004#1842458 (10GWicke) 3NEW [19:17:12] 6operations, 6Services: Discussion: Use XFS for Cassandra data partition? - https://phabricator.wikimedia.org/T120004#1842469 (10GWicke) [19:17:55] addshore: ask that they are not instead of bypassing the block ? [19:18:07] akosiaris: so either also allow access to that port for carbon too, or open something direct from analytics to that machine on that port (though I dont know how to do this one) [19:18:33] no, webproxy is not meant to be used to query internal services [19:18:43] if something is blocked, ask that it is not [19:19:15] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1842484 (10mmodell) Clone {rPHDEP} (master) into `/srv/phab/`, then run `git submodule update` and that's it! For example: ```lang=bash sudo git clone https://gerrit.wikimedi... [19:19:35] 6operations, 6Services: Discussion: Use XFS for Cassandra data partition? - https://phabricator.wikimedia.org/T120004#1842487 (10GWicke) [19:21:30] so, basically, we need a firewall exception to get to wdqs*:8888 from analytics. [19:22:02] ottomata: ^^ [19:24:05] ja makes sense, addshore, what are you trying to do again? [19:24:24] access the wikidata query service on a port which allows extended execution time [19:25:59] I don't see where this access is blocked anywhere in puppet ;) [19:26:29] it's on the routers' ACLs [19:26:40] not in puppet [19:26:48] ahh! [19:26:49] we should probably have that documented someplace [19:27:52] (03PS1) 10Chad: pep8: fix most warnings in wmfdumpsmirror.py [puppet] - 10https://gerrit.wikimedia.org/r/256275 [19:30:33] Should I make a ticket for adding the exception? [19:30:40] addshore: yes [19:30:50] netops and analytics projects please [19:31:45] (03CR) 10ArielGlenn: [C: 032] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/256275 (owner: 10Chad) [19:32:14] hm, ooook... [19:33:59] 6operations, 10Analytics, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1842561 (10Addshore) 3NEW [19:34:13] (03PS2) 10Ori.livneh: Configure Redis diamond collector to observe all jobqueue instances [puppet] - 10https://gerrit.wikimedia.org/r/256268 [19:34:22] (03CR) 10Ori.livneh: [C: 032 V: 032] Configure Redis diamond collector to observe all jobqueue instances [puppet] - 10https://gerrit.wikimedia.org/r/256268 (owner: 10Ori.livneh) [19:35:14] 6operations, 6Analytics-Backlog, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1842581 (10madhuvishy) [19:35:18] tfinc: To answer your question of whether I'm coming into the office... yes. [19:35:24] :) [19:35:32] Deskana: did you want to go to ElasticON ? [19:35:36] 6operations, 10Analytics, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1842585 (10Addshore) [19:35:43] tfinc: I seem to recall saying no, but I can't remember why. [19:35:52] tfinc: Let me look at the dates. [19:36:00] 6operations, 6Analytics-Backlog, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1842561 (10Addshore) [19:36:01] Deskana: ok, it's $1k so i'm not going to nudge much [19:36:08] Erik and I will likely attend [19:36:35] tfinc: I think I'm getting myself confused with another conference. [19:36:56] tfinc: All things being equal I would like to go. But, I wonder whether that $1k would be better spent elsewhere. What do you think? [19:37:13] Deskana: do you have other conferences that you are considering ? [19:37:31] tfinc: None. I'm also probably going to skip Wikimania next year. [19:37:41] this year our travel/wikimania/conference are all one budget so i'm a bit more conservative then usual on the big ticket items [19:37:44] RECOVERY - puppet last run on db2042 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:38:06] elasticon is $1k? are they out of their minds? [19:38:16] holy crap [19:38:41] * Deskana has just realised he started this conversation in the operations channel by accident [19:38:42] That's quite a jump from last year iirc. [19:38:43] 6operations, 6Analytics-Backlog, 6Discovery, 6WMDE-Analytics-Engineering, and 3 others: Add firewall exception to get to wdqs*:8888 from analytics cluster - https://phabricator.wikimedia.org/T120010#1842561 (10Addshore) [19:38:45] ori: that's the early bird price, it only goes up from there. i've reached out to nick to see if there are discounts [19:39:02] many thanks all! [19:39:32] yeah that's a good idea [19:40:22] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [19:41:46] addshore: thanks for the webperf patch! [19:41:50] (03PS2) 10Ori.livneh: Fix 403 -> 404 typo in webpref asset-check.js [puppet] - 10https://gerrit.wikimedia.org/r/256182 (owner: 10Addshore) [19:42:02] (03CR) 10Ori.livneh: [C: 032] Fix 403 -> 404 typo in webpref asset-check.js [puppet] - 10https://gerrit.wikimedia.org/r/256182 (owner: 10Addshore) [19:42:04] ori: no worries :) [19:42:13] (03CR) 10Ori.livneh: [V: 032] Fix 403 -> 404 typo in webpref asset-check.js [puppet] - 10https://gerrit.wikimedia.org/r/256182 (owner: 10Addshore) [19:42:32] (03PS1) 10Chad: varnish: clean up a bunch of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256278 [19:46:52] (03PS4) 10BBlack: cache_misc: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256260 (https://phabricator.wikimedia.org/T119396) [19:46:54] (03PS3) 10BBlack: varnish: ability to remove a named instance [puppet] - 10https://gerrit.wikimedia.org/r/256259 (https://phabricator.wikimedia.org/T119396) [19:46:56] (03PS1) 10BBlack: update varnish directors for full instance names [puppet] - 10https://gerrit.wikimedia.org/r/256279 (https://phabricator.wikimedia.org/T119396) [19:46:58] (03PS1) 10BBlack: cache_maps: switch to new port/instance mapping [puppet] - 10https://gerrit.wikimedia.org/r/256280 (https://phabricator.wikimedia.org/T119396) [19:55:27] (03CR) 1020after4: [C: 031] Gerrit: use Diffusion for repo browsing [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [20:06:58] (03CR) 10coren: [C: 031] "Yep." [dns] - 10https://gerrit.wikimedia.org/r/256270 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [20:08:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [20:09:43] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [20:12:02] (03CR) 10Andrew Bogott: [C: 032] Remove the now-redundant and unused labs-ns1placeholder. [dns] - 10https://gerrit.wikimedia.org/r/256270 (https://phabricator.wikimedia.org/T119762) (owner: 10Andrew Bogott) [20:14:27] (03CR) 10BBlack: [C: 032] varnish: ability to remove a named instance [puppet] - 10https://gerrit.wikimedia.org/r/256259 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [20:14:53] (03PS1) 10Chad: pep8: minor whitespace fix in deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/256288 [20:15:05] (03CR) 10BBlack: [C: 032] "Compiler-verified as a no-op on a couple of tiers and cluster types" [puppet] - 10https://gerrit.wikimedia.org/r/256279 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [20:17:03] (03PS1) 10Chad: ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 [20:19:33] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [20:19:55] !log temp. disabling puppet on cache::misc [20:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:45] (03PS1) 10Chad: pep8: fix list-last-n-good-dumps style, mostly in/not in stuff [puppet] - 10https://gerrit.wikimedia.org/r/256291 [20:23:22] (03PS1) 10Chad: check-raid.py: minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256292 [20:24:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [20:30:45] (03PS2) 10Dzahn: updating star.planet.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/254431 (owner: 10RobH) [20:32:12] (03CR) 10Dzahn: [C: 032] updating star.planet.wikimedia.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/254431 (owner: 10RobH) [20:33:24] (03PS1) 10Chad: new_wmf_service.py: fix pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256311 [20:34:03] (03CR) 10BBlack: [C: 04-1] "There are still lots of issues here with logging (xcps, rls, statsd), do not merge" [puppet] - 10https://gerrit.wikimedia.org/r/256280 (https://phabricator.wikimedia.org/T119396) (owner: 10BBlack) [20:35:09] (03CR) 10jenkins-bot: [V: 04-1] new_wmf_service.py: fix pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [20:37:32] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:44:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [20:54:51] !log re-enabled puppet on cache::misc, replaced planet cert [20:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [20:56:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [21:01:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [21:04:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [21:05:54] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [21:19:44] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1842934 (10Milimetric) @jcrespo, just open up a different task and assign it to me if I can help in any way. Don't worry about tagging analytics projects, it seems everyone's confused abou... [21:26:00] (03PS2) 10Jdlrobson: Enable RelatedArticles on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [21:49:11] (03PS1) 10Andrew Bogott: Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) [21:49:34] (03CR) 10Andrew Bogott: [C: 04-1] "Do not merge -- this needs to be staged with other changes." [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) (owner: 10Andrew Bogott) [21:50:01] 6operations, 10Wikimedia-Planet: update *.planet.wikimedia.org certificate on cluster (implement) - https://phabricator.wikimedia.org/T119987#1842995 (10Dzahn) done. the cert has been replaced on all cache::misc servers (16 now, not just 4). I also had to restart nginx. disabled puppet on all, enabled on a si... [21:51:11] ostriches: Gerrit pulls in new user accounts from ldap, right? Do you know where the name of that ldap server is configured? I can’t find it in puppet. [21:51:21] 6operations, 10Wikimedia-Planet: update *.planet.wikimedia.org certificate on cluster (implement) - https://phabricator.wikimedia.org/T119987#1842999 (10Dzahn) 5Open>3Resolved now valid until 03/05/2017 [21:51:47] (03PS5) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:52:02] andrewbogott, hieradata/common.yaml: ldap_host: 'ldap-eqiad.wikimedia.org' [21:52:15] oh no [21:52:18] that can't be the right one [21:52:28] $ldap::role::config::labs::ldapconfig [21:52:40] (03CR) 10jenkins-bot: [V: 04-1] WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [21:52:47] It's in modules/gerrit/manifests/instance.pp [21:53:14] (03CR) 10MaxSem: WIP: OSM replication for maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [21:53:46] ostriches: ok, I see it. [21:54:11] that seems like a weird place to get that setting [21:54:32] Probably? The gerrit puppet code is old and crusty. [21:54:56] maybe not, it seems to get used in a bunch of places [21:55:09] (03PS6) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:55:41] thanks ostriches [21:56:01] (03CR) 10jenkins-bot: [V: 04-1] WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) (owner: 10MaxSem) [21:56:15] andrewbogott: yw [21:56:42] (03PS2) 10Andrew Bogott: Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) [21:57:25] andrewbogott: fwiw, that requires a restart of the gerrit app to pick up a config change. [21:57:58] ostriches: ok… no notifies in puppet? [21:58:24] Lemme double check now that you mention it [21:58:47] Ignore me it subscribes that file [21:59:22] * andrewbogott is dying to find all the undocumented differences between the opendj and openldap interfaces [21:59:30] ostriches: better yet! [22:07:12] (03Abandoned) 10Dzahn: mirrors: sync ubuntu mirrors more often [puppet] - 10https://gerrit.wikimedia.org/r/256155 (owner: 10Dzahn) [22:08:11] (03PS2) 10Dzahn: Remove Einsteinium hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/256191 (owner: 10Addshore) [22:08:24] (03CR) 10Dzahn: [C: 032] Remove Einsteinium hiera configuration [puppet] - 10https://gerrit.wikimedia.org/r/256191 (owner: 10Addshore) [22:08:37] (03PS2) 10Dzahn: Remove wdqs-roots group from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/256192 (owner: 10Addshore) [22:08:56] (03PS2) 10Dzahn: Remove unused wdqs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/256193 (owner: 10Addshore) [22:09:21] (03CR) 10Dzahn: [C: 032] Remove wdqs-roots group from rutherfordium [puppet] - 10https://gerrit.wikimedia.org/r/256192 (owner: 10Addshore) [22:11:18] (03CR) 10Dzahn: [C: 032] Remove unused wdqs-roots group [puppet] - 10https://gerrit.wikimedia.org/r/256193 (owner: 10Addshore) [22:23:02] mmhhmm, many thanks for removing that confusion mutante [22:26:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [22:28:40] (03CR) 10Dzahn: [C: 04-1] "tested, but see results below:" [puppet] - 10https://gerrit.wikimedia.org/r/255149 (https://phabricator.wikimedia.org/T119532) (owner: 10JanZerebecki) [22:28:42] 6operations, 6Services: Discussion: Use XFS for Cassandra data partition? - https://phabricator.wikimedia.org/T120004#1843083 (10mobrovac) +1 to XFS. > I'm of the opinion of not introducing differences if we can avoid it, but if there's a case to be made that it performs better than ext4 for us then f... [22:29:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [5000000.0] [22:29:33] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [22:31:48] (03PS2) 10Dzahn: Add authors to wikidatabuilder cron-build.sh [puppet] - 10https://gerrit.wikimedia.org/r/256190 (owner: 10Addshore) [22:32:25] (03CR) 10Dzahn: [C: 032] Add authors to wikidatabuilder cron-build.sh [puppet] - 10https://gerrit.wikimedia.org/r/256190 (owner: 10Addshore) [22:33:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [5000000.0] [22:33:34] ACKNOWLEDGEMENT - Varnish traffic logger - erbium on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with args varnishncsa-erbium.pid, UID = 111 (varnishlog) daniel_zahn scheduled downtime [22:34:30] ACKNOWLEDGEMENT - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: puppet fail daniel_zahn its test by definition [22:39:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:41:10] !log stop/start ntp service on lvs2002 [22:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:45:01] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:45:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [22:46:10] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [22:47:16] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1843114 (10Milimetric) >>! In T119541#1840875, @fgiunchedi wrote: > seems to be working fine too on a jessie host, I can't see from wikitech what classes are applied to limn1, mayb... [22:47:54] (03CR) 10Dzahn: "@Paladox i realize it's still here. sorry, was just forwarding info" [puppet] - 10https://gerrit.wikimedia.org/r/251836 (owner: 10Paladox) [22:48:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [22:53:41] (03PS1) 10Dzahn: icinga: add virtual host for ores (test) [puppet] - 10https://gerrit.wikimedia.org/r/256352 [23:00:13] (03PS1) 10Rush: labtest: initial hiera override values [puppet] - 10https://gerrit.wikimedia.org/r/256353 [23:00:16] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1843156 (10MaxSem) puppet-test02.maps-team.eqiad.wmflabs is an example of this failure on an up-to-date jessie image. [23:00:20] RECOVERY - NTP on lvs2002 is OK: NTP OK: Offset -0.002350568771 secs [23:00:45] (03PS2) 10Rush: labtest: initial hiera override values [puppet] - 10https://gerrit.wikimedia.org/r/256353 [23:00:56] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1843161 (10MaxSem) [23:15:54] 6operations, 6Phabricator: this is a test subtask - https://phabricator.wikimedia.org/T120033#1843210 (10RobH) 3NEW a:3RobH [23:16:20] 6operations, 6Phabricator: this is the second test sub-task - https://phabricator.wikimedia.org/T120034#1843218 (10RobH) 3NEW a:3RobH [23:16:43] 6operations, 6Phabricator: this is the second test sub-task - https://phabricator.wikimedia.org/T120034#1843218 (10RobH) [23:16:45] 6operations, 6Phabricator: this is a test subtask - https://phabricator.wikimedia.org/T120033#1843210 (10RobH) [23:21:43] (03CR) 10Rush: [C: 032] labtest: initial hiera override values [puppet] - 10https://gerrit.wikimedia.org/r/256353 (owner: 10Rush) [23:30:59] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1843266 (10RobH) I see what @greg means about how content is not copied in during merge. So it is not as ideal as how we handled it in RT, but I propose the following workflow in phabricator: * Ad... [23:34:20] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1843275 (10RobH) I also want to note on record that we should not disclose our maint-announcements by default. This is why I stated that they should be auto-generated in the S4 operations private... [23:42:46] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1843286 (10RobH) [23:42:48] 6operations, 6Phabricator: this is a test subtask - https://phabricator.wikimedia.org/T120033#1843285 (10RobH) 5Open>3Resolved [23:42:49] (03PS1) 10Rush: initial labtestcontrol2001.wikimedia.org node definition [puppet] - 10https://gerrit.wikimedia.org/r/256358 [23:43:57] (03PS2) 10Rush: initial labtestcontrol2001.wikimedia.org node definition [puppet] - 10https://gerrit.wikimedia.org/r/256358 [23:44:21] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [23:45:59] (03CR) 10Rush: [C: 031] "all my love" [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [23:47:45] (03CR) 10Rush: [C: 032] initial labtestcontrol2001.wikimedia.org node definition [puppet] - 10https://gerrit.wikimedia.org/r/256358 (owner: 10Rush) [23:49:04] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1843290 (10RobH) Ok, I've had an IRC discussion with @chase and @dzahn about this workflow, and I want to modify my suggestion to the following: * Add in the #maint-announce project and have email... [23:58:11] (03PS1) 10Rush: add labtest realm for ldap/manifests/role/config.pp [puppet] - 10https://gerrit.wikimedia.org/r/256359 [23:59:50] (03PS2) 10Rush: add labtest realm for ldap/manifests/role/config.pp [puppet] - 10https://gerrit.wikimedia.org/r/256359