[00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160217T0000). [00:00:04] Krenair bmansurov AaronSchulz jdlrobson aude anomie: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:11] here [00:00:16] * anomie is available [00:00:33] * aude waves [00:01:13] hey [00:01:42] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [00:02:16] guess I'll do it [00:02:18] will do mine last [00:02:49] present [00:03:09] bmansurov, this is one that broke earlier, right? [00:03:17] yes [00:04:01] I'll be putting stuff on mw1017 first so hopefully we'll know before it can do anything that bad :) [00:04:17] ok [00:04:18] bmansurov, do you have the wikimediadebug extension? [00:04:25] no [00:04:54] is that the one for chrome? [00:04:59] or for firefox [00:04:59] yes [00:05:02] how do you test these changes? [00:05:02] If a non-tech fundraiser is interested in getting icinga alerts for the payments cluster without joining the fr-tech mailing list, where would he subscribe to those? [00:05:46] ejegg, probably have to be added by someone in ops-private, or maybe by jeff [00:05:53] puppet-private* by ops [00:06:11] thanks Krenair ! [00:06:18] Krenair: I put the change to my LocalSettings.php file [00:06:34] bmansurov, right but what is your test plan after this is in prod? [00:06:44] Krenair: oh I just visit a page [00:06:48] to see if a survey is there [00:07:06] https://en.wikipedia.org/wiki/Book?quicksurvey=true for example [00:07:32] and this will only occur in 0.005% of page views? [00:07:32] PROBLEM - puppet last run on db1030 is CRITICAL: CRITICAL: Puppet has 1 failures [00:07:42] oh okay, so quicksurvey=true forces it to appear? [00:07:45] Krenair: yes, but the query parameter bypasses that restriction [00:07:47] yes [00:07:59] got it [00:08:13] (03PS4) 10Alex Monk: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [00:08:21] (03CR) 10Alex Monk: [C: 032] Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [00:08:55] (03Merged) 10jenkins-bot: Enable survey at reduced sample rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270985 (https://phabricator.wikimedia.org/T125946) (owner: 10Bmansurov) [00:12:08] umm [00:12:18] Krenair: I see it's working [00:12:32] thanks [00:12:36] why do there still appear to be DBReadOnlyError exceptions on mw1165 enwiktionary? [00:14:10] and mw1167 [00:14:27] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/270985 (duration: 01m 34s) [00:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:35] bmansurov, ^ should be everywhere now [00:14:43] ok thanks [00:15:38] greg-g: i would like to deploy the fix for https://phabricator.wikimedia.org/T127095 tomorrow, sometime earlier than swat [00:15:42] AaronSchulz? [00:15:57] * AaronSchulz looks [00:16:07] hmm, I don't have my yubikey atm [00:16:20] it is too late to be preparing that for backport now [00:16:56] aude: /me nods OK [00:16:58] AaronSchulz, no matter, will I see anything in the logs if there are issuess? [00:17:00] issues* [00:17:20] I'd hope so :) [00:18:13] okay, and you are ready to test this? [00:18:22] (03PS3) 10Dzahn: phabricator: fix 16 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269904 [00:18:24] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033767 (10Tgr) For the increase in `centralauth:user` key, any change that caused session to be initialized more often could have contributed. Introducing [[ https:... [00:19:00] (03CR) 10Dzahn: [C: 032] phabricator: fix 16 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269904 (owner: 10Dzahn) [00:19:53] greg-g: thanks [00:20:08] bblack, hi, any thoughts on https://phabricator.wikimedia.org/T126730 ? [00:20:15] AaronSchulz? [00:20:33] it seems like a fairly significant change [00:20:45] yeah [00:20:49] okay, good [00:21:05] oh, merge conflict [00:22:20] yurik: no, not at the moment. [00:22:22] rebasing locally [00:23:27] (03PS2) 10Alex Monk: Set $wgCentralAuthUseSlaves on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266603 (owner: 10Aaron Schulz) [00:23:44] (03CR) 10Alex Monk: [C: 032] Set $wgCentralAuthUseSlaves on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266603 (owner: 10Aaron Schulz) [00:24:12] (03Merged) 10jenkins-bot: Set $wgCentralAuthUseSlaves on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266603 (owner: 10Aaron Schulz) [00:25:33] AaronSchulz, it's on mw1017, everything look ok? [00:25:43] !log sync-common taking a while to terminate [00:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:52] * AaronSchulz flips the x-debug bit [00:26:14] !log ... after showing rsync common finished line [00:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:03] login/out and such seem ok for me [00:27:22] (03CR) 10Dzahn: "oops, should have waited until iridium runs puppet again, but i swear it's harmless" [puppet] - 10https://gerrit.wikimedia.org/r/269904 (owner: 10Dzahn) [00:28:13] (03PS2) 10Dzahn: ruthenium: Updated update_parsoid.sh to run it as regular user [puppet] - 10https://gerrit.wikimedia.org/r/271047 (owner: 10Subramanya Sastry) [00:28:22] (03CR) 10Alex Monk: [C: 032] Strip references for experimentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [00:28:33] (03PS3) 10Dduvall: Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) [00:28:37] thanks Krenair ping me when i should test :) [00:28:48] jdlrobson, I just noticed it's -labs only, so I think it should be fine [00:29:10] (03PS4) 10Dduvall: Program Dashboard configuration for initial labs rollout [puppet] - 10https://gerrit.wikimedia.org/r/271149 (https://phabricator.wikimedia.org/T105967) [00:29:12] (03CR) 10Dzahn: [C: 032] ruthenium: Updated update_parsoid.sh to run it as regular user [puppet] - 10https://gerrit.wikimedia.org/r/271047 (owner: 10Subramanya Sastry) [00:29:14] but keep an eye on beta and let me know if needs reverting, ok jdlrobson? [00:29:22] (03Merged) 10jenkins-bot: Strip references for experimentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271074 (https://phabricator.wikimedia.org/T126390) (owner: 10Jdlrobson) [00:29:42] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/266603 (duration: 01m 30s) [00:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:48] AaronSchulz, ^ [00:29:55] I see [00:30:18] nothing weird in the logs [00:32:22] RECOVERY - puppet last run on db1030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:32:26] (03PS2) 10Dzahn: yubiauth: adjust class name after move to role module [puppet] - 10https://gerrit.wikimedia.org/r/271139 [00:32:30] Krenair: https://logstash.wikimedia.org/#dashboard/temp/AVLspFHbptxhN1XayYhp looks nice though [00:33:21] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/271074 (duration: 01m 29s) [00:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:49] ugh, I really need to fix those read-only errors...getting kind of annoying now [00:35:11] kind of annoying = tail -f exception.log is useless? :p [00:35:56] Krenair: not seeing it in action yet [00:37:06] Krenair: think something went wrong [00:37:26] jdlrobson, with the beta change? ok... [00:37:34] yeh not seeing on beta cluster [00:37:34] let's see [00:37:51] well, it's on deployment-tin:/srv/mediawiki-staging [00:38:29] jdlrobson, I think it's syncing [00:40:15] aude, your changes are going through jenkins btw [00:40:49] ok [00:41:06] Krenair: cool. Got a little confused by your message about it being fine. [00:41:50] jdlrobson, should be done now? [00:42:22] (03CR) 10Dzahn: [C: 032] yubiauth: adjust class name after move to role module [puppet] - 10https://gerrit.wikimedia.org/r/271139 (owner: 10Dzahn) [00:43:53] (03PS2) 10Dzahn: ruthenium: Clone the parsoid repo with 0775 mode [puppet] - 10https://gerrit.wikimedia.org/r/271082 (owner: 10Subramanya Sastry) [00:44:11] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [00:44:21] RECOVERY - puppet last run on auth1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:44:22] Is there an "easy" way to issue a Varnish purge request for a thumbnail that is stuck in cache in codfw? [00:44:41] can it not be purged in the normal way? [00:45:07] will action=purge on the File page do it? [00:45:15] This image -- https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Airbus_A310-222%28F%29%2C_Oasis_International_Airlines_JP6534732.jpg/120px-Airbus_A310-222%28F%29%2C_Oasis_International_Airlines_JP6534732.jpg [00:45:26] boom thanks Krenair [00:45:30] confirmed working [00:45:32] is apparently serving a stale version from codfw [00:45:43] Krenair, are you swating? [00:45:47] the stale version has a black bar at the bottom [00:46:07] (03CR) 10Dzahn: [C: 032] ruthenium: Clone the parsoid repo with 0775 mode [puppet] - 10https://gerrit.wikimedia.org/r/271082 (owner: 10Subramanya Sastry) [00:46:08] aude, syncing [00:46:10] yurik, yes [00:46:11] ok [00:46:39] Krenair, do you have time for one minor config change? [00:47:04] I think we're going to end up going over the end of the window as it is [00:47:08] what is the change exactly? [00:47:15] Krenair, never mind, tomorrow [00:47:25] just some cleanup [00:47:29] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/271082 (owner: 10Subramanya Sastry) [00:47:31] !log krenair@tin Synchronized php-1.27.0-wmf.13/extensions/Math/MathValidator.php: https://gerrit.wikimedia.org/r/#/c/270981/ (duration: 01m 31s) [00:47:34] aude, ^ [00:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:36] checking [00:47:41] yurik, yes, please put it in tomorrow. thanks [00:47:46] looks good [00:47:50] thanks :) [00:48:10] looks like jenkins just did the wmf.14 one too [00:48:25] ok [00:48:25] woah, what? [00:48:38] i think wmf14 can just go on tin [00:48:41] There are two undeployed changes sitting on the wmf.14 branch [00:48:53] what? [00:49:01] and wmf.14 change is.. on testwiki [00:49:08] hmmm [00:49:09] s/change // [00:49:25] I23b9496e11c3aad8f29baf6840da49f50040a566 and Ic185b238198f4de2180d44f0c5317e5190661814 [00:49:28] i wasn't sure if i should put wmf14 in or not [00:49:47] ebernhardson, ^ [00:50:23] (03CR) 10Dzahn: "Notice: /Stage[main]/Role::Parsoid::Testing/Git::Clone[mediawiki/services/parsoid/deploy]/File[/srv/parsoid]/mode: mode changed '0755' to " [puppet] - 10https://gerrit.wikimedia.org/r/271082 (owner: 10Subramanya Sastry) [00:50:47] the other is yurik [00:50:51] and it's self-merged?! [00:50:59] Krenair, ? [00:51:11] what did i do :) [00:52:53] (03PS2) 10Dzahn: Add me (hoo) to the wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/271132 (owner: 10Hoo man) [00:52:58] Krenair: hasher said this morning that they were going to pull to tin before actually deploying 14 [00:53:00] (03CR) 10Dzahn: [C: 032] Add me (hoo) to the wdqs icinga contact group [puppet] - 10https://gerrit.wikimedia.org/r/271132 (owner: 10Hoo man) [00:53:12] well it's now actually deployed [00:53:14] Krenair: and that 14 hadn't been synced anywhere [00:53:24] and there is stuff sitting there [00:53:28] i thought exactly that is what should be avoided [00:53:41] (pull to tin, not deploy, later it gets surprise deployed) [00:53:47] they probably did actually pull to tin before doing that [00:53:57] did you check the status of wmf.14 when merging that? [00:54:22] Krenair, hashar said it was ok to merge a patch on 14 because it wasn't pulled to tin yet [00:54:30] if that's the problem [00:54:52] i cherry-picked it to 14 [00:55:24] yurik, not only did you do that, which may or may not have been okay, but the branch-branch commit that this is a cherry-pick of was self-merged [00:55:30] master-branch* [00:55:55] there seems to be a conflict in: hashar said they were going to pull to tin before deploying vs. hashar said it's ok to merge because it wasn't pulled to tin [00:57:17] tgr, if I deploy https://gerrit.wikimedia.org/r/#/c/271153/1 now are you able to test it? [00:57:22] without the wmf.14 part [00:57:29] Krenair, from #security: hashar, https://gerrit.wikimedia.org/r/#/c/271063/ [00:57:29] i merged it [00:57:29] yurik: great [00:57:29] hashar, so no need to depl / pull / sync, right? [00:57:29] nop [00:57:30] thx [00:57:32] .14 hasn't been pushed to the app servers yet [00:57:34] it is only on tin for now [00:57:36] will do a mass pull before syncing [00:57:38] whenever .14 is unpaused [00:58:16] okay [00:58:35] In that case, I am going to merge the wmf.14 changes, leave them undeployed, and let hashar deal with it [00:59:25] Krenair, btw, would you mind switching zerowiki to the 2nd tier? we are not looking at it as closely as before [00:59:40] (deployment tier) [00:59:47] (03PS2) 10Bmansurov: Run the survey at normal rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270792 (https://phabricator.wikimedia.org/T125946) [00:59:51] you want to move zerowiki out of group0? [01:00:16] greg-g, ^ fyi [01:00:25] (03CR) 10Aaron Schulz: "I'll amend it to do math/score/captcha first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) (owner: 10Aaron Schulz) [01:00:42] (03PS2) 10Dzahn: kibana,wdqs,wikimetrics: lint fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/269903 [01:00:56] (03CR) 10Dzahn: [C: 032] kibana,wdqs,wikimetrics: lint fix indentation [puppet] - 10https://gerrit.wikimedia.org/r/269903 (owner: 10Dzahn) [01:01:05] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269903 (owner: 10Dzahn) [01:01:07] Krenair, correct [01:01:32] zero doesn't have a full time engineer to look at it [01:01:46] while I don't have an objection, I'd prefer that people involved with the train deployments deal with those changes [01:02:24] anomie, around? [01:02:48] Krenair: Yes [01:03:02] sorry, tried to ping tgr earlier by mistake (name on the commit instead of calendar) [01:06:01] (03PS1) 10BryanDavis: vagrant: Add upstart script to start container on boot [puppet] - 10https://gerrit.wikimedia.org/r/271171 (https://phabricator.wikimedia.org/T127129) [01:06:11] (03PS2) 10Dzahn: ssh: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269609 [01:07:00] (03Abandoned) 10Dzahn: wikitech.m.wikimedia.org -> silver, just showed portal [dns] - 10https://gerrit.wikimedia.org/r/268734 (https://phabricator.wikimedia.org/T120527) (owner: 10Dzahn) [01:07:39] (03CR) 10Dzahn: [C: 032] ssh: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269609 (owner: 10Dzahn) [01:08:34] (03PS2) 10Dzahn: pybal: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269610 [01:10:02] (03PS3) 10Dzahn: dhcp: add es201[1-8] [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:10:31] (03CR) 10Dzahn: [C: 032] pybal: fix lint, indentation [puppet] - 10https://gerrit.wikimedia.org/r/269610 (owner: 10Dzahn) [01:10:37] anomie, on mw1017 [01:10:44] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269610 (owner: 10Dzahn) [01:10:46] (wmf.13 only, so not testwiki) [01:10:52] (03PS4) 10Dzahn: dhcp: add es201[1-8] [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:10:59] Krenair: Seems to work. [01:12:03] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:13:13] !log krenair@tin Synchronized php-1.27.0-wmf.13/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: https://gerrit.wikimedia.org/r/#/c/271153/ (duration: 01m 31s) [01:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:13:37] anomie, ^ [01:13:54] Krenair: Seems to work. [01:13:54] ACKNOWLEDGEMENT - salt-minion processes on sca1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn puppet disabled with no reason specified [01:13:54] ACKNOWLEDGEMENT - salt-minion processes on sca1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion daniel_zahn puppet disabled with no reason specified [01:14:17] why is puppet and salt disabled on sca100x ? [01:14:25] does salt also have to be disabled? [01:14:32] anomie, cool, ty [01:14:39] no reason is specified [01:14:47] which means it could be a bug [01:14:52] mutante: perhaps some trebuchet work? [01:14:56] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033915 (10Krinkle) >>! In T126700#2033767, @Tgr wrote: > For the increase in `centralauth:user` key, any change that caused session to be initialized more often cou... [01:15:02] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:15:07] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [01:15:34] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [01:15:36] mutante: last i know, trebuchet was trying to sync minions there for some services that we migrated to scb100x even though it shouldn't [01:15:36] would that happen to be the codfw pfw breaking again? [01:16:02] mobrovac: i found it in SAL [01:16:06] 18:41 akosiaris: enable puppet and salt-minion on sca100{1,2}.eqiad.wmnet [01:16:09] 18:39 akosiaris: depool sca1001, sca1002 for citoid [01:16:11] this? [01:16:26] that was during the migration [01:16:35] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [01:16:42] which migration? [01:16:48] jessie reinstall? [01:16:55] mutante: i know it still tries to sync them, so i'd keep it as is for now [01:17:02] mutante: citoid from sca to scb [01:17:21] i would not even ask if there was a reason specified and it wasn't popping up in Icinga [01:17:23] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [01:17:28] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [01:17:28] mutante: oh and today or tomorrow akosiaris and kart_ should move cxserver [01:17:29] but i dont want to start ignoring icinga all the time [01:17:33] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:50] mobrovac: yep, i won't start anything [01:17:59] kk [01:19:38] i will keep asking about alerts though until we use scheduled downtime [01:20:21] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.86 ms [01:20:24] Got the fr stuff, but seems npthing can be donw there atm [01:20:26] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms [01:20:31] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.93 ms [01:20:37] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.91 ms [01:20:41] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 37.16 ms [01:20:46] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 37.37 ms [01:20:52] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 37.03 ms [01:20:53] chasemp: yes :/ yet another "pfw fell over" [01:21:01] it's getting a bit too regular for my taste [01:21:10] but nothing we can do right now [01:21:19] Yep and k [01:25:12] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 2 failures [01:25:14] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:26:58] !log labservices1001 - out of disk again (T126572) - moved designate-mdns.log files to /srv/var/ [01:27:00] 6Operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2033935 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLs1l6w-0X0Il_jxsDV} [2016-02-17T01:26:57Z] labservices1001 - out of disk... [01:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:27:22] bd808's bot is awesome [01:28:44] 6Operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2033936 (10Dzahn) /dev/md0 9.1G 6.3G 2.3G 74% / ..but it will come back [01:28:53] RECOVERY - Disk space on labservices1001 is OK: DISK OK [01:30:12] PROBLEM - check_puppetrun on fdb2001 is CRITICAL: CRITICAL: Puppet has 2 failures [01:32:06] (03CR) 10Dzahn: [C: 032] dhcp: add es201[1-8] [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:32:15] (03PS5) 10Dzahn: dhcp: add es201[1-8] [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:32:22] (03CR) 10Dzahn: [V: 032] dhcp: add es201[1-8] [puppet] - 10https://gerrit.wikimedia.org/r/271131 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [01:33:46] !log krenair@tin Synchronized php-1.27.0-wmf.13/extensions/VisualEditor/extension.json: https://gerrit.wikimedia.org/r/#/c/271174/ (duration: 01m 29s) [01:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:34:05] (03CR) 10Dzahn: "how do i deploy this on the beta puppetmaster?" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [01:34:14] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:35:12] RECOVERY - check_puppetrun on fdb2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [01:36:11] !log krenair@tin Synchronized php-1.27.0-wmf.13/extensions/VisualEditor/modules/ve-mw/init/targets/ve.init.mw.DesktopArticleTarget.init.js: touch - https://phabricator.wikimedia.org/T125249#2012068 (duration: 01m 31s) [01:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:37:35] (03CR) 10Dzahn: [C: 032] "confirmed these are not used with "watroles" tool" [puppet] - 10https://gerrit.wikimedia.org/r/269897 (owner: 10Tim Landscheidt) [01:37:44] (03PS2) 10Dzahn: Tools: Remove obsolete roles [puppet] - 10https://gerrit.wikimedia.org/r/269897 (owner: 10Tim Landscheidt) [01:38:06] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/269897 (owner: 10Tim Landscheidt) [01:38:30] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033942 (10Anomie) >>! In T126700#2033915, @Krinkle wrote: > Is that by design, or are you speculating about a potential bug that was introduced? To make session lo... [01:38:33] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2033943 (10Tgr) >>! In T126700#2033915, @Krinkle wrote: > Is that by design, or are you speculating about a potential bug that was introduced? Speculation about a p... [01:40:17] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 2182 [01:42:45] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/#/c/260187/ but i was waiting for a review from Yuvi" [puppet] - 10https://gerrit.wikimedia.org/r/270097 (owner: 10Tim Landscheidt) [01:44:02] (03CR) 10Dzahn: [C: 04-1] "can't compile but i expect the same issue as for the other example changes that do this thing with role classes that are not the "foo::bar" [puppet] - 10https://gerrit.wikimedia.org/r/270025 (owner: 10Tim Landscheidt) [01:45:28] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [01:46:04] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2033970 (10Dzahn) @cmjohnson could you merge this ? https://gerrit.wikimedia.org/r/#/c/269739/ if it's ok with you that i remove all the entries. i just want to make sure the disk gets wiped [01:47:10] (03CR) 10Alex Monk: "Off the top of my head, I think it's something like this:" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [01:50:10] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2033988 (10Dzahn) [01:50:12] 6Operations, 5Patch-For-Review: Add riccardo to icinga (contact/paging/permissions) - https://phabricator.wikimedia.org/T126431#2033985 (10Dzahn) 5Open>3Resolved a:3Dzahn i think it's all done now, right [01:50:19] 6Operations: Add riccardo to icinga (contact/paging/permissions) - https://phabricator.wikimedia.org/T126431#2033989 (10Dzahn) [01:51:02] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2014199 (10Dzahn) @volans anything you are missing? maybe the Gerrit +2 thing? but you should have it if you are in the "ops" LDAP group. [01:51:31] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2033994 (10Dzahn) a:3Volans [01:53:06] sent hashar a quick note about the wmf.14 changes lying about waiting for deployment [01:54:58] 6Operations, 10Ops-Access-Requests, 3Discovery-Search-Sprint: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2034011 (10Krenair) Should the ops access have included ldap/ops as well? [02:04:04] (03PS1) 10Papaul: Add production DNS for es201[1-9] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/271176 (https://phabricator.wikimedia.org/T126006) [02:09:21] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2034023 (10Papaul) [02:11:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [24.0] [02:18:12] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2034037 (10Krinkle) >>! In T126700#2033942, @Anomie wrote: >>>! In T126700#2033915, @Krinkle wrote: >> Is that by design, or are you speculating about a potential bu... [02:25:20] 6Operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#2034049 (10Krenair) a:3Krenair We really need to find new LDAP admins who will actually process things. I'll go ahead and try to fix your account given what I know and my existing permissions. [02:29:37] 6Operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#2034052 (10Krenair) a:5Krenair>3Cobi Okay, I followed Ryan's instructions, but a) from a clone of the puppet repo on terbium, so I can actually run the script and b) using wikitech's LDAP credentials instead of... [02:39:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.13) (duration: 18m 59s) [02:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:54:01] (03PS1) 10Yuvipanda: beta: Move parsoid logs off NFS [puppet] - 10https://gerrit.wikimedia.org/r/271183 (https://phabricator.wikimedia.org/T125624) [02:54:35] (03PS2) 10Yuvipanda: beta: Move parsoid logs off NFS [puppet] - 10https://gerrit.wikimedia.org/r/271183 (https://phabricator.wikimedia.org/T125624) [02:55:27] (03PS3) 10Yuvipanda: beta: Move parsoid logs off NFS [puppet] - 10https://gerrit.wikimedia.org/r/271183 (https://phabricator.wikimedia.org/T125624) [02:56:01] (03CR) 10Yuvipanda: [C: 032] beta: Move parsoid logs off NFS [puppet] - 10https://gerrit.wikimedia.org/r/271183 (https://phabricator.wikimedia.org/T125624) (owner: 10Yuvipanda) [02:58:40] (03PS1) 10Yuvipanda: beta: Followup Id0e9aa4525f773d086d5f4565f8384c94e595017 [puppet] - 10https://gerrit.wikimedia.org/r/271184 [02:59:27] (03CR) 10Yuvipanda: [C: 032 V: 032] beta: Followup Id0e9aa4525f773d086d5f4565f8384c94e595017 [puppet] - 10https://gerrit.wikimedia.org/r/271184 (owner: 10Yuvipanda) [03:06:10] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 10m 31s) [03:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:43] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [03:15:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 17 03:15:40 UTC 2016 (duration 9m 30s) [03:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:23:02] (03CR) 10Yuvipanda: [C: 04-2] "This brings in and sets up all of X, so nope." [puppet] - 10https://gerrit.wikimedia.org/r/270638 (https://phabricator.wikimedia.org/T126933) (owner: 10Merlijn van Deen) [03:25:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [03:30:48] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2034097 (10ori) > To make session loading and user auto-creation sane and predictable, SessionManager loads the session and (if necessary) auto-creates the user in S... [03:35:28] (03CR) 10BBlack: Maps VCL initial forward-port to Varnish 4 (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [03:42:53] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [03:45:03] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:03] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:14] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:14] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:45:43] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:13] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:17] (03PS2) 10KartikMistry: CX: Remove the option to override the certificate of Yandex MT client [puppet] - 10https://gerrit.wikimedia.org/r/270927 [03:46:22] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:46:23] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:52:15] 6Operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2034107 (10ori) p:5Unbreak!>3Normal [04:09:02] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:10:44] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:21:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:32:21] PROBLEM - NTP on alsafi is CRITICAL: NTP CRITICAL: No response from NTP server [04:45:44] PROBLEM - SSH on alsafi is CRITICAL: Server answer [04:47:39] sigh @alsafi [04:47:50] i will ssh to it and nothing else and predict a recovery [04:49:36] hmm.. or not because i spoke too soon [04:51:02] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [04:51:12] [ganeti2001:~] $ sudo gnt-instance console alsafi.wikimedia.org [04:51:15] .. nothing [04:51:27] but recovery, yea right [04:56:08] (03CR) 10Dzahn: [C: 031] phabricator: allow port 25 for ipv6 too [puppet] - 10https://gerrit.wikimedia.org/r/270960 (https://phabricator.wikimedia.org/T127053) (owner: 10ArielGlenn) [04:59:17] (03CR) 10Dzahn: [C: 031] ores: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [05:03:12] (03PS6) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [05:03:45] (03CR) 10jenkins-bot: [V: 04-1] add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 (owner: 10Ori.livneh) [05:05:39] (03PS1) 10Ori.livneh: Dotfiles tweak [puppet] - 10https://gerrit.wikimedia.org/r/271193 [05:05:58] (03PS2) 10Ori.livneh: Dotfiles tweak [puppet] - 10https://gerrit.wikimedia.org/r/271193 [05:06:08] (03CR) 10Ori.livneh: [C: 032 V: 032] Dotfiles tweak [puppet] - 10https://gerrit.wikimedia.org/r/271193 (owner: 10Ori.livneh) [05:07:23] (03PS2) 10Dzahn: Add production DNS for es201[1-8] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/271176 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [05:08:30] (03CR) 10Dzahn: [C: 031] "checked against the data in the ticket, which server in which rack etc. looks all good to me. just adjusted the message that 2019 is being" [dns] - 10https://gerrit.wikimedia.org/r/271176 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [05:10:23] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:10:36] (03CR) 10Dzahn: [C: 031] "haven't tested because i'm actually not sure how to in this case, but more than +1 for the intention" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [05:12:03] (03CR) 10Dzahn: "(maybe if the patch had the same number of lines + and -, so really just split up and no other fixes, that would help for merging)" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [05:12:04] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [05:13:40] (03CR) 10Dzahn: [C: 04-1] "as it currently is, it would have to be "simple::lamp" j/k :p" [puppet] - 10https://gerrit.wikimedia.org/r/270106 (owner: 10Tim Landscheidt) [05:48:53] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2034134 (10Dzahn) ``` 13:06 -!- Irssi: Join to #etherpad-lite-dev was synced in 2 secs 13:07 < mutante> hello, how can i export the raw text content of one specific pad? 13:07... [05:55:54] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2034142 (10Dzahn) when i try to wget that pad i get "

You need a password to access this pad

? [05:57:13] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2034143 (10Dzahn) [05:57:33] PROBLEM - SSH on alsafi is CRITICAL: Server answer [05:59:13] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:00:25] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2034146 (10Dzahn) GET https://etherpad.wikimedia.org/socket.io/.. [HTTP/1.1 400 Bad Request 257ms] 21:58:52.781 Error: Failed assertion: **Invalid changeset (checkRep failed)... [06:14:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [06:16:52] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:22:03] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:27:22] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:29:53] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:30:54] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:31:13] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:31:13] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [06:32:04] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:13] PROBLEM - SSH on alsafi is CRITICAL: Server answer [06:45:43] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:48:23] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [06:53:18] 6Operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#2034232 (10KartikMistry) [06:53:20] 6Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 3 others: Migrate CXServer to Node 4.2 and Jessie - https://phabricator.wikimedia.org/T107307#2034231 (10KartikMistry) 5Open>3Resolved [06:53:49] 6Operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 3 others: Migrate CXServer to Node 4.2 and Jessie - https://phabricator.wikimedia.org/T107307#1491980 (10KartikMistry) This is done yesterday. cxserver is in scb cluster now. [06:56:13] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:03] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:52] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [07:00:53] PROBLEM - SSH on alsafi is CRITICAL: Server answer [07:04:24] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:05:47] 6Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 3 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2034349 (10KartikMistry) [07:06:21] 6Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 3 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2034350 (10KartikMistry) a:3KartikMistry [07:14:03] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [07:21:53] PROBLEM - SSH on alsafi is CRITICAL: Server answer [07:25:23] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:25:44] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [07:33:49] (03PS1) 10Legoktm: Output PHP version before running phpunit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271203 [07:34:13] PROBLEM - SSH on alsafi is CRITICAL: Server answer [07:35:14] 6Operations, 10Traffic, 5Patch-For-Review: openssl-1.0.2f introduced minor bug with nginx - https://phabricator.wikimedia.org/T126616#2034405 (10MoritzMuehlenhoff) Proposed patch by upstream at https://trac.nginx.org/nginx/ticket/901#comment:4 (but not yet merged into nginx Mercurial) [07:38:02] (03CR) 10Legoktm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271203 (owner: 10Legoktm) [07:40:49] ^ ori, done [07:44:03] (03PS1) 10Smalyshev: Checkout submodules for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/271204 [07:46:44] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail [07:48:03] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [07:52:33] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [24.0] [07:58:32] PROBLEM - SSH on alsafi is CRITICAL: Server answer [08:00:22] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:09:03] PROBLEM - SSH on alsafi is CRITICAL: Server answer [08:10:52] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:14:44] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:17:04] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:18:42] RECOVERY - salt-minion processes on sca1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:18:54] RECOVERY - salt-minion processes on sca1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:19:24] PROBLEM - SSH on alsafi is CRITICAL: Server answer [08:19:39] !log enabled puppet on sca1001, sca1002 [08:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:50] !log gnt-instance reboot alsafi.wikimedia.org [08:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:52] RECOVERY - Disk space on alsafi is OK: DISK OK [08:20:52] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [08:20:52] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [08:21:13] RECOVERY - SSH on alsafi is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u1 (protocol 2.0) [08:21:13] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:21:13] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [08:21:23] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [08:21:24] RECOVERY - DPKG on alsafi is OK: All packages OK [08:22:12] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:42] RECOVERY - NTP on alsafi is OK: NTP OK: Offset -0.0008004903793 secs [08:55:34] (03CR) 10Gergő Tisza: [C: 031] Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 (owner: 10Anomie) [09:08:14] (03CR) 10Alexandros Kosiaris: [C: 032] Checkout submodules for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/271204 (owner: 10Smalyshev) [09:08:21] 6Operations, 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#2034589 (10siebrand) >>! In T116552#1753624, @hashar wrote: > The Jenkins job is a template '{name}-puppetlint-... [09:14:58] 6Operations, 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#2034608 (10hashar) @siebrand eek sorry. So for translatewiki repo you can create: `/puppet/.puppet-lint.rc`: `... [09:23:05] 6Operations, 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#2034618 (10siebrand) Thanks for the help, @hashar. That fixed it. [09:23:13] 6Operations, 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#2034619 (10siebrand) 5Open>3Resolved a:3siebrand [09:27:51] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2034623 (10akosiaris) That's the second pad that we see corrupted in this way. https://github.com/ether/etherpad-lite/issues/2107 is the upstream issue for this. Last time w... [09:28:41] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Remove the option to override the certificate of Yandex MT client [puppet] - 10https://gerrit.wikimedia.org/r/270927 (owner: 10KartikMistry) [09:28:50] (03PS3) 10Alexandros Kosiaris: CX: Remove the option to override the certificate of Yandex MT client [puppet] - 10https://gerrit.wikimedia.org/r/270927 (owner: 10KartikMistry) [09:28:56] (03CR) 10Alexandros Kosiaris: [V: 032] CX: Remove the option to override the certificate of Yandex MT client [puppet] - 10https://gerrit.wikimedia.org/r/270927 (owner: 10KartikMistry) [09:32:43] 6Operations, 10Ops-Access-Requests: Onboarding of Riccardo Coccioli - https://phabricator.wikimedia.org/T126425#2034624 (10Volans) 5Open>3Resolved @dzahn yes I have the +2 on Gerrit. Resolving this. Thanks for everything. [09:39:48] (03PS1) 10Elukey: Remove kafka1013 from mediawiki's kafka brokers list for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271214 [09:44:03] just wanted to say thanks for the great docs on wiki [09:47:17] (03PS2) 10Filippo Giunchedi: swift: run swift-drive-audit staggered once a day [puppet] - 10https://gerrit.wikimedia.org/r/270970 (https://phabricator.wikimedia.org/T126574) [09:51:43] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2034670 (10Tgr) Whether responding to a certain request will require session handling is not in general predictable (cf T104755#2034145), so making application boots... [09:55:40] !log depool restbase1008 for raid expansion T119935 [09:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:40] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2034674 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLup4u2hQaf1CQcCfYG} [2016-02-17T09:55:03Z] ... [09:57:17] !log stop cassandra-a on restbase1008 for raid expansion [09:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:58] 6Operations, 5Continuous-Integration-Scaling, 7Nodepool, 7WorkType-NewFunctionality: Backport python-shade from debian/testing to jessie-wikimedia - https://phabricator.wikimedia.org/T107267#2034676 (10hashar) [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=804550 | Debian bug 804550 ]] got closed and... [10:10:26] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2034696 (10fgiunchedi) I've ran the mdadm expansion on restbase1008, this time with restbase and cassandra stopped and `/srv` unmounted, currently rebui... [10:35:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 722 [10:40:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 2487715 Threads: 2 Questions: 17538128 Slow queries: 16654 Opens: 5402 Flush tables: 2 Open tables: 402 Queries per second avg: 7.049 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:40:44] !log refreshed dns server config [10:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:42:04] which was a noop because the change was already applied [10:52:10] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2034765 (10jcrespo) Volans did a reimage of another server, so I can handle these ones. I am missing the MAC of the latest server (2009) and proper IPs for the non-mgmt interf... [11:12:57] 6Operations: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733#2034824 (10MoritzMuehlenhoff) ntpd restarts on trusty are also fairly unreliable; when restarting the ntp service on mw2*, 6 out of 213 hosts failed to restart the ntpd service since they had a stale PID in /var/run/ntpd.pid [11:20:50] 6Operations, 10Traffic, 5Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2034832 (10ema) [11:29:50] !log rolling schema change on wikidatawiki (s5) [11:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:30:30] <_joe_> uhm icinga-wm is silent? [11:30:55] * Ping reply from icinga-wm: 0.67 second(s) [11:32:20] I have no idea what those are [11:32:30] 6Operations: mw2173 mystery install - https://phabricator.wikimedia.org/T126694#2034847 (10Joe) 5Open>3Invalid a:3Joe [11:32:31] network? [11:33:21] no mysql has soft-crashed there [11:33:39] 6Operations: mw2173 mystery install - https://phabricator.wikimedia.org/T126694#2021183 (10Joe) This server has been "partly" reimaged by papaul. [11:37:01] icinga also got hurt, probably [11:38:59] jynus: what are we changing? [11:39:12] just curious [11:39:30] aude, wait, there are production problems, that has been postponed [11:39:39] oh [11:43:31] so we probably have to tune down the backups because the had becomed too aggresive [11:44:02] it could be just the *have become [11:44:22] it could be issues on that host, too [11:45:20] swap is horrible there [11:46:57] I will bring it down when backups finish [11:47:41] !log restarting hhvm on mw2* to put glibc update into effect [11:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:48:07] And we need to change the replication alerts into 2: heartbeat lag and replication errors [11:48:56] I do not care if "io thread is stopped", I care: is replication working as it should? and how much time behind is that slave? [11:49:20] there is also a problem with icinga configuration [11:49:35] non-critical alterts paging [11:53:53] aude, T62539 [11:53:53] T62539: [Task] Convert wb_terms term_row_id from INT to BIGINT on wikidatawiki - https://phabricator.wikimedia.org/T62539 [11:54:42] ah, ok [11:54:55] !log restarting schema change on wikidatawiki (s5) T62539 [11:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:59:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [5000000.0] [11:59:47] !log bump stripe_cache_size to 10240 for md2 on restbase1008 [12:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:55] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:35] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [12:13:08] (03PS1) 10ArielGlenn: dumpscheduler: stages files can now specify 'max' for jobs to run on a host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/271237 [12:13:25] PROBLEM - HTTPS on dataset1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [12:16:07] ^ dataset1001 was me, poor timing of icinga check during service restart [12:19:00] PROBLEM - MariaDB Slave Lag: s5 on db1045 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 327 [12:19:01] PROBLEM - puppet last run on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:01] PROBLEM - Disk space on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:04] PROBLEM - Check size of conntrack table on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:15] <_joe_> uhm just got paged [12:19:16] PROBLEM - RAID on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:25] <_joe_> jynus: you know about this slave lag? [12:19:25] PROBLEM - salt-minion processes on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:25] that is the alter table [12:19:25] PROBLEM - configured eth on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:26] s5 [12:19:36] afecting the slowest servers [12:19:45] PROBLEM - dhclient process on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:46] PROBLEM - DPKG on alsafi is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:47] hmm that alsafi issue is recurring [12:19:49] within the margin of acceptable [12:20:02] PROBLEM - MariaDB Slave Lag: s5 on db1026 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 318 [12:20:21] PROBLEM - MariaDB Slave Lag: s5 on db1049 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 317 [12:21:15] recent changes will be a bit delayed for 5 minutes [12:21:34] (03PS1) 10ArielGlenn: make dump stages template more flexible [puppet] - 10https://gerrit.wikimedia.org/r/271239 [12:21:59] * apergos does the backread [12:22:04] having a master with 96G of memory and (old) slaves with 64G doesn't help [12:22:28] which are exactly the ones that could not keep up with the master, even pre-warmed [12:23:11] we still have 3 servers health [12:23:15] y [12:26:56] I expect the RECOVERY soon [12:27:06] RECOVERY - dhclient process on alsafi is OK: PROCS OK: 0 processes with command name dhclient [12:27:08] RECOVERY - DPKG on alsafi is OK: All packages OK [12:28:17] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [12:28:17] RECOVERY - Disk space on alsafi is OK: DISK OK [12:28:24] RECOVERY - Check size of conntrack table on alsafi is OK: OK: nf_conntrack is 0 % full [12:28:27] RECOVERY - RAID on alsafi is OK: OK: no RAID installed [12:28:36] RECOVERY - salt-minion processes on alsafi is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:28:45] RECOVERY - configured eth on alsafi is OK: OK - interfaces up [12:30:02] I mroe failover [12:30:52] and we should be out of the woods now [12:31:39] !log restarting apache on krypton for glibc update [12:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:15] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:36] even codfw recovers before old production servers [12:36:15] 6Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584#2034902 (10faidon) The patch @GWicke linked to (a data corruption bug of discard on top of MD linear/raid0/raid10) was reworked and finally merged upstream as f3f5da624e0a891c34d8cd513c57f1d9b0c7dadc, aka v4.... [12:37:16] any time now [12:38:42] RECOVERY - MariaDB Slave Lag: s5 on db1026 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [12:38:57] :-) [12:38:57] confirmed rc's are back in sync [12:39:02] RECOVERY - MariaDB Slave Lag: s5 on db1049 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [12:39:04] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [12:39:18] those three b*** must die! [12:39:35] note the <50 ending [12:39:51] RECOVERY - MariaDB Slave Lag: s5 on db1045 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [12:40:59] can more memory be shoved in to them or are they too old to bother? [12:41:08] so, it was worse than I thought, I would expect only 1-2 minutes, it went up to 7 [12:41:16] *had expected [12:41:17] (03CR) 10Tim Landscheidt: "@Dzahn: That's impossible partly because the pre-change classes use nesting and partly because preserving empty lines at the end means git" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [12:45:19] (03PS1) 10ArielGlenn: dumps: fix typo preventing cleanup of old files on recompression job reruns [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/271244 [12:46:22] there were around 1000 errors visible to the users before and after, even if most of those are probably api-related, it is 2000 errors more than I would have liked: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?from=1455705096708&to=1455711356048 [12:46:41] (03CR) 10ArielGlenn: [C: 032] dumps: fix typo preventing cleanup of old files on recompression job reruns [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/271244 (owner: 10ArielGlenn) [12:47:57] (03Abandoned) 10Tim Landscheidt: rcstream: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [12:47:58] finer failover control would had helped quite a lot [12:50:08] also playing around with replication, which is why I do not want to merge https://gerrit.wikimedia.org/r/#/c/270584/ [12:50:42] right [12:51:20] !log rolling restart of ms-fe2* [12:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:25] mariadb10's online alter would help in other cases, but not here (PK change) and not yet (only s2 so far) [12:58:05] (03CR) 10Tim Landscheidt: "Oh, yes, I did miss your change. It needs to be rebased, though, to reflect the changes to modules/quarry/manifests/init.pp since then." [puppet] - 10https://gerrit.wikimedia.org/r/270097 (owner: 10Tim Landscheidt) [12:58:55] (03Abandoned) 10Tim Landscheidt: simplelamp: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270106 (owner: 10Tim Landscheidt) [13:00:03] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2034936 (10fgiunchedi) after bumping `/sys/block/md2/md/stripe_cache_size` to 32470 speed has increased to ~20MB/s ``` Personalities : [raid1] [raid0]... [13:00:14] (03Abandoned) 10Tim Landscheidt: testsystem: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270105 (owner: 10Tim Landscheidt) [13:01:11] (03Abandoned) 10Tim Landscheidt: analytics: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270242 (owner: 10Tim Landscheidt) [13:01:26] (03Abandoned) 10Tim Landscheidt: transparency: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270230 (owner: 10Tim Landscheidt) [13:01:36] (03Abandoned) 10Tim Landscheidt: statsite: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270226 (owner: 10Tim Landscheidt) [13:01:44] (03Abandoned) 10Tim Landscheidt: simplestatic: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270223 (owner: 10Tim Landscheidt) [13:01:52] (03Abandoned) 10Tim Landscheidt: zotero: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270239 (owner: 10Tim Landscheidt) [13:02:00] (03Abandoned) 10Tim Landscheidt: xenon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270237 (owner: 10Tim Landscheidt) [13:02:06] (03Abandoned) 10Tim Landscheidt: wikimania_scholarships: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270235 (owner: 10Tim Landscheidt) [13:03:01] (03Abandoned) 10Tim Landscheidt: yubiauth: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270238 (owner: 10Tim Landscheidt) [13:03:10] I will try to do something about dbstore1001 later [13:03:13] (03Abandoned) 10Tim Landscheidt: sca: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270218 (owner: 10Tim Landscheidt) [13:03:22] (03Abandoned) 10Tim Landscheidt: puppet_compiler: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270214 (owner: 10Tim Landscheidt) [13:03:33] (03Abandoned) 10Tim Landscheidt: pmacct: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270212 (owner: 10Tim Landscheidt) [13:03:44] (03Abandoned) 10Tim Landscheidt: statsdlb: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270225 (owner: 10Tim Landscheidt) [13:03:52] (03Abandoned) 10Tim Landscheidt: parsoid_vd_server: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270207 (owner: 10Tim Landscheidt) [13:04:00] (03Abandoned) 10Tim Landscheidt: wikistats: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270236 (owner: 10Tim Landscheidt) [13:04:07] (03Abandoned) 10Tim Landscheidt: racktables: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270216 (owner: 10Tim Landscheidt) [13:04:14] (03Abandoned) 10Tim Landscheidt: tcpircbot: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270227 (owner: 10Tim Landscheidt) [13:04:20] (03Abandoned) 10Tim Landscheidt: webperf: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270234 (owner: 10Tim Landscheidt) [13:04:26] (03Abandoned) 10Tim Landscheidt: wdqs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270233 (owner: 10Tim Landscheidt) [13:04:33] (03Abandoned) 10Tim Landscheidt: ve: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270232 (owner: 10Tim Landscheidt) [13:04:39] (03Abandoned) 10Tim Landscheidt: poolcounter: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270213 (owner: 10Tim Landscheidt) [13:05:12] (03CR) 10Elukey: [C: 032] "Discussed in security, merging the change to have it ready on tin." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271214 (owner: 10Elukey) [13:05:25] (03Abandoned) 10Tim Landscheidt: url_downloader: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270231 (owner: 10Tim Landscheidt) [13:05:35] (03Abandoned) 10Tim Landscheidt: torrus: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270229 (owner: 10Tim Landscheidt) [13:05:44] (03Abandoned) 10Tim Landscheidt: tendril: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270228 (owner: 10Tim Landscheidt) [13:05:53] (03Abandoned) 10Tim Landscheidt: piwik: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270210 (owner: 10Tim Landscheidt) [13:06:05] (03Abandoned) 10Tim Landscheidt: spare: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270224 (owner: 10Tim Landscheidt) [13:06:14] (03Abandoned) 10Tim Landscheidt: simplelap: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270222 (owner: 10Tim Landscheidt) [13:06:22] (03Abandoned) 10Tim Landscheidt: parsoid_vd_client: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270206 (owner: 10Tim Landscheidt) [13:06:31] (03Abandoned) 10Tim Landscheidt: servermon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270221 (owner: 10Tim Landscheidt) [13:06:40] (03Abandoned) 10Tim Landscheidt: parsoid_rt_server: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270205 (owner: 10Tim Landscheidt) [13:06:41] !log stopping kafka on kafka1013 [13:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:45] (03Abandoned) 10Tim Landscheidt: sentry: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270220 (owner: 10Tim Landscheidt) [13:06:49] (03Abandoned) 10Tim Landscheidt: parsoid_rt_client: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270204 (owner: 10Tim Landscheidt) [13:06:55] (03Abandoned) 10Tim Landscheidt: scb: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270219 (owner: 10Tim Landscheidt) [13:06:59] (03Abandoned) 10Tim Landscheidt: rancid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270217 (owner: 10Tim Landscheidt) [13:07:03] (03Abandoned) 10Tim Landscheidt: pybal_config: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270215 (owner: 10Tim Landscheidt) [13:07:08] (03Abandoned) 10Tim Landscheidt: ntp: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270201 (owner: 10Tim Landscheidt) [13:07:12] (03Abandoned) 10Tim Landscheidt: kibana: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270194 (owner: 10Tim Landscheidt) [13:07:17] (03Abandoned) 10Tim Landscheidt: noc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270200 (owner: 10Tim Landscheidt) [13:07:26] (03Abandoned) 10Tim Landscheidt: planet: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270211 (owner: 10Tim Landscheidt) [13:07:35] (03Abandoned) 10Tim Landscheidt: phragile: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270209 (owner: 10Tim Landscheidt) [13:07:41] (03Abandoned) 10Tim Landscheidt: performance: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270208 (owner: 10Tim Landscheidt) [13:07:46] (03Abandoned) 10Tim Landscheidt: memcached: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270197 (owner: 10Tim Landscheidt) [13:11:50] PROBLEM - Kafka Broker Server on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [13:12:39] ---^ this is me, forgot to turn off icinga [13:12:42] sorry :) [13:13:58] elukey: no worries [13:14:17] <_joe_> elukey: all seems good on the appservers side [13:14:30] <_joe_> you did not sync your change, did you? [13:14:33] nope [13:14:39] only merged in gerrit [13:15:01] I have a shell on tin ready to go [13:15:12] <_joe_> I think you should sync it now [13:15:26] (03Abandoned) 10Tim Landscheidt: horizon: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270185 (owner: 10Tim Landscheidt) [13:15:35] (03Abandoned) 10Tim Landscheidt: mw_rc_irc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270199 (owner: 10Tim Landscheidt) [13:16:33] _joe_ I thought we were trying the reboot first (I only stopped the service, waiting for Joseph to confirm that Event Logging is fine) [13:16:44] I can sync it anyway if you want [13:17:48] <_joe_> nopr [13:17:52] <_joe_> reboot then [13:18:22] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2034981 (10fgiunchedi) after experimenting with readahead settings to 16384 for md2 and set `/sys/block/md2/md/preread_bypass_threshold` to 0 there does... [13:18:29] yep I'll wait a second for Event Logging's recovery, otherwise there might be the change to loose data [13:19:48] (03Abandoned) 10Tim Landscheidt: icinga: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270186 (owner: 10Tim Landscheidt) [13:19:55] <_joe_> elukey: ping me when you reboot [13:19:56] (03Abandoned) 10Tim Landscheidt: ipsec: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270189 (owner: 10Tim Landscheidt) [13:20:05] (03Abandoned) 10Tim Landscheidt: ipmi: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270188 (owner: 10Tim Landscheidt) [13:20:13] (03Abandoned) 10Tim Landscheidt: graphoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270184 (owner: 10Tim Landscheidt) [13:20:21] (03Abandoned) 10Tim Landscheidt: grafana: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270183 (owner: 10Tim Landscheidt) [13:20:28] (03Abandoned) 10Tim Landscheidt: gitblit: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270182 (owner: 10Tim Landscheidt) [13:20:32] <_joe_> uhm why is tim abandoning all those patches? [13:20:35] (03Abandoned) 10Tim Landscheidt: gdash: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270181 (owner: 10Tim Landscheidt) [13:20:39] <_joe_> he's not here I see [13:20:43] (03Abandoned) 10Tim Landscheidt: extdist: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270179 (owner: 10Tim Landscheidt) [13:20:56] (03Abandoned) 10Tim Landscheidt: ipv6relay: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270190 (owner: 10Tim Landscheidt) [13:21:00] RECOVERY - Kafka Broker Server on kafka1013 is OK: PROCS OK: 1 process with command name java, args kafka.Kafka /etc/kafka/server.properties [13:21:01] (03Abandoned) 10Tim Landscheidt: librenms: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270195 (owner: 10Tim Landscheidt) [13:21:06] (03Abandoned) 10Tim Landscheidt: mathoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270196 (owner: 10Tim Landscheidt) [13:21:17] (03Abandoned) 10Tim Landscheidt: jsbench: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270193 (owner: 10Tim Landscheidt) [13:21:25] (03Abandoned) 10Tim Landscheidt: jobqueue_redis: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270192 (owner: 10Tim Landscheidt) [13:21:33] (03Abandoned) 10Tim Landscheidt: iegreview: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270187 (owner: 10Tim Landscheidt) [13:21:40] _joe_: they are marked as abandoned with a link to https://gerrit.wikimedia.org/r/#/c/270107/ [13:21:57] _joe_ proceeding with the reboot [13:22:08] (03Abandoned) 10Tim Landscheidt: ircyall: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270191 (owner: 10Tim Landscheidt) [13:22:13] (03Abandoned) 10Tim Landscheidt: mobileapps: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270198 (owner: 10Tim Landscheidt) [13:22:24] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [13:22:35] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [13:23:01] ---^ this is me waiting to merge the kafka broker removal [13:23:07] (03CR) 10Giuseppe Lavagetto: "This has nothing to do with the role keyword, it's just a namespace collision and it's not caused by that function; we should just move th" [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [13:23:11] (03Abandoned) 10Tim Landscheidt: ocg: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270202 (owner: 10Tim Landscheidt) [13:23:43] (03Abandoned) 10Tim Landscheidt: otrs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270203 (owner: 10Tim Landscheidt) [13:23:43] !log rebooted kafka1013 for maintenance [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:50] (03Abandoned) 10Tim Landscheidt: ganeti: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270180 (owner: 10Tim Landscheidt) [13:23:51] <_joe_> elukey: all seems ok for now [13:23:56] (03Abandoned) 10Tim Landscheidt: etcd: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270177 (owner: 10Tim Landscheidt) [13:24:01] (03Abandoned) 10Tim Landscheidt: cxserver: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270175 (owner: 10Tim Landscheidt) [13:24:06] (03Abandoned) 10Tim Landscheidt: cassandra: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270173 (owner: 10Tim Landscheidt) [13:24:11] (03Abandoned) 10Tim Landscheidt: annualreport: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270171 (owner: 10Tim Landscheidt) [13:24:16] (03Abandoned) 10Tim Landscheidt: aqs: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270168 (owner: 10Tim Landscheidt) [13:24:24] (03Abandoned) 10Tim Landscheidt: etherpad: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270178 (owner: 10Tim Landscheidt) [13:24:30] (03Abandoned) 10Tim Landscheidt: diamond: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270176 (owner: 10Tim Landscheidt) [13:24:39] (03Abandoned) 10Tim Landscheidt: citoid: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270174 (owner: 10Tim Landscheidt) [13:24:50] (03Abandoned) 10Tim Landscheidt: archiva: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270172 (owner: 10Tim Landscheidt) [13:24:59] (03Abandoned) 10Tim Landscheidt: smokeping: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270170 (owner: 10Tim Landscheidt) [13:25:10] (03Abandoned) 10Tim Landscheidt: access_new_install: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270167 (owner: 10Tim Landscheidt) [13:26:03] <_joe_> elukey: you can merge the change if you want, or revert it [13:27:37] _joe_ the node is up and running, reverting :) [13:27:55] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [10.0] [13:31:06] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 60.71% of data above the critical threshold [10.0] [13:31:24] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [10.0] [13:31:35] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [10.0] [13:32:25] ---^ these alarms needs to be tuned [13:35:27] (03PS1) 10Elukey: Revert "Remove kafka1013 from mediawiki's kafka brokers list for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271247 [13:36:16] (03CR) 10Elukey: [C: 032] Revert "Remove kafka1013 from mediawiki's kafka brokers list for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271247 (owner: 10Elukey) [13:42:15] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [13:42:25] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [13:42:34] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1018 is OK: OK: Less than 50.00% above the threshold [1.0] [13:45:44] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [13:48:59] 6Operations, 10Ops-Access-Requests, 3Discovery-Search-Sprint: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2035061 (10fgiunchedi) @krenair, you are indeed correct, ops ldap group membership was missing and added now [13:50:14] 6Operations, 10Ops-Access-Requests, 3Discovery-Search-Sprint: Access for new Discovery OpsEng: Guillaume Lederrey - https://phabricator.wikimedia.org/T125651#2035064 (10Gehel) @Krenair, @fgiunchedi: Thanks for taking care of the lost newbie that I am! [13:50:48] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2035065 (10fgiunchedi) >>! In T127053#2032649, @mmodell wrote: > From what I can remember, iridium hasn't had full ipv6 support until recently. I guess it's taken more... [13:57:34] (03PS1) 10Jcrespo: Repool db1022 as regular traffic + API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271249 (https://phabricator.wikimedia.org/T120122) [13:58:32] BTW, can I perform a repool real quick?^ [13:58:45] Krinkle, elukey ? [13:59:17] jynus: yep already finished, I reverted my changes and the host is up and running [13:59:29] elukey: it'll trigger a warnign though since neither commit went live [13:59:39] icinga will shout any minute now [13:59:57] but jynus can sync that out no worries, it was a no-op [14:00:01] so, ok to merge 2 extra changes? [14:00:31] (03CR) 10Jcrespo: [C: 032] Repool db1022 as regular traffic + API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271249 (https://phabricator.wikimedia.org/T120122) (owner: 10Jcrespo) [14:02:24] checke dhat only my change shows in HEAD~3 [14:02:29] *checked that [14:02:31] !log package upgrades on cp* commence [14:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:02:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:04:19] jynus: thanks, sorry for the extra work :) [14:04:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [14:04:31] 0 extra work actually [14:04:35] literally [14:05:08] as I merge out of hours, I always check diffs, just in case someone merge something for labs only, etc [14:05:10] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1022 after maintenance (duration: 01m 34s) [14:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:43] jynus: thx [14:08:05] (03Abandoned) 10Jcrespo: Repool of db1022 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270975 (https://phabricator.wikimedia.org/T120122) (owner: 10Volans) [14:08:24] (03PS1) 10MarcoAurelio: Enable DynamicPageList for Wikimedia Norge chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271252 (https://phabricator.wikimedia.org/T127161) [14:08:43] 6Operations, 7Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#2035119 (10faidon) Thanks @Southparkfan for the thorough analysis. You're pretty much right on all counts :) Looking forward, we've thought of replacing the server with a more powerf... [14:08:54] 6Operations, 7Monitoring: icinga "max concurrent checks" limits reached - https://phabricator.wikimedia.org/T1242#2035125 (10faidon) [14:08:56] 6Operations, 7Icinga: improve icinga performance / solve general load issues on neon - https://phabricator.wikimedia.org/T85222#2035121 (10faidon) 5Open>3stalled p:5Normal>3Low a:3akosiaris [14:16:14] (03PS3) 10Giuseppe Lavagetto: nutcracker: re-organize redis servers list [puppet] - 10https://gerrit.wikimedia.org/r/270969 [14:16:16] (03PS1) 10Giuseppe Lavagetto: role::memcached: add a second redis instance on mc20{01,16} [puppet] - 10https://gerrit.wikimedia.org/r/271258 [14:16:18] (03PS1) 10Giuseppe Lavagetto: ipresolve: add PTR resolution, tests [puppet] - 10https://gerrit.wikimedia.org/r/271259 [14:16:20] (03PS1) 10Giuseppe Lavagetto: role::memcached: add cross-dc Ipsec for the various shards. [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [14:16:22] (03PS1) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [14:16:51] <_joe_> reviewers VERY welcome ^^ [14:20:11] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: add cross-dc Ipsec for the various shards. [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [14:21:08] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [14:26:04] (03PS1) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [14:38:19] (03PS2) 10Giuseppe Lavagetto: role::memcached: add cross-dc Ipsec for the various shards. [puppet] - 10https://gerrit.wikimedia.org/r/271260 (https://phabricator.wikimedia.org/T126470) [14:38:20] (03PS2) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [14:40:12] (03CR) 10jenkins-bot: [V: 04-1] role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [14:41:44] 6Operations, 6Phabricator, 5Patch-For-Review: iridium / phabricator not accepting email via smtp - https://phabricator.wikimedia.org/T127053#2035189 (10mmodell) @fgiunchedi: indeed, that seems right to me. [14:41:47] (03PS1) 10Ottomata: 2.2.0-1 release [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/271267 [14:46:27] (03PS3) 10Alexandros Kosiaris: Update otrs.TicketExport2Mbox.pl help message [puppet] - 10https://gerrit.wikimedia.org/r/270310 [14:46:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Update otrs.TicketExport2Mbox.pl help message [puppet] - 10https://gerrit.wikimedia.org/r/270310 (owner: 10Alexandros Kosiaris) [14:49:43] (03PS2) 10Jcrespo: Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 [14:51:11] (03PS1) 10Alexandros Kosiaris: otrs: Set MaxConnectionsPerChild to 4000 [puppet] - 10https://gerrit.wikimedia.org/r/271269 [14:54:39] (03PS3) 10Jcrespo: Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 [14:54:43] (03PS2) 10Alexandros Kosiaris: otrs: Set MaxConnectionsPerChild to 4000 [puppet] - 10https://gerrit.wikimedia.org/r/271269 [14:54:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Set MaxConnectionsPerChild to 4000 [puppet] - 10https://gerrit.wikimedia.org/r/271269 (owner: 10Alexandros Kosiaris) [14:58:29] (03PS4) 10Jcrespo: Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 [14:58:33] (03PS1) 10Alexandros Kosiaris: Revert "otrs: Set MaxConnectionsPerChild to 4000" [puppet] - 10https://gerrit.wikimedia.org/r/271272 [14:58:51] (03PS2) 10Alexandros Kosiaris: Revert "otrs: Set MaxConnectionsPerChild to 4000" [puppet] - 10https://gerrit.wikimedia.org/r/271272 [14:58:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "otrs: Set MaxConnectionsPerChild to 4000" [puppet] - 10https://gerrit.wikimedia.org/r/271272 (owner: 10Alexandros Kosiaris) [15:02:30] (03PS1) 10BBlack: base::no_nfs_client - define and use on cp*/lvs* [puppet] - 10https://gerrit.wikimedia.org/r/271274 [15:03:47] (03CR) 10jenkins-bot: [V: 04-1] base::no_nfs_client - define and use on cp*/lvs* [puppet] - 10https://gerrit.wikimedia.org/r/271274 (owner: 10BBlack) [15:05:01] (03PS2) 10BBlack: base::no_nfs_client - define and use on cp*/lvs* [puppet] - 10https://gerrit.wikimedia.org/r/271274 [15:05:57] <_joe_> bblack: be careful with labs [15:06:05] heh, good point [15:06:11] (03PS4) 10Giuseppe Lavagetto: Rationalize services definitions for labs too. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269955 [15:06:13] (03PS6) 10Giuseppe Lavagetto: Configure redis LockManager in both DCs, use the master everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266514 [15:06:15] (03PS6) 10Giuseppe Lavagetto: Add references to wmfServices for Cirrusearch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266512 (https://phabricator.wikimedia.org/T114273) [15:06:17] (03PS6) 10Giuseppe Lavagetto: Use wmfMasterDatacenter for picking the master redis config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266513 (https://phabricator.wikimedia.org/T114273) [15:06:19] (03PS6) 10Giuseppe Lavagetto: Reduce poolcounter configuration complexity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266511 (https://phabricator.wikimedia.org/T114273) [15:06:21] (03PS15) 10Giuseppe Lavagetto: Define service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [15:06:47] <_joe_> sorry, GTG out for a bit [15:09:05] (03PS3) 10BBlack: base::no_nfs_client - define and use on cp*/lvs* [puppet] - 10https://gerrit.wikimedia.org/r/271274 [15:13:03] (03CR) 10BBlack: [C: 032] base::no_nfs_client - define and use on cp*/lvs* [puppet] - 10https://gerrit.wikimedia.org/r/271274 (owner: 10BBlack) [15:15:41] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2035239 (10Papaul) The reason you are missing the MAC address of es2019 and the production IP is because I left it so Volans can do that. Since he is no longer doing the re-i... [15:17:35] (03PS1) 10BBlack: no_nfs_client: fix service dep order [puppet] - 10https://gerrit.wikimedia.org/r/271275 [15:19:49] no one is deploying stuff now (on tin)? [15:20:06] greg said it would be ok if we deploy a bug fix before swat [15:20:57] swat is in ~40 mins [15:21:02] yeah [15:22:01] ok, i'll proceed... [15:23:23] (03PS1) 10Elukey: Add kafka1012 back to the pool of kafka brokers in wmf-config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271276 [15:26:32] (03CR) 10BBlack: [C: 032] no_nfs_client: fix service dep order [puppet] - 10https://gerrit.wikimedia.org/r/271275 (owner: 10BBlack) [15:27:43] (03PS2) 10Elukey: Add kafka1012 back to the pool of kafka brokers in wmf-config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271276 [15:30:46] I've rechecked the patch, and it still works, even if the server is stopped [15:31:20] (03PS5) 10Jcrespo: Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 [15:31:31] which will be a great win [15:32:23] (03CR) 10Jcrespo: [C: 032 V: 032] Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 (owner: 10Jcrespo) [15:32:32] (03CR) 10Tim Landscheidt: "@Joe: That the role() function is not the cause is what I wanted to prove by removing it (and it is indeed not the cause)." [puppet] - 10https://gerrit.wikimedia.org/r/270107 (owner: 10Tim Landscheidt) [15:33:08] !log stopping puppet on all database hosts (db, dbstore, es, etc.) for lag alert testing [15:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:21] urandom: GTG restbase metrics renaming? [15:34:02] godog: yup, shall i deploy to staging? [15:34:09] sure [15:34:10] palladium is not the salt master, palladium is not the salt master [15:34:17] k, un momento [15:35:09] (03PS2) 10ArielGlenn: dumpscheduler: stages files can now specify 'max' for jobs to run on a host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/271237 [15:35:56] (03CR) 10ArielGlenn: [C: 032] dumpscheduler: stages files can now specify 'max' for jobs to run on a host [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/271237 (owner: 10ArielGlenn) [15:36:37] databases have way too many different hostname prefixes [15:37:19] and 1 too many different puppet roles [15:38:12] !log deploying restbase (15a6c50) in staging [15:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:34] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035307 (10Andrew) 3NEW [15:39:34] not pcs, that have no replication [15:39:49] ok, deploying new check [15:39:56] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035314 (10Andrew) {F3366460} [15:40:45] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035318 (10Andrew) Note from the attached screenshot -- the pages themselves are dated 1:17 UTC, but they are arriving at my phone at 1:31AM CST, which is UTC-5. So, 5:10 minutes after they fired. [15:41:40] (03PS2) 10ArielGlenn: make dump stages template more flexible [puppet] - 10https://gerrit.wikimedia.org/r/271239 [15:43:05] !log restbase staging deploy (15a6c50) complete [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:09] godog: ^^ [15:43:33] (03CR) 10ArielGlenn: [C: 032] make dump stages template more flexible [puppet] - 10https://gerrit.wikimedia.org/r/271239 (owner: 10ArielGlenn) [15:44:46] urandom: ack, checking [15:45:04] good morning [15:46:50] the palladium pending merge was old me or someone else? [15:47:30] it was old me, but nagios comes late to the party [15:47:36] * aude deploying [15:49:02] abuse :P [15:49:50] gwicke: are the moves you listed in https://github.com/wikimedia/hyperswitch/pull/6 enough? Is restbase.external not a catch-all for everything note internal and internal-update? [15:49:59] s/note/not/ [15:50:11] urandom: yep looks like internal/ and external/ are populated now, good to go with deploy in production [15:50:21] yeah looks like also POST/HEAD/etc is included [15:50:36] godog: included in external, yes? [15:51:04] yeah, from staging [15:51:06] urandom: looking through the list, I think there are two old entries we can remove (ab and wikimedia) [15:51:29] godog: ok, there is very little changing in this deploy (read: it's low risk), but i'm going to do a short canary anyway [15:51:46] urandom: ack [15:51:52] and yes, then there are a few more that should be moved to external: {ALL,GET,HEAD,OPTIONS,POST,_robots} [15:51:53] !log package upgrades commencing on lvs* [15:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:17] !log canary deploy of restbase to restbase1001.eqiad.wmnet (15a6c50) [15:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:55] !log canary deploy of restbase to restbase1001.eqiad.wmnet (15a6c50) complete [15:53:58] !log aude@tin Synchronized php-1.27.0-wmf.13/extensions/Wikidata: Fix caching data types bug: T127095 (duration: 01m 44s) [15:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:03] godog: ^^ checking [15:55:03] !log copy restbase.private to restbase.internal on graphite1001 and graphite2001 [15:55:05] PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:24] PROBLEM - DPKG on lvs1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:55:25] PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:55:34] ^ that's all me, ignore it [15:55:45] PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:55:46] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:56:05] PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:56:54] PROBLEM - DPKG on lvs1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:57:21] godog: updated the rename list at https://github.com/wikimedia/hyperswitch/pull/6#issuecomment-185026109 [15:57:42] gwicke: thanks! copying external now [15:58:11] godog: k, everything looks good [15:58:43] !log copy restbase.{v1_*,sys_*,ALL,GET,HEAD,POST,OPTIONS,_robots} to restbase.external on graphite1001 and graphite2001 [15:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:59:09] godog: shall we move forward so that any gaps in metrics are no larger than necessary? [15:59:46] urandom: yeah fine for me to continue [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160217T1600). Please do the needful. [16:00:05] mafk anomie: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:24] Present and avalaible [16:00:38] !log continuing production-wide restbase deploy (15a6c50) [16:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:45] RECOVERY - DPKG on lvs1004 is OK: All packages OK [16:00:52] you handling this anomie? [16:01:04] Krenair: I could, although I have no particular desire to [16:01:05] RECOVERY - DPKG on lvs1006 is OK: All packages OK [16:01:11] ok then [16:01:17] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection refused [16:01:22] you are listed as a deployer [16:01:48] * anomie starts SWAT [16:02:15] (03CR) 10Anomie: [C: 032] Adding WP and WT as namespace aliases for tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) (owner: 10MarcoAurelio) [16:02:35] RECOVERY - DPKG on lvs1005 is OK: All packages OK [16:02:48] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2035370 (10Papaul) I have been discussing with Jaime on IRC on the db.cfg not being really unattended since he is getting the "confirming partitioning write to disk" message.... [16:02:52] (03Merged) 10jenkins-bot: Adding WP and WT as namespace aliases for tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/269970 (https://phabricator.wikimedia.org/T126604) (owner: 10MarcoAurelio) [16:02:59] FYI: I'm working on updating the sites tables across all wikis (slowly). There shouldn't be any impact at all, but I wanted to give you a heads up anyway# [16:03:04] RECOVERY - DPKG on lvs1001 is OK: All packages OK [16:03:11] etherpad? [16:03:24] RECOVERY - DPKG on lvs1003 is OK: All packages OK [16:03:25] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:03:35] RECOVERY - DPKG on lvs1002 is OK: All packages OK [16:03:51] also there are quite a lot of read only errors for enwiktionary (but i think this is not new) [16:04:03] think i heard this mentioned yesterday [16:04:08] aude: jynu,s is aware [16:04:12] yeah [16:05:01] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace aliases for tawiki (task [[phab:T126604|]]) (duration: 01m 31s) [16:05:03] mafk: ^ Test please [16:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:35] anomie: on it [16:06:30] anomie: looks it works [16:06:34] Shanmugamp7: ping [16:06:41] can you test it as well, please? [16:06:45] (03PS2) 10Anomie: New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [16:06:51] (03CR) 10Anomie: [C: 032] New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [16:07:21] (03Merged) 10jenkins-bot: New user groups configuration for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270645 (https://phabricator.wikimedia.org/T126931) (owner: 10MarcoAurelio) [16:07:25] mafk: sure [16:07:31] Shanmugamp7: thank you [16:08:43] * mafk wonders how to test that cswiki stuff now :) [16:08:45] RECOVERY - RAID on labstore1001 is OK: OK: Active: 73, Working: 73, Failed: 0, Spare: 0 [16:09:17] !log restbase deploy stalled at restbase1008 (under maintenance) [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:21] hm, wait, not swat'd yet [16:09:28] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: New userrights and configuration for cswiki (task [[phab:T126931|]]) (duration: 01m 31s) [16:09:29] mafk: ^ Test please [16:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:45] (03PS1) 10Jcrespo: Fix typo that prevented from a fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271280 [16:09:48] anomie: will do [16:09:53] need a couple of minutes [16:10:28] mafk: what would happen to the pages that was already there with WP: [16:10:42] !log restbase deploy restarting at restbase1009 [16:10:44] (03PS2) 10Jcrespo: Fix typo that prevented from a fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271280 [16:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:45] 6Operations, 10MediaWiki-ResourceLoader, 10MediaWiki-Sites, 7Easy: Missing "poweredby_mediawiki" icon in the footer on MediaWiki - https://phabricator.wikimedia.org/T127194#2035403 (10Josve05a) 3NEW [16:10:55] anomie: cswiki stuff looks OK by special:listgrouprights [16:11:04] Shanmugamp7: you mean redirects? [16:11:14] yep [16:11:35] good question, but now WP: is a synonym for Wikipedia:, as Project: is [16:11:51] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T127060#2035416 (10Cmjohnson) @fgiunchedi Swapped the disk and added the VD back. A reboot is probably necessary. [16:12:01] (03CR) 10Jcrespo: [C: 032] Fix typo that prevented from a fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271280 (owner: 10Jcrespo) [16:12:38] mafk: yes, they no longer exists or unable to goto that page [16:12:55] 6Operations, 10ops-eqiad: Failed drive in labstore1001 array - https://phabricator.wikimedia.org/T127076#2035422 (10Cmjohnson) @chasemp Disk swapped but I do not do anything beyond physically replacing the disk [16:12:59] Shanmugamp7: but they were not content pages, right? [16:13:09] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035424 (10Josve05a) >>! In T127189#2035318, @Andrew wrote: > Note from the attached screenshot -- the pages themselves are dated 1:17 UTC, but they are arriving at my phone at 1:31AM CST, which is UTC-5. So, 5:10 min... [16:13:11] I mean just redirects for the actual wiki pages? [16:13:22] if so I think they can be deleted? [16:13:24] just redirects yeah [16:13:30] 6Operations, 10ops-eqiad: Failed drive in labstore1001 array - https://phabricator.wikimedia.org/T127076#2035427 (10Cmjohnson) a:5Cmjohnson>3chasemp [16:13:55] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T127060#2035429 (10Cmjohnson) a:3fgiunchedi [16:14:10] !log Re-populating the sites and site_identifiers table for all Wikipedias and testwikidata [16:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:18] !log restbase deploy (15a6c50) completet [16:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:42] Shanmugamp7, mafk: I should run the namespaceDupes.php maintenance script for you. That'll move all the old redirects into the right namespace. [16:14:45] godog: ^^^ [16:14:57] anomie: yep, thanks [16:15:08] anomie: ok thanks [16:15:21] godog: i had to skip 1008 (obviously) [16:15:43] urandom: ok! I'm watching https://graphite.wikimedia.org/render/?width=707&height=315&_salt=1455725718.805&from=-60minutes&target=restbase.external.ALL.ALL.sample_rate&target=restbase.internal.ALL.ALL.sample_rate&target=secondYAxis(restbase.ALL.ALL.sample_rate)&target=secondYAxis(restbase.private.ALL.ALL.sample_rate) [16:16:00] !log Ran namespaceDupes.php on tawiki [16:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:07] Shanmugamp7, mafk: Done. [16:16:15] !log restbase deploy (15a6c50) complete, sans restbase1008.eqiad.wmnet (down for maintenance during deploy) [16:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:16:39] mafk: Is the cswiki change all good? [16:16:39] godog: we'll need to coordinate before repooling 1008 to make sure it is up to date [16:16:59] urandom: *nod* [16:17:05] anomie: from what I can see at special:listgrouprights, sysops are able now to add/remove the patroller and rollbacker groups [16:17:14] ok [16:17:14] works now anomie [16:17:15] but I have no sysop account there to test myself [16:17:23] (03PS5) 10Anomie: Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 [16:17:26] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2035449 (10akosiaris) Following advice on https://github.com/ether/etherpad-lite/issues/2107 , pad has been rescued!!! I must admit this is the very first time we 've managed... [16:17:33] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 (owner: 10Anomie) [16:17:35] 6Operations, 5Patch-For-Review: Issues with partman/install server/autoinstall for db servers on Jessie - https://phabricator.wikimedia.org/T116902#2035450 (10jcrespo) Fixed on https://gerrit.wikimedia.org/r/#/c/267328/ and https://gerrit.wikimedia.org/r/271280 (hopfully). Testing right now to close this ticket. [16:17:36] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.007 second response time [16:17:40] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2035452 (10Cmjohnson) Problem I am having is figuring out which disk is /dev/sdc [16:18:10] !enabling puppet on all of codfw databases (it should be a noop there) [16:18:12] (03Merged) 10jenkins-bot: Remove $wgMWOAuthGrantPermissions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264438 (owner: 10Anomie) [16:19:29] plus, those should not page [16:19:38] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2029020 (10RobH) [16:20:09] godog, urandom: I updated the first graph in https://grafana-admin.wikimedia.org/dashboard/db/restbase [16:20:17] !log anomie@tin Synchronized wmf-config/CommonSettings.php: SWAT: Remove $wgMWOAuthGrantPermissions (duration: 01m 34s) [16:20:18] anomie: ^ test please [16:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:53] anomie: Well, nothing disappeared on https://meta.wikimedia.org/wiki/Special:OAuthConsumerRegistration/propose, which is about the best that can be checked since the config variable isn't used anywhere anymore. [16:21:02] (03PS4) 10Anomie: Undeploy ApiSandbox extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 [16:21:12] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 (owner: 10Anomie) [16:21:15] PROBLEM - puppet last run on etherpad1001 is CRITICAL: CRITICAL: Puppet has 1 failures [16:21:16] gwicke: ok! should be reasonably easy to update via json export/import [16:22:37] (03Merged) 10jenkins-bot: Undeploy ApiSandbox extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263000 (owner: 10Anomie) [16:24:20] (03PS1) 10Alexandros Kosiaris: etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 [16:25:13] !log anomie@tin Synchronized wmf-config/: SWAT: Undeploy Extension:ApiSandbox (duration: 01m 30s) [16:25:14] anomie: ^ test please [16:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:40] anomie: Well, it's dead code now so there's nothing much to test. Loading ApiSandbox still loads the core version. [16:25:42] anomie: Fair enough [16:25:46] * anomie is done with SWAT [16:26:16] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035491 (10JanZerebecki) What path are you using? I use icinga -> smtp -> imap -> k9 mail. Are you using an smtp to sms gateway? [16:27:32] anomie: ;-) [16:28:08] (03PS1) 10Andrew Bogott: Update admin_project_id in keystone [puppet] - 10https://gerrit.wikimedia.org/r/271284 [16:29:02] Thanks anomie for SWAT [16:30:21] godog, urandom: https://grafana-admin.wikimedia.org/dashboard/db/restbase is updated, and metrics are all looking good [16:30:58] (03CR) 10Andrew Bogott: [C: 032] Update admin_project_id in keystone [puppet] - 10https://gerrit.wikimedia.org/r/271284 (owner: 10Andrew Bogott) [16:33:44] RECOVERY - puppet last run on ms-be1008 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:34:12] godog: I think it's okay to remove the old metrics [16:34:36] (03PS2) 10Alexandros Kosiaris: etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 [16:34:58] gwicke: yeah I'll give it some other hours and then remove the old metrics [16:38:47] (03CR) 10RobH: [C: 032] Add production DNS for es201[1-8] Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/271176 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [16:45:36] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:46:46] RECOVERY - puppet last run on etherpad1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:47:25] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [16:47:31] so no strange pages/alerts related to replication lag, right? [16:48:16] !log purged ancient boardvote gpg key from mediawiki fleet. unused since forever. [16:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:27] not afaict jynus [16:48:31] _joe_: Heh, key was generated in '04 :p [16:48:46] (03PS3) 10Alexandros Kosiaris: etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 [16:48:46] great [16:51:37] 6Operations, 10ops-eqiad: ms-be1008.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T127060#2035581 (10fgiunchedi) 5Open>3Resolved disk rebuilding, thanks @cmjohnson no reboot needed [16:51:58] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2035584 (10Dzahn) 23:51 < webzwo0i> mutante: i debugged the pad. you normally should not do this, but if you have a database backup (and after making export/etherpad to have a... [16:52:43] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2035594 (10MBinder_WMF) It's an etherpad miracle. :) That was awesome, thanks folks. I can't say what caused the corruption. I wonder if something existed in the pad that be... [16:52:55] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:26] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2035600 (10MBinder_WMF) @bgerstle-wmf ("Brian") see above [16:54:44] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [16:56:26] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035610 (10RobH) The path used by our US staff is icinga dispatches an email to the phone.number@cellularprovider.to.sms.address. This allows us to send the notices for no cost to WMF (as the US cellular providers pro... [16:57:15] 6Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584#2035616 (10GWicke) > This is all pretty disappointing — the lack of TRIM in our current setup is probably a major performance bottleneck given our utilization of those disks. TRIM does not seem to be as cri... [16:58:32] (03PS1) 10Ema: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/271289 (https://phabricator.wikimedia.org/T127094) [16:59:20] (03CR) 10BBlack: [C: 031] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/271289 (https://phabricator.wikimedia.org/T127094) (owner: 10Ema) [16:59:48] (03PS2) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [17:00:15] "I don't know if you ppl at wikimedia know each other but if you can find out who the user "Brian" is could you ask him what browser he is using? " [17:00:26] (03PS3) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [17:00:34] (03PS2) 10Ema: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/271289 (https://phabricator.wikimedia.org/T127094) [17:00:56] (03CR) 10Ema: [C: 032 V: 032] Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/271289 (https://phabricator.wikimedia.org/T127094) (owner: 10Ema) [17:01:03] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2035646 (10RobH) @andrew: If you send an email to yourfullcellnumber@tmomail.net it should hit your phone. You can test to see if it is delayed via those means, please advise? Example: If you phone is on tmoble with... [17:02:48] !log depooled ulsfo https://phabricator.wikimedia.org/T127094 [17:02:51] 6Operations, 10ops-codfw, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2035651 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmflabs.org/sal/log/AVLwLycR-0X0Il_jxsQQ} [2016-02-17T17:02:48Z] depo... [17:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:26] (03PS4) 10Alexandros Kosiaris: etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 [17:06:07] (03PS3) 10Elukey: Add kafka1012 back to the pool of kafka brokers in wmf-config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271276 [17:13:00] !log Updated the sites and site_identifiers table for on all non-Wikipedias (including Wikidata) [17:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:40] (03PS1) 10Dzahn: partman: fix db.cfg to allow unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271291 [17:16:17] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2035679 (10faidon) a:3Gehel [17:17:36] jzerebecki: Is something funky going on with https://gerrit.wikimedia.org/r/#/c/270884/ because it's not a regular mediawiki repo? [17:19:58] (03CR) 10Papaul: [V: 031] partman: fix db.cfg to allow unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271291 (owner: 10Dzahn) [17:20:15] (03PS2) 10Dzahn: partman: fix db.cfg to allow unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271291 [17:20:41] (03CR) 10Dzahn: [C: 032] partman: fix db.cfg to allow unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271291 (owner: 10Dzahn) [17:22:13] (03CR) 10Dzahn: [V: 032] partman: fix db.cfg to allow unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271291 (owner: 10Dzahn) [17:22:14] siebrand: should not (after I fixed that it didn't skip the php53 jobs). but the current error is confusing me... [17:22:29] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2035715 (10demon) So using nginx (or similar) would obviously work for port 9200 since that's the HTTP/REST api. However, I don't believe we could us... [17:23:27] siebrand: ah it is because the last PS removes . from phpcs.xml [17:23:44] jzerebecki: argh. Thanks for seeing that. [17:23:56] yw [17:25:01] 6Operations, 10CirrusSearch, 6Discovery, 7Elasticsearch, and 2 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2035723 (10demon) (Although I'm probably being overly paranoid here) [17:25:59] (03PS2) 10Cmjohnson: decom iodine, former OTRS server [dns] - 10https://gerrit.wikimedia.org/r/269739 (https://phabricator.wikimedia.org/T126483) (owner: 10Dzahn) [17:26:51] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2035731 (10Dzahn) follow-up fix https://gerrit.wikimedia.org/r/#/c/271291/ [17:34:04] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2035781 (10Milimetric) [17:34:20] !log restarting pybal on ulsfo/esams backup LVS ( lvs[34]00[34]) [17:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:23] 6Operations, 10Analytics, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2035791 (10Milimetric) p:5Triage>3High [17:35:27] (03CR) 10Hashar: "Probably causes T126699" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [17:36:05] (03CR) 10Cmjohnson: [C: 032] decom iodine, former OTRS server [dns] - 10https://gerrit.wikimedia.org/r/269739 (https://phabricator.wikimedia.org/T126483) (owner: 10Dzahn) [17:37:05] cmjohnson1: thanks [17:38:00] !log restarting pybal on codfw backup LVS ( lvs200[456] ) [17:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:49] !log restarting pybal on eqiad inactive LVS clusters ( lvs1007-12 ) [17:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:14] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [17:40:16] !log restarting pybal on eqiad backup LVS ( lvs100[456] ) [17:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:02] (03PS1) 10Papaul: Add another line to Fix typo that prevented from a fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271302 [17:42:45] !log restarting pybal on ulsfo/esams primary LVS ( lvs[34]00[12]) [17:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:31] (03PS2) 10Dzahn: partman: another fix to db.cfg to allow fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271302 (owner: 10Papaul) [17:44:36] PROBLEM - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down!: apaches_80 - Could not depool server mw1253.eqiad.wmnet because of too many down! [17:46:17] ^ lvs1007-12 issues are "normal"; they're not active and some of them are still affected by disabled switch ports [17:46:21] (03PS3) 10Dzahn: partman: another fix to db.cfg to allow fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271302 (owner: 10Papaul) [17:47:18] !log restarting pybal on codfw primary LVS ( lvs200[123]) [17:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:31] (03CR) 10Dzahn: [C: 032] partman: another fix to db.cfg to allow fully unattended install [puppet] - 10https://gerrit.wikimedia.org/r/271302 (owner: 10Papaul) [17:47:54] 6Operations, 6Project-Admins: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2035843 (10Aklapper) Thanks @Faidon. Let me summarize where we are: **DONE:** * #DC-Ops: Created new team project. * #ops-security: Created as blue project, without restrictions (as t... [17:49:00] AaronSchulz: I have to go in ~1h today but can stay longer tomorrow, good for you to postpone (partial) https://gerrit.wikimedia.org/r/#/c/26660 to tomorrow? [17:49:21] !log restarting pybal on eqiad primary LVS ( lvs100[123] ) [17:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:35] i'll amend it and leave it in gerrit for today [17:50:14] AaronSchulz: sounds good! [17:52:07] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2035855 (10Cmjohnson) disks are wiping, disable network port, removed from private1 vlan [17:52:32] (03PS8) 10Ema: Maps VCL initial forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [17:54:20] (03PS2) 10Aaron Schulz: Enable deferred writes to codfw swift cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) [17:54:32] (03PS3) 10Aaron Schulz: Enable deferred writes to codfw swift cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) [17:54:50] (03PS14) 10JanZerebecki: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [17:55:07] (03PS4) 10Aaron Schulz: Enable deferred writes to codfw swift cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266609 (https://phabricator.wikimedia.org/T91869) [17:56:56] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2035866 (10Cmjohnson) updated tracking sheet [17:57:16] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom iodine - https://phabricator.wikimedia.org/T126483#2035867 (10Cmjohnson) updated tracking sheet [17:58:49] * Josve05a still hasn't learned what opernations do... thought it fixed bugs such as https://phabricator.wikimedia.org/T127194, but appearently not :/ [18:00:00] 6Operations, 6Discovery, 10Wikimedia-Logstash, 3Discovery-Search-Sprint, 7Elasticsearch: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2035886 (10Gehel) Testing in mediawiki-vagrant locally, with elasticsearch 1.7.1, I have the following tests in error: ``` Failing Scenarios... [18:00:26] 6Operations, 5Patch-For-Review: Issues with partman/install server/autoinstall for db servers on Jessie - https://phabricator.wikimedia.org/T116902#2035887 (10jcrespo) 5Open>3Resolved Plus https://gerrit.wikimedia.org/r/#/c/267328/ [18:02:49] (03Abandoned) 10Tim Landscheidt: geturls: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169253 (owner: 10Tim Landscheidt) [18:03:10] (03Abandoned) 10Tim Landscheidt: [WIP] swiftrepl: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/205086 (owner: 10Tim Landscheidt) [18:03:25] 6Operations, 10Traffic, 5Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#2035896 (10BBlack) So, I've figured out some of the things that were confusing me yesterday. To recap that: 1) I now question and need to investigate whether our TTL caps are really e... [18:04:28] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2035902 (10Dzahn) one more fix by @Papaul https://gerrit.wikimedia.org/r/#/c/271302/3/modules/install_server/files/autoinstall/partman/db.cfg and it works now [18:05:06] 6Operations, 10Ops-Access-Requests, 6Analytics-Kanban: All members of analytics team need to have sudo -u hdfs on cluster {hawk} [2 pts] - https://phabricator.wikimedia.org/T126752#2035908 (10Dzahn) [18:06:11] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2035910 (10Cmjohnson) replaced the disk and it is currently rebuilding Enclosure Device ID: 32 Slot Number: 5 Drive's position: DiskGroup: 0, Span: 2, Arm: 1 Enclosure position: 1 Device Id: 5 WWN: 5000C50095B5AF9C Sequ... [18:06:54] Is PurgeList safe to use in production? [18:07:28] 6Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584#2035911 (10GWicke) p:5Normal>3Low [18:08:29] (03CR) 10Ema: "I should have addressed all comments from BBlack with the exception of vcl_backend_error vs. vcl_synth and set_last_access_cookie__ which " [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [18:10:59] 6Operations, 10ops-eqiad, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2035928 (10Cmjohnson) p:5Normal>3Low [18:11:39] 6Operations, 10hardware-requests, 5Patch-For-Review: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#2035934 (10fgiunchedi) plan: * as soon as the rebuild has finished (~6h) remount `/srv` and restart cassandra-a * widen hint window by a safe margin e.g... [18:12:36] (03CR) 10JanZerebecki: "!log updated cherry-pick https://gerrit.wikimedia.org/r/#/c/204528/14 on integration-puppetmaster T126699" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [18:14:14] Josve05a: no, we generally just sit around and do nothing, that's why there's no traffic in this channel at all [18:15:02] (03CR) 10JanZerebecki: "Error: Could not set 'manual' on enable: undefined method `manual_start' for Service[mysql](provider=upstart):Puppet::Type::Service::Provi" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [18:15:39] bblack: ouchhhh ;P [18:16:03] bblack: lol :p But atill, so many "department" anmes, and no one really explains how they differentiates from each other... [18:16:15] still* [18:16:40] Josve05a: operations works on server configuration, release engineering deploys mediawiki [18:16:48] CUZ THEY'RE SIKRIT! [18:16:59] here's the public git log https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+status:merged,n,z [18:17:04] 6Operations, 10ops-eqiad: db1063 degraded RAID - https://phabricator.wikimedia.org/T126969#2035946 (10jcrespo) Did they gave you problems? Please tell if so. [18:17:56] "Use heartbeat when possible to check slave lag", I know WIkimedia is voloneer-based, but calling us slaves and checking out heartbeat.... :P [18:18:26] c'mon, we all know it's just a cover for secret search engine, Flow and VE work! [18:18:29] volunteer* our* /me gets a real computer+ḱeyboard [18:18:37] lol [18:19:14] (03PS1) 10Ottomata: Adding debian/ dir from Ubuntu source package and releasing 1.1~b6 [debs/python-gevent] (debian) - 10https://gerrit.wikimedia.org/r/271306 (https://phabricator.wikimedia.org/T126075) [18:19:22] are you one of those that thinks that "master" and "slave" for mysql is "degrading"? [18:20:07] (03PS1) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [18:21:12] Well, if this "master" is able to dictate how many "MaxConnections" each of the "slaves'" "child's" gets, then yes... [18:21:14] (03CR) 10Ottomata: [C: 032 V: 032] 2.2.0-1 release [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/271267 (owner: 10Ottomata) [18:21:18] (03CR) 10jenkins-bot: [V: 04-1] navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [18:21:23] (03PS2) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [18:21:24] https://www.drupal.org/node/2275877 [18:21:46] (03PS1) 10Ottomata: Add python-gevent (>= 1.1b6) to Depends [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/271308 [18:22:07] You are all puppets! [18:22:14] (03PS3) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [18:22:30] (03PS4) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [18:22:37] 6Operations, 10RESTBase-Cassandra, 6Services: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2035956 (10GWicke) The dump with concurrency 100 did not uncover any issues over the last two days. @eevans, @fgiunchedi: I have tested what I wanted to test, so the cluster is... [18:22:48] (03PS1) 10Cmjohnson: Removing mgmt dns entries for nitrogen [dns] - 10https://gerrit.wikimedia.org/r/271309 [18:23:35] (03CR) 10jenkins-bot: [V: 04-1] navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [18:24:08] (03PS2) 10Krinkle: mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T127194) [18:25:36] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for nitrogen [dns] - 10https://gerrit.wikimedia.org/r/271309 (owner: 10Cmjohnson) [18:26:26] (03PS5) 10Krinkle: navtiming: Improve parse_ua and add unit tests [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) [18:26:46] 6Operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#2035962 (10Cmjohnson) [18:27:12] 6Operations, 10ops-eqiad: return nitrogen to spares - https://phabricator.wikimedia.org/T124717#1963665 (10Cmjohnson) Removed network and vlan info on switch, removed mgmt dns, placed physical label on server for removal. [18:27:26] (03PS1) 10Krinkle: Undo "Set $wgResourceBasePath to /w" for www.mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271311 (https://phabricator.wikimedia.org/T127194) [18:27:34] !log no issues found with new mysql, lag monitoring, renabling puppet again on the pending eqiad servers [18:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:28:16] (03CR) 10Krinkle: [C: 032] Undo "Set $wgResourceBasePath to /w" for www.mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271311 (https://phabricator.wikimedia.org/T127194) (owner: 10Krinkle) [18:28:41] (03Merged) 10jenkins-bot: Undo "Set $wgResourceBasePath to /w" for www.mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271311 (https://phabricator.wikimedia.org/T127194) (owner: 10Krinkle) [18:30:29] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: T127194 (duration: 01m 31s) [18:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:02] (03PS2) 10Ottomata: Add python-gevent (>= 1.1b6) to Depends [debs/python-pykafka] (debian) - 10https://gerrit.wikimedia.org/r/271308 (https://phabricator.wikimedia.org/T126075) [18:31:42] (03PS3) 10Dzahn: dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 [18:31:47] (03CR) 10Krinkle: "pep8 fixes" [puppet] - 10https://gerrit.wikimedia.org/r/271307 (https://phabricator.wikimedia.org/T112594) (owner: 10Krinkle) [18:33:39] (03PS4) 10Dzahn: dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 [18:33:52] 7Blocked-on-Operations, 6Phabricator, 10scap, 3Scap3, 7WorkType-Maintenance: scap::target should use scap's debian package instead of trebuchet - https://phabricator.wikimedia.org/T127215#2035978 (10mmodell) 3NEW a:3mmodell [18:33:56] (03CR) 10Dzahn: [C: 032] dhcp: replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268706 (owner: 10Dzahn) [18:34:58] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2035991 (10RobH) [18:35:18] (03PS12) 1020after4: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T127215) [18:35:39] (03CR) 10jenkins-bot: [V: 04-1] make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T127215) (owner: 1020after4) [18:35:56] !log testing now that alerts still work by stopping db1024 replication (depooled) [18:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:40] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2035997 (10RobH) [18:39:06] WARNING slave_io_state Slave_IO_Running: No / WARNING slave_sql_state Slave_SQL_Running: No / WARNING slave_sql_lag Replication lag: 68.211691 seconds [18:39:43] 6Operations, 5Patch-For-Review, 7audits-data-retention: Broken log rotation for many services (was nginx and varnishkafka on cpXXXX) - https://phabricator.wikimedia.org/T127025#2036003 (10mmodell) [18:39:48] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10RobH) [18:39:57] (that is a good thing) [18:40:30] is "deployment-master" the same thing as "beta-master" ? [18:41:34] deployment master is presumably deployment.eqiad.wmflabs (currently = tin). In beta labs this is something else (they call theirs mira currently I believe). [18:42:01] ok, now i'm more confused than before :) [18:42:37] deployment.eqiad.wmnet* [18:42:38] sorry [18:43:20] mutante: where do you see these references? [18:43:44] thank you, i want to find the master that uses the puppet roles defined in /manifests/role/beta.pp [18:44:10] i'll try the "watroles" tool [18:44:11] WARNING slave_io_state Slave_IO_Running: No / CRITICAL slave_sql_lag Replication lag: 396.824618 seconds / WARNING slave_sql_state Slave_SQL_Running: No [18:44:36] puppetmaster!=deployment master [18:44:45] krenair said "ssh deployment-puppetmaster.deployment-prep.eqiad.wmflabs" [18:45:35] so the deployment master would then be called deployment-master.deployment-prep ? [18:46:05] afaik the deployment (scap) master is mira.deployment-prep.eqiad.wmflabs [18:46:26] so named because one of the deployment masters in production is named mira [18:46:33] (03PS13) 1020after4: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T127215) [18:49:29] 6Operations, 10ops-codfw, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036040 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 10:49:13 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [18:50:26] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2036046 (10RobH) I've updated the task description to include terbium's base info as well as linking to its ganglia statisti... [18:52:21] thanks,seems i have to find out if deployment-puppetmaster is the puppet master of mira.deployment-prep [18:56:23] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10RobH) [18:58:19] 6Operations, 10ops-codfw, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036091 (10emailbot) **`Brandon Black`** replied via email on `Wed, 17 Feb 2016 18:58:12 +0000` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failu... [18:58:29] (03PS15) 10JanZerebecki: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [18:59:01] 6Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#2036092 (10kaldari) 3NEW [18:59:03] (03PS2) 10Dzahn: netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 [18:59:42] 6Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#2036105 (10kaldari) [18:59:47] 6Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584#2036106 (10faidon) >>! In T89584#2035616, @GWicke wrote: >> This is all pretty disappointing — the lack of TRIM in our current setup is probably a major performance bottleneck given our utilization of those d... [18:59:48] (03PS3) 10Dzahn: netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 [19:00:12] (03CR) 10Dzahn: [C: 032 V: 032] netboot.cfg - replace tabs with spaces [puppet] - 10https://gerrit.wikimedia.org/r/268717 (owner: 10Dzahn) [19:02:08] 6Operations, 10ops-codfw, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036111 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 11:02:02 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [19:03:06] 6Operations, 10DBA: Drop wikilove_image_log table from Wikimedia wikis - https://phabricator.wikimedia.org/T127219#2036124 (10jcrespo) p:5Triage>3Low Thanks for the report. I will do it. Triaging it as importance low as usually the risk/reward factor is high, but it has to be done, eventually. [19:03:17] (03PS5) 10Alexandros Kosiaris: etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 [19:03:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: Simplify module and role [puppet] - 10https://gerrit.wikimedia.org/r/271283 (owner: 10Alexandros Kosiaris) [19:07:22] (03PS1) 10BBlack: nginx keepalives: enable for maps+misc [puppet] - 10https://gerrit.wikimedia.org/r/271317 (https://phabricator.wikimedia.org/T107749) [19:07:24] (03PS1) 10BBlack: nginx keepalives: enable for text [puppet] - 10https://gerrit.wikimedia.org/r/271318 (https://phabricator.wikimedia.org/T107749) [19:08:10] (03CR) 10BBlack: [C: 032 V: 032] nginx keepalives: enable for maps+misc [puppet] - 10https://gerrit.wikimedia.org/r/271317 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [19:08:35] RECOVERY - puppet last run on labsdb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:45] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: Connection refused [19:10:06] (03CR) 10JanZerebecki: [C: 04-1] "My ugly hack works for now." [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [19:10:10] etherpad works for me [19:10:29] jynus: pretty sure it's related to the merge above that simplifies the puppet module [19:11:07] (it also works for me) [19:11:09] does anyone remember if labsdb1002's puppet is supposed stopped [19:11:24] I mat have accidentally started [19:11:29] it [19:11:36] (03PS4) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271264 (https://phabricator.wikimedia.org/T123980) [19:11:52] checking icinga history [19:11:59] jynus: if you did not have to actually do puppet agent --enable , probably not? [19:12:30] I did that to all databases [19:12:34] ah [19:13:08] (03PS1) 10Papaul: es2014 had wrong MAC and Add es2019 to dhcpd Bug:T126006 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) [19:13:42] jynus: hmmm, so they had the last puppet run 3 hours ago, neither OK nor CRIT [19:14:21] i think disabled but just very recently [19:17:08] (03PS2) 10Dzahn: es2014 had wrong MAC and Add es2019 to dhcpd Bug:T126006 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [19:17:47] I think it was just acked or downtimed [19:18:41] ok [19:18:51] !log truncate 1.2T php error log file on labstore1003 from cluebot [19:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:09] (03PS1) 10BBlack: tlsproxy: upstreams need distinct names [puppet] - 10https://gerrit.wikimedia.org/r/271321 [19:19:09] it wasn't, jynus [19:19:10] it is not important, it is depooled, and it will not go very far- just in case it spams alerts [19:19:17] it's not showing up yet because it's just been 3 hours [19:19:21] it will in another 3 or so [19:19:27] alright [19:19:36] yuvipanda: oh, only 1.2T uh? [19:19:45] 6Operations: audit contractors sheet against cluster access - https://phabricator.wikimedia.org/T114430#2036260 (10RobH) 5Open>3Resolved I finished this up back last fall and its done then (forgot to resolve) [19:19:54] PROBLEM - traffic-pool service on cp1070 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [19:20:03] Krinkle: that's 1 week of growth [19:20:08] I'm sending it to /dev/null [19:20:17] cp1070 is me, working on it, sorry! [19:20:42] (03PS1) 10Jdlrobson: Revert "Strip references for experimentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271322 (https://phabricator.wikimedia.org/T126390) [19:20:49] (03PS3) 10Papaul: dhcpd: fix MAC of es2014, add es2019 Bug:T126006 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) [19:20:53] (03PS2) 10Jdlrobson: Revert "Strip references for experimentation" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271322 (https://phabricator.wikimedia.org/T126390) [19:21:06] (03PS3) 10Krinkle: mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T99096) [19:21:17] (03PS4) 10Krinkle: mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T99096) [19:21:26] (03PS4) 10Dzahn: dhcpd: fix MAC of es2014, add es2019 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [19:21:42] (03PS5) 10Dzahn: dhcpd: fix MAC of es2014, add es2019 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [19:21:54] PROBLEM - HTTPS on cp1070 is CRITICAL: Return code of 255 is out of bounds [19:21:55] (03CR) 10Dzahn: [C: 032 V: 032] dhcpd: fix MAC of es2014, add es2019 [puppet] - 10https://gerrit.wikimedia.org/r/271319 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [19:22:56] PROBLEM - DPKG on restbase1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:23:21] (03PS2) 10BBlack: tlsproxy: upstreams need distinct names [puppet] - 10https://gerrit.wikimedia.org/r/271321 [19:24:25] Krinkle: it actually killed the truncate, it got stuck and I had to kill -9 it [19:25:29] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2036278 (10Southparkfan) >>! In T126988#2032005, @RobH wrote: > * 16GB RAM : 4 * 2GB DIMM DDR3 Synchronous 1333 MHz This system actually con... [19:25:52] (03PS3) 10BBlack: tlsproxy: upstreams need distinct names [puppet] - 10https://gerrit.wikimedia.org/r/271321 [19:26:05] (03CR) 10BBlack: [C: 032 V: 032] "compiler-checked" [puppet] - 10https://gerrit.wikimedia.org/r/271321 (owner: 10BBlack) [19:26:40] (03PS1) 10Subramanya Sastry: ruthenium: Tweak the update_parsoid.sh script to make it more robust [puppet] - 10https://gerrit.wikimedia.org/r/271323 [19:27:11] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2036281 (10RobH) Noted! We don't order in only 8GB anyhow so luckily my mistake doesn't affect quoting (since I asked for 16GB on the quote... [19:27:14] RECOVERY - HTTPS on cp1070 is OK: SSLXNN OK - 36 OK [19:27:22] (03CR) 10Krinkle: "Tested on mw1017, works as expected." [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [19:27:25] 6Operations, 10Wikimedia-Etherpad: Specific etherpad link hangs then shows JavaScript error - https://phabricator.wikimedia.org/T126379#2036282 (10MBinder_WMF) I'm beginning to think there's something to the iOS-team-specific hypothesis, as another one of our etherpads has started suffering the same issue: htt... [19:28:20] (03PS2) 10Dzahn: ruthenium: Tweak the update_parsoid.sh script to make it more robust [puppet] - 10https://gerrit.wikimedia.org/r/271323 (owner: 10Subramanya Sastry) [19:28:35] RECOVERY - traffic-pool service on cp1070 is OK: OK - traffic-pool is active [19:33:38] (03PS5) 10Ori.livneh: mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [19:34:01] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: Apply public-wiki-rewrites.incl to www.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/271013 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [19:34:20] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036304 (10Krinkle) [19:35:02] (03PS3) 10Dzahn: ruthenium: Tweak the update_parsoid.sh script to make it more robust [puppet] - 10https://gerrit.wikimedia.org/r/271323 (owner: 10Subramanya Sastry) [19:35:16] (03CR) 10Dzahn: [C: 032 V: 032] ruthenium: Tweak the update_parsoid.sh script to make it more robust [puppet] - 10https://gerrit.wikimedia.org/r/271323 (owner: 10Subramanya Sastry) [19:38:33] (03CR) 10Dzahn: "@Tim gotcha, thanks for the explanation, that seems fine" [puppet] - 10https://gerrit.wikimedia.org/r/269902 (owner: 10Tim Landscheidt) [19:41:41] (03CR) 10Hashar: "Jan mind to explain how the problem ends up being fixed? The service mysql is set with provider debian, I am wondering how that fix it." [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [19:42:34] PROBLEM - Host mr1-ulsfo is DOWN: CRITICAL - Network Unreachable (198.35.26.194) [19:42:36] PROBLEM - Host mr1-ulsfo.oob is DOWN: PING CRITICAL - Packet loss = 100% [19:42:59] and there goes the ulsfo maintenance for real :) [19:43:11] https://phabricator.wikimedia.org/T127094 [19:43:41] some fallout like mr1 is expected, there's a few items in ulsfo without redundant power, which shouldn't be critical to actual servers or non-oob traffic links [19:43:57] (and in any case, ulsfo user traffic is depooled for the window) [19:43:57] (03PS1) 10Krinkle: mediawiki: Apply public-wiki-rewrites.incl to *.wikiquote.org [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) [19:44:17] (03CR) 10Krinkle: [C: 04-1] "Not yet tested." [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [19:47:27] jynus: about? [19:47:56] Steinsplitter, sorry? [19:48:15] jynus: since 14th commonswiki db is higly slow for me. [19:48:37] (03CR) 10JanZerebecki: "With the service being debian, which are the generic init independent service wrappers create by debian and thus also work on ubuntu with " [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [19:49:59] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036334 (10ori) > Whether responding to a certain request will require session handling is not in general predictable (cf T104755#2034145), so making application boo... [19:50:37] (03PS2) 10Krinkle: mediawiki: Apply public-wiki-rewrites to wikiquote, wikibooks and wikisource [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) [19:50:39] (03PS1) 10Krinkle: mediawiki: Apply public-wiki-rewrites to wikiversity and wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/271332 (https://phabricator.wikimedia.org/T99096) [19:50:52] Steinsplitter, do you want me to kill some long running-queries? [19:51:25] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:51:50] Krinkle: let me know once you've tested the changes on mw1017 and I'll merge [19:52:15] ori: thx, I'm waiting for the mw.org to roll out and confirm, so I can undo my mw-config revert first. [19:52:37] meanwhile I'll test the netx two yeah [19:53:08] jynus: ? try to run "select * from recentchanges limit 6;" it runs forever as well. [19:53:10] (03PS1) 10Aude: Don't yet include wikidatasparql for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/271334 [19:53:41] yes, there is one user abusing [19:53:56] (03PS1) 10Aude: Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 [19:54:22] and that user is yuvipanda [19:54:38] quarry, probably? [19:54:38] (03PS1) 10Krinkle: Re-apply "Set $wgResourceBasePath to /w for www.mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271337 [19:54:45] but yuvipanda is <3 ... :( [19:54:49] possible [19:54:49] (03PS2) 10Krinkle: Re-apply "Set $wgResourceBasePath to /w for www.mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271337 (https://phabricator.wikimedia.org/T99096) [19:55:00] wah [19:55:02] yes [19:55:03] well, when it is slow [19:55:04] quarry [19:55:06] what happened?! [19:55:07] :-D [19:55:10] it swaps [19:55:14] bah [19:55:18] so it is better to kill [19:55:29] yuvipanda, ok with that? [19:55:35] I don't see any active queries running in quarry [19:55:46] s51206 is you? [19:55:58] commonswiki_p [19:56:20] 10.68.18.47:59891 [19:56:22] jynus: ok, then it's probably a bug in quarry that hasn't killed something it should have. [19:56:40] jynus: so yeah, ok to kill, but I am trying to get the SQL it is executing first [19:56:48] the SQL has a json blob prepended [19:56:49] I do not care, yuvipanda, but no privileges here :-) [19:56:50] that has user info [19:56:52] on which exact user is [19:56:54] jynus: yeah totally [19:57:04] jynus: I'm just trying to find my bug so I can prevent it from happening again :) [19:57:32] do not worry so much, as I said to everybody in labs, it is usally not the programmers, fault [19:57:42] but overuse by third party [19:58:08] ACKNOWLEDGEMENT - PyBal backends health check on lvs1009 is CRITICAL: PYBAL CRITICAL - search_9200 - Could not depool server elastic1021.eqiad.wmnet because of too many down!: apaches_80 - Could not depool server mw1214.eqiad.wmnet because of too many down! Brandon Black T112781 [19:58:12] but quarry is specically built to not allow that [19:58:38] when you tell me, I am ready to mass kill [19:58:48] jynus: I can't find them in https://tendril.wikimedia.org/report/slow_queries?host=^labsdb1001&user=&schema=&qmode=eq&query=&hours=1 [19:59:07] (03PS2) 10Dzahn: beta: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260937 [19:59:23] let me grab a screen copy [19:59:26] then kill [19:59:29] (03CR) 10Hashar: [C: 04-1] "Per Tyler that strips stdout :(" [puppet] - 10https://gerrit.wikimedia.org/r/270902 (https://phabricator.wikimedia.org/T110407) (owner: 10Hashar) [19:59:35] jynus: jynus ok [19:59:44] jynus: I can't find it in show processlist either... [19:59:55] (03PS2) 10BBlack: nginx keepalives: enable for text [puppet] - 10https://gerrit.wikimedia.org/r/271318 (https://phabricator.wikimedia.org/T107749) [19:59:57] (03CR) 10Hoo man: [C: 031] Don't yet include wikidatasparql for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/271334 (owner: 10Aude) [20:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160217T2000). [20:00:04] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:00:16] are you checking on the right host? [20:00:21] labsdb1003 [20:00:24] (03CR) 10BBlack: [C: 032 V: 032] nginx keepalives: enable for text [puppet] - 10https://gerrit.wikimedia.org/r/271318 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [20:00:25] commons is there [20:00:56] jynus: quarry is using only labsdb1010 [20:00:58] err [20:01:00] 1001 [20:01:16] then it is something else [20:01:25] by s51206 [20:01:36] (03CR) 10Dzahn: "@Krenair thank you very much, i tried doing that. just ran into this "error: could not apply 2406978... " though" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [20:01:43] Krenair patches? [20:01:51] we haven't merged those [20:02:07] jynus: I don't know if s51206 is me now actually [20:02:25] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:02:30] jynus: it isn't me at all. [20:02:34] yhea, not you [20:02:35] it's tools.ptwikis tool [20:02:49] you had some long running queries [20:03:00] but not the ones causing the issue [20:03:21] @seen hashar [20:03:21] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 2/17/2016 5:43:04 PM (2h20m17s ago) [20:03:30] jynus: yeah, quarry shouldn't be connecting to labsdb1003 at all [20:03:33] it's hardcoded to 1001 [20:03:44] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [20:04:01] jynus: feel free to kill, in general. shoot first ask questions later when it comes to labsdb I think [20:05:06] anyone online who works on deployment-puppetmaster ? [20:05:26] (03PS3) 10Krinkle: mediawiki: Apply public-wiki-rewrites to all remaiming wiki domains [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) [20:05:58] (03Abandoned) 10Krinkle: mediawiki: Apply public-wiki-rewrites to wikiversity and wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/271332 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:06:04] the issues are since 14th. wondering why nobody complained yet :-o [20:06:05] I Am going to throttle his connection because they keep coming [20:06:18] jynus and yuvi, thanks for looking into it :) [20:06:19] jynus: I can also suspend the tool itself [20:06:28] (03PS4) 10Krinkle: mediawiki: Apply public-wiki-rewrites to all remaining wiki domains [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) [20:06:34] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRfxp0: down - BR [20:10:38] this is specially important now that we are with reduced redundancy [20:10:44] yeah [20:11:21] Steinsplitter, also undestand why certain limits have to be set in extreme cases, to avoid things like this [20:13:31] jynus: just curious (i don't need that), can you set higer limits for a single tools? [20:13:46] there are no default limits [20:13:52] (03PS1) 10Yuvipanda: labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) [20:13:55] I just created one for this account [20:14:04] hopefully temporarily [20:14:05] chasemp: ^ can you check that patch? [20:14:35] <3 [20:14:55] Steinsplitter, If you have a specific request, send a task this way [20:15:00] (03CR) 10jenkins-bot: [V: 04-1] labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) (owner: 10Yuvipanda) [20:15:22] and we can talk, but only if it is reasonable and doesn't affect other users [20:15:23] (03PS1) 10Ori.livneh: navtiming: tee save timing stats for mediawikiwiki to separate metric [puppet] - 10https://gerrit.wikimedia.org/r/271341 [20:15:37] (03PS2) 10Yuvipanda: labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) [20:15:50] Is the labsdb cluster underpowered (= more servers needed) or are some users just too rude? [20:16:13] most of the time, neither [20:16:20] (03CR) 10Ori.livneh: [C: 032 V: 032] navtiming: tee save timing stats for mediawikiwiki to separate metric [puppet] - 10https://gerrit.wikimedia.org/r/271341 (owner: 10Ori.livneh) [20:16:27] (03CR) 10Catrope: [C: 032] Speed trials: Add mobile and desktop versions with OOjs UI core loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271144 (https://phabricator.wikimedia.org/T127125) (owner: 10Jforrester) [20:16:37] but right now a user had 50 long running queries consiming all memory [20:16:41] James_F, RoanKattouw: \o/ thanks. [20:16:50] (03CR) 10Krinkle: "Tested robots.txt, favicon.ico, /w/resources/assets/poweredby_mediawiki_88x31.png, and regular page load via / (root redirect to mainpage)" [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:16:52] bah [20:16:52] ori: done [20:16:58] jynus: I just added you to the Quarry change, btw. It still enforces the same time limits as it did before, just makes it easier for people to run it against multiple dbs [20:17:27] (03Merged) 10jenkins-bot: Speed trials: Add mobile and desktop versions with OOjs UI core loaded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271144 (https://phabricator.wikimedia.org/T127125) (owner: 10Jforrester) [20:17:32] will better servers be able to serve more users faster?, yes [20:17:43] but do not assume we are not working on that [20:17:51] (at the same time) [20:18:01] ori: how long for the mw.org to be live on all apaches? 40min should suffice? It works for me but could be lucky [20:18:10] (03CR) 10Dzahn: "cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [20:18:12] I don't :) and I know that it's not possible to buy 25 extra servers [20:18:14] it was merged about ~40min ago [20:18:24] was just wondering [20:19:11] 502 Bad Gateway -- nginx/1.9.4 [20:19:19] bblack: ^ [20:19:21] Josve05a: probably me [20:19:28] I'm still working on some related things.... [20:19:29] 502 Bad Gateway on Meta.. [20:19:32] on #wikimedia-tech [20:19:37] ok, things comming back now [20:19:37] 502 Bad Gateways everywhere, >= 75% [20:19:59] would be good to say something on #wikipedia too [20:20:18] I think the problem that caused that is fixed, but I can't be 100% sure yet [20:20:30] seems fixed for me on Commons [20:20:34] unfortunately, this happens at a level so far at the outer edge of the stack, it doesn't really make it into stats/graphs [20:20:39] > well now it works again [20:21:16] there seems to be another user wanting a throttle too [20:21:50] Josve05a: likely https://upload.wikimedia.org/wikipedia/commons/0/03/Server-kitty.jpg :P [20:21:53] Krinkle: 30 minutes should be fine. 20 for puppet to run everywhere + 10 minute safety margin [20:22:06] (03CR) 10Krinkle: [C: 032] "https://www.mediawiki.org/w/resources/assets/poweredby_mediawiki_88x31.png now works (merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271337 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:22:15] Steinsplitter: I can use meta-templates? :D [20:22:31] (03PS1) 10Papaul: Add es2019 prodcution DNS Bug:T126006 [dns] - 10https://gerrit.wikimedia.org/r/271342 (https://phabricator.wikimedia.org/T126006) [20:22:43] (03Merged) 10jenkins-bot: Re-apply "Set $wgResourceBasePath to /w for www.mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271337 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:22:46] Steinsplitter: Does it say "Ebony black on the box under the cat"...? [20:22:47] ori: puppet is 30 itself I think, currently [20:23:38] !log catrope@tin Synchronized docroot/: (no message) (duration: 01m 33s) [20:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:03] James_F, ori: Done [20:24:20] RoanKattouw: Thanks. [20:24:36] bblack: yes, looks like you're right. So, Krinkle: 40m -- 30 for Puppet, 10 for safety. [20:24:48] k [20:25:09] (03PS5) 10Ori.livneh: mediawiki: Apply public-wiki-rewrites to all remaining wiki domains [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:25:18] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: Apply public-wiki-rewrites to all remaining wiki domains [puppet] - 10https://gerrit.wikimedia.org/r/271330 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [20:25:38] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: Re-enable T99096 for mediawiki.org (duration: 01m 29s) [20:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:42] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036441 (10Stashbot) {nav icon=file, name=Mentioned in SAL, href=https://tools.wmfl... [20:29:29] (03CR) 10Dzahn: [C: 032] "yep, belongs in row d per info on ticket" [dns] - 10https://gerrit.wikimedia.org/r/271342 (https://phabricator.wikimedia.org/T126006) (owner: 10Papaul) [20:30:18] 6Operations, 10Deployment-Systems, 6Performance-Team, 10Traffic, 5Patch-For-Review: Make Varnish cache for /static/$wmfbranch/ expire when resources change within branch lifetime - https://phabricator.wikimedia.org/T99096#2036457 (10Krinkle) [20:33:02] 6Operations, 10OCG-General-or-Unknown, 6Scrum-of-Scrums, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#2036467 (10RobLa-WMF) >>! In T120079#1872382, @Joe wrote: > @Aklapper apparently no one wants to take responsibility for this. And it's a proble... [20:40:16] !log es201[1-9] - signing puppet certs, salt-key, initial run [20:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:14] (03PS7) 10Ori.livneh: add a dependency on xhprof/xhgui [mediawiki-config] - 10https://gerrit.wikimedia.org/r/251137 [20:42:18] (03CR) 10Dzahn: [C: 031] "checked instance names using this with "watroles"." [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [20:42:34] (03PS3) 10Dzahn: beta: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260937 [20:42:40] (03CR) 10Dzahn: [C: 032] beta: move roles to modules/role/ [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [20:42:56] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/260937 (owner: 10Dzahn) [20:46:05] 6Operations, 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: translatewiki-puppetlint-strict does not honor puppet-lint.rc file in /puppet - https://phabricator.wikimedia.org/T116552#2036500 (10hashar) Neat! Well done @siebrand and congratulations for passing puppet-lint! [20:47:25] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [20:57:24] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036552 (10Anomie) >>! In T126700#2036334, @ori wrote: > But I would have appreciated (and would appreciate still) if you were proactive about doing the work to rule... [20:58:32] (03CR) 10Rush: labs: Allow setting NFS mounts to be soft via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) (owner: 10Yuvipanda) [21:00:05] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160217T2100). Please do the needful. [21:01:30] no mobileapps deploy today [21:01:36] no parsoid deploy today [21:02:44] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:03:01] (03CR) 10Yuvipanda: labs: Allow setting NFS mounts to be soft via hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) (owner: 10Yuvipanda) [21:03:10] chasemp: I have that check a bit above [21:03:31] PROBLEM - MariaDB disk space on labsdb1003 is CRITICAL: DISK CRITICAL - free space: /srv 177552 MB (5% inode=99%) [21:04:04] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [21:04:10] (03CR) 10Dzahn: "should https://tools.wmflabs.org/watroles/role/misc::limn::instance show us if/which instance is using this? or does that not work becaus" [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [21:04:37] yuvipanda: sorry I couldn't highlight multiple lines so confusing, but mainly what I was trying to say is that logic seems to say [21:04:55] if nfs_mode is unset then it's hard and if it equals hard then make it hard (phrasing!) [21:04:56] (03CR) 10Hoo man: [C: 031] "Ok for beta… you might want to move it into the -labs file if it's possible to have this beta only in a sensible way. Removing it now is m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:05:06] and if it's set to something not hard then set it to this one value of soft only [21:05:11] chasemp: ah, I see. [21:05:22] I was suggesting $mode = hiera('mode', 'hard') [21:05:24] RECOVERY - DPKG on restbase1008 is OK: All packages OK [21:05:27] and then putting teh alternative options in hiera [21:05:30] chasemp: hmm, so maybe I should change that to be 'nfs_hard' and treat that as True / False [21:05:40] I don't want random arbitrary mode values [21:05:54] that would make more sense then sure [21:06:03] my qualm is basically, if I set a non-hard value I get this one other thing [21:06:09] which may not be related and that's confusing [21:06:16] yeah [21:06:23] i'll rework it to be bool [21:07:01] RECOVERY - MariaDB disk space on labsdb1003 is OK: DISK OK [21:07:41] jynus: is that you?^ [21:08:16] I guess so, I was taking a look and saw him logged :) [21:08:37] no [21:08:45] did someone increase the volume? [21:09:01] not me [21:09:46] from 2.9T 161G 95% /srv to 2.6T 489G 85% /srv [21:09:57] the partition didn't change size [21:10:12] https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1455657009515&to=1455743169515&var-server=labsdb1003&var-network=eth0 [21:10:31] maybe the queries I killed were creating temp tables? [21:10:54] (03PS1) 10BBlack: enable tcp_tw_reuse on caches [puppet] - 10https://gerrit.wikimedia.org/r/271350 [21:11:25] it takes some time for mysql/os to actually clean up those [21:11:33] 6Operations, 10ops-codfw, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036604 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 13:11:25 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [21:12:14] now is 650MB on the tmp table dir [21:12:39] (03PS3) 10Yuvipanda: labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) [21:12:43] chasemp: updated [21:12:52] in any case, normal trend isn't that worring [21:13:03] apergos: oh hey! [21:13:13] heyoh [21:13:16] apergos: did you see my tiny input on the dumps discussion? [21:13:24] requires investigation of large user tables or temp tables [21:13:30] I did, are you signed up on the project, awight? [21:13:32] but not immediate problem [21:13:55] if you are you shoulda got my update that I have a draft list of questions out, sort of a 'first top level' round [21:14:02] (03CR) 10Aude: "this is already set for beta (for now, using 'wdqs-test.wmflabs.org')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:14:04] yuvipanda: does this need to come w/ a hiera yaml config to set tha true since it's not default? [21:14:06] apergos: ooh, great thanks. I'll subscribge there [21:14:08] which I read your stuff first before doing that [21:14:16] chasemp: I' [21:14:17] please! 'dumps-rewrite' I think it's called [21:14:21] chasemp: I"ve defaulted it to true [21:14:24] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [21:14:35] chasemp: the second param to hiera( is the default value [21:14:48] yeah I'm having an out of body experience [21:14:49] some of your comments will be especially helpful as we start getting more into specifics of implementation [21:14:50] sounds good [21:15:11] chasemp: kk [21:15:13] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036617 (10Southparkfan) [21:15:59] (03CR) 10Yurik: "its not set for beta - beta is using production wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:17:13] (03CR) 10Yurik: [C: 04-1] Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:17:19] (03CR) 10Dzahn: "< milimetric> mutante: yes, limn is still waiting to die, but we're much closer to killing it now" [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [21:17:28] apergos: hrm, okay robla style, but I'm still unclear on how we proceed from that questions page. I guess we create a spike to investigate each question? [21:17:35] not yet [21:17:45] first is: make sure we got all the questions [21:17:48] and the way we want the [21:17:49] m [21:17:55] then yes we split them all out into tasks [21:18:02] investigate and answer [21:18:04] cool [21:18:11] oppa robla style [21:18:15] *techno beat* [21:18:23] ba da bum [21:18:30] (03PS2) 10BBlack: enable tcp_tw_reuse on caches [puppet] - 10https://gerrit.wikimedia.org/r/271350 (https://phabricator.wikimedia.org/T107749) [21:18:33] * awight feels hip for the first time in decades [21:18:56] * milimetric always thought awight was hip [21:18:57] after that a second round likely of implementation related questions, we'll get into implementation after [21:19:07] aww [21:19:08] at least that's the map I have in my head righ tnow [21:20:10] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036638 (10ori) >>! In T126700#2036552, @Anomie wrote: > It doesn't help your case that the timeline doesn't fit at all, @ori. @anomie, the timeline is very useful... [21:20:23] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:24] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:24:03] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [21:24:04] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [21:24:13] * robla disappears robla style [21:24:22] poof [21:24:29] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036666 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 13:24:03 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [21:24:32] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036668 (10BBlack) >>! In T126700#2034670, @Tgr wrote: > Whether responding to a certain request will require session handling is not in general predictable (cf T104... [21:25:58] (03CR) 10BBlack: [C: 032] enable tcp_tw_reuse on caches [puppet] - 10https://gerrit.wikimedia.org/r/271350 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [21:31:00] (03PS1) 10Thcipriani: Beta: Update git_server for scap [puppet] - 10https://gerrit.wikimedia.org/r/271353 [21:31:57] chasemp: ok, am going to merge my hard change [21:32:09] ok [21:32:24] (03PS4) 10Yuvipanda: labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) [21:32:38] (03PS1) 10Ottomata: Update Cloudera package reprepro updates to CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/271355 [21:33:16] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Allow setting NFS mounts to be soft via hiera [puppet] - 10https://gerrit.wikimedia.org/r/271340 (https://phabricator.wikimedia.org/T127224) (owner: 10Yuvipanda) [21:35:08] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036710 (10Tgr) >>! In T126700#2036668, @BBlack wrote: > It's important that all HTTP transactions for URL paths which are sensitive to sessions (meaning they can mo... [21:35:39] !log restbase restarted restbase1002 on nodejs v4.3.0 [21:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:36:03] (03PS2) 10Ottomata: Update Cloudera package reprepro updates to CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/271355 [21:36:09] (03CR) 10Ottomata: [C: 032 V: 032] Update Cloudera package reprepro updates to CDH 5.5.2 [puppet] - 10https://gerrit.wikimedia.org/r/271355 (owner: 10Ottomata) [21:39:12] (03CR) 10Aude: "@yurik then what is https://github.com/wikimedia/operations-mediawiki-config/blob/5ce61a32ce04c3227a9ee630b19c301b5ef9e785/wmf-config/Comm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:39:38] (03CR) 10Ottomata: [C: 031] "Seems like this should be in hiera, but none of the other stuff is so, meh? Shall I merge?" [puppet] - 10https://gerrit.wikimedia.org/r/271353 (owner: 10Thcipriani) [21:41:23] (03CR) 10Yurik: "aude, the code uses first protocol listed as "default". The labs adds an extra one, this way someone in labs can write "wikidataquery://w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:41:53] (03CR) 10Thcipriani: [C: 031] "@Ottomata please do. One less patch cherry-picked to beta-puppetmaster :)" [puppet] - 10https://gerrit.wikimedia.org/r/271353 (owner: 10Thcipriani) [21:42:31] (03CR) 10Yuvipanda: [C: 04-2] "-2 for a few hours because you said labs instead of beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:42:33] !log mathoid deploying ed98ffe9d [21:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:39] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036739 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 13:42:32 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [21:43:14] (03PS2) 10Aude: Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 [21:43:46] (03PS2) 10Ottomata: Beta: Update git_server for scap [puppet] - 10https://gerrit.wikimedia.org/r/271353 (owner: 10Thcipriani) [21:43:54] (03CR) 10Ottomata: [C: 032 V: 032] Beta: Update git_server for scap [puppet] - 10https://gerrit.wikimedia.org/r/271353 (owner: 10Thcipriani) [21:44:19] yuvipanda: The files are named that way, don't blame me… but you should surely blame others [21:44:25] :D [21:44:28] yuvipanda: it's CommonSettings-labs.php [21:44:33] that's what i mean :) [21:44:44] I was talking about yurik's 'aude, the code uses first protocol listed as "default". The labs adds an extra one, this way someone in labs can write "wikidataquery://wdqs-test.wmflabs.org/query=..." instead of "wikidataquery:///?query=..." (which automatically goes to the production one)' [21:45:09] yeah, either way :D [21:46:02] Next time I'm going to add a disclaimer :P [21:46:14] Or git blame and blame whoever introduced these files :D [21:46:31] (03CR) 10Yurik: [C: 031] Don't yet allow wikidatasparql graph urls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271336 (owner: 10Aude) [21:46:43] hoo: :D [21:46:53] hoo: Krenair aude I wonder if we can rename the files [21:47:07] I guess [21:47:20] you might break labs briefly though [21:47:24] (yes, I'm trolling rihgt now) [21:47:43] :D [21:48:33] hoo: I'm remounting NFS mounts on beta-cluster, so I could be breaking them too [21:49:21] 6Operations, 6Editing-Department, 6Performance-Team, 7Performance: Severe save latency regression - https://phabricator.wikimedia.org/T126700#2036745 (10Anomie) >>! In T126700#2036668, @BBlack wrote: > Side-track, but this seems very scary to me. On this front, we're probably better off with SessionManage... [21:51:14] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:08] (03CR) 10Yurik: [C: 04-1] "Per discussion with SMalyshev on IRC, we think that this is not needed as graphoid is only requesting this data during varnish cache-miss," [puppet] - 10https://gerrit.wikimedia.org/r/271334 (owner: 10Aude) [21:55:13] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:56:42] (03PS1) 10Andrew Bogott: Move novaadmin authority from 'testlabs' to 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/271420 [21:56:53] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [21:57:44] greg-g, i might go over the service depl - for some reason it is in the same spot as the train (not that they would conflict i think) [21:58:08] there are no depls after this for the next two hours, so should be ok [21:59:37] (03PS2) 10Andrew Bogott: Move novaadmin authority from 'testlabs' to 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/271420 [22:00:07] yurik: yeah, they overlap these two weeks due to us moving the train back so that antoine could do it (but, no train this week due to the save time regression) [22:00:32] greg-g, ok, so any issues with me going longer for graphoid? [22:00:39] nope [22:00:44] cool [22:00:53] I mean, don't take 2 more hours :) [22:01:11] i'll try :-P [22:09:57] (03PS1) 10BBlack: tlsproxy: set files ulimit at 2x conns [puppet] - 10https://gerrit.wikimedia.org/r/271423 [22:12:29] 6Operations: Andrew gets pages many hours too late - https://phabricator.wikimedia.org/T127189#2036804 (10Andrew) Oh, notably, msg.fi.google.com didn't exist until last week, approximately when my pages starting being delayed. Surely no coincidence; maybe google degraded their support of the tmo gateway when th... [22:13:39] (03PS2) 10BBlack: tlsproxy: set files ulimit at 2x conns [puppet] - 10https://gerrit.wikimedia.org/r/271423 [22:15:26] (03CR) 10BBlack: [C: 032] tlsproxy: set files ulimit at 2x conns [puppet] - 10https://gerrit.wikimedia.org/r/271423 (owner: 10BBlack) [22:15:35] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036810 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:15:08 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:15:50] greg-g, nah, better not rush it, will do it tomorrow [22:16:03] is this a new yurik ? :) [22:16:33] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036812 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:16:07 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:18:38] :) [22:33:55] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:35:43] (03PS3) 10Andrew Bogott: Move novaadmin authority from 'testlabs' to 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/271420 [22:38:09] (03CR) 10Andrew Bogott: [C: 032] Move novaadmin authority from 'testlabs' to 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/271420 (owner: 10Andrew Bogott) [22:39:04] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [22:39:56] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036863 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 14:39:47 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [22:40:17] (03PS1) 10BBlack: nginx keepalives: enable for upload [puppet] - 10https://gerrit.wikimedia.org/r/271429 (https://phabricator.wikimedia.org/T107749) [22:40:55] (03PS1) 10Dzahn: fix typo in FQDN of es2015 [dns] - 10https://gerrit.wikimedia.org/r/271430 [22:41:46] (03CR) 10Dzahn: [C: 032] fix typo in FQDN of es2015 [dns] - 10https://gerrit.wikimedia.org/r/271430 (owner: 10Dzahn) [22:44:03] 6Operations, 10Traffic, 5Patch-For-Review: HTTP/1.1 keepalive for local nginx->varnish conns - https://phabricator.wikimedia.org/T107749#2036867 (10BBlack) I'm enabling this for upload now as well, as I've been testing one esams cache with live hacks for a while now and not seen any issues. Will try to keep... [22:49:00] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2036876 (10chasemp) General summary and outline of above and IRC discussions. Desired acquisition: Two servers with 128G of RAM, 32 cores, and >=8T of local disk after RAID1 or 10, and a 10G interfac... [22:49:02] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036875 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 14:48:32 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [22:49:04] (03CR) 10BBlack: [C: 032] nginx keepalives: enable for upload [puppet] - 10https://gerrit.wikimedia.org/r/271429 (https://phabricator.wikimedia.org/T107749) (owner: 10BBlack) [22:50:55] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:51:39] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036883 (10ori) >>! In T124356#2028810, @BB... [22:52:54] bblack: ^ [22:56:16] (03PS4) 10Dzahn: exim: rewriting rule for maint-announce@ mail to phab [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) [22:56:20] ori: if our cache ttl caps were lower, that date would already have fallen off on its own :) [22:56:48] nod [22:57:53] (03PS5) 10Dzahn: exim: rewriting rule for maint-announce@ mail to phab [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) [23:00:38] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036903 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 15:00:09 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [23:03:29] 7Blocked-on-Operations, 6Operations, 6Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2036928 (10MBinder_WMF) The recent Phab upgrade chatter has had my teams ask me to check on this. I think it may have gotten swallowed by the... [23:03:32] !log ori@mira Synchronized php-1.27.0-wmf.13/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: live-hacked debug logging for T124356 (duration: 02m 16s) [23:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:39] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036930 (10Stashbot) {nav icon=file, name=M... [23:05:53] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2036938 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 15:05:45 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [23:07:15] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:08:45] !log csteipp@tin Synchronized php-1.27.0-wmf.14/includes: add security patches (duration: 01m 35s) [23:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:54] !log csteipp@tin Synchronized php-1.27.0-wmf.14/resources/src/mediawiki/page/patrol.ajax.js: add security patches (duration: 01m 28s) [23:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:35] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:47] which field should I use for creating a non-public task? 'Security' or 'Visible To'? [23:14:29] twentyafterfour, ^ [23:14:52] hi csteipp [23:15:13] Hi Platonides [23:15:39] Platonides: security, but not for much longer, that feature is going away soon [23:15:50] uh [23:16:44] for now it works [23:17:11] I'm not sure if I did it the way I was supposed to: https://phabricator.wikimedia.org/T127247 [23:17:22] but it's private :P [23:17:55] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:18:13] csteipp: maybe it's also interesting for the security group [23:18:39] although it's something to be fixed by ops [23:18:57] Platonides, Yeah, was just looking at it :) [23:19:22] I didn't want to give out ideas :) [23:19:32] Platonides, the 'visible to' field requires a special permission to change directly [23:19:40] Krenair: but I have it :) [23:19:41] given to security, among other groups [23:19:48] if you don't know what it does, don't touch it :p [23:20:31] 6Operations, 10MediaWiki-Interface, 10Traffic, 5MW-1.27-release, and 3 others: Broken mobile edit section links are showing up in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2036991 (10BBlack) I've executed bans for d... [23:21:15] so are any ops around in about an hour to rubber stamp a puppet change or two? I can pare down the amount of changes just to get the upgrade out of the way and let the bigger patches go until next week [23:21:27] *the phab upgrade* [23:22:06] (03PS1) 10Mobrovac: RESTBase: Enable purging and minor config style changes [puppet] - 10https://gerrit.wikimedia.org/r/271436 [23:22:14] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2037000 (10Papaul) [23:24:06] 6Operations, 10ops-codfw, 5Patch-For-Review: es2011-es2019 racking and onsite setup tasks - https://phabricator.wikimedia.org/T126006#2037002 (10Papaul) a:5Papaul>3jcrespo @jcrespo Installation complete an all new es servers. [23:24:51] (03CR) 10Dzahn: "this work, tested. this ticket was created in phab:" [puppet] - 10https://gerrit.wikimedia.org/r/268851 (https://phabricator.wikimedia.org/T118176) (owner: 10Dzahn) [23:30:48] !log deployed all missing security patches from wmf14 [23:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:10] RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.68 ms [23:35:39] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:35:39] RECOVERY - Host mr1-ulsfo.oob is UP: PING OK - Packet loss = 0%, RTA = 87.58 ms [23:37:21] twentyafterfour: greg-g I have 20m until kid things take over, can look? [23:38:37] sure! [23:39:15] are teh changes up now or in an hour, I'm sorry I may be confused :) [23:39:32] * greg-g looks for the changes [23:39:49] (03PS1) 1020after4: update phabricator to release/2016-02-18/1 [puppet] - 10https://gerrit.wikimedia.org/r/271439 (https://phabricator.wikimedia.org/T120013) [23:40:00] chasemp: https://gerrit.wikimedia.org/r/271439 [23:40:03] chasemp: at least these: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:phab_scap3,n,z [23:40:17] oh I see [23:40:29] :) [23:40:56] that last change I just posted is the minimal root change I need [23:40:56] oh, well, twentyafterfour has the better answers for which to review :) [23:41:14] twentyafterfour: I'm going to merge that one then on grounds that it's cool as puppet is disabled there and you know what you are doing [23:41:19] yes? [23:41:35] chasemp: right. but I plan to enable puppet if possible [23:41:44] after testing of course. I've got a 2 hour window [23:41:50] understood post scap + that gets you to sane place [23:41:53] looking at the other two now [23:42:00] https://gerrit.wikimedia.org/r/#/c/269561 seems to do more than it claims to? still looking [23:42:07] (03CR) 10Rush: [C: 032] "yep" [puppet] - 10https://gerrit.wikimedia.org/r/271439 (https://phabricator.wikimedia.org/T120013) (owner: 1020after4) [23:42:24] bblack: given the short notice I think the only reasonable thing is to go with the super simple change that chase just merged [23:42:26] i think https://gerrit.wikimedia.org/r/#/c/270960/ is also good [23:42:42] and ^ [23:42:43] just didnt merge it because puppet was disabled and needed a window [23:42:45] what mutante said [23:42:47] +1 [23:42:50] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:43:07] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/269561/ needs manual rebase fyi [23:43:19] (03PS14) 10Rush: make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T127215) (owner: 1020after4) [23:43:33] grr I rebased it today already ..manually [23:43:45] oh I rebased the other one.. [23:44:01] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2037112 (10emailbot) **`Rob Halsell`** replied via email on `Wed, 17 Feb 2016 15:41:26 -0800` `Re: [UnitedLayer #118704] SF8 - Wikimedia: PDU nic failure... [23:44:10] just gerrit things [23:44:11] anyway that change is pretty major and although I have it working in labs I am not very sure about all the consequences in prod [23:44:45] fair point [23:44:47] (03CR) 10BBlack: "There seems to be things in here unrelated to the commitmsg, e.g. the creation of the phab-ssh.service systemd service unit file? Was thi" [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [23:44:50] plus it depends on the other one which depends on deploying scap from apt, which is ready to go but it got reverted once ... [23:44:57] on https://gerrit.wikimedia.org/r/#/c/269560/ [23:45:08] that's pretty scap3 specific, I more or less grok it [23:45:20] but any releng cohorts around who know scap well to countersign this? [23:45:22] it's mainly to get scap deployted from apt [23:45:35] it's got 2 +1s from the affected people [23:45:56] ottomata and thcipriani [23:46:04] which were lost with rebases, stupidly [23:46:08] right [23:46:09] oh I see yeah, header clears it out [23:46:09] but you can see them in the history [23:46:19] PROBLEM - RAID on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:46:34] why does that rb1008 check keep spamming? [23:46:35] (03CR) 10Rush: [C: 032] make scap::target use the scap3 package provider [puppet] - 10https://gerrit.wikimedia.org/r/269560 (https://phabricator.wikimedia.org/T127215) (owner: 1020after4) [23:49:10] https://gerrit.wikimedia.org/r/#/c/269942/ got merged and then reverted... [23:49:15] it needs to eb re-reverted [23:50:16] 6Operations, 10ops-ulsfo, 10Traffic, 5Patch-For-Review: ulsfo possible downtime - PDU swaps in both cabinets - https://phabricator.wikimedia.org/T127094#2037173 (10emailbot) **`UnitedLayer Support Ticket System`** replied via email on `Wed, 17 Feb 2016 15:49:33 -0800` `Re: [UnitedLayer #118704] SF8 - Wiki... [23:50:31] what's teh story on https://gerrit.wikimedia.org/r/#/c/269942/ then? [23:50:38] merge, revert and now it's needed for scap things? [23:51:48] (03PS1) 1020after4: Install the scap package from deb instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/271442 (https://phabricator.wikimedia.org/T114363) [23:51:59] PROBLEM - puppet last run on restbase1008 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:52:07] no reason for revert https://gerrit.wikimedia.org/r/#/c/269949/ [23:52:27] https://gerrit.wikimedia.org/r/#/c/271442/ [23:52:53] "not cleared" .. [23:53:11] !log redeployed wmf14 patches [23:53:12] there was some fear that it would break deployments [23:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:27] it's been tested on beta, should be ok now [23:53:43] twentyafterfour: I don't know that I understand the changeset enough honestly [23:53:48] I woldn't know how to debug this atm [23:53:54] can we get a releng countersign? [23:54:06] esp since it seems to have been reverted for speculative reasons previously [23:54:09] for which one? [23:54:15] https://gerrit.wikimedia.org/r/#/c/271442/ [23:55:11] RECOVERY - RAID on restbase1008 is OK: OK: Active: 6, Working: 7, Failed: 0, Spare: 1 [23:55:11] (03PS1) 10Bmansurov: Enable the structured language overlay and increase the instrumentation rate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271443 (https://phabricator.wikimedia.org/T123980) [23:55:12] thcipriani, marxarelli and I all tested it on beta after the revert. that's what was requested of us [23:55:17] or of me really [23:55:37] one problem was found and fixed with a new version of the scap package [23:56:01] I don't doubt you I just would like one of those ppls to +1 too since I'm not hip to scap things [23:56:40] * thcipriani looks [23:58:59] if it doesn't merge we should revert the other patch since that one depends on the scap package being installed from apt [23:59:25] gotcha, I'm giving tyler a minute here