[00:00:03] (03CR) 10Yuvipanda: [C: 032] librenms: Move to apache::site [puppet] - 10https://gerrit.wikimedia.org/r/253495 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:00:21] mutante: do you have credentials for librenms? [00:00:29] we want to reduce the number of services not behind misc-web [00:00:35] but we already went through that in the past [00:00:54] so the ones left by now are probably exceptions that we talked about before [00:01:08] and the ones that are left.. i hope we can use letsencrypt some time soon [00:01:57] PlasmaFury: no [00:02:02] yurik: it's swat [00:02:16] mutante: heh, ok! so librenms.wikimedia.org wfm then [00:02:25] greg-g, yeah, i just noticed too - got a bit confused because of the TZ change [00:02:27] as in shows me the login screen [00:02:56] yea, and a scary fail message [00:02:57] (03PS1) 10Yuvipanda: Revert "smokeping: Use apache::site" [puppet] - 10https://gerrit.wikimedia.org/r/253497 [00:03:01] about criminal prosecution [00:03:04] yeah [00:03:16] (03PS2) 10Yuvipanda: Revert "smokeping: Use apache::site" [puppet] - 10https://gerrit.wikimedia.org/r/253497 [00:03:24] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "smokeping: Use apache::site" [puppet] - 10https://gerrit.wikimedia.org/r/253497 (owner: 10Yuvipanda) [00:05:35] netmon puppet failure notice will show up, is transient [00:06:11] RoanKattouw, are you swatting? [00:06:55] jouncebot: did you ping me? [00:07:23] jouncebot: next [00:07:23] In 14 hour(s) and 52 minute(s): CI (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1500) [00:07:26] hrmm [00:07:48] twentyafterfour: For this weeks' branch, check out https://gerrit.wikimedia.org/r/#/c/252587/ before creating/deploying branch. [00:07:59] mutante: also https://phabricator.wikimedia.org/T118787?workflow=create [00:08:08] mutante: I think that's behind misc web though? [00:08:27] Krinkle: ok [00:09:07] PlasmaFury: yes it is. actually.. just assing that to me [00:09:20] 6operations: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1809564 (10Dzahn) a:3Dzahn [00:09:25] eh, i took it [00:09:36] eh, I have a late swat [00:09:40] there might have been a reason but i'll check [00:09:53] mutante: thanks [00:09:55] i do remember doing the puppet setup [00:09:56] Krinkle: you mean that needs to merge before the branch? [00:10:13] twentyafterfour: Well, something needs to happen. Right now, master would probably be a regression for prod. [00:10:21] Coordinate with Aaron, I don't know. [00:10:34] Krinkle: just you and me? [00:10:50] legoktm: Would appear that way [00:11:01] (03PS1) 10Yuvipanda: Revert "Revert "smokeping: Use apache::site"" [puppet] - 10https://gerrit.wikimedia.org/r/253498 [00:11:07] would you like to deploy or should I? [00:11:11] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "Revert "smokeping: Use apache::site"" [puppet] - 10https://gerrit.wikimedia.org/r/253498 (owner: 10Yuvipanda) [00:11:17] egads, there's more webserver crap floating around [00:11:20] including for horizon [00:11:22] >_> [00:12:39] Krinkle: I'll assume me? [00:12:50] Krenair, ostriches, RoanKattouw - is anyone swating? I would like to sync graphoid service [00:12:57] legoktm: I'm not sure what you're asking [00:13:08] yurik: yes [00:13:11] Oh, you're not the swatter [00:13:12] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [00:13:12] I see [00:13:17] Yeah, go ahead first :) [00:13:30] (03CR) 10Legoktm: [C: 032] REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [00:13:40] Krinkle, are you talking to me or legoktm ? [00:13:47] to me [00:13:48] PlasmaFury: smokeping used to have a role class but it was not used, for some reason it directly included the module and stuff .. hrmm https://phabricator.wikimedia.org/rOPUP6c09e459db77f5796967dfd41a084a3a7cc5b713 [00:13:52] Krinkle: thanks [00:13:58] (03CR) 10jenkins-bot: [V: 04-1] REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [00:14:12] PlasmaFury: and webserver/apache stuff in that unused role class and whatnot [00:14:14] (03PS3) 10Legoktm: REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [00:14:14] ... [00:14:20] gwicke, sca1001 is having issues in citoid? ^^ [00:14:30] yurik: Lego and I are both swatting per the schedule. [00:14:31] (03CR) 10Legoktm: [C: 032] REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [00:14:45] ori: can you look at webserver::sysctl_settings and see if that's actually useful still? [00:14:48] if it is it should be in ::apache [00:14:53] mutante: ^ you too ifyou've the time [00:14:56] Krinkle, ok, i will depl tomorrow morning than, no rush [00:15:07] yurik: The window is free within the hour though. [00:15:08] let me take a look [00:15:11] Oh, hah, I promised to SWAT today but got distracted [00:15:15] (03Merged) 10jenkins-bot: REL1_26 knocks on the door [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252399 (owner: 10Florianschmidtwelzow) [00:15:18] Am I needed or is someone else already doing it? [00:15:24] we're taking care of it [00:16:15] yurik: looks okay now [00:16:16] i missed jouncebot talking about it, but: [00:16:19] https://gerrit.wikimedia.org/r/#/c/253063/ [00:16:21] and i'm here [00:16:30] 00:16:25 sync-master failed: Command '['sudo', '-u', 'mwdeploy', '-g', 'wikidev', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/cache/l10n/*.cdb', '--exclude=*.swp', '--no-perms', 'tin.eqiad.wmnet::common', '/srv/mediawiki-staging']' returned non-zero exit status 23 [00:16:44] !log legoktm@tin Synchronized wmf-config/CommonSettings.php: REL1_26 knocks on the door - https://gerrit.wikimedia.org/r/#/c/252399/ (duration: 00m 27s) [00:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:10] https://www.mediawiki.org/w/api.php?action=query&meta=siteinfo shows REL1_26, yay [00:17:11] (03PS1) 10Yuvipanda: wikimania_scolarships: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253501 (https://phabricator.wikimedia.org/T118786) [00:17:16] Krinkle: ok, I'm done [00:17:27] ori: Hm.. half an hour ago rcs redis 01 and 02 were down for 5 min or so per https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors channel redis. Weird. [00:17:30] legoktm: OK [00:17:37] (03CR) 10jenkins-bot: [V: 04-1] wikimania_scolarships: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253501 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:17:40] (03CR) 10Krinkle: [C: 032] fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [00:18:04] (03CR) 10jenkins-bot: [V: 04-1] fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [00:18:31] ori: Ah https://gerrit.wikimedia.org/r/#/c/253487/ [00:18:34] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [00:18:39] (03PS4) 10Krinkle: fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [00:18:45] (03CR) 10Krinkle: [C: 032] fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [00:19:11] (03Merged) 10jenkins-bot: fix wrong IP for codfw redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253063 (owner: 10Dzahn) [00:20:23] ori: Hm.. so, does that mean some events were missed, or would it try again on the other? Assuming they didn;t reboot at the same time. I don't remember whether the streams servers listen to both though. I guess not. So different clients missed different events? Kafka yay :) [00:20:34] PlasmaFury: i don't know, but be careful with that. it's used by a lot. modules/varnish/manifests/common.pp: include webserver::sysctl_settings and "tlsproxy" and memcached ... [00:20:53] mutante: yeah I haven't touched that yet [00:21:03] wmf-config/interwiki.json wmf-config/keys.txt wmf-config/x.php still exist on tin, untracked. [00:21:14] mutante: I got rid of webserver::apache::site, now getting rid of webserver::php5 [00:21:48] !log krinkle@tin Synchronized wmf-config/jobqueue-codfw.php: (no message) (duration: 00m 26s) [00:21:52] PlasmaFury: what are you replacing it with? [00:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:29] (03PS1) 10Yuvipanda: planet: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253502 (https://phabricator.wikimedia.org/T118786) [00:24:31] (03PS1) 10Yuvipanda: contint: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253503 (https://phabricator.wikimedia.org/T118786) [00:24:34] (03PS1) 10Yuvipanda: rt: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253504 (https://phabricator.wikimedia.org/T118786) [00:24:36] (03PS1) 10Yuvipanda: icinga: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253505 (https://phabricator.wikimedia.org/T118786) [00:25:27] (03CR) 10jenkins-bot: [V: 04-1] planet: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253502 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:25:40] (03CR) 10jenkins-bot: [V: 04-1] contint: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253503 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:25:54] (03CR) 10jenkins-bot: [V: 04-1] rt: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253504 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:26:07] (03PS2) 10Yuvipanda: rt: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253504 (https://phabricator.wikimedia.org/T118786) [00:26:09] (03PS2) 10Yuvipanda: icinga: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253505 (https://phabricator.wikimedia.org/T118786) [00:26:11] (03PS2) 10Yuvipanda: wikimania_scolarships: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253501 (https://phabricator.wikimedia.org/T118786) [00:26:13] (03PS2) 10Yuvipanda: contint: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253503 (https://phabricator.wikimedia.org/T118786) [00:26:15] (03PS2) 10Yuvipanda: planet: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253502 (https://phabricator.wikimedia.org/T118786) [00:26:17] (03PS1) 10Yuvipanda: bugzilla_static: Fix doc to not use webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253506 (https://phabricator.wikimedia.org/T118786) [00:26:19] (03CR) 10jenkins-bot: [V: 04-1] icinga: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253505 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:26:53] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1809635 (10RobH) [00:31:24] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1809658 (10greg) You can merge in phab also, but comments and content don't come along unless you manually copy. [00:38:33] (03CR) 10Yuvipanda: [C: 032] wikimania_scolarships: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253501 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:38:44] (03PS1) 10Yuvipanda: base: Increase ephemeral port range everywhere [puppet] - 10https://gerrit.wikimedia.org/r/253508 [00:40:11] (03CR) 10Yuvipanda: [C: 032] planet: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253502 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:42:35] (03CR) 10Yuvipanda: [C: 032] contint: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253503 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:42:56] (03CR) 10Yuvipanda: [C: 032] rt: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253504 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:43:05] Krinkle: i don't think any events were missed, because they were restarted at different times [00:43:28] (03PS4) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 [00:43:40] ori: That would suggest the stream socket servers connect to both redis servers and de-duplicate (or that the redis servers replicate to one another) [00:43:57] Is that true? [00:44:29] (03CR) 10Yuvipanda: [C: 032] icinga: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253505 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:44:39] (03CR) 10Yuvipanda: [C: 032] bugzilla_static: Fix doc to not use webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253506 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:44:39] PlasmaFury: did you run puppet on those servers yet? [00:44:50] that seems really fast to change all those services [00:44:51] mutante: yup [00:44:59] mutante: have them all open in different tabs [00:45:08] doing neon now [00:45:20] mutante: magnesium had no change because racktables still includes webserver::php5 [00:45:26] so it'll take effect when I change that [00:45:27] ok, i'm checking the planet one [00:45:31] I checked :) [00:45:40] the only difference is that the sysctl stuff is gone and everything else is fine [00:45:48] alright [00:46:00] and then bugzilla_static is a doc fix [00:46:29] ok [00:46:46] Krinkle: no, they fail over [00:47:02] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [00:48:16] (03PS5) 10Dzahn: puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 [00:48:56] ori: sorry for being persistent, it's okay if it did drop some, just trying to understand. rcstream only takes 1 redis server as argument. So it would fail over at nginx level? [00:49:28] (03PS1) 10Yuvipanda: racktables: do not use webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253510 (https://phabricator.wikimedia.org/T118786) [00:49:30] (03PS1) 10Yuvipanda: wikistats: Doc fix to not refer to webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253511 (https://phabricator.wikimedia.org/T118786) [00:49:32] Krinkle: *sigh* Ok, you really want to know :) I need to remind myself, then. Let me look more carefully. [00:49:32] (03PS1) 10Yuvipanda: openstack: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253512 (https://phabricator.wikimedia.org/T118786) [00:49:34] (03PS1) 10Yuvipanda: horizon: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253513 (https://phabricator.wikimedia.org/T118786) [00:50:04] (03CR) 10Dzahn: [C: 032] puppetcompiler/puppet-tests: mini lint fixes (meta) [puppet] - 10https://gerrit.wikimedia.org/r/249655 (owner: 10Dzahn) [00:50:09] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [00:50:24] (03PS1) 10Yuvipanda: labs: Remove webserver::php5 from old labslamp [puppet] - 10https://gerrit.wikimedia.org/r/253515 (https://phabricator.wikimedia.org/T118786) [00:50:46] (03PS2) 10Yuvipanda: racktables: do not use webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253510 (https://phabricator.wikimedia.org/T118786) [00:50:47] (03PS2) 10Yuvipanda: wikistats: Doc fix to not refer to webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253511 (https://phabricator.wikimedia.org/T118786) [00:50:49] that's the last of the direct webserver::php5 referencesssssssssssss [00:50:49] (03PS2) 10Yuvipanda: labs: Remove webserver::php5 from old labslamp [puppet] - 10https://gerrit.wikimedia.org/r/253515 (https://phabricator.wikimedia.org/T118786) [00:50:51] (03PS2) 10Yuvipanda: openstack: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253512 (https://phabricator.wikimedia.org/T118786) [00:50:53] (03PS2) 10Yuvipanda: horizon: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253513 (https://phabricator.wikimedia.org/T118786) [00:52:44] PlasmaFury, ssssoooo cruelllll [00:52:59] somebody give him a medal [00:54:01] (03CR) 10Yuvipanda: [C: 032] racktables: do not use webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253510 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:55:01] (03CR) 10Yuvipanda: [C: 032] wikistats: Doc fix to not refer to webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253511 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:55:38] (03CR) 10Yuvipanda: [C: 032] openstack: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253512 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:56:35] Krinkle: there is no guarantee that no events were lost. To know exactly how much data is dropped, we'd have to do some experiments. Namely, find out whether the rcstream instance fails instantly when the redis server shuts down the socket on the remote end. [00:57:23] (03CR) 10Yuvipanda: [C: 032] horizon: Stop using webserver::php5 [puppet] - 10https://gerrit.wikimedia.org/r/253513 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:57:36] (03CR) 10Yuvipanda: [C: 032] labs: Remove webserver::php5 from old labslamp [puppet] - 10https://gerrit.wikimedia.org/r/253515 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [00:58:08] ori: Right. but let's assume that it does. What happens? From what I can see, existing clients would be screwed, and some time later nginx healthcheck would depool it, upstart will try restarting, which fails until the redis server is back up. [00:58:56] Krinkle: yes, correct. But IIRC, we were aware of that to begin with, and explicitly did not make guarantees that no event would be dropped [00:59:08] Right. [00:59:26] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [01:00:28] ori: Maybe we can try to mitigate that with kafka. [01:00:46] (03PS1) 10Yuvipanda: netmon: Remowe webserver::apache [puppet] - 10https://gerrit.wikimedia.org/r/253517 [01:00:50] later [01:01:08] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [01:01:16] jesus fucking christ labs http://tools.wmflabs.org/watroles/role/webserver::apache [01:01:26] so many times 'wooo I can remove this' and then 'boo labs' [01:01:50] * PlasmaFury shall not be deterred [01:01:56] Krinkle: yes. lots to talk about on that topic, but not right now [01:04:39] (03PS2) 10Yuvipanda: netmon: Remowe webserver::apache [puppet] - 10https://gerrit.wikimedia.org/r/253517 (https://phabricator.wikimedia.org/T118786) [01:04:41] (03PS1) 10Yuvipanda: labs: Introduce role::simplelap [puppet] - 10https://gerrit.wikimedia.org/r/253518 (https://phabricator.wikimedia.org/T118786) [01:04:52] Krinkle: https://phabricator.wikimedia.org/T117824 more interesting [01:04:54] and now I'll replace webserver::apache and webserver::php5 with the 'role:simplelap' [01:05:47] awww my bash history on terbium is gone :( [01:05:55] oh well [01:06:59] (03CR) 10Dzahn: "could you make the role name match the file name and put it in autoload layout" [puppet] - 10https://gerrit.wikimedia.org/r/253518 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:07:48] mutante: yes good point [01:08:10] 6operations, 7Database: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#1809838 (10Legoktm) >>! In T115982#1807648, @jcrespo wrote: > Asking @LegoKTM, as he was the one archiving the tag. I arch... [01:08:19] labslamp.pp, role::simplelap and role::lamp::labs seemed a bit confusing [01:08:28] yeah so role::lamp::labs should die [01:08:31] how about that older one [01:08:33] ok [01:08:37] I've a tikcet for that [01:08:43] it's already been removed from list of roles you can select [01:08:53] and role::simplelap will never actually be present there [01:09:04] I don't want to replace webserver::php5 or webserver::apache with simplelamp [01:09:10] since they might have their own mysql install that'll conflict [01:10:26] Krinkle: how are you doing with that, by the way? are you stuck? would you like me to help? [01:10:52] PlasmaFury: actually, why is the combination of apache and those 2 modules a "labs" thing [01:10:53] ori: Was done last year and tested once in labs. https://github.com/Krinkle/node-rcstream [01:11:01] ori: Haven't looked at it since. [01:11:07] mutante: because we don't want to use something like that in prod? [01:11:17] PlasmaFury: why is it good in labs but bad in prod? [01:11:18] Krinkle: no -- I meant T117824 [01:11:22] ori: right, just realised. [01:11:56] PlasmaFury: i'm trying to think from the "minimal differences between prod and labs" point of view etc [01:12:05] ori: Flushing my brain from messageblobstore first since I've got it clear now. With T117824, not sure yet. Gonna try a few things in-browser and indeed varnishlog. [01:12:13] mutante: right, so these are things where people want to just have a /var/www directory and then put files into it [01:12:17] mutante: so that's unacceptable in prod [01:12:19] but ok in labs [01:12:22] gotcha [01:12:26] ori: One thing I haven't quite figured out yet (but also haven't tried much) is how its parameters interact with each other (e.g. AND or OR). [01:12:40] seemed to be AND, applied to the request as a whole. [01:14:01] mutante: actually, the new roles don't have the words 'labs' in them [01:14:07] mutante: it's just role::simplelap and role::simplelamp [01:14:18] PlasmaFury: i don't see the relation yet to including ::apache. are we going to use "simlelap" in production then? [01:14:47] ah, "lamp" as well [01:15:09] ? [01:15:27] (03CR) 10Yuvipanda: [C: 032] netmon: Remowe webserver::apache [puppet] - 10https://gerrit.wikimedia.org/r/253517 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:15:38] are you planning to use the "simplap" role in site.pp? [01:15:48] simplelap [01:15:51] (03CR) 10Yuvipanda: [C: 032] "(I'll do the autolayout in another patch since lots of labs-specific roles need moving)" [puppet] - 10https://gerrit.wikimedia.org/r/253518 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:15:58] mutante: no? [01:16:44] PlasmaFury: ok, so i'm confused if it's labs then or not [01:16:53] re: the naming [01:16:55] I don't understand the question [01:17:13] I'm currently rpelacing all instances of webserver::php5 and webserver::apache in labs LDAP with role::simplelap [01:17:30] if it's a role for labs, you used role labs:: in that file [01:18:07] http://tools.wmflabs.org/watroles/role/webserver::php5 is empty now [01:18:13] I didn't use role labs:: [01:18:20] that's existed for a long time and I'm killing it soon [01:19:57] http://tools.wmflabs.org/watroles/role/webserver::apache is empty too now! \o/ [01:25:06] (03PS1) 10Yuvipanda: Put simplelamp/simplelap roles in autolayout [puppet] - 10https://gerrit.wikimedia.org/r/253523 [01:25:13] mutante: ^ autolayout just for just the two classes [01:25:46] (03PS1) 10Ori.livneh: define etcd_hosts for codfw [puppet] - 10https://gerrit.wikimedia.org/r/253524 [01:26:02] (03PS2) 10Ori.livneh: define etcd_hosts for codfw [puppet] - 10https://gerrit.wikimedia.org/r/253524 [01:26:11] (03CR) 10Ori.livneh: [C: 032 V: 032] define etcd_hosts for codfw [puppet] - 10https://gerrit.wikimedia.org/r/253524 (owner: 10Ori.livneh) [01:26:47] (03CR) 10Dzahn: [C: 031] Put simplelamp/simplelap roles in autolayout [puppet] - 10https://gerrit.wikimedia.org/r/253523 (owner: 10Yuvipanda) [01:29:40] PlasmaFury: ok, cool! [01:33:24] (03PS1) 10Ori.livneh: redis::instance: cast true / false to 'yes' / 'no' [puppet] - 10https://gerrit.wikimedia.org/r/253525 [01:35:00] 7Puppet, 6operations, 7Performance: Investigate mysterious_sysctl settings and figure out what to do with them - https://phabricator.wikimedia.org/T118812#1809888 (10yuvipanda) 3NEW [01:35:27] (03PS1) 10Yuvipanda: webserver: Kill webserver::php5 class [puppet] - 10https://gerrit.wikimedia.org/r/253527 (https://phabricator.wikimedia.org/T118786) [01:35:29] (03PS1) 10Yuvipanda: webserver: Remove webserver::apache:* [puppet] - 10https://gerrit.wikimedia.org/r/253528 (https://phabricator.wikimedia.org/T118786) [01:35:31] (03PS1) 10Yuvipanda: webserver: Remove randomly placed puppet files [puppet] - 10https://gerrit.wikimedia.org/r/253529 (https://phabricator.wikimedia.org/T118786) [01:35:33] (03PS1) 10Yuvipanda: base: Move mysterious_sysctl from webserver [puppet] - 10https://gerrit.wikimedia.org/r/253530 (https://phabricator.wikimedia.org/T118786) [01:35:39] ori: mutante THAT IS THE END OF THE WEBSERVER MODULE! [01:35:41] \o/ [01:35:49] haha, woooo! [01:35:51] that's amazing [01:35:59] (03PS2) 10Yuvipanda: Put simplelamp/simplelap roles in autolayout [puppet] - 10https://gerrit.wikimedia.org/r/253523 [01:36:07] (03CR) 10Yuvipanda: [C: 032 V: 032] Put simplelamp/simplelap roles in autolayout [puppet] - 10https://gerrit.wikimedia.org/r/253523 (owner: 10Yuvipanda) [01:36:17] (03PS2) 10Yuvipanda: webserver: Kill webserver::php5 class [puppet] - 10https://gerrit.wikimedia.org/r/253527 (https://phabricator.wikimedia.org/T118786) [01:36:25] (03PS2) 10Yuvipanda: webserver: Remove webserver::apache:* [puppet] - 10https://gerrit.wikimedia.org/r/253528 (https://phabricator.wikimedia.org/T118786) [01:36:32] (03PS2) 10Yuvipanda: webserver: Remove randomly placed puppet files [puppet] - 10https://gerrit.wikimedia.org/r/253529 (https://phabricator.wikimedia.org/T118786) [01:38:09] (03CR) 10Yuvipanda: [C: 032] webserver: Kill webserver::php5 class [puppet] - 10https://gerrit.wikimedia.org/r/253527 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:38:21] (03CR) 10Yuvipanda: [C: 032] webserver: Remove webserver::apache:* [puppet] - 10https://gerrit.wikimedia.org/r/253528 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:38:32] (03CR) 10Yuvipanda: [C: 032] webserver: Remove randomly placed puppet files [puppet] - 10https://gerrit.wikimedia.org/r/253529 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:38:44] (03PS2) 10Ori.livneh: redis::instance: cast true / false to 'yes' / 'no' [puppet] - 10https://gerrit.wikimedia.org/r/253525 [01:38:46] (03PS2) 10Yuvipanda: base: Move mysterious_sysctl from webserver [puppet] - 10https://gerrit.wikimedia.org/r/253530 (https://phabricator.wikimedia.org/T118786) [01:38:58] mutante: ori CR for the last patch would be welcome: https://gerrit.wikimedia.org/r/#/c/253530/ [01:39:32] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [01:41:15] (03PS1) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [01:41:25] (03CR) 10Ori.livneh: [C: 031] "looks fine, but run pcc on a varnish host to make sure" [puppet] - 10https://gerrit.wikimedia.org/r/253530 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:42:57] (03PS2) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [01:42:59] (03PS3) 10Yuvipanda: base: Move mysterious_sysctl from webserver [puppet] - 10https://gerrit.wikimedia.org/r/253530 (https://phabricator.wikimedia.org/T118786) [01:47:29] (03CR) 10Yuvipanda: [C: 032] "> [ 2015-11-17T01:46:45 ] INFO: Nodes: 1 NOOP 0 DIFF 0 ERROR 0 FAIL" [puppet] - 10https://gerrit.wikimedia.org/r/253530 (https://phabricator.wikimedia.org/T118786) (owner: 10Yuvipanda) [01:47:49] (03PS1) 10Dzahn: icinga: add team-operations user to sms group [puppet] - 10https://gerrit.wikimedia.org/r/253532 (https://phabricator.wikimedia.org/T114661) [01:49:12] (03PS3) 10Ori.livneh: redis::instance: cast true / false to 'yes' / 'no' [puppet] - 10https://gerrit.wikimedia.org/r/253525 [01:49:30] (03CR) 10Ori.livneh: [C: 032 V: 032] redis::instance: cast true / false to 'yes' / 'no' [puppet] - 10https://gerrit.wikimedia.org/r/253525 (owner: 10Ori.livneh) [01:49:37] ok that was a noop! [01:49:43] ori: mutante officially done now! [01:49:47] * PlasmaFury scratches that off his list [01:49:49] that feels nice [01:50:08] that's really awesome [01:50:08] ori: https://gerrit.wikimedia.org/r/#/c/253531/ moves quarry's redis to redis::instance [01:50:12] nice work [01:50:22] :D thanks! [01:50:42] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [01:51:14] (03CR) 10Ori.livneh: [C: 04-1] "persist => 'aof' has other effects too -- see the erb file" [puppet] - 10https://gerrit.wikimedia.org/r/253531 (owner: 10Yuvipanda) [01:51:18] (03PS3) 10Yuvipanda: quarry: Move to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/253531 [01:53:37] ori: hmm, the only other effect I see is that it sets appendfilename [01:53:42] nothing else [01:58:03] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [02:00:22] (03PS2) 10Dzahn: icinga: add team-operations user to sms group [puppet] - 10https://gerrit.wikimedia.org/r/253532 (https://phabricator.wikimedia.org/T114661) [02:01:57] (03CR) 10Dzahn: [C: 032] icinga: add team-operations user to sms group [puppet] - 10https://gerrit.wikimedia.org/r/253532 (https://phabricator.wikimedia.org/T114661) (owner: 10Dzahn) [02:02:53] 7Puppet, 6operations, 5Patch-For-Review: Remove the webserver module - https://phabricator.wikimedia.org/T118786#1809955 (10yuvipanda) 5Open>3Resolved a:3yuvipanda AAAND IT IS GONE! I also moved all instances of webserver::apache and webserver::php5 to role::simplelap in labs instances. [02:07:50] PROBLEM - Check status of defined EventLogging jobs on eventlog2001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/server-side-events-log consumer/mysql-m4-master consumer/client-side-events-log consumer/all-events-log processor/server-side-0 processor/client-side-0 forwarder/server-side-raw forwarder/legacy-zmq [02:14:10] (03PS1) 10Dzahn: puppetmaster: puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/253535 [02:14:59] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/253535 (owner: 10Dzahn) [02:15:17] (03PS2) 10Dzahn: puppetmaster: puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/253535 [02:16:07] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/253535 (owner: 10Dzahn) [02:16:41] (03PS1) 10Dzahn: puppetmaster: remove broken incl. of nagios.pp [puppet] - 10https://gerrit.wikimedia.org/r/253536 [02:17:27] (03PS2) 10Dzahn: puppetmaster: remove broken incl. of nagios.pp [puppet] - 10https://gerrit.wikimedia.org/r/253536 [02:17:29] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: remove broken incl. of nagios.pp [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [02:18:06] (03CR) 10Dzahn: "lol what, and backup.pp also does not, next error:" [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [02:18:23] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: remove broken incl. of nagios.pp [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [02:19:52] (03PS3) 10Dzahn: puppetmaster tests: remove broken includes [puppet] - 10https://gerrit.wikimedia.org/r/253536 [02:21:37] (03CR) 10Dzahn: "jenkins should like after https://gerrit.wikimedia.org/r/#/c/253536/" [puppet] - 10https://gerrit.wikimedia.org/r/253535 (owner: 10Dzahn) [02:21:42] !log l10nupdate@tin Synchronized php-1.27.0-wmf.6/cache/l10n: l10nupdate for 1.27.0-wmf.6 (duration: 06m 26s) [02:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:54] PROBLEM - RAID on db1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:24:43] RECOVERY - RAID on db1034 is OK: OK: optimal, 1 logical, 2 physical [02:28:25] uh [02:30:04] PlasmaFury: got paged about slave lag on the same host.. but then recovered [02:30:12] and it was a test for the icinga mail change :p [02:30:21] it was delivered to root@ as well [02:33:03] yeah [02:33:06] 'tis all ok I guess [02:40:01] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1809991 (10Dzahn) 3NEW [02:40:27] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1809999 (10Dzahn) [02:40:28] 6operations, 5Patch-For-Review: move racktables and RT to a VM - https://phabricator.wikimedia.org/T105555#1809998 (10Dzahn) [02:42:37] 6operations, 7Icinga, 5Patch-For-Review: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1810000 (10Dzahn) - added a new special user "team-operations" (like existing team-services), with timezone 24x7 and email address root@ -... [02:45:10] 6operations, 7Icinga, 5Patch-For-Review: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1810001 (10Dzahn) 5Open>3Resolved and the first one we got here was: ** PROBLEM alert - db1034/MariaDB Slave Lag: s7 is CRITICAL **... [02:45:55] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1810004 (10Dzahn) [03:07:22] (03CR) 10TTO: "Any issues? Should I schedule this for SWAT? Just keen to get this moving, with a view to possibly rolling this out for production before " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) (owner: 10TTO) [03:09:22] (03PS1) 10Dereckson: Editatón contra la violencia hacia las mujeres throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) [03:10:02] (03PS2) 10Dereckson: Editatón contra la violencia hacia las mujeres throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) [03:10:41] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [03:11:23] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [03:13:01] Luke081515|away and Krenair: in wmf-config/throttle.php I see a if ( isset( $options['IP'] ) && !in_array( $ip, (array) $options['IP'] ) ) { continue; } [03:13:24] It's 'IP', not 'ip' the config key (tricky case). [03:14:16] (03CR) 10TTO: Editatón contra la violencia hacia las mujeres throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) (owner: 10Dereckson) [03:15:20] (03CR) 10Dereckson: Editatón contra la violencia hacia las mujeres throttle rule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) (owner: 10Dereckson) [03:17:23] https://wikimedia.org/api/rest_v1/metrics/pageviews/top/en.wikipedia/all-access/2015/11/11 is giving me intermittent Varnish errors. [03:17:54] > Request from 10.64.0.103 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 684028254
Forwarded for: 50.190.184.240, 10.64.0.103, 10.64.0.103
Error: 503, Service Unavailable at Tue, 17 Nov 2015 03:16:44 GMT [03:18:52] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [03:19:20] (03Abandoned) 10TTO: Restrict changetags right to sysops and bots only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208088 (https://phabricator.wikimedia.org/T97013) (owner: 10TTO) [03:19:22] (03PS3) 10Dereckson: Editatón contra la violencia hacia las mujeres throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) [03:19:52] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [03:20:56] (03PS1) 10Dzahn: admin: add my new yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253542 [03:23:11] (03PS1) 10Dereckson: Improve throttle configuration file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253543 [03:27:31] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [03:32:12] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [03:34:02] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [03:34:28] (03CR) 10Cenarium: "The task was for enwiki only though. A discussion at meta looked like it favored restricting as well: https://meta.wikimedia.org/wiki/Wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208088 (https://phabricator.wikimedia.org/T97013) (owner: 10TTO) [03:38:51] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [03:44:31] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [03:45:22] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [03:48:11] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [03:50:52] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [04:24:11] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [04:33:41] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [04:41:15] (03CR) 10Yuvipanda: "I can't actually find anything else outside of this that'll be relevant for persist = aof?" [puppet] - 10https://gerrit.wikimedia.org/r/253531 (owner: 10Yuvipanda) [04:41:47] ori: ^ for when you come back on your second wind :) [04:42:09] * PlasmaFury goes afk for a bit, maybe [04:48:32] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: puppet fail [05:13:46] ori: there's also a problem with the new redis::instance - rename-command takes multiple instances (so I have one instance of the rename-command line for each command I want to rename). Not sure the current hash model can support that [05:18:42] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:57:02] PlasmaFury: no, it doesn't. I'll have to think about that. shouldn't be too hard. [05:57:18] the desirable behavior would be for it to take a hash, i suppose [05:58:29] 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1810077 (10kaldari) Who would be the person to poke to actually get it upgraded? I know @ArielGlenn used to handle stuff like this a long time ago, but I have no idea who's purview i... [06:01:41] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1810079 (10ori) [06:18:13] !log restbase cassandra: Increased tombstone_threshold from 0.02 to 0.1 for wikipedia and wikimedia html and data-parsoid to reduce single-sstable compaction rate. [06:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:51] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:22] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:52] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:42] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:31] (03PS1) 10Yuvipanda: labs: Remove the biglogs class [puppet] - 10https://gerrit.wikimedia.org/r/253551 [06:54:33] (03PS1) 10Yuvipanda: labs: Remove role::labs::lvm::volume (unused) [puppet] - 10https://gerrit.wikimedia.org/r/253552 [06:54:35] (03PS1) 10Yuvipanda: mesos: Delete module [puppet] - 10https://gerrit.wikimedia.org/r/253553 [06:57:12] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:31] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:57:33] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:42] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:53] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:10] (03CR) 10Muehlenhoff: [C: 031] admin: add my new yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253542 (owner: 10Dzahn) [07:01:15] (03CR) 10Yuvipanda: [C: 032] labs: Remove the biglogs class [puppet] - 10https://gerrit.wikimedia.org/r/253551 (owner: 10Yuvipanda) [07:01:28] (03CR) 10Yuvipanda: [C: 032] labs: Remove role::labs::lvm::volume (unused) [puppet] - 10https://gerrit.wikimedia.org/r/253552 (owner: 10Yuvipanda) [07:01:47] (03CR) 10Yuvipanda: [C: 032] "Hey goodbye, Mesos..." [puppet] - 10https://gerrit.wikimedia.org/r/253553 (owner: 10Yuvipanda) [07:11:26] (03PS1) 10Yuvipanda: simplelamp: Allow overriding mysql datadir path [puppet] - 10https://gerrit.wikimedia.org/r/253554 (https://phabricator.wikimedia.org/T118784) [07:12:27] (03CR) 10Yuvipanda: [C: 032] simplelamp: Allow overriding mysql datadir path [puppet] - 10https://gerrit.wikimedia.org/r/253554 (https://phabricator.wikimedia.org/T118784) (owner: 10Yuvipanda) [07:15:57] (03PS1) 10Ori.livneh: Delete ipython role [puppet] - 10https://gerrit.wikimedia.org/r/253555 [07:16:53] (03CR) 10Yuvipanda: [C: 032 V: 032] "Jupyterhub shall arise in its place at some point soon!" [puppet] - 10https://gerrit.wikimedia.org/r/253555 (owner: 10Ori.livneh) [07:17:29] ori: thanks [07:18:12] thanks [07:44:30] 6operations, 5Patch-For-Review: Delete / decom sitemap.wikimedia.org - https://phabricator.wikimedia.org/T101486#1810163 (10Aklapper) [07:52:26] (03PS1) 10Muehlenhoff: Add a .gitreview file [debs/linux] - 10https://gerrit.wikimedia.org/r/253560 [07:52:28] (03PS1) 10Muehlenhoff: Replace fix for CVE-2015-5307 with now-merged upstream patch Add fix for CVE-2015-8104 (similar vector) [debs/linux] - 10https://gerrit.wikimedia.org/r/253561 [07:52:30] (03PS1) 10Yuvipanda: k8s: Whitelist the resources namespaced users can create [puppet] - 10https://gerrit.wikimedia.org/r/253562 [07:52:50] (03PS1) 10Yuvipanda: k8s: Stop explicitly depending on the jessie-backports repo [puppet] - 10https://gerrit.wikimedia.org/r/253563 [07:52:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a .gitreview file [debs/linux] - 10https://gerrit.wikimedia.org/r/253560 (owner: 10Muehlenhoff) [07:53:19] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Whitelist the resources namespaced users can create [puppet] - 10https://gerrit.wikimedia.org/r/253562 (owner: 10Yuvipanda) [07:53:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] Replace fix for CVE-2015-5307 with now-merged upstream patch Add fix for CVE-2015-8104 (similar vector) [debs/linux] - 10https://gerrit.wikimedia.org/r/253561 (owner: 10Muehlenhoff) [07:53:32] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Stop explicitly depending on the jessie-backports repo [puppet] - 10https://gerrit.wikimedia.org/r/253563 (owner: 10Yuvipanda) [07:53:43] PlasmaFury: ssl termination [07:54:03] akosiaris: but isnt' that being done by varnish? [07:54:04] we probably don't need apache in front of nodejs these days [07:54:07] yeah [07:54:12] we should get rid of that [07:54:23] oh, also logging [07:54:24] well [07:54:30] not varnish, but varnish + nginx [07:54:31] hmm [07:54:33] it logs the enduser IPs [07:54:45] I see [07:54:46] but that can be approximated by the nginx logs [07:54:48] yeah [07:55:02] there is no longer any really good reason [07:55:15] :D [07:55:15] in fact, it is not proxying websockets [07:55:21] so it is causing some problems [07:55:26] but that can be fixed [07:55:32] 2.4 support doing that [07:55:36] we can make it use nginx [07:55:46] I bet that has way better websocket proxying than apache does [07:55:57] than 2.4 ? maybe not [07:55:57] * PlasmaFury is inherently distrustful of apache, having never used it outside of wordpress [07:56:03] heh [07:56:20] it is really good software though... just a little bit overcomplicated [07:56:25] heh [07:56:28] half the internet runs on it [07:56:31] true [07:56:40] I think I've always just associated it with PHP [07:56:44] not to mention wikipedia!!!! [07:56:48] that's more my fault than apache's ofc [07:56:56] well, there is no more technical reason for us to [07:56:59] outside of all the config [07:57:04] true [07:57:17] but if we go for mod_event or mod_worker [07:57:26] most of nginx's advantages go away [07:57:43] now that nginx is getting slightly more opencore maaaaybeeee [07:57:59] oh the opencore model... I hate that [07:58:21] so we can try putting varnish to talk directly to etherpad [07:58:24] and see what happens [07:58:52] hmm lemme first check what does etherpad do for better logging than the current one [07:58:55] (03PS1) 10Yuvipanda: k8s: Make the c in replicationControllers small [puppet] - 10https://gerrit.wikimedia.org/r/253564 [07:58:59] not that I will find a lot ... [07:59:37] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make the c in replicationControllers small [puppet] - 10https://gerrit.wikimedia.org/r/253564 (owner: 10Yuvipanda) [08:06:09] (03CR) 10Yuvipanda: "@paravoid has to change the settings in all networking gear (which syslog to observium.wikimedia.org) before this can be merged." [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) (owner: 10Dzahn) [08:10:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [08:11:00] 6operations, 6Labs: Write a diamond collector to collect active ssh sessions - https://phabricator.wikimedia.org/T118827#1810194 (10yuvipanda) 3NEW [08:11:08] 6operations, 6Labs: Write a diamond collector to collect active ssh sessions - https://phabricator.wikimedia.org/T118827#1810201 (10yuvipanda) a:3yuvipanda [08:11:17] 6operations, 6Labs, 10Tool-Labs: Write a diamond collector to collect active ssh sessions - https://phabricator.wikimedia.org/T118827#1810194 (10yuvipanda) [08:11:49] 6operations, 7Availability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829#1810213 (10ori) 3NEW [08:19:47] 6operations, 7Availability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829#1810224 (10mobrovac) This would be awesome! I don't want to sound pessimistic, but wouldn't this need a mountain of work in `ops/puppet` ? In particular, all those //if this is... [08:23:21] 6operations, 7Availability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829#1810225 (10yuvipanda) Kubernetes is way too young and is missing several features (and all container orchestrators are kind of crap still at doing stateful services like dbs),... [08:23:34] 6operations, 7Availability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829#1810229 (10yuvipanda) [08:24:02] 6operations, 7Availability: Automate the provisioning and management of MediaWiki clusters - https://phabricator.wikimedia.org/T118829#1810213 (10yuvipanda) (I do believe that it can do some of these in 6months-1year, but way too early to be having that conversation specific to kubernetes, IMO) [08:25:12] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [08:32:03] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster tests: remove broken includes [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [08:32:28] (03PS4) 10Alexandros Kosiaris: puppetmaster tests: remove broken includes [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [08:35:49] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Backport etcd 2.2 to jessie - https://phabricator.wikimedia.org/T118830#1810237 (10Joe) 3NEW [08:37:01] (03CR) 10Alexandros Kosiaris: [V: 032] puppetmaster tests: remove broken includes [puppet] - 10https://gerrit.wikimedia.org/r/253536 (owner: 10Dzahn) [08:38:41] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:41:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [5000000.0] [08:43:38] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Upgrade the production etcd cluster to 2.2 - https://phabricator.wikimedia.org/T118831#1810252 (10Joe) 3NEW [08:47:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [08:48:38] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1810260 (10mobrovac) [08:54:22] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:54:35] <_joe_> akosiaris: ^^ [08:54:45] <_joe_> strontium acting up again [08:55:53] done [08:56:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:56:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:03:16] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1810275 (10jcrespo) @Dzahn If I do this, please make sure you create another ticket deleting the previous grant after the migration is complete. [09:08:15] (03PS1) 10Muehlenhoff: Update the ABI to 2 [debs/linux] - 10https://gerrit.wikimedia.org/r/253570 [09:09:33] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [09:10:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [09:12:01] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [09:12:03] 6operations, 7Database: mysql permission request: racktables from krypton - https://phabricator.wikimedia.org/T118816#1810282 (10jcrespo) p:5Triage>3Normal [09:12:32] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [09:13:39] 6operations, 7Database: Adapt wmf-mariadb10 package for jessie or puppetize differently its service to adapt it to systemd - https://phabricator.wikimedia.org/T116903#1810285 (10jcrespo) 5Open>3stalled I am going to stall it, because official packages do not do it. [09:14:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 1 below the confidence bounds [09:14:43] (03PS2) 10Muehlenhoff: Update the ABI to 2 [debs/linux] - 10https://gerrit.wikimedia.org/r/253570 [09:16:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update the ABI to 2 [debs/linux] - 10https://gerrit.wikimedia.org/r/253570 (owner: 10Muehlenhoff) [09:26:31] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:28:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:37:06] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Upgrade conftool to support credentials form a config file - https://phabricator.wikimedia.org/T118833#1810322 (10Joe) 3NEW [09:42:56] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Upgrade python-etcd to 0.4.2+ - https://phabricator.wikimedia.org/T118834#1810332 (10Joe) 3NEW [09:43:32] (03PS4) 10Faidon Liambotis: Remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) (owner: 10Dzahn) [09:43:41] (03PS5) 10Faidon Liambotis: Remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) (owner: 10Dzahn) [09:47:40] 6operations, 5Continuous-Integration-Scaling: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1810354 (10hashar) The packaging work is held in our Gerrit repo `integration/zuul.git` with the following branches: | `upstream` | 1cc37f7b469a... [09:47:45] (03PS2) 10Muehlenhoff: Uninstall wpasupplicant [puppet] - 10https://gerrit.wikimedia.org/r/252916 [09:49:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Uninstall wpasupplicant [puppet] - 10https://gerrit.wikimedia.org/r/252916 (owner: 10Muehlenhoff) [09:54:30] !log nodetool decommission on restbase2002 [09:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:04] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1810371 (10Joe) So, as far as a general puppet interface for this would go I envision something like: - we create a provider for the "user" and "group" puppet resou... [09:55:51] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:16] (03CR) 10Faidon Liambotis: [C: 032] Remove observium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253491 (https://phabricator.wikimedia.org/T118790) (owner: 10Dzahn) [09:57:38] (03PS1) 10Faidon Liambotis: Fix snapshot::cron::primary role include [puppet] - 10https://gerrit.wikimedia.org/r/253574 [09:58:12] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet last ran 4 days ago [09:58:12] (03PS2) 10Faidon Liambotis: Fix snapshot::cron::primary role include [puppet] - 10https://gerrit.wikimedia.org/r/253574 [09:59:03] (03CR) 10Faidon Liambotis: [C: 032] Fix snapshot::cron::primary role include [puppet] - 10https://gerrit.wikimedia.org/r/253574 (owner: 10Faidon Liambotis) [09:59:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "It seems good to me, but if you're using fail the message is confusing: those metrics will not be added as the catalog is failing to compi" [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) (owner: 10Filippo Giunchedi) [10:01:12] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:02:22] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:03:33] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet last ran 4 days ago [10:04:01] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 5 days ago [10:05:41] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:05:53] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:06:31] (that's me just fixing shit that have been alerting for 5-6 days) [10:07:42] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:09:17] (03CR) 10Filippo Giunchedi: "minor comments, LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [10:11:16] godog: I'm guessing you have access to the graphite host then? :) [10:12:11] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [500.0] [10:17:48] addshore: I do indeed [10:18:35] I just made this one :) https://phabricator.wikimedia.org/T118836 [10:22:19] if you feel like it ;) [10:29:56] addshore: heh those requests tend to be more best effort and done in batches, could happen in ~2w when I'm clinic duty tho [10:33:37] okay :) [10:33:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:45:52] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [10:47:42] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [10:50:29] (03CR) 10Jcrespo: "This was an ok change (puppet compiler was wrong). But I will do it differently." [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [10:51:45] (03Abandoned) 10Jcrespo: Cleaning up config, setting dbs to install mariadb10 by default [puppet] - 10https://gerrit.wikimedia.org/r/250428 (owner: 10Jcrespo) [10:52:38] (03PS3) 10Jcrespo: Explicit all mariadb versions for 5.5 vs 10 [puppet] - 10https://gerrit.wikimedia.org/r/253316 [10:56:13] (03CR) 10Jcrespo: [C: 032] Explicit all mariadb versions for 5.5 vs 10 [puppet] - 10https://gerrit.wikimedia.org/r/253316 (owner: 10Jcrespo) [11:03:22] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [11:04:42] (03PS1) 10Jcrespo: Make mariadb10 as the default version and install libjemalloc1 [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 [11:05:54] (03PS1) 10Jcrespo: Bump version on parent repo [puppet] - 10https://gerrit.wikimedia.org/r/253585 [11:06:16] (03PS2) 10Jcrespo: Bump version on parent repo [puppet] - 10https://gerrit.wikimedia.org/r/253585 [11:07:17] (03CR) 10jenkins-bot: [V: 04-1] Bump version on parent repo [puppet] - 10https://gerrit.wikimedia.org/r/253585 (owner: 10Jcrespo) [11:08:12] jenkis, you are too fast [11:12:13] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [3000.0] [11:12:30] (03CR) 10Jcrespo: "@moritz, anything against installing libjemalloc on all mysqls? I compile against it on the mysql packages (and it is only already install" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:21:27] (03CR) 10Muehlenhoff: "The added dependency is fine, but if the mariadb binary from wmf-mariadb10 links against libjemalloc1, then it should rather be a "Depends" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:22:16] (03CR) 10Jcrespo: "I know :-)" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:23:03] (03PS1) 10Filippo Giunchedi: swift: set hourly commons upload threshold to 80% [puppet] - 10https://gerrit.wikimedia.org/r/253587 [11:23:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: set hourly commons upload threshold to 80% [puppet] - 10https://gerrit.wikimedia.org/r/253587 (owner: 10Filippo Giunchedi) [11:23:56] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1810559 (10akosiaris) curl with `--compressed` is succeeding every single time. curl with `--compressed` will set **Accept-Encoding: deflate, gzip ** whereas this does... [11:25:21] (03CR) 10Filippo Giunchedi: RESTBase configuration for scap3 deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/252887 (owner: 10Thcipriani) [11:25:43] (03CR) 10Muehlenhoff: [C: 031] "The current approach doesn't hurt (but we should also update the wmf-mariadb once it's updated the next time)." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:27:47] (03CR) 10Jcrespo: "That is the intention." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:29:27] (03PS1) 10Giuseppe Lavagetto: phabricator: add ServerAlias for phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253588 [11:29:29] (03PS1) 10Giuseppe Lavagetto: phabricator: disallow crawling of phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253589 [11:33:26] (03PS2) 10Giuseppe Lavagetto: phabricator: add ServerAlias for phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253588 [11:34:35] (03PS3) 10Filippo Giunchedi: monitoring: fail on graphite metrics using single quotes [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) [11:35:14] (03CR) 10Filippo Giunchedi: "I expanded the commit message to include more rationale on why the check is in place, how does it look?" [puppet] - 10https://gerrit.wikimedia.org/r/252963 (https://phabricator.wikimedia.org/T118398) (owner: 10Filippo Giunchedi) [11:36:16] (03CR) 10Giuseppe Lavagetto: [C: 032] phabricator: add ServerAlias for phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253588 (owner: 10Giuseppe Lavagetto) [11:38:24] (03CR) 10Jcrespo: [C: 032] Make mariadb10 as the default version and install libjemalloc1 [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/253584 (owner: 10Jcrespo) [11:39:18] (03PS3) 10Jcrespo: Bump version on parent repo [puppet] - 10https://gerrit.wikimedia.org/r/253585 [11:40:35] (03CR) 10Jcrespo: [C: 032] Bump version on parent repo [puppet] - 10https://gerrit.wikimedia.org/r/253585 (owner: 10Jcrespo) [11:42:56] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1810567 (10ArielGlenn) Ah I see you are right; well then the rsync is fine. [11:47:57] (03PS1) 10Jcrespo: Enabling ferm and performance schema on db1027 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/253591 [11:49:49] (03CR) 10Jcrespo: "@moritzm As I am doing this one host at at time, I will not wait or expect for your +1. Please tell me also if it is useful to notify you " [puppet] - 10https://gerrit.wikimedia.org/r/253591 (owner: 10Jcrespo) [11:50:06] jynus: sure, absolutely! [11:50:21] the spam or the notification? [11:50:38] :-) [11:51:00] don't wait for my +1 and no need to ping me, I'll see the grrit-wm changes anyway [11:51:08] perfect, then! [11:51:41] I will not send you an email 150 times :-) [11:52:16] (03CR) 10Jcrespo: [C: 032] Enabling ferm and performance schema on db1027 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/253591 (owner: 10Jcrespo) [11:53:49] (03PS2) 10Giuseppe Lavagetto: phabricator: disallow crawling of phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253589 [11:54:49] (03PS2) 10Muehlenhoff: Some further finetuning to server groups [puppet] - 10https://gerrit.wikimedia.org/r/253348 [11:55:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] Some further finetuning to server groups [puppet] - 10https://gerrit.wikimedia.org/r/253348 (owner: 10Muehlenhoff) [11:57:49] ACKNOWLEDGEMENT - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [3000.0] Filippo Giunchedi known, swiftrepl running [12:01:52] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [12:01:52] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet has 1 failures [12:03:18] checking [12:05:13] (03PS1) 10Muehlenhoff: Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 [12:06:28] ^that is a repo problem [12:08:41] 6operations, 10vm-requests: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1810603 (10akosiaris) Question: I know the previous hosts did have a public (external IP), not sure of the reasons though. Is there any chance we could have the new hosts using internal IPs ? [12:22:18] 6operations, 10vm-requests: VM request for OpenLDAP labs servers - https://phabricator.wikimedia.org/T118726#1810612 (10MoritzMuehlenhoff) That's good point. I'm fairly sure these can use an internal IP instead; the current firewall rules on nembus/neptunium already limit the access to internal IPs only. I'll... [12:26:13] (03PS1) 10ArielGlenn: keep fewer dataset web server logs, add date to filename [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) [12:27:03] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1810626 (10ArielGlenn) need to change the file name format for these logs, otherwise it's going ot be very annoying... [12:32:13] !log depool restbase1002 [12:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:55] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1810628 (10mobrovac) >>! In T118817#1810559, @akosiaris wrote: > curl with `--compressed` is succeeding every single time. curl with `--compressed` will set > > **Accept... [12:34:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "toollabs installs python-apport AFAICS, not sure apport is needed though." [puppet] - 10https://gerrit.wikimedia.org/r/253593 (owner: 10Muehlenhoff) [12:35:56] fixed db2068, for some reason its first apt sources line was deleted [12:39:10] and a glich on db1027 with a slow query: https://phabricator.wikimedia.org/P2316 [12:41:32] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [12:41:41] PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused [12:42:05] (will investigate later, db1027 is currently depooled) [12:43:16] known about rb1002 ^^ please ignore [12:43:33] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.030 second response time [12:45:23] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [12:54:20] 10Ops-Access-Requests, 6operations, 6WMDE-Analytics-Engineering, 10Wikidata, 5Patch-For-Review: Requesting access to dataset-admins for Addshore - https://phabricator.wikimedia.org/T118739#1810673 (10ArielGlenn) After looking at the other rsyncs you do (erbium, oxygen), and considering the other syncs th... [12:54:32] PROBLEM - RAID on db1027 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [12:58:42] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:00:51] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [13:01:52] <_joe_> citoid is not feeling good, taking a look [13:02:42] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:10:42] I learned too late about the RAID [13:19:14] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1810680 (10ArielGlenn) Any info from those vendors yet? [13:22:08] 6operations, 10ops-eqiad: Disk failure on db1027 (RAID degraded) - https://phabricator.wikimedia.org/T118848#1810682 (10jcrespo) 3NEW [13:23:23] ACKNOWLEDGEMENT - RAID on db1027 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo Tracked on https://phabricator.wikimedia.org/T108856 [13:30:36] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1810701 (10mobrovac) a:3mobrovac We have debugged this further and hopefully found the root cause: `preq` (the lib used by RESTBase to issue external requests) forces gz... [13:32:15] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1810712 (10jcrespo) What about the rest (they are not on a db list)? ``` chwikimedia.ipblocks_old comcomwiki.ipblocks_old wikimania2005wiki tables *_old zh_cnwiki.old ``` Do I leave them there... [13:34:25] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1810713 (10ArielGlenn) Adding @Ottomata and a link to T84618 which is still pending with a number of open... [13:35:40] (03PS4) 10Dereckson: Editatón contra la violencia hacia las mujeres throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) [13:37:33] PROBLEM - Restbase root url on restbase1002 is CRITICAL: Connection refused [13:37:43] PROBLEM - Restbase endpoints health on restbase1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [13:44:46] known ^^ ignore [13:50:54] I'm going to silence it for an hour [13:55:26] thnx godog [13:58:42] RECOVERY - Restbase root url on restbase1002 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.007 second response time [13:58:53] RECOVERY - Restbase endpoints health on restbase1002 is OK: All endpoints are healthy [14:01:38] (03PS1) 10Jcrespo: Repool db1027, depool db1044 (regular maintenance) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 [14:03:02] 6operations, 7Database: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1810737 (10jcrespo) db1027 defragmented and ready to be deployed. Only blocked by T118848. [14:03:40] 6operations, 7Database: defragment db1015, db1035 and db1027 - https://phabricator.wikimedia.org/T110504#1810739 (10jcrespo) 5Open>3Resolved [14:04:00] (03CR) 10Jcrespo: [C: 04-1] "Do not deploy until T118848 is fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253600 (owner: 10Jcrespo) [14:05:44] (03PS1) 10Addshore: Prune stat1002 /a/mw-log/archive after 30 days [puppet] - 10https://gerrit.wikimedia.org/r/253601 (https://phabricator.wikimedia.org/T118527) [14:06:18] (03Abandoned) 10Addshore: Prune stat1002 /a/mw-log/archive after 30 days [puppet] - 10https://gerrit.wikimedia.org/r/253601 (https://phabricator.wikimedia.org/T118527) (owner: 10Addshore) [14:06:24] <_joe_> I [14:06:31] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1810745 (10Addshore) [14:06:34] <_joe_> sorry, wrong paste [14:26:30] (03PS1) 10Alexandros Kosiaris: Actually return nodes from nodegen [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/253605 [14:26:42] PROBLEM - puppet last run on restbase1002 is CRITICAL: CRITICAL: puppet fail [14:26:59] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Actually return nodes from nodegen [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/253605 (owner: 10Alexandros Kosiaris) [14:29:09] 6operations, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 7Database, and 4 others: Wikibase dispatchChanges.php runs slow, creates an absurd amount of database connections - https://phabricator.wikimedia.org/T118162#1810791 (10thiemowmde) The #Wikidata team will check if this is resolved in #Wiki... [14:30:01] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [14:32:01] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [14:37:52] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [14:37:56] !log stopping pybal on lvs400[12], fallback to lvs400[34] (pybal 1.10 vs 1.12) [14:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:42:33] PROBLEM - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:42:33] PROBLEM - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:42:47] oh balls.... *wonders is you can make a grafana dashboard editable again after unticking that box*..... [14:44:44] (03PS1) 10Alexandros Kosiaris: Bump version to actually reflect the tags [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/253606 [14:45:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Bump version to actually reflect the tags [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/253606 (owner: 10Alexandros Kosiaris) [14:45:32] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [14:46:02] addshore: yeah I ran into that before. You can still export the JSON I think [14:46:22] (and then make a new one, and re-import the json, but with the edit thing toggled) [14:46:29] :/ can I do that and get someone with access to the grafana instance to just delete them? [14:46:47] otherwise there will end up being missleading legacy stuff in the wrong place :/ [14:46:50] jzerebecki: ^^ [14:47:11] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1810817 (10jcrespo) Aaron is using mediawiki itself to query pt-heartbeat. With that in mind I wonder if exposing a port is still interesting for external servi... [14:47:15] I remember someone saying that you can delete on grafana by posting to the server [14:47:21] via a manual curl from inside the cluster [14:47:29] oooooohhhh [14:47:48] Might have been bd808 ^ [14:48:11] RECOVERY - puppet last run on restbase1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:51:06] (03PS3) 10Hashar: contint: setup zuul-merger on scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/252336 (https://phabricator.wikimedia.org/T95046) [14:51:06] !log stopping pybal on lvs200[123], fallback to lvs200[456] (pybal 1.10 vs 1.12) [14:51:06] (03PS2) 10Hashar: contint: pool in zuul-merger on scandium [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) [14:51:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:43] 6operations, 5Continuous-Integration-Scaling: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1810828 (10Andrew) ok, good enough for me :) [14:53:05] (03PS6) 10coren: Tools: Puppetize gridengine complex configuration [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [14:55:18] hashar: in your email you say 'review and push the zuul.deb for jessie-wikimedia’ is that because the trusty and precise packages are already in the repo? [14:55:26] good morning [14:55:33] nop they are not [14:55:49] but I have manually installed the trusty/precise ones already [14:55:53] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:55:58] !log swift eqiad-prod: set ms-be1019 / ms-be1020 / ms-be1021 weight 1500 [14:56:00] (03PS2) 10Alexandros Kosiaris: salt: Move the role manifests into role module [puppet] - 10https://gerrit.wikimedia.org/r/253342 [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:02] the jessie package will be required to get zuul installed on scandium [14:56:15] I did not want to abuse root right on that new machine [14:56:22] PROBLEM - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:56:28] (03CR) 10coren: [C: 032] "Better way to do this, for sure." [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [14:56:37] hashar: would you like me to add precise and trusty packages while I’m at it? [14:56:44] addshore: sounds like an exuse to try https://github.com/m110/grafcli [14:56:45] andrewbogott: sure thing! [14:57:02] PROBLEM - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [14:57:39] hashar: why ‘main’ rather than ‘universe’? [14:58:08] andrewbogott: no clue. that is the current state. We can change them to universe [14:58:12] well, or thirdparty [14:58:16] jzerebecki: you can delete them using the API :) [14:58:31] andrewbogott: yeah or thirdparty :-} Would have to clean up the old packages from main after that [14:58:57] ok, I think I’ll move everything to thirdparty [14:59:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [24.0] [14:59:48] andrewbogott: and I found out we have an old jenkins package in precise-wikimedia/main , the last got put to thirdparty [14:59:57] !log stopping pybal on lvs300[12], fallback to lvs300[34] (pybal 1.10 vs 1.12) [15:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:04] hashar andrewbogott: Dear anthropoid, the time has come. Please deploy CI (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1500). [15:00:07] andrewbogott: so you can nuke Jenkins 1.596.2 from precise-wikimedia/main [15:00:10] jouncebot_: ack [15:00:37] (03PS1) 10Alexandros Kosiaris: Puppet compiler: bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/253608 [15:00:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Puppet compiler: bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/253608 (owner: 10Alexandros Kosiaris) [15:00:47] addshore: you deleted all of them? [15:00:49] (03PS2) 10Alexandros Kosiaris: Puppet compiler: bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/253608 [15:00:54] (03CR) 10Alexandros Kosiaris: [V: 032] Puppet compiler: bump version to 0.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/253608 (owner: 10Alexandros Kosiaris) [15:01:22] jzerebecki: ya, reload now, I just re loaded them all [15:01:26] <_joe_> akosiaris: uh you patched it? [15:01:26] * hashar presses Ctrl + R[eload] [15:01:28] <_joe_> great! [15:01:33] ACKNOWLEDGEMENT - pybal on lvs2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:01:33] ACKNOWLEDGEMENT - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:01:33] ACKNOWLEDGEMENT - pybal on lvs2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:01:33] ACKNOWLEDGEMENT - pybal on lvs4001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:01:33] ACKNOWLEDGEMENT - pybal on lvs4002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:01:59] _joe_: yup. it did not return the autoguessed nodes [15:02:26] andrewbogott: maybe we can switch to #wikimedia-releng or hangouts ? [15:03:00] ACKNOWLEDGEMENT - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:03:00] ACKNOWLEDGEMENT - pybal on lvs3002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal Brandon Black testing pybal version update [15:04:53] (03PS1) 10Giuseppe Lavagetto: lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 [15:04:58] (03PS4) 10Andrew Bogott: contint: setup zuul-merger on scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/252336 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:06:35] (03CR) 10Andrew Bogott: [C: 032] contint: setup zuul-merger on scandium.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/252336 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:07:35] (03PS2) 10Giuseppe Lavagetto: lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 [15:07:59] (03PS3) 10Andrew Bogott: contint: pool in zuul-merger on scandium [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [15:10:27] <_joe_> akosiaris: actually I defined the compiler tag to the latest via hiera [15:10:29] <_joe_> in labs [15:10:43] <_joe_> no point in committing a change every time I want a new version [15:11:29] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/1305" [puppet] - 10https://gerrit.wikimedia.org/r/253610 (owner: 10Giuseppe Lavagetto) [15:11:34] <_joe_> bblack: ^^ [15:12:09] _joe_: well, that latest is not exactly working right now [15:12:39] <_joe_> uhm? [15:12:58] <_joe_> the latest you created or the one that is live and I am actively using? [15:13:02] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:18] (03PS3) 10Giuseppe Lavagetto: lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 [15:13:37] _joe_: the disrepancy between those 2 is my problem right now [15:13:42] <_joe_> uhm [15:13:44] (03PS1) 10Hashar: contint: mount ssd on scandium on /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/253611 [15:13:50] (03CR) 10BBlack: [C: 031] lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 (owner: 10Giuseppe Lavagetto) [15:13:50] <_joe_> what's the host of the compiler? [15:13:52] or why it is forced to 0.0.3 instead of 0.0.4 [15:14:01] Debug: Executing 'git clean -df & git checkout . && git diff HEAD..0.0.3 --exit-code' [15:14:02] grrr [15:14:10] oh shit [15:14:13] I just found out [15:14:16] PEBKAC [15:14:18] <_joe_> akosiaris: what's the problem? [15:14:26] me [15:14:29] that's the problem [15:14:29] <_joe_> which keyboard and which chair? [15:14:31] <_joe_> ahha ok [15:14:35] 0.0.3..HEAD? [15:14:37] <_joe_> I thought it was mine :P [15:15:09] https://wikitech.wikimedia.org/wiki/Hiera:Puppet3-diffs [15:15:11] oh the singular '&' [15:15:23] between clean and check [15:15:44] (03PS2) 10Hashar: contint: mount ssd on scandium on /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/253611 [15:15:58] (03CR) 10Rush: [C: 031] "thanks, this is good with me. I'll ask upstream why not serve the same robots anyways but good holdover even if they agree" [puppet] - 10https://gerrit.wikimedia.org/r/253589 (owner: 10Giuseppe Lavagetto) [15:15:59] 3 different places that version is defined ... [15:16:04] <_joe_> 3? [15:16:12] (03PS1) 10coren: Tool labs: start gridengine-master by default [puppet] - 10https://gerrit.wikimedia.org/r/253612 (https://phabricator.wikimedia.org/T109316) [15:16:16] <_joe_> I thought only hiera and the default class parameter [15:16:20] code, setup.py, hiera [15:16:24] puppet code* [15:16:31] 3 different repos ... [15:16:37] well, for a definition of repo [15:16:39] <_joe_> well puppet code/hiera is 1 place [15:16:44] <_joe_> logically [15:16:58] nope, you are not winning this argument [15:17:00] <_joe_> you know you set default class values and then override them in hiera :) [15:17:14] and nope [15:17:16] <_joe_> also, of course a software has its version in setup.py :P [15:17:18] 0.0.1 is not a default [15:17:35] I love/hate hiera much like I do puppet itself :P [15:17:44] bblack: I am with ya [15:17:47] <_joe_> ok, that is a point [15:17:47] but with less understanding heh [15:18:00] ah my code just updated [15:18:01] yay! [15:18:08] (03CR) 10Andrew Bogott: [C: 032] contint: mount ssd on scandium on /srv/ssd [puppet] - 10https://gerrit.wikimedia.org/r/253611 (owner: 10Hashar) [15:18:17] and finally I get to test my puppet change [15:18:43] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [15:18:43] <_joe_> puppet change or code change? [15:18:58] [ 2015-11-17T15:18:42 ] INFO: Compiling host lvs2005.codfw.wmnet (production) aaah [15:18:59] so nice... [15:19:03] <_joe_> ahahah [15:19:14] let's see how many hosts it compiles that change for now [15:19:15] <_joe_> akosiaris: you're using the catchall? [15:19:35] yes [15:19:40] <_joe_> with my evil puppet-parsed-in-python thing? [15:19:46] yup [15:19:51] <_joe_> :P [15:20:01] I assume if that salt change passes a 10% of hosts ok [15:20:05] <_joe_> I guess I did some minor fuckup at the time [15:20:09] it probably is fine all around [15:20:21] <_joe_> well the compiler will keep grinding [15:20:33] <_joe_> but you get a preview of your results every 5 hosts I think [15:20:37] well, it will stop at some point, no ? [15:20:37] !log restbase start deploy of e749f6ff [15:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:44] the more the merrier for me [15:20:46] <_joe_> akosiaris: no it goes on foreverrr :P [15:20:58] oh, like a nodejs process you mean ? [15:21:00] :P [15:21:03] <_joe_> ahahahahaha [15:21:13] and there they go again [15:21:14] ah ah [15:21:16] :P [15:21:25] <_joe_> we should start to make people pay to hear us complain [15:21:30] hashar, andrewbogott deploying? if you finish early, i would like to deploy graphoid nodejs service [15:21:37] (03PS1) 10coren: Add /~dispenser redirect to www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/253614 (https://phabricator.wikimedia.org/T116757) [15:21:49] <_joe_> I guess we're funny [15:21:59] <_joe_> oh it's puppetswat day [15:22:01] yurik: still in process [15:22:14] yurik: and there is a swat in 40 minutes [15:23:21] hashar, its a fairly quick git depl sync - plus it shouldn't affect MW sync [15:25:10] 6operations, 10ops-eqiad: Disk failure on db1027 (RAID degraded) - https://phabricator.wikimedia.org/T118848#1810919 (10Cmjohnson) Swapped disk 32-8...will wait until it finished rebuild before swapping the failing disk. [15:25:58] (03PS2) 10coren: Add /~dispenser redirect to www.toolserver.org [puppet] - 10https://gerrit.wikimedia.org/r/253614 (https://phabricator.wikimedia.org/T116757) [15:27:16] (03CR) 10coren: [C: 032] "Trivial addition to redirect list" [puppet] - 10https://gerrit.wikimedia.org/r/253614 (https://phabricator.wikimedia.org/T116757) (owner: 10coren) [15:27:56] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1810931 (10Ottomata) I don't know much about the mw-logs. Maybe @bd808 knows more, or who to ask? [15:28:35] (03PS1) 10Hashar: zuul: create git_dir parent directory [puppet] - 10https://gerrit.wikimedia.org/r/253616 [15:29:24] 6operations, 10Analytics, 6Services: Wikimedia pageview API intermittently throwing HTTP 503s - https://phabricator.wikimedia.org/T118817#1810938 (10mobrovac) 5Open>3Resolved [preq PR #9](https://github.com/wikimedia/preq/pull/9) fixed this issue entirely. IT has been deployed and now everything works as... [15:29:36] _joe_: akosiaris: ^^ [15:30:08] !log restbase end deploy of e749f6ff [15:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:20] <_joe_> mobrovac: eheheh [15:30:29] :P [15:30:33] (03CR) 10Hashar: zuul: create git_dir parent directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253616 (owner: 10Hashar) [15:31:14] (03PS1) 10Andrew Bogott: Ensure that the zuul-merger parent dir exists. [puppet] - 10https://gerrit.wikimedia.org/r/253617 (https://phabricator.wikimedia.org/T95046) [15:31:44] 6operations: Puppet Compiler: Support wildcards, regexps, or 'all hosts' - https://phabricator.wikimedia.org/T114305#1810943 (10akosiaris) 5Open>3stalled With https://gerrit.wikimedia.org/r/#/c/253605/ merged and deployed, leaving the `LIST_OF_NODES` field empty will now instruct the compiler to go through... [15:32:06] (03PS4) 10Giuseppe Lavagetto: lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 [15:32:38] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 (owner: 10Giuseppe Lavagetto) [15:32:53] (03CR) 10Andrew Bogott: [C: 032] zuul: create git_dir parent directory [puppet] - 10https://gerrit.wikimedia.org/r/253616 (owner: 10Hashar) [15:33:12] <_joe_> akosiaris: it was that stupid? [15:33:26] _joe_: ETOOMUCHRUBY [15:33:28] <_joe_> oh shit [15:33:31] mobrovac: nice! [15:33:35] (03PS5) 10Giuseppe Lavagetto: lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 [15:33:47] (03CR) 10Giuseppe Lavagetto: [V: 032] lvs: stop spamming bgp announcements from lvs1007-12 [puppet] - 10https://gerrit.wikimedia.org/r/253610 (owner: 10Giuseppe Lavagetto) [15:33:59] i never get the ETOOMUCHRUBY error :P [15:35:22] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:36:00] !log reseating pem 0 on cr1-eqiad [15:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:10] PROBLEM - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [15:39:22] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: fixed Style/TrailingBlankLines offense [puppet] - 10https://gerrit.wikimedia.org/r/253349 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:39:28] (03PS2) 10Alexandros Kosiaris: RuboCop: fixed Style/TrailingBlankLines offense [puppet] - 10https://gerrit.wikimedia.org/r/253349 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:39:35] (03CR) 10Alexandros Kosiaris: [V: 032] RuboCop: fixed Style/TrailingBlankLines offense [puppet] - 10https://gerrit.wikimedia.org/r/253349 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:39:38] 6operations, 10ops-eqiad, 10netops: cr1-eqiad PEM 0 fan failed - https://phabricator.wikimedia.org/T118721#1810958 (10Cmjohnson) reseated pem0, status is the same 2 alarms currently active Alarm time Class Description 2015-11-17 15:38:00 UTC Minor PEM 0 Fan Failed [15:40:32] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: fixed Style/Tab offense [puppet] - 10https://gerrit.wikimedia.org/r/253350 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:40:37] (03PS2) 10Alexandros Kosiaris: RuboCop: fixed Style/Tab offense [puppet] - 10https://gerrit.wikimedia.org/r/253350 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [15:43:00] RECOVERY - zuul_merger_service_running on scandium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [15:47:17] (03PS1) 10Faidon Liambotis: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 [15:48:41] PROBLEM - zuul_merger_service_running on scandium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [15:49:05] (03PS2) 10Faidon Liambotis: varnish: switch from libGeoIP to libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/253619 (https://phabricator.wikimedia.org/T99226) [15:54:19] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:20] Coren, YuviPanda, andrewbogott, chasemp ^^^^^^^ [15:54:20] Yeah, I'm looking at it now. [15:54:51] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [15:55:00] back up? [15:55:01] well that seems related :) [15:55:22] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 3.871 second response time [15:55:23] it is up, just veryyy slow [15:59:14] 6operations, 10Analytics, 10CirrusSearch, 6Discovery, 7audits-data-retention: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old. - https://phabricator.wikimedia.org/T118527#1811003 (10EBernhardson) I believe the medawiki logs were rsync'd over at @ironholds request. They are not... [16:00:19] Hi. [16:00:47] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1600). [16:00:47] James_F Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:02:09] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1811010 (10hashar) [16:02:12] 6operations, 5Continuous-Integration-Scaling: Upload new Zuul packages on apt.wikimedia.org for Precise / Trusty / Jessie - https://phabricator.wikimedia.org/T118340#1811007 (10hashar) 5Open>3Resolved Andrew uploaded them all :-} Thank you! [16:02:29] * James_F waves. [16:02:41] !log re-enabling pybal on lvs400[12] (upgraded to 1.12) [16:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:51] RECOVERY - zuul_merger_service_running on scandium is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger [16:02:55] !log Zuul-merger deployment aborted / uncomplte (I hate puppet) [16:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:14] James_F: Dereckson: we are done with the previous operation so you can do the swat safely :-} [16:04:08] I can SWAT: Dereckson: James_F looks like you are both around. [16:04:21] (03Abandoned) 10Hashar: Ensure that the zuul-merger parent dir exists. [puppet] - 10https://gerrit.wikimedia.org/r/253617 (https://phabricator.wikimedia.org/T95046) (owner: 10Andrew Bogott) [16:04:31] RECOVERY - pybal on lvs4001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:04:31] thcipriani: Yup. [16:04:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [16:04:39] Indeed. [16:04:49] Hello :) [16:08:11] (03PS2) 10Thcipriani: Enable VisualEditor for 50% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250472 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:08:13] RECOVERY - pybal on lvs4002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:08:13] that 5xx spike seems to be text cluster in eqiad only [16:08:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250472 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:08:13] strange [16:08:13] (03Merged) 10jenkins-bot: Enable VisualEditor for 50% of new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250472 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:08:31] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [16:08:41] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:11] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [60.0] [16:09:12] Coren: what's going on? [16:09:17] Coren: so something is hammering tools? [16:09:58] paravoid: Something caused a write barrier on labstore1001 causing all the dirty buffers to be flushed at once. [16:10:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for 50% of new accounts on eswiki [[gerrit:250472]] (duration: 00m 26s) [16:10:08] ^ James_F check please [16:10:21] well...if you can check anything :) [16:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:26] paravoid: It's done now, and things have recovered. The tools homepage should pass the next check. [16:10:41] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 928423 bytes in 3.749 second response time [16:11:01] thcipriani: The wiki seems fine… [16:11:03] chasemp: I don't know what caused it yet; the only things I see in the logs for now are symptoms not cause. [16:11:06] thcipriani: I'll say "working". :-) [16:11:13] James_F: nice! :) [16:11:27] (03PS3) 10BryanDavis: scap: Create wrapper script for master-master rsync [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) [16:11:29] could you be a bit more specific? [16:11:32] what are you seeing? [16:12:06] (03CR) 10BryanDavis: scap: Create wrapper script for master-master rsync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/253040 (https://phabricator.wikimedia.org/T117016) (owner: 10BryanDavis) [16:12:47] paravoid: Saw. Basically, I/O writes went very high for a while while the number of dirty pages was going down quickly. Things recovered basically instantaneously when that hit near-zero. I'm seeing dmesg entires about stalled processes (all kcopyd, consistent with buffers flushed) waiting on device. [16:13:20] Also 'nfsd:peername failed' which are symptoms of timeouts while also waiting on I/O [16:13:28] (03CR) 10Jforrester: "Scheduled for next Tuesday, 24 November." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) (owner: 10Jforrester) [16:13:28] (03PS1) 10Hashar: zuul: monitor git-daemon on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/253622 (https://phabricator.wikimedia.org/T118856) [16:13:28] (03PS2) 10Jforrester: Enable VisualEditor for all new accounts on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250473 (https://phabricator.wikimedia.org/T117410) [16:13:38] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=labstore1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1447776629&g=cpu_report&z=large&c=Labs%20NFS%20cluster%20eqiad [16:13:45] doesn't look like something that happened once [16:14:07] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1811043 (10Ottomata) [16:14:18] paravoid: No, I see a couple spikes in the past several hours - all with the same symptoms. [16:14:24] there is an rsync running... [16:15:12] (03CR) 10Hashar: [C: 031] "Seems the command is fine:" [puppet] - 10https://gerrit.wikimedia.org/r/253622 (https://phabricator.wikimedia.org/T118856) (owner: 10Hashar) [16:15:12] paravoid: That's every day, and that shouldn't cause buffers being dirtied (it's the read side of the rsync - the write is in codfw) [16:15:12] RECOVERY - Persistent high iowait on labstore1001 is OK: OK: Less than 50.00% above the threshold [40.0] [16:15:20] it causes I/O [16:15:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) (owner: 10Dereckson) [16:15:38] paravoid: Read I/O, yeah. [16:15:48] Dereckson: sorry for the delay, got caught up making sure I wasn't crazy looking at the timestamps :) [16:15:54] No problem. [16:15:59] it's not like the disks for reading are different than the ones for writing, is it? [16:16:27] (03Merged) 10jenkins-bot: Editatón contra la violencia hacia las mujeres throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253541 (https://phabricator.wikimedia.org/T118702) (owner: 10Dereckson) [16:16:28] dm-0 0.00 0.00 57.00 620.40 235.20 75202.40 222.73 2569.16 31.57 66.47 28.36 1.48 99.92 [16:16:29] 6operations, 7Database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1811053 (10Reedy) >>! In T54932#1810712, @jcrespo wrote: > What about the rest (they are not on a db list)? > > ``` > chwikimedia.ipblocks_old > comcomwiki.ipblocks_old > > wikimania2005wiki tab... [16:16:33] dm-0 0.00 0.00 49.20 164.20 3351.20 20912.80 227.40 3459.95 3664.80 18.03 4757.50 4.69 100.00 [16:16:44] dm-0 0.00 0.00 69.80 110.60 3049.60 11825.60 164.91 4780.98 3088.98 48.56 5007.80 5.54 100.00 [16:16:56] that 99.92 and 100.00 at the end is %util [16:16:57] of dm-0 [16:17:33] this causes i/o wait [16:17:33] paravoid: No, but I don't see how a readonly rsync is likely to cause all the dirty buffers being flushed at once. [16:17:33] what are you talking about? [16:17:41] paravoid: Of course it hits 100%; the rsync is running at ionice Idle so it'll take "everything else" it can. :-) [16:18:18] paravoid: The high load was caused by (two, in the recent hours) bursts of very high write I/O matching dirty buffers going down. [16:18:21] RECOVERY - RAID on db1027 is OK: OK: optimal, 1 logical, 2 physical [16:19:02] where do you see those spikes? [16:19:18] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: Editaton contra la violencia hacia las mujeres throttle rule [[gerrit:253541]] (duration: 00m 27s) [16:19:21] http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&c=Labs+NFS+cluster+eqiad&h=labstore1001.eqiad.wmnet&jr=&js=&event=hide&ts=0&v=37.04&m=load_fifteen&vl=+&ti=Fifteen+Minute+Load+Average [16:19:25] ^ Dereckson sync'd [16:19:31] I don't understand why you think it's an isolated incident (or two) [16:19:35] mark: The dirty buffers? I don't think they're in graphite - I saw it live in meminfo [16:19:45] that graph pretty much proves that it's not, doesn't it? [16:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:20:05] paravoid: The load graph? [16:20:08] yes. [16:20:32] coinciding with http://ganglia.wikimedia.org/latest/graph.php?r=4hr&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Labs+NFS+cluster+eqiad [16:20:37] or daily view http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Labs+NFS+cluster+eqiad [16:20:51] paravoid: Yup. See the dip in the network writes? [16:21:06] I see the increase in network in [16:21:10] paravoid: That's the points where all the buffers got flushed, stalling writes in general. [16:21:29] (03CR) 10Andrew Bogott: [C: 031] zuul: monitor git-daemon on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/253622 (https://phabricator.wikimedia.org/T118856) (owner: 10Hashar) [16:21:29] (ca. 14:30 and 15:50) [16:21:31] this is going on for two and a half hours [16:21:47] paravoid: Yes, but it didn't affect performance until those dips. [16:22:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/252911 (https://phabricator.wikimedia.org/T117857) (owner: 10Dereckson) [16:23:03] how did it not affect performance, the load avg graphs are all indicative of increased load for the past two and a half hours? [16:25:02] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Namespace configuration on ru.wikipedia.org [[gerrit:252911]] (duration: 00m 26s) [16:25:04] ^ Dereckson check please [16:25:09] paravoid: because the whole system can function pretty well until high load without serious impact; and indeed nothing complained until the bigger incidents. We might increase the sensitivity of the load alert, though, and catch those earlier before they become major. [16:25:38] what was the underlying cause of the i/o issue? [16:25:52] where do we have graphs of individual labs instances? [16:26:08] 252911 tested [16:26:30] 7Puppet, 6operations, 5Continuous-Integration-Scaling: On Jessie, puppet does not start zuul-merger via init scripts - https://phabricator.wikimedia.org/T118861#1811101 (10hashar) 3NEW a:3hashar [16:26:37] bblack: fastcci-master.fastcci.eqiad.wmflabs has high write traffic atm, I think it's the most significant factor [16:26:43] Dereckson: thanks [16:26:45] yeah, that was what I found. [16:27:00] hence me looking for graphs [16:27:06] I found nagf, looking at that now [16:27:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250709 (https://phabricator.wikimedia.org/T115938) (owner: 10Dereckson) [16:27:48] it doesn't look to be very new [16:27:50] https://graphite.wmflabs.org/render/?title=fastcci-master+Network+bytes+last+day&width=800&height=250&from=-1day&hideLegend=false&uniqueLegend=true&target=alias%28fastcci.fastcci-master.network.eth0.rx_byte%2C%22Bytes+received%22%29&target=alias%28fastcci.fastcci-master.network.eth0.tx_byte%2C%22Bytes+sent%22%29 [16:28:07] (03Merged) 10jenkins-bot: Set import sources on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250709 (https://phabricator.wikimedia.org/T115938) (owner: 10Dereckson) [16:28:31] the maps rsync has been running for a long time too [16:28:34] I think it's related as well [16:28:57] it fluctuates between 60 and 80% I/O [16:29:14] (03CR) 10Giuseppe Lavagetto: [C: 032] pybal: fix config dict types [debs/pybal] - 10https://gerrit.wikimedia.org/r/253325 (owner: 10Giuseppe Lavagetto) [16:29:17] paravoid: No, but looking at historical data on labstore doesn't look like it generally causes issues. fascci seems to burst every 2h, but the labstore graph doesn't see spikes at that period. [16:29:28] http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20eqiad&h=labvirt1006.eqiad.wmnet&m=network_report&r=day&s=by%20name&hc=4&mc=2 [16:29:38] paravoid: That's expected - it runs at ionice Idle so it'll try to get all the bandwidth it can. [16:29:40] that correlates pretty well [16:29:45] yeah [16:29:56] 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1811113 (10hashar) [16:29:59] 7Puppet, 6operations, 5Continuous-Integration-Scaling: On Jessie, puppet does not start zuul-merger via init scripts - https://phabricator.wikimedia.org/T118861#1811111 (10hashar) 5Open>3Resolved zuul-merger does not have `ensure => running,` so we can stop it manually without having puppet to start... [16:30:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set import sources on en.wikiversity [[gerrit:250709]] (duration: 00m 27s) [16:30:11] ^ Dereckson check please [16:30:11] (03Merged) 10jenkins-bot: pybal: fix config dict types [debs/pybal] - 10https://gerrit.wikimedia.org/r/253325 (owner: 10Giuseppe Lavagetto) [16:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:30:52] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [16:31:03] Can't check that. We need an admin/importer on en.wikiversity. Will let a comment in the bug asking feedback. [16:31:09] it's spiking up again [16:31:12] Dereckson: kk [16:31:14] Comment let. [16:31:17] so I think we'll have another page soon [16:31:20] ft [16:31:32] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=labvirt1006.eqiad.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Virtualization+cluster+eqiad [16:32:15] (03CR) 10Alexandros Kosiaris: [C: 032] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/1306/console says 240 hosts noop. The rest are not compiling d" [puppet] - 10https://gerrit.wikimedia.org/r/253342 (owner: 10Alexandros Kosiaris) [16:32:21] (03PS3) 10Alexandros Kosiaris: salt: Move the role manifests into role module [puppet] - 10https://gerrit.wikimedia.org/r/253342 [16:32:24] paravoid: It seems clear to me that the fascci load /by itself/ isn't sufficient to cause issues, but that right now it's pushing over the edge. [16:32:30] (03CR) 10Hashar: "So I have hold the pooling of the new zuul-merger because puppet did not start the process on scandium ( T118861 ). It was a mistake on my" [puppet] - 10https://gerrit.wikimedia.org/r/252337 (https://phabricator.wikimedia.org/T95046) (owner: 10Hashar) [16:32:57] (03CR) 10Andrew Bogott: [C: 032] zuul: monitor git-daemon on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/253622 (https://phabricator.wikimedia.org/T118856) (owner: 10Hashar) [16:33:15] paravoid: Hm. Maybe not. [16:33:28] paravoid: fascci has spikes every 2h, but /this one/ seems bigger than usual. [16:33:44] no, I don't think it's fastcci [16:34:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250862 (https://phabricator.wikimedia.org/T116527) (owner: 10Dereckson) [16:34:11] paravoid: Yeah; otherwise we'd see that every time. [16:34:50] (03PS4) 10Alexandros Kosiaris: salt: Move the role manifests into role module [puppet] - 10https://gerrit.wikimedia.org/r/253342 [16:35:01] it was something that was running on the tools-bastion I think [16:35:05] (03CR) 10Alexandros Kosiaris: [V: 032] salt: Move the role manifests into role module [puppet] - 10https://gerrit.wikimedia.org/r/253342 (owner: 10Alexandros Kosiaris) [16:35:22] (03Merged) 10jenkins-bot: Set $wgCategoryCollation for bs.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250862 (https://phabricator.wikimedia.org/T116527) (owner: 10Dereckson) [16:35:44] yeah https://tools.wmflabs.org/nagf/?project=tools confirms [16:36:08] tools-exec too [16:36:37] https://graphite.wmflabs.org/render/?title=tools+cluster+Disk+space+last+day&width=800&height=250&from=-1day&hideLegend=false&uniqueLegend=true&target=aliasByNode%28sum%28tools.*.diskspace.*.byte_avail%29%2C-3%2C-2%29 [16:36:41] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:36:41] wth is that? [16:36:54] thcipriani: 250862 requires to run updateCollation.php [16:36:58] icicles? [16:37:50] paravoid: I'm not sure how that metric is measured. tools.*.diskspace.*? [16:38:35] 7Puppet, 6Labs, 5Patch-For-Review: dynamicproxy: Move list of blocked user agents to hiera - https://phabricator.wikimedia.org/T90844#1811147 (10Krenair) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/249182/ [16:39:56] thcipriani: according https://phabricator.wikimedia.org/T52311#536084 it would be `mwscript updateCollation.php --wiki=bswiki --previous-collation=uca-default` [16:40:12] PROBLEM - puppet last run on mw2060 is CRITICAL: CRITICAL: puppet fail [16:40:18] Dereckson: thanks, I was just looking for that :) [16:40:25] (03PS1) 10Ottomata: Can now require python-pykafka on all eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/253630 (https://phabricator.wikimedia.org/T109567) [16:40:42] it's all over now [16:40:54] and graphs aren't enough to find out what was happening [16:41:15] Hm. No, but that it was on labvirt1006 does give me a narrowed field. [16:41:35] look at nagf [16:41:37] it was tools [16:41:41] the bastion and a couple of exec nodes [16:41:47] several exec nodes indeed [16:41:59] Yeah, tools-exec-1221 is the most likely one. [16:42:12] <_joe_> YuviPanda: still sure we don't want to ditch nfs for k8s? ;) [16:42:15] I'm looking at the logs to see what ran there during the problem window. [16:42:25] !log upgrading / re-enabling pybal on lvs200[123].codfw.wmnet (1.10 -> 1.12) [16:42:27] _joe_: troll [16:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:42:39] <_joe_> paravoid: :D [16:43:08] (03PS3) 10Giuseppe Lavagetto: phabricator: disallow crawling of phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253589 [16:43:10] RECOVERY - pybal on lvs2001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:43:16] Dereckson: ok, syncing, then running update, hopefully won't take too long. [16:43:30] RECOVERY - pybal on lvs2002 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:43:51] RECOVERY - pybal on lvs2003 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [16:44:01] RECOVERY - puppet last run on mw2060 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:45:28] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Set $wgCategoryCollation for bs.wikipedia.org [[gerrit:250862]] (duration: 00m 26s) [16:45:44] (03CR) 10Ottomata: [C: 032] Can now require python-pykafka on all eventlogging hosts [puppet] - 10https://gerrit.wikimedia.org/r/253630 (https://phabricator.wikimedia.org/T109567) (owner: 10Ottomata) [16:46:10] Dereckson: sync'd, script run. [16:46:50] bs.wikipedia is 65 212 articles, 335 471 pages [16:47:09] (03PS12) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [16:47:47] suspiciously short script run time. "Collations up-to-date." was the only output. [16:48:13] (03PS4) 10Giuseppe Lavagetto: phabricator: disallow crawling of phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253589 [16:48:14] It could be because uca-bs is very similar to uca-default? [16:48:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/253589 (owner: 10Giuseppe Lavagetto) [16:48:32] Reedy_: any idea ? ^. [16:49:51] thcipriani: what did you run? [16:49:54] ie what parameters? [16:50:09] Reedy_: mwscript updateCollation.php --wiki=bswiki --previous-collation=uca-default [16:50:37] post-syncing https://gerrit.wikimedia.org/r/#/c/250862/1 [16:50:48] paravoid: aha. https://tools.wmflabs.org/dimensioner/index This allows endusers to create a potentially huge dataset from wikidata and writes out a downloadable db for it. And it /just/ finished a job that overlaps (in time) the period in question. [16:51:26] * Coren digs deeper [16:52:01] thcipriani: haha [16:52:10] thcipriani: It's because the previous collation was uppercase [16:52:19] rre collation is still uppercase? [16:52:33] thcipriani: mwscript updateCollation.php --wiki=bswiki --previous-collation=uppercase [16:52:57] ah, ok. Lemme give that a shot. [16:52:58] Oh, I thought we updated every wiki to uca-default [16:53:11] Dereckson: look at the top of that list :( [16:53:15] 13011 13011 'wgCategoryCollation' => array( [16:53:15] 13012 13012 » 'default' => 'uppercase', [16:53:50] yeah, seems to be doing some actual work now [16:54:01] There's 438554 to update [16:54:43] oh good. Done with 50,000 so far. [16:55:06] won't be super quick, but shouldn't take it too long [16:56:07] Reedy_: kk. Thanks for your help! Should have caught that :\ [16:56:22] I stand corrected. [16:56:24] thcipriani: yeah, either look at the default, or look at the database [16:56:32] (03PS1) 10Ottomata: Deploy eventlogging from new server code only repo [puppet] - 10https://gerrit.wikimedia.org/r/253637 (https://phabricator.wikimedia.org/T118863) [16:58:07] (03CR) 10Ottomata: [C: 032] Deploy eventlogging from new server code only repo [puppet] - 10https://gerrit.wikimedia.org/r/253637 (https://phabricator.wikimedia.org/T118863) (owner: 10Ottomata) [17:00:04] _joe_ moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1700). [17:00:05] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1811237 (10csteipp) @jcrespo, interesting. If you're able to get those numbers into nagois, I can probably figure out a way to get it onto the appservers. [17:00:30] <_joe_> jouncebot: no one submitted patches anyways [17:03:09] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet has 1 failures [17:03:19] Dereckson: 438554 rows processed [17:03:32] thanks for your help! [17:05:19] You're welcome. Sorry for the default mess. [17:05:36] np. All's well that ends well. [17:08:18] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:12:08] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:16:05] 6operations, 10Analytics, 6Analytics-Kanban, 6Discovery, and 8 others: Send HTTP stats about eventlogging-service to statsd - https://phabricator.wikimedia.org/T118869#1811301 (10Ottomata) 3NEW a:3Ottomata [17:16:21] (03CR) 10JanZerebecki: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/253637 (https://phabricator.wikimedia.org/T118863) (owner: 10Ottomata) [17:18:34] (03PS1) 10Giuseppe Lavagetto: phabricator: fix serving robots.txt for phab.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/253638 [17:19:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/253638 (owner: 10Giuseppe Lavagetto) [17:21:54] (03PS1) 10Giuseppe Lavagetto: phabricator: fix typo in virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/253639 [17:22:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/253639 (owner: 10Giuseppe Lavagetto) [17:23:32] (03PS1) 10Alexandros Kosiaris: trebuchet: make the role a module [puppet] - 10https://gerrit.wikimedia.org/r/253640 [17:25:12] <_joe_> akosiaris: you're daring to touch the trebuchet puppet shitshow? [17:25:15] <_joe_> man you're brave [17:25:20] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1811344 (10Dzahn) after some discussion on the mailing list, added a new address alerts@ which is an alias for root@ but let's us filter better. we are using a... [17:26:01] _joe_: it's the puppet compiler that gives me courage [17:26:31] <_joe_> akosiaris: oh I see, you're preparing to shift the blame in the end :P [17:26:47] ;-) [17:32:22] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1811368 (10Sjoerddebruin) [17:35:52] _joe_ if you have nothing to puppet SWAT, can i go earlier? CC: greg-g [17:36:07] i have graphoid scheduled afterwards [17:36:08] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:36:25] hmm http://halide-lang.org/ [17:36:34] ^ looks interesting for image scaling, maybe [17:38:15] bblack, wouldn't image scaling be a highly optimized custom solution? [17:38:38] i would think a more generic language is for various filters [17:38:50] well obviously halide is a lot more generic than scaling [17:39:21] but the upside is a very simple high-level description of the scaling algorithm -> code that runs native-cpu or various GPU hardware, etc. [17:39:34] as opposed to complicated C code with maybe-bugs that only really works on the host CPU [17:41:56] i would hope that scaling is such an old problem that they have solved and debugged it for all use cases including via GPU. We might introduce more bugs/security issues if we introduce a new language [17:42:19] "scaling" isn't one thing, there are lots of algorithms with different tradeoffs [17:42:41] but if we were to introduce access to more than just scaling, this might make good sense ... image manipulation via Lua? [17:42:47] bblack: but those algorithms all perform better in node.js [17:42:47] image libraries implement them in C, we indirectly use the image libraries, and yes the image libraries sometimes have buffer overflows that can be exploited with a crafted image [17:43:07] 6operations, 7Database: Better mysql monitoring for number of connections and processlist strange patterns - https://phabricator.wikimedia.org/T112473#1811410 (10jcrespo) @csteipp Are you suggesting the appservers depending on nagios. I would strongly suggest against that. If you mean "copying" what nagios w... [17:43:23] halide could have bugs too, but in the long run I tend to trust automated code generated by something like halide over hand-crafted C complexity. [17:44:00] ori, they perform better if run in a JVM under docker inside a VM on top of intel virtualized hardware in a intel simulated mode [17:44:20] yurik: sure, as long as it's running in the cloud [17:44:28] * yurik agrees [17:45:16] the cloud here is very fast today, thanks to a massive storm system moving through TX [17:45:17] * yurik also thinks we should invest in FPGA for this [17:45:47] i am pretty sure we can get it running faster with FPGA than with a generic GPU [17:46:14] well sure :) [17:46:24] we could fab our own ASICs and be faster than FPGA too [17:46:41] yurik: does the ZeroOpts cookie encode any data apart from tls yes/no? [17:46:51] dr0ptp4kt, ^ [17:47:09] I don't think it does, I looked before [17:47:27] dr0ptp4kt played with it a while ago, i don't remember what he did with it [17:47:29] (is tls ever not set?) [17:47:48] it may still have purpose, there's some some TLS-optional things going on in the zero world, kinda [17:47:49] yurik: bequeathed it to bblack for perpetual maintenance? ;) [17:48:16] (as in, some partners don't zero-rate for TLS, and so we show them a banner saying hey this isn't free because it's TLS, or something. At least, that was the case at some past transitional point) [17:48:17] yep [17:48:45] although I guess now we could just set those statically by partner, since all traffic is TLS [17:49:02] (if there are any partners left that don't do TLS-compatible IP whitelisting) [17:50:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1811442 (10Andrew) Associated patch can be merged on Thursday, 2015-11-19 [17:50:47] bblack, i could also ping dfoy, he might know something about it [17:50:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to rest base and cassandra nodes - https://phabricator.wikimedia.org/T117473#1811443 (10Andrew) p:5Triage>3Normal [17:51:07] i don't know if there are any non-ip partners at the moment [17:51:33] probably not effectively in practice [17:51:48] if they're not whitelisting on IP, and all the traffic is HTTPS, then Zero isn't working there heh [17:53:18] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for ejegg - https://phabricator.wikimedia.org/T118320#1811472 (10Andrew) @Ejegg, note that this ticket is stalled pending Rob's earlier requests. [17:53:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [17:54:31] !log deployed latest graphoid service [17:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:04] yurik: Respected human, time to deploy Graphoid deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1800). Please do the needful. [18:01:21] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [5000000.0] [18:02:03] !log upgrading and re-enabling pybal on lvs300[12].esams.wmnet (1.10 -> 1.12) [18:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:02:19] (03PS1) 10Ori.livneh: wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 [18:07:49] bblack: whenever you are done with pybal and have a moment, there's https://gerrit.wikimedia.org/r/#/c/253645/ . I think I may have promised you at some point not to do exactly what that patch is doing. [18:08:13] ori: on the CP=H2 thing - I assume it's a cookie so RL javascript can see it? do you want to give it some life beyond the session (1day)? should also not send it if already set. [18:09:06] I assume it's a cookie so RL javascript can see it -- yes. should also not send it if already set. -- good point [18:09:08] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [18:09:57] re: life beyond the session -- dunno, i imagine some corporate proxy servers etc don't support SPDY / HTTP 2 [18:09:59] can copy the geoip stuff re checking both Cookie and Orig-Cookie [18:10:06] yep [18:10:14] ori: yeah true esp with mobile devices / laptops [18:10:30] !log swapping failing disk 32:8 on db1027..will cause icinga alert (jynus) [18:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:45] although they'll probably put it to sleep with the browser open and it will still be the same "session" for cookie purposes [18:11:08] cmjohnson1, thanks, it is depooled for now, so no issue [18:11:31] we could set it to H2 or H1 anytime there's not an existing CP= value or the CP= value doesn't match X-C-P [18:11:44] (then it would flip if they move with an open session) [18:12:21] I should change icinga checks so that if a server is depooled they do not alert [18:22:49] (03CR) 10BBlack: [C: 04-1] wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253645 (owner: 10Ori.livneh) [18:23:22] ori: put all that in CR. also, I don't remember why I would've said anything about not doing this before. Now I'm worried I forgot a good reason :) [18:24:02] bblack: I think you may have simply been making the point about analyzing % of traffic with SPDY via varnishlog instead of sending it to the client just so they can send it back to us [18:24:19] oh I see [18:24:22] which makes sense, but isn't what i'm doing here [18:25:09] yeah it would be really fascinating to see how client perf changes if RL becomes unbundled separate reqs for H2/SPDY3 [18:25:17] aside from the cache miss thing which is nifty too [18:26:11] see krinkle's analysis in https://phabricator.wikimedia.org/T117824#1811333 for a good example [18:27:53] nice [18:28:21] tho, thinking about it more, why "should also not send it if already set"? you potentially save six header bytes (though there's compression) [18:28:26] but you make the vcl more complicated that way [18:28:51] could just always set it if SPDY=[123] and always unset it otherwise [18:28:57] well more than 6, there's the Set-Cookie:\n stuff too [18:29:10] ok, fair [18:29:23] we do have outbound header comp for SPDY though [18:30:33] multiplied by however many reqs, too [18:30:44] all 16 of our RL requests are sending Set-Cookie, etc [18:30:50] + images, etc [18:31:14] actually the images part is kinda fascinating, in that technically upload + text could not share connection properties [18:31:40] but that seems like a really corner-case thing to even worry about [18:32:07] (maybe someone has a proxy in the way only for upload.wm.o but not en.wp.o) [18:32:14] (or vice-versa) [18:32:38] oh but we're not doing any kind of domain stuff, it's the request hostname only [18:33:06] yep [18:33:11] ughhga [18:33:21] Coren: paravoid did the investigation hit anything? [18:33:51] ori: do we have data on how bad fetching 16 stylesheets would really be for non-spdy users? [18:34:04] considering that those would likely be cache hits on the edge [18:34:13] YuviPanda: Couple of reasonable candidates; nothing definite as it's not really possible to guess just from the graphs. [18:34:20] gwicke: edge could still be very far [18:34:35] it's a long way indonesia to ulsfo [18:34:38] *from [18:34:40] _joe_: at leat with the way we do k8s NFS outage will only affect tools that use NFS [18:34:53] ori: sure, just wondering if we actually measured this [18:34:53] i don't see how it wouldn't be catastrophic for such users [18:35:07] no, so you are right that we should not speak confidently [18:35:17] still, i would be very surprised if our intuitions about this were off [18:35:39] it would run contrary to a lot of what we know about page performance [18:35:47] Coren: ok! do write up a wiki page or a phab ticket or somesuch maybe. [18:36:58] 16 also seems like a very large number [18:37:14] most pages don't have that many page-dependent styles, afaik [18:38:54] I wouldn't be surprised if having one consolidated default stylesheet plus individual loading of per-page ones would work out okay for non-spdy users [18:40:25] that would be possible, but "one consolidated default stylesheet" is not something that currently exists; RL doesn't know whether a module is needed on all pages or only a small subset. if it did, it would know to fetch rarely-needed modules in a separate request.request rarely-needed [18:40:31] s/\..*/ [18:41:12] so it is possible that you are right, but it would require additional work [18:41:48] (03PS3) 10Dzahn: puppetmaster: puppet-lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/253535 [18:42:11] however, that work would still be useful in a spdy only world, as compression of one large blob will be better than the same as fragments, assuming it's needed on every view [18:42:40] akosiaris: if you are around, I have questions about librsvg. It looks like you updated it in November, but now the version on Carbon is different from the build tree in Gerrit. [18:42:46] maybe, assuming the blob is sufficiently stable [18:43:00] if its constituent modules changed often enough, that could easily offset any improvement in compression [18:43:33] yeah, good point [18:44:43] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1811704 (10Andrew) It looks like Alexandros upgraded things in November. Here's what we currently serve up for Trusty: root@carbon:~# reprepro list trusty... [18:45:21] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1309/" [puppet] - 10https://gerrit.wikimedia.org/r/253535 (owner: 10Dzahn) [18:45:28] I mean, you could be right; you're not saying anything completely implausible. I would measure this if I had more time or if I had a very good reason to believe that my intuitions could be way off, but absent those things this question goes into a large backlog. [18:45:58] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Decom observium.wikimedia.org - https://phabricator.wikimedia.org/T118790#1811716 (10Andrew) p:5Triage>3Normal [18:46:49] PROBLEM - HHVM processes on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:54] ori: I think it's important that we figure out a direction for RL in a SPDY world, though [18:47:09] PROBLEM - nutcracker process on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:24] even if the steps we take in that direction initially are baby steps [18:47:28] PROBLEM - salt-minion processes on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:46] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Decom observium.wikimedia.org - https://phabricator.wikimedia.org/T118790#1811731 (10Dzahn) I _think_ this is done. it's been deleted from DNS and the redirect was removed. not sure if there is anything else to it. [18:48:11] gwicke: do you think the patch above is counterproductive, or do you simply worry about the absence of a bigger plan? [18:48:29] PROBLEM - dhclient process on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:07] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Decom observium.wikimedia.org - https://phabricator.wikimedia.org/T118790#1811738 (10Dzahn) ..maybe there is a database for this that should be removed or mysql grants for it [18:50:27] ori: it looks like a temporary work-around to me, rather than a first step towards RL working better in a mostly-HTTP2 / SPDY world [18:50:51] unbundling everything isn't ideal for most of our clients with HTTP2 support [18:51:03] why not? [18:51:25] it increases the download size without any apparent advantage [18:51:56] your point about fast-changing modules is a valid one, but we can group things by stability [18:51:56] sure it does; it avoids collateral cache invalidation [18:53:14] if all the modules change all the time, then yeah, perhaps complete unbundling could be a net win [18:53:16] I mean, what you're saying is not insane or anything, I have had the same thoughts. But at some point the balance tips toward the simple thing that gets most of the benefits. [18:53:21] (03CR) 10Ottomata: [C: 032 V: 032] allow config lines up to 4K length [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/253476 (owner: 10BBlack) [18:53:33] 6operations: Kernel errors on mw1158 - https://phabricator.wikimedia.org/T118888#1811767 (10Andrew) 3NEW [18:53:35] they do change often [18:53:48] bblack: just merged ^. thanks. [18:54:40] !log depooled mw1158 due to kernel errors. https://phabricator.wikimedia.org/T118888 [18:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:52] (03PS2) 10Ori.livneh: wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 [18:54:52] ori: if you see complete unbundling to be the way forward in general, then your patch makes sense [18:54:59] andrewbogott: my bash_history says sudo -E reprepro --ignore=wrongdistribution include precise-wikimedia librsvg_2.36.1-1wm2_amd64.changes [18:55:12] I have no idea where that 2.40 for trusty-wikimedia came from [18:55:30] akosiaris: ok. That’s unfortunate [18:55:39] gwicke: I do, because I think unbundling has an advantage for developer ergonomics as well; it is better if the naive approach just works. [18:56:22] bblack: updated [18:56:29] andrewbogott: I see trusty has by default 2.40 [18:56:38] 2.40.2-1 in fact [18:56:44] 6operations: Kernel errors on mw1158 - https://phabricator.wikimedia.org/T118888#1811787 (10Andrew) Lots of icinga alerts for this host. dmesg says: [10660830.146766] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756! [10660830.153540] invalid opcode: 0000 [#38] SMP [10660830.157993] Modules linked in... [18:56:51] ori: no disagreement there; I've been pushing for REST APIs for a while now.. [18:56:53] paravoid might know [18:57:00] re akosiaris [18:57:08] ? [18:57:15] need to read backlog ? [18:57:15] i remember him having to do something with librsvg, maybe build it with a wmf-specific patch or something [18:57:25] oh [18:57:27] ok thanks [18:57:33] akosiaris: that would add up except those boxes are running 2.40.2-1+wm1 [18:57:37] so someone built something [18:57:41] ori: it's just the two code paths that irk me a bit [18:58:06] andrewbogott: yeah, should be easy to figure out what [18:58:07] lemme see [18:58:09] but, I should go back to my actual work ;) [18:58:15] moritzm: any interest in looking at https://phabricator.wikimedia.org/T118888 ? [18:58:27] gwicke: I am with you on that :/ but past experience has taught me to be tactical and cautious with changes to mediawiki.js [18:59:26] andrewbogott: sure, I've subscribed to the bug, will have a look tomorrow morning [18:59:38] andrewbogott: librsvg (2.40.2-1+wm1) trusty-wikimedia; urgency=medium [18:59:38] * Apply upstream patch for data number parsing - [18:59:38] -- Giuseppe Lavagetto Fri, 09 Jan 2015 08:46:35 +0000 [18:59:46] there's your perp [18:59:47] thanks! I’ve depooled, I’ll ack the alerts. [18:59:53] hunt him down [18:59:54] bblack: um, let me updated it again, since I think you are right about only supporting SPDY 3 [19:00:00] ACKNOWLEDGEMENT - HHVM processes on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. andrew bogott https://phabricator.wikimedia.org/T118888 [19:00:01] ACKNOWLEDGEMENT - dhclient process on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. andrew bogott https://phabricator.wikimedia.org/T118888 [19:00:01] ACKNOWLEDGEMENT - nutcracker process on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. andrew bogott https://phabricator.wikimedia.org/T118888 [19:00:01] ACKNOWLEDGEMENT - salt-minion processes on mw1158 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. andrew bogott https://phabricator.wikimedia.org/T118888 [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151117T1900). Please do the needful. [19:01:39] _joe_: Can you fill me in on where the latest rsvg packages came from? Or better yet, just take on https://phabricator.wikimedia.org/T112421 entirely? [19:01:49] bblack: also, do you think it is actually important (either because it is liable to be set, or because reasoning about this is hard) to check X-Orig-Cookie? vcl_deliver seems a little late in the game for X-Orig-Cookie to matter, no? [19:01:59] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1811824 (10Andrew) ok, further digging suggests that Giuseppe built the latest package. [19:03:29] ori: how terrible is it to lose a rendering server? I just depooled one and note that there are only 8 altogether. [19:03:43] does ‘rendering’ mean ‘thumbnail rendering’? [19:04:39] andrewbogott: yes, and not terrible [19:04:52] or rather: not terrible, and yes [19:04:59] ori: ok, good, thanks :) [19:05:41] or rather, good, ok, thanks [19:06:08] heh [19:10:00] (03PS1) 10Dzahn: puppet-lint: re-enable unquoted resource check [puppet] - 10https://gerrit.wikimedia.org/r/253652 [19:11:54] (03PS2) 10Dzahn: puppet-lint: re-enable unquoted resource check [puppet] - 10https://gerrit.wikimedia.org/r/253652 [19:12:38] (03CR) 10Dzahn: [C: 032] puppet-lint: re-enable unquoted resource check [puppet] - 10https://gerrit.wikimedia.org/r/253652 (owner: 10Dzahn) [19:13:51] mutante: thanks ^ [19:14:21] matanya: :) one by one [19:14:49] others are close to being fixed globally too [19:15:54] (03PS3) 10Ori.livneh: wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 [19:19:20] 6operations, 6Performance-Team, 10Traffic: Update CP cookie VCL once HTTP/2 support lands - https://phabricator.wikimedia.org/T118892#1811858 (10ori) 3NEW [19:20:36] (03CR) 10Ori.livneh: "Filed https://phabricator.wikimedia.org/T118892 so we remember to update this once HTTP/2 lands." [puppet] - 10https://gerrit.wikimedia.org/r/253645 (owner: 10Ori.livneh) [19:22:19] (03CR) 10Ori.livneh: "@YuviPanda, `appendfilename`." [puppet] - 10https://gerrit.wikimedia.org/r/253531 (owner: 10Yuvipanda) [19:23:12] (03CR) 10Yuvipanda: "That's set regardless of it is persist or not, but you're right I'll have to match it. However, for this migration it probably wouldn't ma" [puppet] - 10https://gerrit.wikimedia.org/r/253531 (owner: 10Yuvipanda) [19:23:37] (03PS1) 10Dzahn: mysql: fix the last double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/253653 [19:25:23] bblack: should be good now [19:28:04] (03PS1) 10Yuvipanda: labs: Set realm in hiera [puppet] - 10https://gerrit.wikimedia.org/r/253654 (https://phabricator.wikimedia.org/T101447) [19:28:10] andrewbogott: ^ should allow us to get rid of realm [19:28:13] from the wikitech page [19:28:14] (03PS2) 10Dzahn: admin: add my new yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253542 [19:29:00] (03PS3) 10Dzahn: admin: add my new yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253542 [19:29:09] andrewbogott: I can test this by merging it, removing the realm variable from wikitech and creating a new instance. [19:29:31] (03CR) 10Dzahn: [C: 032] admin: add my new yubikey ssh key [puppet] - 10https://gerrit.wikimedia.org/r/253542 (owner: 10Dzahn) [19:30:25] (03PS2) 10Yuvipanda: labs: Set realm in hiera [puppet] - 10https://gerrit.wikimedia.org/r/253654 (https://phabricator.wikimedia.org/T101447) [19:30:35] YuviPanda: I’ve never been clear on if the $realm set in realm.pp is the same as the $realm set in ldap. Aren’t they different scope? [19:30:56] andrewbogott: well prod has no realm and the only realm is the realm set in realm.pp [19:30:58] err [19:31:00] prod has no ldap [19:31:17] so I presume that means that the realm set in realm.pp (which is set 'bare' not in any class) must be the same [19:31:28] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Set realm in hiera [puppet] - 10https://gerrit.wikimedia.org/r/253654 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:31:34] hm [19:31:37] well, I guess we can find out :) [19:31:41] yeah :D [19:33:47] dammit puppet, why are you so slow [19:34:22] it works [19:34:39] logged in with the yubikey [19:36:42] andrewbogott: yesss that works [19:36:44] we can kill realm now [19:36:49] * YuviPanda kills and starts a new instance to check [19:37:04] YubiPanda? [19:37:34] YuviPanda: please make the comment for your key in admin.yaml 'YuviKey' [19:37:35] Labs realm is never set anywhere in puppet afaik and the production realm is set at hte top in case of undef [19:37:51] I imagine the labs realm is set first via ldap in labs context [19:38:03] ori: yea, what i thought :) once it works he has to be YubiPanda [19:38:10] haha :D [19:38:18] I shall do so when I've it setup, ori [19:38:25] Yu(v|b)ikey [19:38:40] chasemp: my patch just makes it hiera('realm', 'production') [19:38:57] chasemp: so if it's set in ldap it takes that if not it picks up from hiera [19:41:29] andrewbogott: chasemp \o/ it works! we can kill realm from ldap and from OSM now [19:41:39] well, from being hardcoded into OSM [19:42:17] can you make sure there aren’t any corner cases when a labs instance has nothing defined in its ldap node definition? [19:42:41] andrewbogott: so role::labs::instance will still be defined [19:42:44] YuviPanda: so this basically says, "loook in hiera first else use 'production'" yes? [19:43:32] yeah [19:43:34] andrewbogott: hmm http://tools.wmflabs.org/watroles/variable/instancename/tools-worker-03 [19:43:39] andrewbogott: somehow it's in ldap still [19:43:41] * YuviPanda removes it [19:44:11] it would only be removed if you edited and committed the puppet def in wikitech for that particular instance [19:44:51] PROBLEM - cassandra CQL 10.192.16.153:9042 on restbase2002 is CRITICAL: Connection refused [19:45:37] back in a bit [19:46:15] ok [19:49:06] andrewbogott: ok I explicitly got rid of it and it still works [19:49:31] now to get rid of it from OSM [19:49:34] ugh I gotta clone OSM now [19:52:06] (03PS2) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:52:29] YuviPanda: where is the 'labs' realm value being pulled from by hiera now? [19:52:45] chasemp: hieradata/labs.yaml [19:52:47] godog still around? :) [19:53:08] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:56:03] addshore: sure, what's up? [19:56:19] (03PS3) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:56:20] I was just going to link you to some of the stuff that result as of that thing being merged! :) [19:56:42] https://grafana.wikimedia.org/dashboard/db/wikidata-api-wbgetclaims https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch https://grafana.wikimedia.org/dashboard/db/wikidata-social-followers https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage [19:56:44] many thanks! [19:57:36] addshore: nice [19:57:37] ! [19:57:52] (03PS1) 10Yuvipanda: Include role::labs::instance in labs via puppet [puppet] - 10https://gerrit.wikimedia.org/r/253664 (https://phabricator.wikimedia.org/T101447) [19:58:02] andrewbogott: chasemp ^ for role::labs::instance removal [19:58:03] ori: indeed ;) more to come I hope! [19:58:11] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [19:58:47] addshore: ooohh shiny! good job :D [19:59:29] (03PS4) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [19:59:38] <- off [20:00:07] *waves* [20:00:27] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:01:34] (03PS1) 10Jcrespo: [WIP] Use heartbeat when possible to check slave lag [puppet] - 10https://gerrit.wikimedia.org/r/253665 [20:02:08] addshore: btw, prefer tags + descriptive titles rather than '::'-segments [20:02:30] They have tags, I would probably just get rid of the :: [20:02:40] * ori nods [20:02:43] (03PS5) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:03:29] (03CR) 10Ori.livneh: [WIP] Puppetize eventlogging-service with systemd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [20:06:17] (03PS6) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:07:03] 6operations, 7Mail: EXIM Config: Remove yana alias for ywelinder - https://phabricator.wikimedia.org/T118899#1812033 (10JKrauska) 3NEW [20:07:31] YuviPanda: see, https://gerrit.wikimedia.org/r/#/c/253664/1/manifests/site.pp is the part that gets us into uncharted empty-ldap-node territory [20:07:36] 6operations, 7Mail: EXIM Config: Remove yana alias for ywelinder - https://phabricator.wikimedia.org/T118899#1812051 (10JKrauska) [20:07:40] It’s probably fine, we just need to be alert. [20:08:03] andrewbogott: so instanceproject and instancename are still set [20:08:11] andrewbogott: but you're right it'll have 0 roles [20:08:20] Hm… true... [20:08:26] although we should be able to get those from metadata [20:08:39] I guess there are still external tools that use ldap queries to enumerate things, huh? [20:08:57] yeah [20:09:00] well, 'external' [20:09:06] shinken and http://tools.wmflabs.org/watroles/variable/instancename/tools-worker-03 [20:09:32] andrewbogott: we can fix those partly with novaobserver :D [20:09:38] I also should finish up that ENC I had... [20:09:57] novaobserver is temporarily doomed :( [20:10:04] aaah [20:10:06] :( [20:10:20] is it buried under a pile of stuff we need to fix first? [20:10:22] 6operations, 10ops-eqiad: Disk failure on db1027 (RAID degraded) - https://phabricator.wikimedia.org/T118848#1812078 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson all the disks have been replaced. root@db1027:~# megacli -PDList -aALL |grep "Firmware state:" Firmware state: Online, Spun Up Firmware state: O... [20:10:52] YuviPanda: basically yes, that plus keystone is an incoherent mess [20:11:43] (03PS7) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:11:46] andrewbogott: heh [20:12:51] andrewbogott: ok I'm going to go meet another human being for lunch, I'll be back at some point and merge the instance change and be on the lookout [20:12:59] andrewbogott: I'll also cleanup ldap afterwards [20:15:52] (03PS8) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:16:41] 6operations: Remove Alias - https://phabricator.wikimedia.org/T118900#1812121 (10Krenair) Adding #operations as this relates to the exim aliases in the ops private repository. [20:19:57] (03CR) 10Addshore: [C: 031] keep fewer dataset web server logs, add date to filename [puppet] - 10https://gerrit.wikimedia.org/r/253594 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [20:23:08] (03PS1) 1020after4: 1.27.0-wmf.7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253674 [20:23:13] (03PS2) 10Dzahn: mysql: fix the last double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/253653 [20:23:44] (03CR) 1020after4: [C: 032] 1.27.0-wmf.7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253674 (owner: 1020after4) [20:23:48] (03CR) 10Dzahn: [C: 032] mysql: fix the last double quoted strings [puppet] - 10https://gerrit.wikimedia.org/r/253653 (owner: 10Dzahn) [20:24:06] (03Merged) 10jenkins-bot: 1.27.0-wmf.7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253674 (owner: 1020after4) [20:24:22] (03Abandoned) 10Dzahn: wikistats: add cronjob for miraheze import script [puppet] - 10https://gerrit.wikimedia.org/r/235959 (https://phabricator.wikimedia.org/T107398) (owner: 10Dzahn) [20:25:12] !log twentyafterfour@tin Started scap: sync new branch 1.27.0-wmf.7 and enable for testwiki [20:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:43] !log twentyafterfour@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_4014066321" --threads=4 --lang en --quiet' returned non-zero exit status 1 (duration: 01m 30s) [20:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:34] !log twentyafterfour@tin Started scap: sync new branch 1.27.0-wmf.7 and enable for testwiki [20:31:16] !log twentyafterfour@tin scap failed: CalledProcessError Command 'cp -r "/tmp/scap_l10n_2994655917"/* "/srv/mediawiki-staging/php-1.27.0-wmf.7/cache/l10n"' returned non-zero exit status 1 (duration: 00m 42s) [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:34:39] grr... checkoutMediawiki is still badly broken [20:35:33] how can I make the l10n directory owned by l10nupdate? I don't belong to the groups that l10nupdate does and l10nupdate doesn't belong to any other groups... [20:35:46] (03PS2) 10Dzahn: delete sitemap.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) [20:35:53] I don't see any way that I can do it [20:36:08] other than making the directory world writable? [20:36:27] (03PS3) 10Dzahn: delete sitemap.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) [20:36:49] (03PS1) 10Addshore: wgRCWatchCategoryMembership true on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253682 [20:37:16] (03PS9) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:37:22] (03CR) 10Dzahn: "this used to have sitemap files ..back in 2005." [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) (owner: 10Dzahn) [20:37:40] (03CR) 10Dzahn: [C: 032] "this used to have sitemap files ..back in 2005." [dns] - 10https://gerrit.wikimedia.org/r/253375 (https://phabricator.wikimedia.org/T101486) (owner: 10Dzahn) [20:37:59] !log twentyafterfour@tin Started scap: sync new branch 1.27.0-wmf.7 and enable for testwiki [20:38:38] twentyafterfour: you should be able to sudo as the l10nupdate user [20:39:27] we really should have rolled back that sudo wmdeploy stuff in the checkout scripts [20:39:57] !log deleted sitemap.wikimedia.org (T101486) [20:39:58] bd808: I can sudo to l10nupdate but that user can't create a directory inside the branch [20:39:59] 6operations, 5Patch-For-Review: Delete / decom sitemap.wikimedia.org - https://phabricator.wikimedia.org/T101486#1812214 (10Dzahn) a:3Dzahn [20:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:21] permissions are a horrible mess right now and I don't know how to move forward [20:40:33] (but I worked around it for now with world-writable directory) [20:41:13] (03PS2) 10Merlijn van Deen: toollabs: make sure /tmp and swap are large for all exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/252506 (https://phabricator.wikimedia.org/T118419) [20:41:14] by move forward I mean I don't know how to fix checkoutMediawiki or of the mirroring stuff. I really just want to burn it with fire [20:41:26] (03PS10) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:41:49] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:42:21] twentyafterfour: fwiw, i had problems locally with rebuildLocalisationCache.php today [20:42:46] (and with building my cirrus index, which touches l10n stuff) [20:43:01] (03PS1) 10BryanDavis: Revert "checkoutMediaWiki: sudo as mwdeploy for most things" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253684 [20:43:07] and normally don't have such problems [20:44:56] twentyafterfour: I think that revert + the scap and puppet changes to run the sync as root *should* work. You could also easily just remove the mira sync code from scap for now or hide it under a feature flag [20:49:19] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:51:47] 6operations: Remove exim aliases for cdeubner - https://phabricator.wikimedia.org/T118900#1812226 (10Aklapper) [20:55:39] PROBLEM - HHVM rendering on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:19] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:04:56] (03CR) 10Aaron Schulz: "Would be nice to have all the stuff in https://gerrit.wikimedia.org/r/#/c/253151/ addressed soon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253682 (owner: 10Addshore) [21:07:39] !log twentyafterfour@tin Finished scap: sync new branch 1.27.0-wmf.7 and enable for testwiki (duration: 29m 40s) [21:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:58] (03PS11) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:08:51] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [21:09:42] (03PS12) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:14:20] (03PS1) 10Dzahn: mediawiki: remove sitemap.wm.org Apache redirect [puppet] - 10https://gerrit.wikimedia.org/r/253749 (https://phabricator.wikimedia.org/T101486) [21:15:21] (03PS1) 10Ori.livneh: Add additional YubiKey-backed key for self (=ori) [puppet] - 10https://gerrit.wikimedia.org/r/253751 [21:16:08] (03PS2) 10Ori.livneh: Add additional YubiKey-backed key for self (=ori) [puppet] - 10https://gerrit.wikimedia.org/r/253751 [21:16:50] (03CR) 10Ori.livneh: [C: 032 V: 032] Add additional YubiKey-backed key for self (=ori) [puppet] - 10https://gerrit.wikimedia.org/r/253751 (owner: 10Ori.livneh) [21:17:20] (03PS2) 10Dzahn: mediawiki: remove sitemap.wm.org Apache redirect [puppet] - 10https://gerrit.wikimedia.org/r/253749 (https://phabricator.wikimedia.org/T101486) [21:17:28] (03CR) 10Dzahn: [C: 032] mediawiki: remove sitemap.wm.org Apache redirect [puppet] - 10https://gerrit.wikimedia.org/r/253749 (https://phabricator.wikimedia.org/T101486) (owner: 10Dzahn) [21:19:33] 6operations, 5Patch-For-Review: Delete / decom sitemap.wikimedia.org - https://phabricator.wikimedia.org/T101486#1812274 (10Dzahn) 5Open>3Resolved [21:19:53] 6operations: Delete / decom sitemap.wikimedia.org - https://phabricator.wikimedia.org/T101486#1340162 (10Dzahn) [21:31:12] (03PS1) 10Rush: hiera_lookup allow debug with '-v' [puppet] - 10https://gerrit.wikimedia.org/r/253753 [21:34:30] 6operations, 10ops-eqiad: Decommission cisco servers, Analytics1003, 1004 and 1010 - https://phabricator.wikimedia.org/T118572#1812301 (10Dzahn) a:3Cmjohnson [21:35:20] 6operations, 7Icinga, 7Monitoring: Monitor all mgmt hosts - https://phabricator.wikimedia.org/T85143#1812304 (10Dzahn) [21:35:44] 6operations: releases.wikimedia.org should be https only and have hsts set - https://phabricator.wikimedia.org/T118787#1812306 (10Dzahn) p:5Triage>3Normal [21:40:49] (03PS13) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:43:44] (03PS1) 10Dzahn: releases: enforce http->https redirect behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/253757 (https://phabricator.wikimedia.org/T118787) [21:45:23] twentyafterfour: Train done? [21:48:32] (03PS1) 10Dzahn: planet: add HSTS headers [puppet] - 10https://gerrit.wikimedia.org/r/253758 [21:50:08] (03PS1) 10Dzahn: releases: enable strict transport security [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T253757) [21:52:16] (03PS2) 10Dzahn: releases: enable strict transport security [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T118787) [21:53:07] cccccceillbuheftkukjlgreectrleurrjnfbkbvghnj [21:53:29] damn, and again:) i need to put tape on this button [21:53:30] is that from your yubikey, mutante? [21:53:57] yea, it's really easy to touch that when typing [21:54:21] it's not used though [21:54:43] that one-time password doesnt get you anything [21:56:49] 6operations: How to page when a host is down? - https://phabricator.wikimedia.org/T113834#1812338 (10Andrew) 5Open>3Resolved ok, everyone seems to think that this is not a problem, so I will close and reopen if it turns out to be a problem at some point. [22:11:06] (03PS2) 10Rush: hiera_lookup allow debug with '-v' [puppet] - 10https://gerrit.wikimedia.org/r/253753 [22:11:08] (03PS1) 10Dzahn: mediawiki: remove wikimedia.biz Apache config [puppet] - 10https://gerrit.wikimedia.org/r/253763 (https://phabricator.wikimedia.org/T81344) [22:11:44] Krinkle: I didn't promote to group0 yet it's just on testwiki [22:12:52] (03CR) 10Rush: [C: 032] hiera_lookup allow debug with '-v' [puppet] - 10https://gerrit.wikimedia.org/r/253753 (owner: 10Rush) [22:14:15] (03PS1) 1020after4: group0 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253767 [22:14:28] (03CR) 1020after4: [C: 032] group0 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253767 (owner: 1020after4) [22:17:59] (03Merged) 10jenkins-bot: group0 wikis to 1.27.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253767 (owner: 1020after4) [22:18:25] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.27.0-wmf.7 [22:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:19:28] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1812395 (10Aklapper) [22:19:29] 6operations, 6Security-Team: can we get rid of rsvg security patch? - https://phabricator.wikimedia.org/T104147#1812394 (10Aklapper) [22:20:21] !log deactivated wikimedia.biz, webhostingwikipedia.com [22:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:21:16] (03PS1) 10Rush: labtest realm introduction [puppet] - 10https://gerrit.wikimedia.org/r/253770 [22:23:41] !log deactivated wikimaps.com, wikimaps.net, wikimaps.org [22:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:27:30] (03CR) 10BBlack: [C: 04-1] wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/253645 (owner: 10Ori.livneh) [22:28:49] bblack: why? the two possible states are CP=H2 for SPDY or no CP cookie at all for first request or for HTTP 1.x [22:29:09] not sure what the value of CP=H1 would be since we'd need to assume that whenever CP is unset anyway [22:29:35] no, I'm fine with CP=H2 vs no cookie [22:30:05] but to tell the browser "no cookie", you need header.append("Set-Cookie: CP="), not header.remove(), which would remove a header you had previously set to send to the browser in this response. [22:30:46] (03PS1) 10MaxSem: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253772 [22:31:28] "append the header that tells the browers to remove" vs "remove a header from the response" (what the patch current does in the else clause) [22:31:32] yes, I'm a moron [22:31:42] ok :) [22:31:42] don't know why I didn't get that that's what you were saying [22:34:20] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: puppet fail [22:35:29] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [22:36:11] wait, what.. why eeden [22:36:17] checks [22:38:10] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:45:59] (03CR) 10JanZerebecki: [C: 031] planet: add HSTS headers [puppet] - 10https://gerrit.wikimedia.org/r/253758 (owner: 10Dzahn) [22:49:08] bblack: do you think it is sufficient to just call header.append(resp.http.Set-Cookie, "CP="), or should I actually force the cookie to expire? [22:51:27] well [22:51:32] I donno, you could go either way [22:51:57] now that I think about it, CP= is still going to leave a cookie in place [22:52:15] in theory the cookie spec knows about K=V, but I don't think an empty V means don't bother sending it either heh [22:52:48] !log deactivated vikipedia.com.tr, vikipedia.com.tr [22:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:59] eh, vikipedi.com.tr shrug [22:53:04] so maybe send them "CP=; Expires=Jan1,1970-in-a-better-format" [22:54:09] !log deactivated indiawikipedia.com [22:54:10] A value might be more reliable too perhaps. "CP=X; Expires=Thu, 1 Jan 1970 00:00:01 GMT" [22:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:36] yep [22:55:58] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: Puppet has 1 failures [22:56:40] !log deactivated softwarewikipedia.[com|net|org] [22:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:57:58] !log deactivated visualwikipedia.[com|net] [22:57:59] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [22:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:59:29] (03CR) 10Rush: [C: 032] labtest realm introduction [puppet] - 10https://gerrit.wikimedia.org/r/253770 (owner: 10Rush) [23:01:23] !log deactivated wikimedia.xyz [23:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:15] (03CR) 10JanZerebecki: [C: 031] releases: enforce http->https redirect behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/253757 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [23:04:48] (03CR) 10JanZerebecki: [C: 031] releases: enable strict transport security [puppet] - 10https://gerrit.wikimedia.org/r/253759 (https://phabricator.wikimedia.org/T118787) (owner: 10Dzahn) [23:05:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:06:01] bblack: do we have something like geo_get_top_cookie_domain() in geoip.inc.vcl.erb that can be called from vcl directly? [23:08:40] ori: vcl functions don't really have return values, so no :/ [23:09:18] no biggie. performing geoip lookup is expensive; checking if (req.http.X-Connection-Properties ~ "SPDY=3") is not [23:11:48] (03PS4) 10Ori.livneh: wikimedia.vcl: set CP ('Connection Properties') cookie in vcl_deliver [puppet] - 10https://gerrit.wikimedia.org/r/253645 [23:12:02] bblack: should do the trick ^ [23:13:44] !log deactivated border-wikipedia.de [23:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:49] (03PS1) 10BryanDavis: Remove deprecated wgRateLimitLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253779 [23:18:17] !log deactivated wikiartpedia.[biz|co|info|me|mobi|net|org] [23:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:40] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: puppet fail [23:18:58] (03PS2) 10Dzahn: deactivate wekipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244085 [23:19:15] (03CR) 10Dzahn: [C: 032] deactivate wekipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244085 (owner: 10Dzahn) [23:19:52] * hoo finds most of these domains hilarious :P [23:20:24] wikiicantbelieveanyonewouldreallythinkthisispedia.org [23:20:26] hoo: that's cause the most hilarious ones are killed first :) [23:21:01] but just wait for the 650 that are not in DNS yet [23:21:18] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: puppet fail [23:22:28] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:58] PROBLEM - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: puppet fail [23:27:10] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: puppet fail [23:27:10] (03PS1) 10Rush: WIP: labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 [23:27:18] (03PS2) 10Rush: WIP: labs remove old neutron configs [puppet] - 10https://gerrit.wikimedia.org/r/253780 [23:31:38] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: puppet fail [23:32:38] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: puppet fail [23:34:57] andrewbogott: chasemp ok I'm going to merge and babysit the labsinstance change [23:35:06] great! [23:35:32] I don’t actually expect it to break things, but good to keep an eye out [23:35:52] (03PS2) 10Yuvipanda: Include role::labs::instance in labs via puppet [puppet] - 10https://gerrit.wikimedia.org/r/253664 (https://phabricator.wikimedia.org/T101447) [23:36:15] (03CR) 10Yuvipanda: [C: 032 V: 032] "Peeaaaceee on eeaaartth, aaand an eeeeeend to waaar" [puppet] - 10https://gerrit.wikimedia.org/r/253664 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [23:40:10] !log remove realm variable from all instances in ldap [23:40:14] andrewbogott: ^ \o/ [23:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:25] (03PS1) 10Rush: labtest realm lookups for mail [puppet] - 10https://gerrit.wikimedia.org/r/253783 [23:40:37] YuviPanda: cool! [23:40:55] andrewbogott: in a while I'll do that for the instance role too [23:41:00] in my testing (so far) it seems ok [23:41:31] andrewbogott: ldapvi has been great for making these kind of changes [23:42:04] yeah, I love not having to write diff files [23:43:26] !log deactivated wekipedia.om [23:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:44:58] (03PS4) 10Dzahn: deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 [23:45:13] (03CR) 10Rush: [C: 032] labtest realm lookups for mail [puppet] - 10https://gerrit.wikimedia.org/r/253783 (owner: 10Rush) [23:47:07] (03CR) 10Dzahn: [C: 032] deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 (owner: 10Dzahn) [23:47:09] YuviPanda: puppet seems upset [23:47:24] andrewbogott: yeah it's telling me there's a python-yaml conflict... [23:47:40] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:47:43] that sounds… unrelated [23:48:00] andrewbogott yeah which is why I'm worried [23:48:25] yeah, maybe not detecting the right realm :( [23:48:40] yah [23:48:47] ok [23:48:50] I ran it on tools-proxy [23:48:52] and no errors [23:50:33] It could still have the wrong realm and just work by accident on some hosts [23:50:48] yeah [23:50:50] I can’t think of a speedy way to ask puppet ‘what is $::realm”? [23:51:10] we don't set it as a fact so it really only exists on teh master right? [23:51:50] yeah [23:51:52] no it's just set as a global variable [23:52:14] so there's actually a conflict in python-yaml that I don't know how it worked so far... [23:52:24] let me try on unrelated projects [23:52:29] you can put in a notify w/ $realm tho to test that theory [23:52:36] andrewbogott: notify{"realm: ${realm}": } [23:52:40] if you are skeptical it's a real relam mismatch [23:52:47] can one of you write and merge that? [23:53:18] ok, sure [23:53:31] you got it? [23:53:33] notify{ "realm is ${realm}": } [23:53:40] yeah [23:53:43] kk [23:53:59] (03PS1) 10Yuvipanda: toollabs: Fix genpp to use require_package [puppet] - 10https://gerrit.wikimedia.org/r/253788 [23:54:20] so I checked out ores-web-02 [23:54:23] all good [23:54:30] complete noop [23:54:37] (03PS2) 10Yuvipanda: toollabs: Fix genpp to use require_package [puppet] - 10https://gerrit.wikimedia.org/r/253788 [23:54:51] ^ I shall wait for jenkins here [23:55:00] from what I looked through realm wise (like we said) it's circular and funky but seemed fine [23:55:03] is it related to the distro version? [23:55:11] and that's why all -exec nodes but not others? [23:55:21] or some realm check on those nodes [23:55:30] well so they all have a python-yaml conflict [23:55:37] and that patch (I'm waiting for jenkins) fixes taht conflict [23:55:42] I only have no idea why they worked before [23:56:03] (03CR) 10Yuvipanda: [C: 032] toollabs: Fix genpp to use require_package [puppet] - 10https://gerrit.wikimedia.org/r/253788 (owner: 10Yuvipanda) [23:57:08] (03PS1) 10Andrew Bogott: Display 'realm' during each puppet run. [puppet] - 10https://gerrit.wikimedia.org/r/253789 [23:57:18] YuviPanda: chasemp: ^ Let me test on a self-hosted box first... [23:57:24] unless one of you has a test box handy already [23:57:42] nope [23:58:18] so if the hiera lookup fails, then it becomes production realm by default [23:58:23] I do hold on [23:58:30] is that what we want though [23:58:37] mutante: if hiera lookup fails we're in for a lot of shit :) [23:59:00] in the same vein earlier if the ldap lookup fails it'll default to production [23:59:09] and I think when either ldap or hiera lookups fail they just fail the puppet run [23:59:19] ok so my patch fixed the failing puppets [23:59:21] and I've a theory [23:59:23] andrewbogott: runs fine man, give it a go [23:59:24] Notice: /Stage[main]/Main/Notify[realm is labtest]/message: current_value absent, should be realm is labtest (noop) [23:59:38] labtest is my valid test realm here [23:59:40] whish has to do with... ordering! [23:59:47] earlier we included role::labs::instance via ldap [23:59:58] and also the exec node stuff via ldap