[00:08:37] (03PS1) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:09:19] (03CR) 10jenkins-bot: [V: 04-1] WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:10:52] (03CR) 10Dzahn: "this is because you copied the GID in admins/data.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:11:34] mutante: Yeah I noticed. That's a really nice check to have in Jenkins [00:11:38] (03PS2) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:12:18] RoanKattouw: yes, we got it with the conversion to .yaml. and very good to have, used to happen a lot before :) [00:12:21] (03CR) 10jenkins-bot: [V: 04-1] WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:18:58] I'm unable to log into the frack bastion, tellurium... [00:19:26] awight: are you in office? [00:19:42] AaronS: yes [00:19:54] *we* have a package btw ;) [00:20:24] * awight looks around for backscroll! [00:20:48] AaronS: aha, the pedal? [00:21:05] more like the pedal*s* [00:21:16] tripleplus good [00:22:14] AaronS: I can cash u out for my half, *walking* now [00:24:00] (03PS3) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:30:21] (03CR) 10Spage: [C: 031] "Matches the article capitalization. This can go out whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162965 (https://bugzilla.wikimedia.org/71204) (owner: 10Prtksxna) [00:31:39] (03CR) 10BryanDavis: "Some comments inline about what I think are better ways to integrate this with Trebuchet." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:34:19] (03CR) 10Catrope: "Bryan: All this is copied directly from the Mathoid module. I'll try out at least some of the fixes you're suggesting, but the hacks in he" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:37:28] (03CR) 10Catrope: WIP Citoid puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [00:38:09] bd808: A lot of that patch is copypasted straight from Moritz's and Alex's Mathoid work, but I'm trying your exec suggestion and getting rid of the directory creations [00:39:12] RoanKattouw: Yeah. I saw some of that when Moritz tried to sneak the same puppet code into mediawiki-vagrarnt :) [00:39:40] The directory ownership was most disturbing to me [00:40:56] PROBLEM - puppet last run on neptunium is CRITICAL: CRITICAL: Puppet has 1 failures [00:40:57] RoanKattouw, how has your day been? [00:41:56] PROBLEM - LDAPS on neptunium is CRITICAL: Connection refused [00:42:25] PROBLEM - LDAP on neptunium is CRITICAL: Connection refused [00:42:56] RECOVERY - puppet last run on neptunium is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:43:16] RECOVERY - LDAP on neptunium is OK: TCP OK - 0.001 second response time on port 389 [00:44:05] RECOVERY - LDAPS on neptunium is OK: TCP OK - 0.050 second response time on port 636 [00:48:22] Bsadowski1: It's been good. I'm subjecting Puppet to my will, I feel powerful now :) [00:49:17] What is Puppet used for? [00:50:09] (03PS4) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [00:50:28] Bsadowski1, for managing server configuration [00:56:34] it saves us a lot of time ;) [00:56:55] (sometimes) [01:00:43] (03PS1) 10Ori.livneh: Allow wikidev users to restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/163075 [01:13:23] (03CR) 10Ori.livneh: "Related change to scap: https://gerrit.wikimedia.org/r/#/c/163078/" [puppet] - 10https://gerrit.wikimedia.org/r/163075 (owner: 10Ori.livneh) [01:19:05] PROBLEM - LDAPS on neptunium is CRITICAL: Connection refused [01:19:25] PROBLEM - LDAP on neptunium is CRITICAL: Connection refused [01:20:05] RECOVERY - LDAPS on neptunium is OK: TCP OK - 0.003 second response time on port 636 [01:20:25] RECOVERY - LDAP on neptunium is OK: TCP OK - 0.001 second response time on port 389 [01:31:55] (03PS1) 10BBlack: add hhvm-api.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/163080 [01:32:04] oooo [01:32:30] (03CR) 10BBlack: [C: 032] add hhvm-api.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/163080 (owner: 10BBlack) [01:32:50] (03CR) 10BBlack: [C: 031] lvs: add hhvm-api.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/162894 (owner: 10Giuseppe Lavagetto) [01:32:59] (03CR) 10BBlack: [C: 031] hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 (owner: 10Giuseppe Lavagetto) [01:33:33] I'll let joe deal with deploying that (or me tomorrow, but I'm out for now) [01:33:40] nod. thanks! [01:46:27] ori: Halp? [01:47:03] I cherry-picked https://gerrit.wikimedia.org/r/163068 into deployment-salt:/var/lib/git/operations/puppet but deployment-sca01 still says Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::citoid::beta for i-0000060f.eqiad.wmflabs on node i-0000060f.eqiad.wmflabs [01:49:02] Neever mind [01:49:05] It's not puppetmaster::self'ed [01:59:59] OK so I got that to work [02:00:09] Now, next problem: duplicate declaration of package nodejs [02:00:13] How do we handle that problem usually? [02:09:16] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3616 MB (3% inode=99%): [02:17:06] (03PS5) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [02:18:05] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [02:20:10] (03PS6) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [02:20:56] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Puppet has 1 failures [02:25:02] (03PS7) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [02:33:59] (03CR) 10Catrope: "Sorry Bryan, I tried using your Exec dependency thing (see patchsets 4-7 roughly) but it doesn't work. Nothing creates the /srv/deployment" [puppet] - 10https://gerrit.wikimedia.org/r/163068 (owner: 10Catrope) [02:35:15] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [02:36:32] (03PS8) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [02:36:46] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-26 02:36:45+00:00 [02:36:52] Logged the message, Master [02:37:57] (03PS9) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [02:38:58] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:57:09] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [02:58:11] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [03:00:59] RECOVERY - Disk space on virt0 is OK: DISK OK [03:09:48] !log LocalisationUpdate completed (1.25wmf1) at 2014-09-26 03:09:47+00:00 [03:09:52] Logged the message, Master [03:16:19] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [03:16:22] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:56:15] (03CR) 10Glaisher: "https://developer.apple.com/library/ios/documentation/userexperience/conceptual/mobilehig/IconMatrix.html -- iPhone 6 Plus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162638 (https://bugzilla.wikimedia.org/70996) (owner: 10Glaisher) [03:56:31] RoanKattouw_away: let me know if you want me to look through that [04:09:31] ori: I think Roan would love a hand with his puppet stuff for citoid. He was asking for help at scrum of scrums this week and apparently it's not getting easier. [04:10:12] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Sep 26 04:10:12 UTC 2014 (duration 10m 11s) [04:10:17] Logged the message, Master [04:15:49] (03CR) 10Tim Starling: [C: 031] Allow wikidev users to restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/163075 (owner: 10Ori.livneh) [04:31:24] bd808: cool, i'll take a look [04:32:02] I think the example he was pointed to to follow is maybe a bit flawed which hasn't helped. [04:38:42] <_joe_> did someone already re-updated bash? [04:43:18] <_joe_> !log updating bash, USN-2363 [06:01:18] is it crazy for me to do an OCG deploy now? [06:01:33] aka at the last possible hour before it is "friday" [06:02:09] (03PS1) 10Giuseppe Lavagetto: Fix config paths so that hhvm is not broken by default [debs/hhvm] - 10https://gerrit.wikimedia.org/r/163101 [06:02:21] <_joe_> cscott: it is friday here [06:02:37] it is friday here, too. i'm hoping the SFO people won't notice. ;) [06:02:40] <_joe_> and this is an horrible time for a deploy [06:02:58] <_joe_> it's 8 am in central europe, 7 am in the uk [06:03:08] <_joe_> so, most eu staffers are on coffee [06:03:13] <_joe_> or asleep [06:03:14] it is. i just wanted to give some of the OCG bits a little bit more time to bake in production before they are turned on for everybody on monday. [06:03:42] <_joe_> cscott: go on then, but the timing is unfortunate. [06:04:00] <_joe_> cscott: I assume you'll stick around to verify everything is working afterwards [06:04:07] who needs sleep? [06:04:29] <_joe_> (I'd do that on my early morning on friday) [06:04:53] alternatively i could do an ocg deploy first thing monday morning Eastern time, which still gives it a few hours to bake before it gets turns on by default for everyone, which is scheduled for 4pm eastern. [06:06:58] choices choices [06:24:19] (03PS2) 10Giuseppe Lavagetto: Fix config paths so that hhvm is not broken by default [debs/hhvm] - 10https://gerrit.wikimedia.org/r/163101 [06:24:46] heheh [06:24:53] i like the commit message [06:25:03] <_joe_> :) [06:25:24] <_joe_> that's my just-had-coffee mood [06:25:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix config paths so that hhvm is not broken by default [debs/hhvm] - 10https://gerrit.wikimedia.org/r/163101 (owner: 10Giuseppe Lavagetto) [06:25:41] <_joe_> just had coffee and everything's broken [06:25:46] <_joe_> for any value of everything [06:25:47] <_joe_> :P [06:27:17] <_joe_> ori: the message meant - we're installing a php.ini file that is meant to allow hhvm to run, and we install it where hhvm will not search it by default [06:27:29] <_joe_> so yeah, the package _is_ broken by default [06:29:11] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] <_joe_> oh it's 8.30 AM already [06:29:29] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:51] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:09] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:29] !log updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d [06:41:34] Logged the message, Master [06:45:59] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:52:26] _joe_: i have a trivial change to push for the HHVM beta feature (we promised jared to have a screenshot and James_F improved the copy). it's really trivial, but since it's off-schedule i want to make sure i have a backup. is it all right with you if i go ahead? [06:52:54] <_joe_> yep [06:52:59] thanks [06:53:26] !log ori Synchronized php-1.25wmf1/extensions/WikimediaEvents: Update WikimediaEvents for 0e087daea5 (duration: 00m 07s) [06:53:31] Logged the message, Master [06:54:30] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 15.3GB (= 5.0GB critical): /srv/deployment/ocg/output 75.4GB (= 50.0GB critical): /srv/deployment/ocg/postmortem 3.3GB (= 2.0GB critical): ocg_job_status 41778 msg: ocg_render_job_queue 0 msg [06:54:41] <_joe_> ouch [06:54:58] <_joe_> cscott: ^^ [06:55:02] i'm working on it [06:55:08] <_joe_> ok :) [06:55:10] <_joe_> sorry [06:55:17] no worries, it's good to check. [07:06:24] !log ori Synchronized php-1.24wmf22/extensions/WikimediaEvents: Update WikimediaEvents for 791e14cfc1d (duration: 00m 05s) [07:06:29] Logged the message, Master [07:06:46] [07:07:02] oh wow: "739 users are trying this feature." [07:07:16] that's better :) [07:07:49] <_joe_> eheh [07:12:40] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 12.1GB (= 5.0GB critical): /srv/deployment/ocg/output 73.9GB (= 50.0GB critical): /srv/deployment/ocg/postmortem 2.9GB (= 2.0GB critical): ocg_job_status 41879 msg: ocg_render_job_queue 0 msg [07:41:48] !log Updated our Jenkins Job Builder fork 2d74b16..686265a [07:41:54] Logged the message, Master [08:05:20] ori: still around? [08:07:30] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 39387103B: /srv/deployment/ocg/postmortem 91666B: ocg_job_status 42123 msg: ocg_render_job_queue 0 msg [08:09:32] !log cleared OCG queues and cache to quiet icinga; will try to get to the root cause tomorrow. [08:09:39] Logged the message, Master [08:10:27] <_joe_> hashar: I hope not [08:10:41] <_joe_> hashar: if this is hhvm-related, I may help, maybe [08:11:04] _joe_: morning :) Na it is a patch he deployed to the WikimediaEvents extension which causes some issue :] [08:15:14] (03PS1) 10Christopher Johnson (WMDE): Adds a libext option to class for tag install of Sprint library [puppet] - 10https://gerrit.wikimedia.org/r/163121 [08:20:02] (03PS1) 10Giuseppe Lavagetto: hhvm: do not install a specific version anymore [puppet] - 10https://gerrit.wikimedia.org/r/163122 [08:21:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: do not install a specific version anymore [puppet] - 10https://gerrit.wikimedia.org/r/163122 (owner: 10Giuseppe Lavagetto) [08:25:40] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 844016568B: /srv/deployment/ocg/output 2818947112B: /srv/deployment/ocg/postmortem 389833371B: ocg_job_status 42496 msg: ocg_render_job_queue 0 msg [08:29:41] hashar: thanks for flagging, fixed [08:29:45] * ori sleeps [08:30:01] <_joe_> bye ori [08:30:24] <_joe_> !log updated hhvm on mw1053, kicked the jr a couple of times, working again now [08:30:27] (03CR) 10Christopher Johnson (WMDE): "I created a new changeset for this https://gerrit.wikimedia.org/r/#/c/163121/." [puppet] - 10https://gerrit.wikimedia.org/r/162873 (owner: 10Christopher Johnson (WMDE)) [08:30:29] Logged the message, Master [08:32:15] (03PS2) 10Filippo Giunchedi: install-server: install lldpd early [puppet] - 10https://gerrit.wikimedia.org/r/162847 [08:32:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: install lldpd early [puppet] - 10https://gerrit.wikimedia.org/r/162847 (owner: 10Filippo Giunchedi) [08:34:47] mark: does https://gerrit.wikimedia.org/r/#/c/162887 look good to you? essentially re-using pmtpa service ip range for codfw, anything else needed to make that range assigned? [08:48:34] (03PS4) 10Hashar: Add *.nijmegen.nl to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (https://bugzilla.wikimedia.org/71191) (owner: 10Steinsplitter) [08:50:46] (03CR) 10Hashar: [C: 031] "I have cleared JeremyB -1 since the commit message formatting issue has been addressed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (https://bugzilla.wikimedia.org/71191) (owner: 10Steinsplitter) [08:52:09] godog: thx for your reviews to my lame zuul-gearman.py script :] [08:55:00] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [08:55:21] hashar: haha no worries :) it isn't lame btw [08:55:47] godog: you raise valid points. I hacked it up because I am only interested in two single word commands 'status' and 'workers' [08:55:58] but looking at the upstream code, gear supports some other multiple words commands [08:55:58] bah [08:57:05] hashar: yep I think going with a dict commandline -> method is going to cause less headaches [08:57:26] godog: definitely. Chaining argparse sub parsers is not very readable [09:06:15] <_joe_> I love subparsers! [09:09:48] (03PS3) 10Giuseppe Lavagetto: lvs: add hhvm-api.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/162894 [09:10:03] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: add hhvm-api.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/162894 (owner: 10Giuseppe Lavagetto) [09:14:31] _joe_: chaining them is a bit cumbersome though. My use case is a command such as: foo.py show unique jobs [09:14:37] or foo.py status [09:22:34] (03PS1) 10Giuseppe Lavagetto: Add service ip for apis under hhvm [puppet] - 10https://gerrit.wikimedia.org/r/163131 [09:26:23] (03CR) 10Giuseppe Lavagetto: [C: 032] Add service ip for apis under hhvm [puppet] - 10https://gerrit.wikimedia.org/r/163131 (owner: 10Giuseppe Lavagetto) [09:37:03] (03PS2) 10Giuseppe Lavagetto: add hhvm-api.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/162896 [09:37:05] PROBLEM - puppet last run on analytics1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:17] (03PS1) 10Filippo Giunchedi: eqiad-prod: bump ms-be1013/1014/1015 weight to 3000 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/163137 [10:23:17] (03Draft1) 10Filippo Giunchedi: swiftstats: fix argparse %(default)s [software] - 10https://gerrit.wikimedia.org/r/160427 [10:23:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftstats: fix argparse %(default)s [software] - 10https://gerrit.wikimedia.org/r/160427 (owner: 10Filippo Giunchedi) [10:24:21] (03Draft2) 10Filippo Giunchedi: thumbstats: fix description [software] - 10https://gerrit.wikimedia.org/r/160426 [10:24:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] thumbstats: fix description [software] - 10https://gerrit.wikimedia.org/r/160426 (owner: 10Filippo Giunchedi) [10:24:49] (03PS2) 10Filippo Giunchedi: swiftstats: fix argparse %(default)s [software] - 10https://gerrit.wikimedia.org/r/160427 [10:26:29] sorry the spam :( [10:29:06] (03PS6) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [10:29:08] (03PS1) 10Yuvipanda: Setup ldap client in bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/163141 [10:29:09] tch tch :) [10:29:42] YuviPanda: I know you are disappoint by my spam compared to your spam [10:29:55] godog: :) hopefully no more super-long series, tho [10:30:17] hehe [10:30:44] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 367 seconds [10:31:35] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 324 seconds [10:33:19] UPDATE /* RenameUserJob::run [10:33:43] <_joe_> springle: user renaming going on [10:33:45] (03CR) 10Hoo man: [C: 04-1] "I thought we agreed to not add stuff to the bastions again..." [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [10:34:37] _joe_: mmm... but not waiting for a slave in the pool seems odd [10:34:45] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay -0 seconds [10:34:46] (03CR) 10Yuvipanda: "Hmm, this was Coren's suggestion (and I wasn't around when the bastion discussion took place, can you give me context?). We can always put" [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [10:35:05] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 0 seconds [10:35:40] hoo: can you tell me where the bastion discussion happened? [10:35:56] ops list, a while ago? [10:36:00] (03CR) 10Hoo man: "A misc host sounds ok to me. See also https://gerrit.wikimedia.org/r/126027" [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [10:36:29] hmm, I see [10:36:52] Reedy: where do you think I should put 'em? [10:36:59] since you're on the ldap-admins group :) [10:37:12] misc host probably makes sense [10:37:15] tin/terbium [10:38:08] hmm [10:38:21] * YuviPanda wonders where it'll fit better, tin or terbium [10:38:36] terbium seems more misc than tin [10:38:39] so terbium, perhaps [10:38:54] it wouldn't be a major issue for it to be on a couple of servers [10:39:25] springle: :S Sorry for the huge rename... doesn't seem to be common sense to not rename users with 300k edits :S [10:39:41] heh [10:40:12] If this gets a problem mid-term we might need to put in a hard limit, again :S [10:44:15] hoo: this might be some wfWaitForSlaves() type issue. RenameUserJob::run seems well batched. And none of db1022's siblings were much affected. only dbstore1002, which is not pooled [10:44:42] binlog time [10:45:53] mh, I'm not sure how we could make these kind of things more graceful for non-pooled DBs [10:46:12] (03CR) 10Dan-nl: [C: 031] Add *.nijmegen.nl to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162556 (https://bugzilla.wikimedia.org/71191) (owner: 10Steinsplitter) [10:46:25] i don't care about non-pooled DBs. just pooled ones, as db1022 is [10:47:00] So it's not an issue? [10:47:20] wait, it is :S [10:47:31] Scrollback is helpful :P [10:47:33] it is an issue if pooled DBs like db1022 lag. it isn't an issue if dbstore lags :) [10:50:43] (03PS2) 10Yuvipanda: Setup ldap client in terbiu [puppet] - 10https://gerrit.wikimedia.org/r/163141 [10:50:49] hoo: Reedy ^ moved to terbium [10:51:23] (03PS3) 10Hoo man: Setup ldap client in terbium [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [10:51:25] typos [10:51:53] did I typo decommission again? [10:51:54] * YuviPanda sighs [10:52:15] (03CR) 10Hoo man: [C: 031] Setup ldap client in terbium [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [11:24:46] (03PS1) 10Alexandros Kosiaris: convert mathoid from high-traffic to low-traffic [puppet] - 10https://gerrit.wikimedia.org/r/163144 [11:25:25] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow additional pools on one mediawiki host [puppet] - 10https://gerrit.wikimedia.org/r/163145 [11:25:27] <_joe_> akosiaris: ^^ [11:26:13] <_joe_> I'm using the same servers to serve api and text on hhvm, but I already created separated pybal pools [11:26:32] <_joe_> so that once we're ready to separate the two, we won't have to do anything traumatic [11:26:53] heh, ok [11:27:20] (03CR) 10Alexandros Kosiaris: [C: 032] convert mathoid from high-traffic to low-traffic [puppet] - 10https://gerrit.wikimedia.org/r/163144 (owner: 10Alexandros Kosiaris) [11:28:16] (03PS2) 10Yuvipanda: nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/162967 [11:28:18] (03PS2) 10Yuvipanda: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 [11:28:20] (03PS2) 10Yuvipanda: icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/162966 [11:28:22] (03PS4) 10Yuvipanda: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 [11:28:24] (03PS4) 10Yuvipanda: icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 [11:28:26] (03PS4) 10Yuvipanda: shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 [11:28:28] (03PS1) 10Yuvipanda: icinga: Move initscript into module [puppet] - 10https://gerrit.wikimedia.org/r/163146 [11:31:44] (03PS1) 10Springle: WIP: Cleanup the Sanitarium [software] - 10https://gerrit.wikimedia.org/r/163147 [11:32:35] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow additional pools on one mediawiki host [puppet] - 10https://gerrit.wikimedia.org/r/163145 [11:32:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: allow additional pools on one mediawiki host [puppet] - 10https://gerrit.wikimedia.org/r/163145 (owner: 10Giuseppe Lavagetto) [11:34:05] (03CR) 10Springle: [C: 04-2] WIP: Cleanup the Sanitarium [software] - 10https://gerrit.wikimedia.org/r/163147 (owner: 10Springle) [11:43:47] (03CR) 10JanZerebecki: [C: 031] puppetmaster - use ssl_ciphersuite [puppet] - 10https://gerrit.wikimedia.org/r/153986 (owner: 10Dzahn) [11:44:49] (03PS4) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 [11:49:27] (03PS2) 10Reedy: Add robots.txt rewrite rule where wiki is public [puppet] - 10https://gerrit.wikimedia.org/r/147487 [11:50:00] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/147488 (owner: 10Reedy) [11:55:03] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 3.8GB (= 1.0GB warning): /srv/deployment/ocg/output 23974555903B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44483 msg: ocg_render_job_queue 0 msg [11:57:13] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 4.2GB (= 1.0GB warning): /srv/deployment/ocg/output 24017559111B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44502 msg: ocg_render_job_queue 0 msg [11:59:45] (03CR) 10Tim Landscheidt: "At labs-l you wrote that this patch must be applied to self-hosted puppetmasters, yet this is not merged :-)." [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [12:01:13] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 2.7GB (= 1.0GB warning): /srv/deployment/ocg/output 24288640572B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44538 msg: ocg_render_job_queue 0 msg [12:09:15] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 2.2GB (= 1.0GB warning): /srv/deployment/ocg/output 26027946244B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44615 msg: ocg_render_job_queue 0 msg [12:16:19] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 3.2GB (= 1.0GB warning): /srv/deployment/ocg/output 26787142202B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44682 msg: ocg_render_job_queue 0 msg [12:20:33] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 8.4GB (= 5.0GB critical): /srv/deployment/ocg/output 22983581863B: /srv/deployment/ocg/postmortem 1.5GB (= 1.0GB warning): ocg_job_status 44716 msg: ocg_render_job_queue 0 msg [12:22:24] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 1.4GB (= 1.0GB warning): /srv/deployment/ocg/output 27331718384B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44740 msg: ocg_render_job_queue 0 msg [12:22:58] uhhhh [12:23:02] from InitializeSettings: [12:23:07] '+amwiki' => array( [12:23:07] 100 <= 'በር', [12:23:07] ), [12:23:18] PHP is not a RTL language, gentlemen… [12:23:39] haha [12:24:03] this apparently has been there since the dawn of time [12:24:40] hahaha [12:24:45] how did that work [12:24:47] false? [12:25:14] it didn't [12:25:17] hehe [12:25:22] sure, but php didn't throw an error [12:25:24] i guess it's just ignored [12:25:26] (03PS1) 10Bartosz Dziewoński: Remove dead code for amwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163148 [12:25:51] I think it might treat it as a <= operator, and check if the left is lesser than or equal to the right, result as 'false' and just make it array (false) [12:25:52] or, amwiki has '0' as alias for the main namespace, i guess? [12:26:07] haha [12:26:12] array( 0 => false) [12:32:25] (03PS2) 10Bartosz Dziewoński: Remove dead code for amwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163148 [12:34:43] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 18.4GB (= 5.0GB critical): /srv/deployment/ocg/output 24617883610B: /srv/deployment/ocg/postmortem 1.5GB (= 1.0GB warning): ocg_job_status 44862 msg: ocg_render_job_queue 0 msg [12:35:48] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 2.2GB (= 1.0GB warning): /srv/deployment/ocg/output 30156423997B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44873 msg: ocg_render_job_queue 0 msg [12:40:53] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 923137387B: /srv/deployment/ocg/output 30479904911B: /srv/deployment/ocg/postmortem 2.1GB (= 2.0GB critical): ocg_job_status 44929 msg: ocg_render_job_queue 0 msg [12:41:52] (03PS2) 10Giuseppe Lavagetto: hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 [12:42:57] hmmm [12:45:03] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 8.4GB (= 5.0GB critical): /srv/deployment/ocg/output 25202229666B: /srv/deployment/ocg/postmortem 1.5GB (= 1.0GB warning): ocg_job_status 44971 msg: ocg_render_job_queue 0 msg [12:46:53] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 851367651B: /srv/deployment/ocg/output 652697158B: /srv/deployment/ocg/postmortem 17979B: ocg_job_status 44990 msg: ocg_render_job_queue 0 msg [12:48:57] !log cleared OCG caches again when I woke up to buy me more time to investigate the issue properly. [12:49:02] Logged the message, Master [12:51:00] (03PS1) 10Yuvipanda: icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163149 [12:51:02] (03PS1) 10Yuvipanda: icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163150 [12:53:56] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 194094355B: /srv/deployment/ocg/output 1677047587B: /srv/deployment/ocg/postmortem 222820B: ocg_job_status 45067 msg: ocg_render_job_queue 0 msg [13:06:45] (03CR) 10Mark Bergsma: [C: 031] hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 (owner: 10Giuseppe Lavagetto) [13:07:35] _joe_: when are we enabling hhvm for api? [13:08:02] <_joe_> aude: I guess in a couple of hours [13:08:16] can wikidata opt out (for now)? [13:08:23] https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Error_message_when_trying_to_create_a_new_property_with_HHVM_activated is not too nice already [13:09:00] would be happier if it were more stable before we enable [13:09:15] <_joe_> aude: did you retry today? [13:09:24] https://www.wikidata.org/wiki/Wikidata:Database_reports/Constraint_violations/P373 doesn't work for me. [13:09:25] yes [13:09:34] https://test.wikidata.org/w/index.php?title=Q6&diff=8305&oldid=5032&uselang=en *boom* [13:09:39] <_joe_> ok, is there an open bug? [13:09:51] we will be backporting some more things for deploy on tuesday that should help [13:10:00] for our next deployment branch [13:10:32] https://bugzilla.wikimedia.org/show_bug.cgi?id=64415 [13:11:03] <_joe_> testwikidata is down altogether [13:11:06] now we have to come up with something to retry the secondary updates [13:11:25] maybe i broke it with tyring to view a diff [13:11:47] <_joe_> mmmh lemme check [13:11:52] anyway, since vast majority of wikidata is edited via api, i don't think that's a good step for us yet [13:11:56] (03PS2) 10Hashar: zuul: client to easily query Gearman server [puppet] - 10https://gerrit.wikimedia.org/r/162856 [13:12:05] (03CR) 10Hashar: zuul: client to easily query Gearman server (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162856 (owner: 10Hashar) [13:12:22] why is HHVM having so many problems with wikidata specifically? [13:12:29] mark: https://bugzilla.wikimedia.org/show_bug.cgi?id=64415 [13:12:41] some bug with class_alias [13:12:47] Because Wikidata isn't the same as the other wiki's? :) [13:13:47] more in https://bugzilla.wikimedia.org/show_bug.cgi?id=69708 [13:13:56] thanks [13:14:02] <_joe_> aude: I'll stall deployment then [13:14:21] should be better on tuesday after daploy [13:14:24] deploy* [13:15:07] <_joe_> ok sorry I was pretty sure that bug was resolved [13:15:16] ori thought so but no [13:15:30] <_joe_> aude: if hhvm is not manually enabled, however, you won't use hhvm at all [13:15:40] correct [13:15:47] <_joe_> aude: so, as long as you don't opt-in to the beta feature [13:15:54] <_joe_> in wikidata specifically [13:16:02] we could just turn the beta feature off as last resort [13:16:04] <_joe_> I don't think you will have a problem [13:17:55] <_joe_> aude: also, yes; let me speak with ori later [13:18:05] <_joe_> we'll make a call toghether [13:18:30] <_joe_> if disabling beta on wikidata can be a better option [13:18:42] <_joe_> or waiting until tuesday [13:18:42] might be [13:19:00] seems to be causing data corruption (e.g. when deleting stuff) [13:19:17] <_joe_> mmmh [13:19:21] fine to keep testing on test.wikidata and also ok on wikipedias, since it's not involving editing [13:19:34] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 6.2GB (= 5.0GB critical): /srv/deployment/ocg/output 2756084659B: /srv/deployment/ocg/postmortem 708001B: ocg_job_status 45304 msg: ocg_render_job_queue 0 msg [13:19:52] e.g. the secondary updates are not done when deleting so it's impossible to add the same label to another property [13:21:01] <_joe_> ok, then - I'll take some time off and discuss it later [13:21:10] <_joe_> I think we should be GTG as-is [13:21:12] ok [13:21:35] i say turn it off on wikidata and then hhvm on api should be ok [13:21:49] <_joe_> ok, laters! [13:21:59] based on the reports, users will understand and not be a problem [13:26:02] (03CR) 10Hashar: "Tried on gallium:" [puppet] - 10https://gerrit.wikimedia.org/r/162856 (owner: 10Hashar) [13:26:20] godog: I enhanced the zuul-gearman.py script based on your recommendation ( https://gerrit.wikimedia.org/r/#/c/162856/ ) [13:26:36] godog: using a map of user command to gear.classname :-] [13:27:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 19.8GB (= 5.0GB critical): /srv/deployment/ocg/output 3148383762B: /srv/deployment/ocg/postmortem 43895095B: ocg_job_status 45371 msg: ocg_render_job_queue 0 msg [13:28:32] (03Abandoned) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [13:28:34] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 930974473B: /srv/deployment/ocg/output 3286104467B: /srv/deployment/ocg/postmortem 43924168B: ocg_job_status 45381 msg: ocg_render_job_queue 0 msg [13:29:11] (03PS1) 10Yuvipanda: icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163153 [13:29:38] (03PS1) 10Mark Bergsma: Send all ariel's mail to Google Apps [puppet] - 10https://gerrit.wikimedia.org/r/163154 [13:30:47] (03CR) 10Yuvipanda: "Tried rebasing this, was *very* complicated :(" [puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [13:31:03] (03Restored) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [13:34:23] (03PS5) 10Yuvipanda: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 [13:36:19] hashar: cool I'll take a look! [13:37:46] godog: wanna merge ^? [13:38:44] RECOVERY - Host mathoid.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [13:39:46] !log moved mathoid to low-traffic lvs servers@eqiad [13:39:52] Logged the message, Master [13:40:54] YuviPanda: which? [13:41:05] godog: https://gerrit.wikimedia.org/r/162570 [13:41:10] not icinga related [13:41:13] (03PS1) 10BBlack: authdns: set up extra listeners for codfw transitions [puppet] - 10https://gerrit.wikimedia.org/r/163157 [13:41:16] just a simple modularization [13:41:29] PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: Connection timed out [13:42:17] pages [13:42:23] yes, disregard [13:42:26] thanks [13:46:57] YuviPanda: sure, with many files involved I'd suggest running the puppet compiler but I'm don't know its status ATM, _joe_ perhaps does [13:47:25] godog: hmm, right. although they're just straight up moves, and 99% of the files moved are config files in files/ [13:47:39] (03CR) 10Andrew Bogott: "It's about to be merged (as per my email :) )" [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [13:47:40] it's pretty much 2 puppet files with about 20 lines of code [13:50:21] godog: and yeah, puppet compiler is dead atm, IIRC [13:51:28] (03PS6) 10Filippo Giunchedi: dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [13:51:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] dsh: Move dsh related code into a module [puppet] - 10https://gerrit.wikimedia.org/r/162570 (owner: 10Yuvipanda) [13:52:08] Функция: MovePage::moveToInternal [13:52:08] Ошибка: 1062 Duplicate entry '2-VlsergeyBot/\xD0\xA4\xD1\x83\xD1\x82\xD0\xB1\xD0\xBE\xD0\xBB\x' for key 'name_title' (10.64.16.12) [13:52:12] YuviPanda: ack, I've merged it [13:52:23] (03PS2) 10BBlack: authdns: set up extra listeners for codfw transitions [puppet] - 10https://gerrit.wikimedia.org/r/163157 [13:52:29] (03CR) 10BBlack: [C: 032 V: 032] authdns: set up extra listeners for codfw transitions [puppet] - 10https://gerrit.wikimedia.org/r/163157 (owner: 10BBlack) [13:54:07] mh puppet-merge gave me an error [13:54:08] fatal: Unable to create [13:54:08] '/var/lib/git/operations/puppet/.git/refs/remotes/origin/production.lock': File [13:54:11] exists. [13:54:13] checking [13:54:19] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 5.2GB (= 5.0GB critical): /srv/deployment/ocg/output 5774048299B: /srv/deployment/ocg/postmortem 44318614B: ocg_job_status 45634 msg: ocg_render_job_queue 0 msg [13:55:46] godog: maybe because we were merging close to each other [13:56:12] (03PS1) 10BBlack: authdns: Add trailing comma on extra listeners [puppet] - 10https://gerrit.wikimedia.org/r/163162 [13:56:18] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 5 failures [13:56:19] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 5.7GB (= 5.0GB critical): /srv/deployment/ocg/output 7075727818B: /srv/deployment/ocg/postmortem 133952377B: ocg_job_status 45653 msg: ocg_render_job_queue 0 msg [13:57:01] (03CR) 10BBlack: [C: 032 V: 032] authdns: Add trailing comma on extra listeners [puppet] - 10https://gerrit.wikimedia.org/r/163162 (owner: 10BBlack) [13:57:12] bblack: yeah I think so, I had to ssh strontium as gitpuppet to trigger the hook [13:57:14] godog: hmm, can you check the puppet failures on bast? [13:57:17] * YuviPanda wonders if that was me [13:57:22] !log Manually declared the global rename Secretary-> VlsergeyBot done after it twice timed out on pages moves on ruwiki [13:57:23] (03CR) 10Alexandros Kosiaris: Followup 6084646d: apply Mathoid directory creation hack to labs too (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/162811 (owner: 10Catrope) [13:57:28] Logged the message, Master [13:57:43] YuviPanda: yes that was me running puppet manually, it failed at first because of the puppet-merge, running again [13:58:43] (03CR) 10Mark Bergsma: [C: 032] Send all ariel's mail to Google Apps [puppet] - 10https://gerrit.wikimedia.org/r/163154 (owner: 10Mark Bergsma) [14:00:11] YuviPanda: looks good [14:00:18] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:00:18] godog: \o/ cool, thanks [14:00:45] np [14:01:27] godog: if you've time later on (but not too close to end-of-your-working-day) I could use reviews/merges on icinga, but I can also wait till next week :) [14:02:42] YuviPanda: my gerrit routine is usually in the morning :)) if it can wait it'll be monday [14:02:54] ah cool [14:03:00] I forgot which tz you were in for a bit :) [14:03:28] YuviPanda: same as yours if you are still BST! [14:03:45] godog: I am! [14:03:54] however these days I manage to wake up by 11:30 AM or noon [14:03:59] in India it's usually around 6PM [14:05:10] you wake up at 6pm while in india? :P [14:05:22] godog: for quite a while, yeah [14:05:53] haha wow [14:06:13] godog: was in a mid-atlantic-ish TZ when I was in India [14:06:16] and it drifted constantly [14:07:29] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 732754450B: /srv/deployment/ocg/output 7264268465B: /srv/deployment/ocg/postmortem 44737794B: ocg_job_status 45758 msg: ocg_render_job_queue 0 msg [14:08:33] (03CR) 10Yuvipanda: "I don't really have a full understanding of what's going on, but comments (especially around the requires, and why they are there) would b" [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [14:10:29] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 5.5GB (= 5.0GB critical): /srv/deployment/ocg/output 7687109855B: /srv/deployment/ocg/postmortem 192206018B: ocg_job_status 45784 msg: ocg_render_job_queue 0 msg [14:13:38] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 7713833040B: /srv/deployment/ocg/postmortem 289867754B: ocg_job_status 45811 msg: ocg_render_job_queue 0 msg [14:13:39] (03PS1) 10BBlack: switch NS1 IP to baham (codfw) in local data [dns] - 10https://gerrit.wikimedia.org/r/163164 [14:18:19] * bmansurov is back (gone 00:01:34) [14:18:30] * bmansurov is away: I'm busy [14:20:02] (03PS11) 10Andrew Bogott: Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 [14:21:12] _joe_: https://gerrit.wikimedia.org/r/#/c/163166/ if ori, greg-g etc would be ok with friday deploy on test.wikidata [14:21:23] should help clear most of the issues, afaik [14:21:51] wont' be on wikidata until tuesday, though [14:24:10] (03PS2) 10BBlack: switch NS1 IP to baham (codfw) in local data [dns] - 10https://gerrit.wikimedia.org/r/163164 [14:25:49] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 8.5GB (= 5.0GB critical): /srv/deployment/ocg/output 9321064521B: /srv/deployment/ocg/postmortem 25863811B: ocg_job_status 45866 msg: ocg_render_job_queue 0 msg [14:29:42] (03CR) 10BBlack: [C: 032] switch NS1 IP to baham (codfw) in local data [dns] - 10https://gerrit.wikimedia.org/r/163164 (owner: 10BBlack) [14:34:40] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/163174 [14:35:05] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/163174 (owner: 10Giuseppe Lavagetto) [14:35:24] (03CR) 10Giuseppe Lavagetto: [V: 032] puppet-compiler: fix typos [puppet] - 10https://gerrit.wikimedia.org/r/163174 (owner: 10Giuseppe Lavagetto) [14:36:19] !log address for ns1 switched in our local dns data - https://gerrit.wikimedia.org/r/163164 [14:36:26] Logged the message, Master [14:36:33] * cscott is still working on OCG [14:36:42] (just in case that helps some correlate any fallout, not that there should be any) [14:36:59] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 7.9GB (= 5.0GB critical): /srv/deployment/ocg/output 9964406709B: /srv/deployment/ocg/postmortem 79634975B: ocg_job_status 45917 msg: ocg_render_job_queue 0 msg [14:39:01] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.3GB (= 5.0GB critical): /srv/deployment/ocg/output 9971297437B: /srv/deployment/ocg/postmortem 79748978B: ocg_job_status 45930 msg: ocg_render_job_queue 0 msg [14:43:59] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 960016513B: /srv/deployment/ocg/output 900831754B: /srv/deployment/ocg/postmortem 262034B: ocg_job_status 45951 msg: ocg_render_job_queue 0 msg [14:46:45] (03PS4) 10Alexandros Kosiaris: openldap module [puppet] - 10https://gerrit.wikimedia.org/r/156322 [14:52:09] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 6.5GB (= 5.0GB critical): /srv/deployment/ocg/output 9495734644B: /srv/deployment/ocg/postmortem 272253962B: ocg_job_status 45981 msg: ocg_render_job_queue 0 msg [14:53:59] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 203126575B: /srv/deployment/ocg/output 1060683803B: /srv/deployment/ocg/postmortem 120224182B: ocg_job_status 45988 msg: ocg_render_job_queue 0 msg [14:54:09] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 721561677B: /srv/deployment/ocg/output 999737576B: /srv/deployment/ocg/postmortem 47240744B: ocg_job_status 45988 msg: ocg_render_job_queue 0 msg [15:02:39] !log documented what I'm going to clear the OCG queues at https://wikitech.wikimedia.org/wiki/OCG#Pruning_the_queue [15:02:46] Logged the message, Master [15:05:22] (03PS7) 10Ottomata: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [15:05:34] (03CR) 10Ottomata: [C: 032 V: 032] Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [15:05:38] ottomata: yay [15:05:47] ottomata: there's a dependent patch there as well [15:05:49] ja on it [15:05:54] (03PS4) 10Ottomata: Setup ldap client in terbium [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [15:06:30] (03CR) 10Ottomata: [C: 032 V: 032] Setup ldap client in terbium [puppet] - 10https://gerrit.wikimedia.org/r/163141 (owner: 10Yuvipanda) [15:06:40] ottomata: yay :) force a run on terbium? [15:06:43] on it! [15:06:44] :) [15:08:12] aude: _joe_ I missed context I think, what are the issues? [15:08:36] greg-g: https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Error_message_when_trying_to_create_a_new_property_with_HHVM_activated [15:08:46] hhvm issues, so [15:09:08] 1) i think bad idea to have hhvm for wikidata api (since api is involved for most edits) [15:09:27] we could disable hhvm beta on wikidata and keep on test.wikidata until we are more confident [15:09:29] ottomata: wanna help out with the icinga ones? andrewbogot.t is busy with ldap stuff elsewhere [15:09:48] 2) we have some fixes for wmf/1.25wmf1, if ori wants to try them on test.wikidata [15:09:55] yup [15:09:58] got time now [15:10:01] ottomata: yay [15:10:04] https://gerrit.wikimedia.org/r/#/c/163166/ [15:10:16] aude: disabling hhvm BF makes sense [15:10:22] agree [15:10:27] aude: can the fixes be tested on beta wikidata? [15:10:29] ottomata: simple ones to start with: https://gerrit.wikimedia.org/r/#/c/162924/, https://gerrit.wikimedia.org/r/#/c/162925/ and https://gerrit.wikimedia.org/r/#/c/162919/ [15:10:35] YuviPanda: it will be hard for me with that name though, but I will slog through :p [15:10:50] ottomata: don't worry, most of these are named 'icinga' and are icinga specific :) [15:10:55] greg-g: i think most/all are [15:10:59] * aude checks [15:11:10] (03PS1) 10Jgreen: attempt to move ocg logs back to /var/log/ocg.log [puppet] - 10https://gerrit.wikimedia.org/r/163176 [15:11:18] https://gerrit.wikimedia.org/r/#/c/163143/ [15:11:48] (03PS5) 10Ottomata: shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 (owner: 10Yuvipanda) [15:11:54] (03CR) 10Ottomata: [C: 032 V: 032] shinken: Specify config_dir for contacts [puppet] - 10https://gerrit.wikimedia.org/r/162919 (owner: 10Yuvipanda) [15:12:09] (03PS5) 10Ottomata: icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 (owner: 10Yuvipanda) [15:12:25] i think beta is missing one small patch, but others are there [15:14:04] hm, YuviPanda, since these nagios_common things are classes [15:14:29] why not have a nagios_common/manifests/init.pp file where the parameters are set for all the other classes? [15:14:30] aude: I'll let _joe_ / ori decide on just disabling HHVM BF vs pulling in a lot of changes for wikidata on Friday :) [15:14:33] like [15:14:37] greg-g: ok [15:14:41] ottomata: ah, it is on the way :) [15:14:43] nagios_common($config_dir = '/etc/icinga') [15:14:44] ah ok :) [15:14:50] then reference them via variable [15:14:52] ok cool :) [15:14:53] these changes are test.wikidata only, so don't help wikidata until tuesday [15:14:57] ottomata: it's the last stop, and if you notice, a lot of individual $config_dir params go away in latter patches [15:15:02] ah ok cool [15:15:11] if they want hhvm on the api today, then i say disable the bf for wikidata [15:15:26] * aude thinks more scary to friday deploy hhvm on api [15:16:03] YuviPanda: I'm confused about user_macros [15:16:16] it says [15:16:17] # Defines $USERn$ macros for nagios compatible implementations [15:16:17] ottomata: they setup $USERn$ macros for icinga/shinken [15:16:24] aude: /me nods [15:16:35] but, it doesn't actually do that, all it does it install resource.cfg [15:16:41] up to them :) [15:16:43] ottomata: resource.cfg has user macros :) [15:16:51] HMMMMMMMMMMM [15:16:53] ottomata: I kept the resource.cfg name because that's what icinga/shinken put them in [15:16:56] ok sorry, am reading more... [15:17:40] cool, ok, i guess this can be templatized if/when needed [15:17:51] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Remove resource.cfg from refresh list [puppet] - 10https://gerrit.wikimedia.org/r/162924 (owner: 10Yuvipanda) [15:17:55] ottomata: yeah, not necesary right now [15:18:08] (03PS5) 10Ottomata: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 (owner: 10Yuvipanda) [15:20:05] ottomata: I'm unsure if icinga-doc is actually required for the web interface, but didn't want to break it [15:21:38] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:21:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:22:28] (03CR) 10Jgreen: [C: 032 V: 031] attempt to move ocg logs back to /var/log/ocg.log [puppet] - 10https://gerrit.wikimedia.org/r/163176 (owner: 10Jgreen) [15:23:28] PROBLEM - LDAPS on labcontrol2001 is CRITICAL: Connection refused [15:23:29] PROBLEM - LDAP on neptunium is CRITICAL: Connection refused [15:23:33] ottomata: ok to merge your changes? [15:24:01] * YuviPanda thinks so [15:24:06] ok [15:24:06] ah soryr [15:24:07] yup [15:24:16] merged! [15:24:22] am reviewing a couple more, always forget when i'm in the middle of merging a few at once [15:24:28] RECOVERY - LDAPS on labcontrol2001 is OK: TCP OK - 0.052 second response time on port 636 [15:24:36] RECOVERY - LDAP on neptunium is OK: TCP OK - 0.002 second response time on port 389 [15:24:38] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:24:46] (03PS6) 10Ottomata: icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 (owner: 10Yuvipanda) [15:24:51] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move packages into module [puppet] - 10https://gerrit.wikimedia.org/r/162925 (owner: 10Yuvipanda) [15:25:06] (03PS3) 10Ottomata: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [15:25:14] YuviPanda: should I just keep going down the dependencies? [15:25:18] ottomata: yes [15:28:04] (03CR) 10Andrew Bogott: [C: 032] Update labs instances to use the new ldap-eqiad server [puppet] - 10https://gerrit.wikimedia.org/r/162689 (owner: 10Andrew Bogott) [15:29:03] !log puppet is now moving all labs instances to new ldap servers: ldap-eqiad and ldap-codfw [15:29:10] Logged the message, Master [15:29:35] andrewbogott: hmm, self-hosted puppetmasters will need an update as well before the old ones are taken down, I presume? [15:29:38] ottomata: ok if I merge this icinga patch? [15:29:41] YuviPanda: yep! [15:29:56] YuviPanda: hopefully that's clear from my email... [15:29:59] yeah [15:30:02] just verifying [15:31:09] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:32:24] andrewbogott: agh [15:32:24] yes. [15:32:27] sorry [15:33:12] andrewbogott: these changes are a bit bigger, so we should probably run icinga after each one, I think. [15:33:54] YuviPanda: I was just checking with ottomata about a pending merge on paladium. I'm not free to look at icinga quite yet [15:34:04] andrewbogott: ahhh, right. sorry, ok [15:34:15] andrewbogott: I'll just let ottomata review/merge them now [15:34:20] cool [15:34:32] I'll be sitting here nervously waiting for tools to break :( [15:34:47] Looks good so far, though. This change is actually pretty trivial on boxes that don't have a puppetmaster [15:34:52] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:35:07] well [15:35:59] wtf, why is labs-ns1 pointed at neptunium now? [15:35:59] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [15:36:57] (03CR) 10Ottomata: icinga: Move naggen into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [15:37:27] terbium! [15:37:35] i ran there and things were fine! [15:37:48] ottomata: that is fixed in a subsequent patch, btw (the dependencies) [15:37:53] the notify is moved out of naggen [15:38:06] hm, YuviPanda [15:38:07] Error: Failed to apply catalog: Could not find dependency Class[Certificates::Globalsign_ca] for File[/etc/ldap/ldap.conf] at /etc/puppet/modules/ldap/manifests/client.pp:272 [15:38:15] which is weird, because puppet ran just fine the first time [15:38:23] after I merged your ldap terbium change [15:38:44] ottomata: that was just changed by andrewbogott [15:38:48] ah [15:38:56] phew ok [15:39:41] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: puppet fail [15:40:16] andrewbogott: ldap clients not working because certificates::globalsign_ca is not included on all of them? [15:40:23] I'm looking [15:40:26] aye k. [15:41:09] (03CR) 10Yuvipanda: icinga: Move naggen into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [15:42:49] weird, it's defined in the role [15:44:13] andrewbogott: if ( $::realm == 'labs' ) { [15:44:14] ? [15:45:33] (03PS1) 10Andrew Bogott: We need ldap certs on some production boxes as well. [puppet] - 10https://gerrit.wikimedia.org/r/163183 [15:45:38] ottomata: yep! ^ [15:46:49] (03CR) 10Ottomata: [C: 031] We need ldap certs on some production boxes as well. [puppet] - 10https://gerrit.wikimedia.org/r/163183 (owner: 10Andrew Bogott) [15:49:39] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: puppet fail [15:51:21] (03PS4) 10Ottomata: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [15:52:22] (03PS5) 10Alexandros Kosiaris: openldap module [puppet] - 10https://gerrit.wikimedia.org/r/156322 [15:52:24] (03PS1) 10Alexandros Kosiaris: Introduce role::openldap::oit [puppet] - 10https://gerrit.wikimedia.org/r/163184 [15:52:30] hmm, i can't merge on gerrit? [15:52:35] ottomata: yes, gerrit broken [15:52:41] ottomata: because of LDAP [15:52:46] ottomata: people are on it :) [15:52:49] oh, ha ok [15:52:58] ottomata: you can still review the rest of the patches, tho :) [15:53:48] * YuviPanda calms down, acts less pushy [15:53:51] must be the tea [15:54:42] am doing that :) [15:56:51] YuviPanda: where do you intend to put the monitor_group, monitor_service defines, etc.? [15:57:29] ottomata: probably $nagios_common module [15:57:33] (whatever that is named) [15:58:05] (03PS2) 10Hashar: We need ldap certs on some production boxes as well. [puppet] - 10https://gerrit.wikimedia.org/r/163183 (owner: 10Andrew Bogott) [15:58:57] ottomata: we can't use them in labs tho [16:00:03] right exported. [16:00:08] ottomata: yea [16:00:19] soemthing about this icinga::global_hostgroups class rubs me the wrong way, but i'm not sure why... [16:00:34] the name is one thing, trying to come up with a better suggestion [16:00:38] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 6100402343B: /srv/deployment/ocg/postmortem 252729742B: ocg_job_status 46354 msg: ocg_render_job_queue 0 msg [16:00:50] ottomata: yeah, it also rubs me wrong :) also see where it came *from*, it wasn't even inside a class [16:00:57] I am not *fully* sure where they were realized... [16:01:02] or if they were at all [16:01:14] yeah [16:01:50] hmmm [16:02:00] YuviPanda: will this work then? hm [16:02:06] those were global to all puppet instances before [16:02:25] but maybe just realized only on the icinga host? [16:02:26] hm [16:02:29] ottomata: they're exported resources [16:02:34] ottomata: and so are realized only one the one host [16:02:46] (03CR) 10Hashar: [C: 032] "Hack" [puppet] - 10https://gerrit.wikimedia.org/r/163183 (owner: 10Andrew Bogott) [16:03:05] mark: see I can merge after all :] ^^^ [16:03:29] ha, they are virtual exported resources? [16:03:34] @monitor_service [16:03:35] :p [16:03:40] @monitor_group [16:03:44] YuviPanda: brb [16:03:50] ottomata: ok [16:05:09] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:05:29] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 688570251B: /srv/deployment/ocg/postmortem 79057116B: ocg_job_status 46414 msg: ocg_render_job_queue 0 msg [16:07:39] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:24:45] (03PS1) 10Cscott: Increase OCG warning/critical space thresholds. [puppet] - 10https://gerrit.wikimedia.org/r/163186 (https://bugzilla.wikimedia.org/71341) [16:27:42] (03CR) 10Ottomata: icinga: Move global monitoring hostgroups into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162966 (owner: 10Yuvipanda) [16:28:05] (03PS5) 10Ottomata: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [16:28:24] (03CR) 10Yuvipanda: icinga: Move global monitoring hostgroups into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162966 (owner: 10Yuvipanda) [16:28:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 5.4GB (= 5.0GB critical): /srv/deployment/ocg/output 1554427971B: /srv/deployment/ocg/postmortem 79326068B: ocg_job_status 46498 msg: ocg_render_job_queue 0 msg [16:29:07] https://gerrit.wikimedia.org/r/163186 is for ^^^^ [16:29:40] (03CR) 10Ottomata: icinga: Move global monitoring hostgroups into module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162966 (owner: 10Yuvipanda) [16:30:39] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 302364723B: /srv/deployment/ocg/output 163598785B: /srv/deployment/ocg/postmortem 64000B: ocg_job_status 46502 msg: ocg_render_job_queue 0 msg [16:31:29] ottomata: updated, but can't seem to push [16:32:11] (03CR) 10Ottomata: [C: 031] nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/162967 (owner: 10Yuvipanda) [16:32:18] yeah, gerrit probs eh? [16:32:28] ottomata: ya [16:32:38] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:45] i got 99 problems but gerrit ain't one [16:32:56] (03CR) 10Ottomata: [C: 031] icinga: Remove ganglios checks [puppet] - 10https://gerrit.wikimedia.org/r/162969 (owner: 10Yuvipanda) [16:33:01] but i feel bad for you, son [16:33:14] cscott: if you tried to merge something, gerrit will be :) [16:33:28] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [16:34:42] (03CR) 10Ottomata: "This doesn't need a class, does it? Can you just put it into ::packages? Ideally, the ::packages class wouldn't exist either. Both of t" [puppet] - 10https://gerrit.wikimedia.org/r/163146 (owner: 10Yuvipanda) [16:35:04] ottomata: hmm, makes sense. I shall. [16:35:32] (03CR) 10GWicke: [C: 031] Increase OCG warning/critical space thresholds. [puppet] - 10https://gerrit.wikimedia.org/r/163186 (https://bugzilla.wikimedia.org/71341) (owner: 10Cscott) [16:36:51] ottomata: modified, yeah. still can't push [16:38:33] (03PS1) 10Manybubbles: Elasticsearch Drop number of concurrent merges [puppet] - 10https://gerrit.wikimedia.org/r/163188 [16:38:48] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 9.4GB (= 5.0GB critical): /srv/deployment/ocg/output 561253567B: /srv/deployment/ocg/postmortem 114894B: ocg_job_status 46545 msg: ocg_render_job_queue 0 msg [16:40:00] (03CR) 10Manybubbles: "I haven't applied this to the cluster already. I tried a while ago and it didn't take as a dynamic setting. I'm not sure if that's fixed" [puppet] - 10https://gerrit.wikimedia.org/r/163188 (owner: 10Manybubbles) [16:41:28] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.8GB (= 5.0GB critical): /srv/deployment/ocg/output 772028790B: /srv/deployment/ocg/postmortem 409326B: ocg_job_status 46552 msg: ocg_render_job_queue 0 msg [16:41:34] (03CR) 10Ottomata: "I tend to think that this class should also not exist. There's no good reason to have special classes for ::packages, ::config (files), :" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163149 (owner: 10Yuvipanda) [16:41:35] bblack: i'm around; let me know if i can support the api varnish change deployment somehow. [16:41:47] YuviPanda: basically the same comment on the ::config class [16:41:49] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 9.4GB (= 5.0GB critical): /srv/deployment/ocg/output 582234200B: /srv/deployment/ocg/postmortem 135409B: ocg_job_status 46553 msg: ocg_render_job_queue 0 msg [16:42:33] ottomata: hmm, makes sense. but I don't want to move them all in one go, so how about I first get rid of everything in the icinga.pp file, and then create an init.pp as the last step? [16:42:47] I'm wary of moving + big refactors in the same patch [16:42:53] so would want to split them up [16:43:38] you can split them up still, ja? [16:43:41] instead of making a new ::config class [16:43:48] just have the icinga class [16:43:50] (03PS2) 10Andrew Bogott: Revert "(Re-)enable VisualEditor for Wikitech (labswiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161262 [16:43:52] (03PS1) 10Andrew Bogott: Move wikitech to the new ldap server, ldap-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163189 [16:43:56] and move the things out of misc::icinga into your icinga base class [16:44:03] you can do each piece in an individual commit like you are now [16:44:07] config fiiles, packages, whatever [16:45:00] ottomata: hmm, my plan was to move most of the includes right now into a role, and the rest into the init.pp [16:45:10] ottomata: I also personally prefer having them seperate, since otherwise they can get a bit large [16:45:27] the commits or the classes? [16:45:37] both :) [16:46:03] ha, the class size shouldn't really matter, classes should be logical units, not just split up for screen size reasons :p [16:46:34] ottomata: hmm [16:46:36] * YuviPanda thinks [16:46:37] (03PS2) 10Andrew Bogott: Move wikitech to the new ldap server, ldap-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163189 [16:46:38] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 1356001589B: /srv/deployment/ocg/postmortem 598138B: ocg_job_status 46568 msg: ocg_render_job_queue 0 msg [16:46:41] YuviPanda: I used to do this too, but faidon convinced me otherwise :) [16:46:47] and he is rigiht :) [16:47:36] (03CR) 10Giuseppe Lavagetto: [C: 031] "Would be +2, but I can't ATM" [puppet] - 10https://gerrit.wikimedia.org/r/163075 (owner: 10Ori.livneh) [16:47:49] ori: are you saying you're basically ready to go and want to do it now? [16:47:50] ottomata: haha :) [16:48:47] bblack: yep [16:48:52] ottomata: alright, makes sense. let me start reworking them... [16:48:59] ottomata: I think plugins should still be separate, though [16:49:14] (03PS3) 10BBlack: hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 (owner: 10Giuseppe Lavagetto) [16:49:39] * ori got a couple of gerrit 503s [16:49:44] ditto [16:50:29] ottomata: hmm, actually, I can't think of a *clean* way to do that over multiple commits. how about we review the current ones as they are, and I'll start moving them in a subsequent patch? [16:50:37] * YuviPanda starst movign them [16:50:49] (03PS2) 10Ori.livneh: Allow wikidev users to restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/163075 [16:50:51] ottomata: the current patches make the dependency chains nicer too, so makes it easier for me to merge into oe thing [16:51:05] YuviPanda: ja, plugins should likely be separate [16:51:14] things that are logicallly distinct should ahve separate classes [16:51:22] so, an icinga module user will always need those base things [16:51:30] packages, base configs, init script(?), service [16:51:31] etc. [16:52:48] bblack: probably soon, if enabling the api doesn't reveal a class of new bugs [16:53:00] but there are app servers to reimage first, too [16:53:15] ready? I'm gonna push the red button now [16:53:27] * ori nods [16:53:44] uhhh [16:53:49] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 193093639B: /srv/deployment/ocg/output 3933033642B: /srv/deployment/ocg/postmortem 502540B: ocg_job_status 46605 msg: ocg_render_job_queue 0 msg [16:53:49] gerrit won't let me +2 now, wtf? [16:54:32] bblack: it's a sign saying 'Don't push!' :p [16:54:50] when was the last time someone got a +2 through? something's borked [16:54:50] _joe_ had the same thing [16:54:59] hence (CR) Giuseppe Lavagetto: [C: 1] "Would be +2, but I can't ATM" [puppet] [16:55:03] andrewbogott: ^ [16:55:07] bblack: ori yeah, andrewbogott was investigating [16:55:21] ah, ok. no pressure. [16:55:35] ok well [16:55:39] no red button :p [16:56:16] salt -G role:varnish cmd.run 'sed /.../' [16:56:44] salt '*' cmd.run 'userdel ori' [16:57:06] happily that will silently fail on at least a couple of hosts [16:57:12] :) [16:57:14] zing [17:07:50] (03PS1) 10Andrew Bogott: Include CA certs everywhere rather than just when we need them. [puppet] - 10https://gerrit.wikimedia.org/r/163194 [17:08:36] (03CR) 10jenkins-bot: [V: 04-1] Include CA certs everywhere rather than just when we need them. [puppet] - 10https://gerrit.wikimedia.org/r/163194 (owner: 10Andrew Bogott) [17:08:44] ottomata: bah, I made a change that implements the init, but can't push it [17:08:45] grr [17:08:50] yeah :/ [17:09:22] making some lunch, heading to a cafe soon [17:09:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 7.0GB (= 5.0GB critical): /srv/deployment/ocg/output 1180124651B: /srv/deployment/ocg/postmortem 201551795B: ocg_job_status 46658 msg: ocg_render_job_queue 0 msg [17:10:38] (03Abandoned) 10Yuvipanda: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/162936 (owner: 10Yuvipanda) [17:10:44] (03Abandoned) 10Yuvipanda: icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/162966 (owner: 10Yuvipanda) [17:10:54] (03Abandoned) 10Yuvipanda: nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/162967 (owner: 10Yuvipanda) [17:10:56] (03PS2) 10Andrew Bogott: Include CA certs everywhere rather than just when we need them. [puppet] - 10https://gerrit.wikimedia.org/r/163194 [17:10:58] (03Abandoned) 10Yuvipanda: icinga: Move initscript into module [puppet] - 10https://gerrit.wikimedia.org/r/163146 (owner: 10Yuvipanda) [17:11:08] (03Abandoned) 10Yuvipanda: icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163149 (owner: 10Yuvipanda) [17:11:14] (03Abandoned) 10Yuvipanda: icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163150 (owner: 10Yuvipanda) [17:11:22] (03Abandoned) 10Yuvipanda: icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163153 (owner: 10Yuvipanda) [17:11:37] (03PS1) 10Yuvipanda: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/163196 [17:11:39] (03PS1) 10Yuvipanda: icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/163197 [17:11:41] (03PS1) 10Yuvipanda: nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/163198 [17:11:43] (03PS1) 10Yuvipanda: icinga: Move initscript into module [puppet] - 10https://gerrit.wikimedia.org/r/163199 [17:11:45] (03PS1) 10Yuvipanda: icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163200 [17:11:47] (03PS1) 10Yuvipanda: icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163201 [17:11:49] (03PS1) 10Yuvipanda: icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163202 [17:11:51] (03PS1) 10Yuvipanda: icinga: Setup icinga class, consolidate from other classes [puppet] - 10https://gerrit.wikimedia.org/r/163203 [17:11:55] ottomata: managed to push them in again :) [17:11:59] ottomata: single class structure at the end of last commit [17:12:47] (03CR) 10BBlack: [C: 031] Include CA certs everywhere rather than just when we need them. [puppet] - 10https://gerrit.wikimedia.org/r/163194 (owner: 10Andrew Bogott) [17:14:39] So, I'm going to do a Bad Thing, and temporarily hotfix the puppet repo on palladium and strontium. [17:15:58] !log hotfixing /var/lib/git/operations/puppet in hopes of fixing gerrit so I don't have to hotfix no more [17:16:04] Logged the message, Master [17:16:08] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: /mnt/tmpfs 6.3GB (= 5.0GB critical): /srv/deployment/ocg/output 5950002212B: /srv/deployment/ocg/postmortem 124734535B: ocg_job_status 46697 msg: ocg_render_job_queue 0 msg [17:18:51] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 6.0GB (= 5.0GB critical): /srv/deployment/ocg/output 2841923980B: /srv/deployment/ocg/postmortem 300665198B: ocg_job_status 46709 msg: ocg_render_job_queue 0 msg [17:22:49] !log Manually ran rebuildEntityPerPage.php for Wikidata [17:22:55] Logged the message, Master [17:23:39] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 1209241842B: /srv/deployment/ocg/postmortem 143107B: ocg_job_status 46740 msg: ocg_render_job_queue 0 msg [17:25:04] (03PS1) 10Yuvipanda: icinga: Setup icinga role! [puppet] - 10https://gerrit.wikimedia.org/r/163205 [17:25:09] RECOVERY - OCG health on ocg1001 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 1220420729B: /srv/deployment/ocg/postmortem 153332B: ocg_job_status 46752 msg: ocg_render_job_queue 0 msg [17:25:38] bblack, Coren: I've applied the classes for globalsign_ca and wmf_ca on gerrit, still nothing. and ldapsearch still fails as well. So there must be yet another piece that is coincidentally present on labs... [17:27:24] ottomata: things left in misc/icinga.pp are variables that I've to move, ganglios that I'll probably remove, and then a bunch of misc stuff I don't fully understand [17:30:26] getting intermittent 503 in gerrit [17:31:19] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:31:32] MaxSem: yeah gerrit is basically hosed right now [17:31:34] MaxSem: that's me [17:31:45] :P [17:32:19] bblack: ytterbium (aka gerrit) continues to reject the cert on ldap-eqiad.wikimedia.org. Even though that cert works elsewhere. [17:32:24] I'm pretty much out of ideas. [17:32:28] RECOVERY - DPKG on labmon1001 is OK: All packages OK [17:33:01] andrewbogott: I saw some traffic earlier in -labs about a duplicate IP screwing with things, which also sounds like it could cause the intermittent 503 issue. maybe that's not fully addressed yet? [17:33:09] bblack: it's not just the gerrit service, it's other things on that same box. So must be something wront with the ip chain [17:33:15] the intermittent 503 is me restarting gerrit [17:33:19] oh ok [17:33:20] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:32] the duplicate IP shouldn't matter since it's the ip for labs-ns1 [17:33:42] well, let me make sure [17:34:25] Yeah, everyone seems to agree as to the IP of ldap-eqiad.wikimedia.org [17:35:34] !log "git reset --hard origin" to remove that terrible hotfix on palladium and strontium. [17:35:41] Logged the message, Master [17:40:44] andrewbogott: i did an apt-get update on ytterbium and then an apt-get upgrade --dry-run to see what would get upgraded and there are several things that seem plausibly related [17:40:55] andrewbogott: i admit that that's kind of grasping at straws but.. [17:41:15] what things? [17:41:42] PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:41:50] libnss3, for example [17:42:13] andrewbogott: https://dpaste.de/r63O/raw [17:42:32] should I be worrying about mathoid now or not? [17:42:45] nss, at least, is the same version as virt1000 [17:42:47] libcurl3-gnutls [17:43:02] libgnutls26 [17:43:36] hm, libgnutls26 is different [17:43:53] !log upgraded libgnutls26 on ytterbium [17:43:59] Logged the message, Master [17:44:31] no dice [17:46:36] [10:41] icinga-wm PROBLEM - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: Connection timed out [17:46:44] What happened, did someone restart LVS? [17:49:08] RoanKattouw: i think nothing new happened, it didnt work yet, icinga is just reminding [17:49:14] (duration 7d..) [17:49:22] Aah OK [17:49:34] Is it still ping critical? [17:50:16] the mathoid LVS is "timed out" and the 2 "mathoid"s on sca1001/1002 are "refused" [17:50:30] everything didnt change since a week though [17:50:30] YuviPanda: So I created a new instance (dev-trusty) and applied the same classes and puppetmaster as I did for integration slave 6,7,8; (which also run trusty) - yet here all hell breaks lose [17:50:31] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:50:31] https://gist.githubusercontent.com/Krinkle/b6e5301c6279f1797c65/raw [17:50:35] Any idea what that's all about? [17:50:43] I don't see any recent changes to those manifests [17:50:43] * YuviPanda checks [17:51:01] That happens [17:51:09] deployment::target is broken and doesn't create directories properly [17:51:18] Krinkle: looks like you didn't apply the srv role, or the manifest doesn't apply it [17:51:29] You can hack around this by creating the /srv/deployment/integration directory [17:51:36] I know, but I don't want to [17:51:40] Longer term the puppet manifest should create that dir in some way or other [17:51:43] I created a dozen instances like this before, puppet always did it correctly [17:51:57] yeah, seems like something needing a puppet patch [17:52:18] bd808 claims that this isn't necessary if you use deployment::target correctly, but I couldn't get it to work the way he said, so I went back to Alexandros's hack of just creating all the dirs I need (in puppet, but still) [17:53:57] Even /srv/deployment doesn't exist on integration-dev-trusty [17:54:01] let alone /integration [17:54:08] This worked last week.. [17:54:14] you shouldn't use deployment::target at all, but package { } with a trebuchet provider [17:54:26] role::wikimania_scholarships worked fine the last time I setup a host with it, but that's been a while. It doesn't try to split half in module and half in role though either. [17:54:47] Oh yeah, new hottness [17:55:44] You also technically shouldn't try to stick things inside of the Trebuchet managed directory. Ryan didn't like that part of what I did in scholarships at all. [17:56:09] that too [17:56:34] All I know is, I've set up a dozen instances before, and it all worked fine. And I see no changes in contint specific modules and now it's broken. So I suspect something generic changes in ops. [17:56:49] And maybe we're using those classes wrong, I don't know [17:57:04] RoanKattouw: alex k. already took perfect care of it it seems. per RT ticket [17:57:21] RoanKattouw: "The 3 alarms present now are correct and actually useful. That is once the first deploy of the software has been done. ... Acknowledging them and pinging Gabriel" [17:58:19] Krinkle: Have you tried running git-deploy again from the deployment server? I would hope that isn't necessary but... trebuchet can be fickle at times. [17:58:29] ACKNOWLEDGEMENT - LVS HTTP IPv4 on mathoid.svc.eqiad.wmnet is CRITICAL: Connection timed out daniel_zahn RT #6077 , RT #8306 [17:58:41] ori: Ugh, really? Then why does every piece of code I've touched recently already use deployment::target :( [17:58:44] ACKNOWLEDGEMENT - mathoid on sca1001 is CRITICAL: Connection refused daniel_zahn RT #6077 , RT #8306 [17:58:44] ACKNOWLEDGEMENT - mathoid on sca1002 is CRITICAL: Connection refused daniel_zahn RT #6077 , RT #8306 [17:59:03] bd808: which deployment server in this case? [17:59:08] It's git-clone [17:59:14] bd808: Yeah sticking things inside of the Trebuchet-managed dir is exactly what Mathoid does. For Citoid I'm trying to avoid it [17:59:25] Krinkle: Dunno. Whichever one the integration project uses? [17:59:31] bd808: we don't have any [18:00:03] its git::clone ensure>latest in labs, and git-deploy in prod [18:00:13] (03PS2) 10Krinkle: Fix Elasticsearch in ci [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [18:00:52] Hmmm. and previously something made /srv/deployment/integration for you outside the role then I take it. [18:01:12] (03CR) 10Krinkle: "@Hashar: Are you saying this change is good or bad? All I know is, the code currently in git for this class is broken. And this patch fixe" [puppet] - 10https://gerrit.wikimedia.org/r/160524 (owner: 10Manybubbles) [18:01:31] uhhh, YuviPanda, i think you abandoned a change that your subsequent changes depend on [18:01:32] * mutante lets Google translate a Russian language RT ticket, gets "This truly magical elixir was known to the monks of ancient Tibet since before Buddhist times.".. rejects it [18:01:45] heheh [18:01:46] oh [18:01:48] ottomata: oh? the ganglios one, I think. I shall redo that. the rest should be ok [18:01:50] maybe you depened on many of them [18:01:52] sorry [18:01:55] abandoned many of them [18:02:01] YuviPanda: will you add me as reviewer to these? [18:02:02] i have lost them [18:02:05] ottomata: yes [18:02:06] and there are so many [18:02:19] ottomata: start from https://gerrit.wikimedia.org/r/#/c/163196/, follow chain [18:02:39] k [18:03:16] ah, gerrit still broken? [18:03:29] yeah [18:03:52] crap i signed out, now can't sign back in :/ [18:05:18] (03PS8) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [18:06:12] ok welp, YuviPanda, I guess we are just waiting on gerrit then [18:06:18] since i assume you have amendments to make, ja? [18:07:01] (03PS9) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [18:09:23] ori: when gerrit is back up and you have a chance, maybe comment on qchris's questions on your change here? https://gerrit.wikimedia.org/r/#/c/157841/ [18:10:02] when you said no gerrit, you meant this? "Authentication unavailable at this time." [18:10:15] yup [18:10:19] ldap is broken i think right now [18:10:19] ottomata: sure. [18:10:24] ah, makes sense, because of the LDAP work, right? [18:10:26] they are working on it in -labs [18:10:27] yup [18:10:31] ok, thx [18:12:56] ldap is broken for gerrit, should be working elsewhere [18:14:16] andrewbogott: there's still no mention in SAL [18:14:53] !log ldap is broken [18:14:59] Logged the message, Master [18:15:00] Nemo_bis: hth [18:15:32] well, I wanted to do this actually: [18:15:42] !log untruncated: andrewbogott> ldap is broken for gerrit, should be working elsewhere [18:15:48] Logged the message, Master [18:17:49] thats probably better [18:26:07] (03CR) 10BBlack: [C: 032] Include CA certs everywhere rather than just when we need them. [puppet] - 10https://gerrit.wikimedia.org/r/163194 (owner: 10Andrew Bogott) [18:28:04] (03PS4) 10BBlack: hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 (owner: 10Giuseppe Lavagetto) [18:28:10] (03CR) 10BBlack: [C: 032 V: 032] hhvm: serve API as well [puppet] - 10https://gerrit.wikimedia.org/r/162902 (owner: 10Giuseppe Lavagetto) [18:30:29] ori: the api patch is borked (VCL doesn't compile) [18:31:40] YuviPanda: Oo, gerrit back! (for now at least?) [18:32:46] PROBLEM - puppet last run on amssq42 is CRITICAL: CRITICAL: Puppet has 2 failures [18:32:56] PROBLEM - puppet last run on cp1038 is CRITICAL: CRITICAL: Puppet has 2 failures [18:33:06] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 2 failures [18:33:06] PROBLEM - puppet last run on amssq62 is CRITICAL: CRITICAL: Puppet has 2 failures [18:33:57] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 2 failures [18:33:59] (03PS1) 10BBlack: hhvm_api varnish fixup [puppet] - 10https://gerrit.wikimedia.org/r/163215 [18:34:24] (03CR) 10BBlack: [C: 032 V: 032] hhvm_api varnish fixup [puppet] - 10https://gerrit.wikimedia.org/r/163215 (owner: 10BBlack) [18:34:26] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 2 failures [18:34:46] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 2 failures [18:35:26] PROBLEM - puppet last run on cp1037 is CRITICAL: CRITICAL: Puppet has 2 failures [18:35:36] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 2 failures [18:36:28] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 2 failures [18:36:46] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 2 failures [18:36:57] PROBLEM - puppet last run on cp1040 is CRITICAL: CRITICAL: Puppet has 2 failures [18:37:06] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: puppet fail [18:37:06] PROBLEM - puppet last run on amssq31 is CRITICAL: CRITICAL: Puppet has 2 failures [18:37:27] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 2 failures [18:37:27] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [18:37:28] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [18:37:28] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 2 failures [18:37:46] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:37:47] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 2 failures [18:37:47] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: puppet fail [18:37:58] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 2 failures [18:38:01] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [18:38:01] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [18:38:06] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 2 failures [18:38:12] ^ those will all self-correct as they pick up hhvm_api varnish fixup [puppet] - https://gerrit.wikimedia.org/r/163215 in ~20 mins [18:38:27] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 2 failures [18:38:27] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: puppet fail [18:38:28] the intermediate failure isn't terribly consequential, they just stick with the old config [18:38:46] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [18:38:46] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: puppet fail [18:38:56] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [18:38:56] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: puppet fail [18:38:56] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [18:39:16] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [18:39:17] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: puppet fail [18:39:22] oh, even my fixup is incorrect [18:39:32] still, nothing consequential for prod traffic [18:39:38] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: puppet fail [18:39:48] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [18:40:08] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [18:40:16] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [18:41:16] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: puppet fail [18:41:16] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: puppet fail [18:41:17] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [18:41:17] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [18:41:26] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [18:41:27] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [18:42:07] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: puppet fail [18:42:17] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: puppet fail [18:42:17] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [18:42:38] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: puppet fail [18:42:46] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: puppet fail [18:42:56] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [18:43:06] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [18:43:20] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [18:43:20] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [18:43:36] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [18:43:36] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [18:43:46] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: puppet fail [18:44:06] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: puppet fail [18:44:07] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [18:44:17] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [18:44:36] (03PS1) 10BBlack: Move hhvm backend choice back to backend templates [puppet] - 10https://gerrit.wikimedia.org/r/163218 [18:44:37] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: puppet fail [18:44:48] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [18:44:50] (03CR) 10BBlack: [C: 032 V: 032] Move hhvm backend choice back to backend templates [puppet] - 10https://gerrit.wikimedia.org/r/163218 (owner: 10BBlack) [18:44:56] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: puppet fail [18:45:17] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [18:45:37] PROBLEM - puppet last run on cp1039 is CRITICAL: CRITICAL: puppet fail [18:45:49] we really should get a ratelimiter on icinga-wm like syslog [18:45:52] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: puppet fail [18:45:56] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [18:45:56] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [18:46:00] PROBLEM - approximately-same message repeated 400 times [18:46:28] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [18:46:37] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: puppet fail [18:46:47] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: puppet fail [18:46:47] bblack: but what if you missed it the first 399 times? [18:46:56] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [18:47:05] heh heh [18:47:17] at least this icinga spam isn't my fault. [18:47:21] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:47:37] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [18:47:47] although poke poke about https://gerrit.wikimedia.org/r/163186 [18:47:53] or i'll make the OCG spam return [18:47:55] incoming ~8 minutes of relative peace followed by 12 minutes of RECOVERY spam [18:47:57] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [18:48:06] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [18:48:15] I think, ish? [18:49:09] I guess the first recoveries will be around :52 actually, since they started at :32 [18:49:57] (03PS2) 10BBlack: Increase OCG warning/critical space thresholds. [puppet] - 10https://gerrit.wikimedia.org/r/163186 (https://bugzilla.wikimedia.org/71341) (owner: 10Cscott) [18:50:51] (03CR) 10BBlack: [C: 032] Increase OCG warning/critical space thresholds. [puppet] - 10https://gerrit.wikimedia.org/r/163186 (https://bugzilla.wikimedia.org/71341) (owner: 10Cscott) [18:51:07] RECOVERY - puppet last run on amssq42 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:51:27] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:51:32] icinga-wm: you're a minute early (according to bblack) [18:51:38] or 51! because puppet is bad at math [18:51:47] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [18:51:53] whoops [18:52:21] there's a +/- 1 minute random splay in addition to the fixed (per-node) offsets within the 20-minute window [18:52:26] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:52:46] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:52:49] before i hit 'commit' can someone check my understanding of 'enable active service checks' does in icinga? [18:53:16] I've never used that button [18:53:21] are you trying to get rid of a downtime? [18:53:31] active service checks are disabled for LVS HTTP IPv4 on ocg.svc.eqiad.wmnet, but i'd like to turn them off briefly (while still suppressing notifications) to check whether the problem has been fixed. [18:53:55] let me go look at it [18:54:07] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:54:27] RECOVERY - puppet last run on amssq31 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [18:55:02] yeah I think you can enable active checks as a first step there [18:55:06] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:55:07] cscott: ^ [18:55:11] cscott: bblack no problem, we can enabled the active checks but hit "disable notifications" [18:55:15] ah [18:55:31] it already has them disabled I think [18:55:38] ok. thought it was wise to check before i hit the button. ;) [18:55:47] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:55:47] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:55:47] RECOVERY - puppet last run on amssq39 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:55:48] let's see what this button does ;) [18:56:02] wait not that button! )#(REHOFID9 NO CARRIER [18:56:03] cscott: let's see if you have permission to execute commands [18:56:15] (03PS1) 10coren: Use the Ubuntu Way of installing SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/163222 [18:56:16] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [18:56:25] mutante: ah, yes. status: "NOT AUTHORIZED" [18:56:25] andrewbogott: ^^ [18:56:27] login to icinga is one thing... hitting buttons another [18:56:36] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [18:56:47] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:56:49] ok I'll turn it on [18:57:07] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [18:57:08] I hate how the back button works with icinga iframes [18:57:16] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:57:16] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:57:23] always clicks "open frame in new tab" thing [18:57:26] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:57:40] cscott: do you want to request having those permissions.. or ? [18:57:56] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [18:57:56] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:58:06] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:58:07] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:58:11] cscott: or we are doing it for you.. [18:58:26] mutante: i don't need the permissions i don't think. if you wanted to grant some, i just need it for tasks with OCG in their name. but i think bblack is doing it for me. [18:58:27] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:58:36] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:58:36] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:58:38] i generally prefer to have people who know what they are doing do the things ;) [18:58:47] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:58:58] cscott: ok, fair [18:59:27] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:59:27] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:59:37] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [18:59:48] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [18:59:53] (03CR) 10Andrew Bogott: "A million /etc/ldap.confs will still have reference to the certs being in /etc/ssl/certs/.pem won't they? Do we need to leave them " [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [18:59:57] ottomata: sorry, was pulled away [18:59:59] ottomata: back, and no, I don't have amends :) [19:00:07] ottomata: if you look at the patchset series, at the end there's one that rms most of the individual classes, and adds an init.pp [19:00:17] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:00:19] ottomata: I still want to keep the patches separate, so we can merge them one by one and test as we go [19:00:22] rather than one big bang patch [19:00:36] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:00:56] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:01:14] Coren: unless you have a burning need to merge that patch today, perhaps you should go back to bed and leave it for next week? [19:01:28] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:01:34] this conversation would be so much more readable if someone didn't break all the varnishes :P [19:01:35] (03PS2) 10Ottomata: Use $::instanceproject as Hadoop user group in labs [puppet] - 10https://gerrit.wikimedia.org/r/162961 [19:01:37] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:01:47] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:01:50] andrewbogott: I have no burning need to merge that at all - it's a big change and I expect we'll want to be (a) deliberate in merging it and (b) not do it on a Friday. [19:01:56] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:02:00] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [19:02:02] Coren: great. [19:02:16] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:02:22] andrewbogott: You may want to do something like this manually to wikitech maybe? [19:02:26] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:02:37] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:02:38] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:02:38] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [19:02:40] ok, OCG is looking good. but i can't fix the problem if I can't reproduce it. So I'm going to start pulling jobs through it again. With https://gerrit.wikimedia.org/r/163186 landed this shouldn't cause icinga to scream. but fair warning. [19:02:52] Coren: Maybe, although I think I understand the wikitech outage pretty well. [19:02:57] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:03:00] Anyway, since nothing is currently broken (knock wood) I'm going to try to relocate and get some lunch. [19:03:17] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:03:17] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:03:29] kk. Did you send to outage email? If you want my input on it now's a good time; I'm probably going to go back to bed soon. [19:03:36] cscott: notifications for OCG LVS are disabled, so nobody should get paged. no worries [19:03:36] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:03:38] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:03:38] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:04:00] Coren: sent a draft to you, you can reply with comments and then I'll send to list when I return [19:04:09] cscott: and it's also in a scheduled downtime [19:04:26] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:04:26] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:04:47] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:05:26] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:05:26] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:05:29] (03CR) 10Ottomata: [C: 032 V: 032] Use $::instanceproject as Hadoop user group in labs [puppet] - 10https://gerrit.wikimedia.org/r/162961 (owner: 10Ottomata) [19:05:56] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:05:57] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:06:16] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [19:06:17] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [19:06:27] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:06:33] (03PS2) 10Dzahn: terbium - include misc::noc-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/162536 [19:06:36] !log aaron Synchronized php-1.24wmf22/extensions/CentralAuth: (no message) (duration: 00m 08s) [19:06:43] Logged the message, Master [19:06:57] (03CR) 10Dzahn: [C: 032] terbium - include misc::noc-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/162536 (owner: 10Dzahn) [19:07:49] (03CR) 10Dzahn: [V: 032] terbium - include misc::noc-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/162536 (owner: 10Dzahn) [19:08:46] bah, Duplicate declaration: Monitor_service[http] [19:08:57] !log aaron Synchronized php-1.25wmf1/extensions/CentralAuth: (no message) (duration: 00m 07s) [19:09:04] Logged the message, Master [19:09:10] !log git-deploy: Deploying integration/slave-scripts 08147c42ea42e1a5eca1d29 [19:09:16] Logged the message, Master [19:09:50] mutante: where's that from? [19:09:57] ah, not me [19:09:59] mutante: it's the "OCG health" checks which were spamming #ops earlier this morning. they aren't in downtime (but maybe should be) [19:10:33] YuviPanda: oh, on terbium.. already fixing it, my change did it [19:10:54] YuviPanda: we have monitor_service {'http' in 2 different roles [19:10:56] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: puppet fail [19:11:02] and if they are combined.. duplicate [19:13:52] (03PS1) 10Dzahn: fix duplicate declaration of http check [puppet] - 10https://gerrit.wikimedia.org/r/163226 [19:14:41] cscott: done, i added a downtime to the OCG health.. how long though? [19:15:04] (03CR) 10RobH: [C: 031] "this looks right to me, but considering it touches a LOT of machines, the more eyes the better." [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [19:15:18] cscott: disabled notifications. just let us know when to turn it back on [19:15:20] (03PS10) 10Catrope: WIP Citoid puppetization [puppet] - 10https://gerrit.wikimedia.org/r/163068 [19:15:23] !log Deployed security patches to CentralAuth [19:15:47] Logged the message, Master [19:17:45] ottomata: think you'll have time today to look at icinga again? [19:18:12] (03PS2) 10Dzahn: fix duplicate declaration of http check [puppet] - 10https://gerrit.wikimedia.org/r/163226 [19:19:06] (03PS1) 10BBlack: authdns: switch to all IPs on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/163227 [19:19:09] (03CR) 10Dzahn: [C: 032] fix duplicate declaration of http check [puppet] - 10https://gerrit.wikimedia.org/r/163226 (owner: 10Dzahn) [19:19:28] ja, YuviPanda, did you submit new patches? [19:19:44] ottomata: ya, including the one with init.pp that removes the small classes [19:19:45] (03CR) 10jenkins-bot: [V: 04-1] authdns: switch to all IPs on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/163227 (owner: 10BBlack) [19:19:57] ottomata: https://gerrit.wikimedia.org/r/#/c/163203/ is that one [19:20:14] ottomata: but I still want to merge the patches that introduce the smaller classes, so we can merge / verify / merge cycle in smaller patches [19:20:25] YuviPanda: ^ that fixes it.. it's just the name of the check [19:20:39] well, other issue now :p [19:20:39] mutante: ah, cool! [19:20:42] heh :) [19:20:57] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:22:17] (03PS2) 10Ottomata: icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/163196 (owner: 10Yuvipanda) [19:22:32] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move naggen into module [puppet] - 10https://gerrit.wikimedia.org/r/163196 (owner: 10Yuvipanda) [19:22:44] (03PS2) 10Ottomata: icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/163197 (owner: 10Yuvipanda) [19:22:49] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move global monitoring hostgroups into module [puppet] - 10https://gerrit.wikimedia.org/r/163197 (owner: 10Yuvipanda) [19:22:59] (03PS2) 10Ottomata: nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/163198 (owner: 10Yuvipanda) [19:22:59] ah, now i'm also lacking an SSL cert .. the one for noc of course [19:23:05] (03CR) 10Ottomata: [C: 032 V: 032] nagios_common: Move check_ganglia into module [puppet] - 10https://gerrit.wikimedia.org/r/163198 (owner: 10Yuvipanda) [19:23:23] (03PS2) 10Ottomata: icinga: Move initscript into module [puppet] - 10https://gerrit.wikimedia.org/r/163199 (owner: 10Yuvipanda) [19:23:35] (03CR) 10Ottomata: [C: 032 V: 032] "This is being modified in a future change." [puppet] - 10https://gerrit.wikimedia.org/r/163199 (owner: 10Yuvipanda) [19:24:05] uh oh, YuviPanda getting a path merge conflict when attempting to rebase https://gerrit.wikimedia.org/r/#/c/163200/ [19:25:39] ottomata: weird, trying [19:26:19] icinga probs, on it... [19:26:28] ottomata: uh oh [19:27:32] dependency cycle [19:27:32] (File[/etc/icinga/puppet_hostextinfo.cfg] => Service[icinga] => Class[Icinga::Monitor::Service] => Class[Icinga::Naggen] => File[/etc/icinga/puppet_hostextinfo.cfg]) [19:29:25] (03PS2) 10Yuvipanda: icinga: Setup icinga role! [puppet] - 10https://gerrit.wikimedia.org/r/163205 [19:29:26] err [19:29:27] ottomata: rebased [19:29:27] (03PS2) 10Yuvipanda: icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163202 [19:29:29] (03PS2) 10Yuvipanda: icinga: Setup icinga class, consolidate from other classes [puppet] - 10https://gerrit.wikimedia.org/r/163203 [19:29:31] (03PS2) 10Yuvipanda: icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163200 [19:29:33] ottomata: uh oh [19:29:33] (03PS2) 10Yuvipanda: icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163201 [19:29:56] cool, YuviPanda, let's fix this puppet erro before moving more [19:30:08] ottomata: wait, how does File[/etc/icinga/puppet_hostextinfo.cfg] depend on Class[Icinga::Naggen]? [19:30:11] oh [19:30:24] ottomata: it would be fixed if the series of patches are merged, I think [19:30:44] let me look again [19:31:07] ok, yeah, probably so, as teh dependency chain is altered there [19:31:13] ottomata: yes [19:31:50] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163201 (owner: 10Yuvipanda) [19:33:05] oh, YuviPanda did you respond to my comment about requires inside of classes (preferably on the resources) rather than on the class usage? [19:33:10] i think here [19:33:10] https://gerrit.wikimedia.org/r/#/c/163200/2 [19:33:13] its not here anymore though... [19:33:20] maybe that was a different (abandoned?) change [19:34:19] ottomata: yeah, probably [19:34:20] looking [19:34:51] ottomata: https://gerrit.wikimedia.org/r/#/c/163149/ [19:35:50] ja, did you address that in a later change I haven't read yet? [19:36:18] you still ahve the notifies here https://gerrit.wikimedia.org/r/#/c/163200/1/manifests/misc/icinga.pp [19:36:23] on the class usages [19:37:01] mutante: ok, thanks. [19:37:26] disabling notifications is best, so i can keep an eye on it this weekend and make sure all the checks are staying green. [19:37:30] ottomata: no, I haven't [19:37:35] ottomata: I didn't read it, just read your general comment [19:38:12] ottomata: I'm kinda split on that, though. [19:38:20] ottomata: that's just a lot of duplication... [19:38:24] but does make it more explicit [19:38:58] ja, but your way, the dependency is not captured by the class itself in anyway [19:39:10] its left up to the user of the class to set up the notify dependency [19:39:10] why should it be? [19:39:23] well, because you want changes that this class applies to refresh the service [19:39:30] we could also set up naggen2 to not refresh the service [19:39:30] so the class should be responsible for doing this [19:39:32] and that's valid too [19:39:33] not the user of the class [19:39:34] that is fine. [19:39:38] if that is what you want to do :) [19:39:43] not here [19:39:48] but it shoudl be explicitly documented [19:39:57] as usually config files notify their services [19:40:02] hmm [19:40:18] (03PS1) 10RobH: setting install params for ms-fe2001-2004 [puppet] - 10https://gerrit.wikimedia.org/r/163237 [19:40:19] e.g. https://github.com/wikimedia/puppet-cdh#description [19:40:23] " • In general, services managed by this module do not subscribe to their relevant config files. This prevents accidental deployments of config changes. If you make config changes in puppet, you must apply puppet and then manually restart the relevant services. [19:40:24] ' [19:40:48] ottomata: let me add a similar comment in there :) [19:41:10] I don't want to tie naggen's individual files, etc to knowledge that a Service['icinga'] exists [19:41:26] that makes sense, what about an alias, like we briefly talked about before? [19:41:38] a generic 'nagiosish' service alias [19:41:45] (03CR) 10Ryan Lane: [C: 04-1] "This does not look accurate to me. The normal ubuntu location is /etc/ssl/certs." [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [19:41:45] that its ok for files to notify [19:41:55] is naggen icinga specific? [19:42:04] or does it work with other nagiosish services? [19:42:08] !log removing root's public_html from fenari - backup kept just in case [19:42:14] Logged the message, Master [19:42:23] ottomata: ah, the naggen2 is what your comment was about, and it got merged already [19:42:32] ah ok [19:42:33] let me add a followuppatch [19:42:36] its relevent here though too I think [19:42:44] ottomata: the config class goes away in the init.pp patch :) [19:42:49] https://gerrit.wikimedia.org/r/#/c/163200/1/manifests/misc/icinga.pp [19:42:52] ah yeah, icinga::config [19:42:55] ok ok, geez [19:42:55] haha [19:42:57] (03CR) 10RobH: [C: 032] setting install params for ms-fe2001-2004 [puppet] - 10https://gerrit.wikimedia.org/r/163237 (owner: 10RobH) [19:42:59] ottomata: :) [19:43:00] YuviPanda: i can't review like this! [19:43:07] ottomata: just once! [19:43:12] i'll keep merging, but let's then discuss the final output! [19:43:38] ottomata: yes! I suggest: 1. merge until the init.pp, verify that it works, 2. CR / discuss that one, then merge that [19:43:48] ok [19:43:56] (03PS3) 10Ottomata: icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163200 (owner: 10Yuvipanda) [19:44:15] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move config into module [puppet] - 10https://gerrit.wikimedia.org/r/163200 (owner: 10Yuvipanda) [19:44:25] (03PS3) 10Ottomata: icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163201 (owner: 10Yuvipanda) [19:44:47] (03CR) 10Ottomata: [V: 032] icinga: Move service definition into module [puppet] - 10https://gerrit.wikimedia.org/r/163201 (owner: 10Yuvipanda) [19:45:26] (03CR) 10Ryan Lane: "I think the only change necessary is to call update-ca-certificates at the correct times. Also, some application's don't use the hash link" [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [19:46:09] (03PS3) 10Ottomata: icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163202 (owner: 10Yuvipanda) [19:46:17] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move plugins into module [puppet] - 10https://gerrit.wikimedia.org/r/163202 (owner: 10Yuvipanda) [19:46:34] (03Abandoned) 10BBlack: add hhvm-api.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/162896 (owner: 10Giuseppe Lavagetto) [19:47:09] (03CR) 10Dzahn: "this did not install the needed SSL cert yet.. todo.." [puppet] - 10https://gerrit.wikimedia.org/r/162536 (owner: 10Dzahn) [19:47:16] (03CR) 10RobH: [C: 04-1] "so chatting with ryan, it seems these arent the default places, but the original use of /etc/ssl is, so changing my vote." [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [19:50:18] ottomata: did puppet work fine? [19:50:50] Is ishmael still used? And, if so, who knows about it? [19:50:54] (03CR) 10Ottomata: icinga: Setup icinga class, consolidate from other classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/163203 (owner: 10Yuvipanda) [19:50:57] (03PS1) 10Dzahn: move noc from fenari to terbium [dns] - 10https://gerrit.wikimedia.org/r/163239 [19:51:00] YuviPanda: no, still busted [19:51:08] (03CR) 10jenkins-bot: [V: 04-1] move noc from fenari to terbium [dns] - 10https://gerrit.wikimedia.org/r/163239 (owner: 10Dzahn) [19:51:11] andrewbogott: the status is.. it's up but has no data.. but we want to fix that [19:51:16] but, lets' go ahead and get through these and then fix, if you like [19:51:19] andrewbogott: per sean [19:51:23] it'll be easier for me to trace the dependency cycle and fix [19:51:27] if things aren't moving so much [19:51:37] ottomata: yeah [19:51:44] different cycle now though :) [19:51:46] andrewbogott: but i recently just moved it behind misc-web [19:51:51] ottomata: heh :) [19:51:52] yeah [19:52:12] (03PS1) 10Andrew Bogott: Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163240 [19:52:20] mutante: can you take ownership of ^ ? [19:52:26] left you a comment here: https://gerrit.wikimedia.org/r/#/c/163203/ YuviPanda [19:52:35] Which, if the stakes are low, 'ownership' may mean 'merge blindly' [19:53:03] ottomata: hmm, that's notify vs subscribe, I guess [19:53:08] andrewbogott: ah, just the auth part to the new servers.. that's ok. yea [19:53:27] i was like "eh.. why move the whole tool:?) [19:53:48] YuviPanda: yeah, and that is an aesthetic preference, at that level, since it is identical. that's what i've usually seen, so I think we should stick with it that way [19:54:03] ottomata: I prefer notify, since you can add other files/config without having to go back and add a subscribe? [19:54:04] (03CR) 10Dzahn: [C: 032] Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163240 (owner: 10Andrew Bogott) [19:54:06] but, I like to organize as much as I can to be sequentially readable, and to set up the deps that way too [19:54:12] you can still do that [19:54:21] YuviPanda, you can use both notify and subscribe [19:54:39] I'd be ok wtih notify, but the file should be organized in dependency order, if possible [19:54:39] ottomata: hmm, I see more notifys than subscribe [19:55:29] ottomata: if you see the files{}, they also require Service['icinga'] [19:55:49] since we set up the Service ourselves, we need to ensure that the service exists before we notify [19:56:07] hmm, does puppet treat notify as a 'before' dependency? [19:56:09] notify is require [19:56:13] sorryu [19:56:13] ah [19:56:13] before [19:56:14] yes [19:56:22] notify is before with refresh :) [19:56:22] ottomata: that makes sense then [19:56:35] ottomata: let me move 'em [19:56:36] k [19:56:41] mutante: thank you! [19:56:56] andrewbogott: can't apply it on neon yet though, because of the unrelated icinga puppet issue [19:57:02] so we gotta check later [19:57:29] Found 1 dependency cycle [19:57:48] YuviPanda: ottomata ^ fyi.. but that's probably what you are already talking about [19:58:23] (03PS3) 10Yuvipanda: icinga: Setup icinga class, consolidate from other classes [puppet] - 10https://gerrit.wikimedia.org/r/163203 [19:58:37] ottomata: ^ [19:58:39] (03PS3) 10Yuvipanda: icinga: Setup icinga role! [puppet] - 10https://gerrit.wikimedia.org/r/163205 [19:58:49] mutante: yeah, we're on it [19:59:41] YuviPanda: let me take one more minute to try to convince you to like subscribe better :) [19:59:47] since notify is before [19:59:50] and subscribe is require [20:00:04] requires are much more common in puppet, and it is easier to reason about dependencies if they work in one direction [20:00:47] the more bi-direction dependencies that are expressed, the more difficult it is to keep a clear picture of the chain [20:01:04] also, i'm not sure how you just saw more notifies than subscribes [20:01:06] ottomata: if we use subscribe, then when adding a new config file, two things need to be changed. [20:01:07] but if you just did a grep | wc [20:01:16] yeah, it would undercount subscribes [20:01:18] yeah [20:01:27] YuviPanda: yes that's true [20:01:35] you'd have to change the file and the service [20:01:36] ottomata: having two places to do things feels bad to me [20:01:38] yes [20:01:42] ottomata: also you'd end up with a huge list [20:01:44] of subscribes [20:01:47] well, big list [20:01:48] i dunno, i actually like it, as it is more explicit [20:01:53] yeah i like that :) [20:01:55] all in one place [20:02:01] (well mostly :) [20:02:02] ) [20:02:04] well, all in *two* places :) [20:02:09] two? [20:02:14] mutante: same question re: kibana? [20:02:50] andrewbogott: i don't know, maybe more for ottomata ? [20:02:53] ottomata: one in the subscribe, and one in the file{} [20:02:54] eh? [20:02:57] ottomata: kibana? [20:03:07] we didn't touch kibana :) [20:03:12] ah, no, the service dependency would not be expressed in the file [20:03:31] (03PS1) 10Andrew Bogott: Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163246 [20:03:34] ottomata: sure, but the file itself would have a piece of info about it (that it needs to be refreshed) embedded elsewhere [20:03:40] err [20:03:43] that it needs to refresh [20:03:45] file { blalba: require => whatever before, package maybe? } [20:03:45] ... [20:03:45] service {boooboo: subscribe => [File[blalba], File[woohoo]], etc. [20:04:02] ottomata: so the tradeoff is 1 service definition having its subscribes distributed, or lots of file definitions having it distributed [20:04:14] well, distributed away from themselves [20:04:14] eh? the subscribes aren't distributed [20:04:22] the services is the one getting refreshed here [20:04:26] ottomata: I think the word I was looking for was 'separated' [20:04:30] so the service says: I want to know whenever this crap changes. [20:04:34] ottomata: will you please merge and watch https://gerrit.wikimedia.org/r/#/c/163246/ when you have a moment? [20:04:44] ottomata: idk, this centralization feels a bit wrong to me, especially since it's a fair number of files [20:04:48] and I've always used notify [20:04:49] :) [20:05:03] hm, reading that I can't believe that it ever worked [20:05:07] bblack: this change .. it looks so simple. but jenkins hates it https://gerrit.wikimedia.org/r/#/c/163239/1/templates/wikimedia.org [20:05:16] YuviPanda: what's your arguemnt for using require instead of before then? [20:05:22] ottomata: since icinga is still broken, if you've no other objections other than this (which as you said, is aesthetic), do you want to merge, and we can fix icinga, and continue this after? [20:05:33] bblack: is that new? "CNAME 'noc.wikimedia.org.' points to known same-zone NXDOMAIN 'terbium.wikimedia.org.' [20:05:50] ottomata: 1. avoids long lists, 2. adding new config file doesn't need me to remember to add it to the services list, 3. I really do not like long lists :) [20:06:02] YuviPanda: ,ok [20:06:11] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Setup icinga class, consolidate from other classes [puppet] - 10https://gerrit.wikimedia.org/r/163203 (owner: 10Yuvipanda) [20:06:43] andrewbogott: you want me to merge and run puppet on kibana...what is the kibana host oooOooo... :p [20:07:03] no, YuviPanda, I asked [20:07:09] what is your reason for using require then [20:07:19] if you like notify better, it would follow that you also prefer before over require [20:07:41] ottomata: if you're not the boss of kibana then you can pass the buck to whoever is :) [20:07:57] ottomata: I don't think that follows :) with before, I have to add a list of files to the package definition, (long list), and adding a new file means i've to add an item to that long list [20:07:59] Check and see if ldap is broken before you apply the patch though, to accurately assign lame [20:08:01] mutante: terbium.wikimedia.org does not exist. It's terbium.eqiad.wmnet [20:08:04] *blame [20:08:15] i like "assign lame" [20:08:16] ack, YuviPanda [20:08:18] apache.pp [20:08:19] needs changed [20:08:21] to web.pp [20:08:26] autoloader fail :) [20:08:57] bblack: dduh.. makes sense.. only fenari was public..thx [20:09:16] ottomata: bah [20:09:22] bd808: yt? [20:09:30] doing [20:09:46] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [20:09:47] ottomata: yeah, what's up? [20:09:56] bd808: https://gerrit.wikimedia.org/r/#/c/163246/ [20:10:04] andrewbogott: i just logged into logstash.wikimedia.org just fine [20:10:10] (03PS1) 10Yuvipanda: icinga: Move web module into appropriate file [puppet] - 10https://gerrit.wikimedia.org/r/163248 [20:10:12] ottomata: ^ [20:10:27] ottomata: I will trust your implication that that is related :) [20:10:27] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Move web module into appropriate file [puppet] - 10https://gerrit.wikimedia.org/r/163248 (owner: 10Yuvipanda) [20:10:45] andrewbogott: eh? [20:10:48] (03CR) 10BryanDavis: [C: 031] Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163246 (owner: 10Andrew Bogott) [20:10:52] ottomata: LGTM. [20:10:58] ottomata: I totally don't know what kibana is [20:11:02] ignorance is bliss [20:11:09] haha, yes, kibana is the web frontend to logstash stuff [20:11:12] it's the frontend for logstash [20:11:13] which is hosted at logstash.wikimedia.org [20:11:43] +1 for service names too [20:12:11] andrewbogott, if I logged in...does that mean it is not broken? [20:12:26] and do I really want to merge that when I am messing with icinga already on a friday afternoon? :D [20:12:36] no :D [20:12:42] ottomata: it probably means it's not broken [20:12:46] and, yes you want to merge today [20:12:54] i do, even though it is not broken? [20:13:04] bd808: what host runs kibana? [20:13:06] It'll be broken when I power down virt0 :) [20:13:07] andrewbogott: icinga itself will also need that change..fwiw [20:13:09] ok [20:13:13] * YuviPanda has to go soon :( [20:13:23] ottomata: logstash1001 I think, let me check [20:13:28] ok, YuviPanda we are back to the same dep cycle nowo [20:13:31] (File[/etc/icinga/puppet_hostextinfo.cfg] => Class[Icinga::Naggen] => Service[icinga] => Class[Icinga] => Class[Icinga::Naggen] => File[/etc/icinga/puppet_hostextinfo.cfg]) [20:13:47] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:13:48] (03PS1) 10BBlack: Add hhvm_api varnish backend director [puppet] - 10https://gerrit.wikimedia.org/r/163249 [20:14:00] eh? [20:14:01] ottomata: icinga-wm says it succeeded? [20:14:05] ottomata: if it breaks, you can revert and just rm the bits that refer to virt0 for the short term. [20:14:43] ottomata: I think it should be fine now? [20:14:50] trying again. [20:15:03] (03PS1) 10Andrew Bogott: Update from virt0/virt1000 to virt1000/labcontrol2001 in virt scripts. [puppet] - 10https://gerrit.wikimedia.org/r/163250 [20:15:05] (03PS1) 10Andrew Bogott: Minor purge of virt0/pmtpa refs. [puppet] - 10https://gerrit.wikimedia.org/r/163251 [20:15:07] (03PS1) 10Andrew Bogott: Switch over to using a labcontrol2001 as the labs salt secondary. [puppet] - 10https://gerrit.wikimedia.org/r/163252 [20:15:09] (03PS1) 10Andrew Bogott: Purge a bunch of pmtpa labs defs from openstack manifests. [puppet] - 10https://gerrit.wikimedia.org/r/163253 [20:15:11] (03PS1) 10Andrew Bogott: Purged virt0 refs from the ldap module. [puppet] - 10https://gerrit.wikimedia.org/r/163254 [20:15:26] (03PS2) 10Ottomata: Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163246 (owner: 10Andrew Bogott) [20:15:34] (03CR) 10Ottomata: [C: 032 V: 032] Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163246 (owner: 10Andrew Bogott) [20:15:40] nope, YuviPanda [20:15:41] (File[/etc/icinga/puppet_hostextinfo.cfg] => Class[Icinga::Naggen] => Service[icinga] => Class[Icinga] => Class[Icinga::Naggen] => File[/etc/icinga/puppet_hostextinfo.cfg]) [20:15:51] (03CR) 10Ori.livneh: [C: 031] Add hhvm_api varnish backend director [puppet] - 10https://gerrit.wikimedia.org/r/163249 (owner: 10BBlack) [20:16:02] (03PS2) 10BBlack: Add hhvm_api varnish backend director [puppet] - 10https://gerrit.wikimedia.org/r/163249 [20:16:07] (03PS1) 10Andrew Bogott: Move icinga ldap to the new servers. [puppet] - 10https://gerrit.wikimedia.org/r/163255 [20:16:09] (03CR) 10BBlack: [C: 032 V: 032] Add hhvm_api varnish backend director [puppet] - 10https://gerrit.wikimedia.org/r/163249 (owner: 10BBlack) [20:16:18] thanks [20:16:28] ottomata: what does the arrow direction imply? [20:16:40] depends [20:16:45] before :) [20:17:19] ottomata: hmm, I don't see how Class[Icinga] => Class[Icinga::Naggen] [20:17:22] happens [20:17:43] ottomata: Apparently varnish load balances across logstash100[123] for logstash.wm.o [20:17:51] * bd808 had forgotten that [20:18:00] (03PS1) 10Andrew Bogott: Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163256 [20:18:32] (03PS1) 10Andrew Bogott: Move graphite ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163257 [20:18:56] ottomata: aaaah, I see. it's because notify is also kinda like a require [20:18:58] wait [20:18:58] no [20:19:01] that shouldn't matter [20:19:06] ah [20:19:08] i think because [20:19:11] class { 'icinga::naggen': [20:19:11] require => Class['icinga'], [20:19:31] you are doing before and require at the same time? [20:19:33] yeah [20:19:38] notify is like before [20:19:41] so you are saying [20:19:45] class { 'icinga::naggen': [20:19:46] require => Class['icinga'], [20:19:46] notify => Service['icinga'], [20:19:54] and service icinga is inside of Class icinga [20:19:55] sure, but Service['icinga'] is inside Class['icinga'] [20:19:57] so basically [20:20:08] so this simply says that Class['icinga'] should be *before* class Naggen [20:20:09] no? [20:20:14] (03PS1) 10BBlack: hhvm_api: fix missing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/163258 [20:20:14] well, because its in the class, the class needs to be evalutated first, [20:20:21] The notify will take care of the DAG ordering [20:20:22] hmmm [20:20:32] Can I get someone(s) to merge and test those tendril/graphite/icinga changes? Mutante again? [20:20:36] (03PS2) 10BBlack: hhvm_api: fix missing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/163258 [20:20:42] (03CR) 10BBlack: [C: 032 V: 032] hhvm_api: fix missing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/163258 (owner: 10BBlack) [20:20:48] (03CR) 10coren: "Sorry, but in this case Ryan is mistaken. Please read 'man update-ca-certs'" [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [20:21:34] YuviPanda: i think you should manage your dependencies inside of your classes :) [20:21:35] Requiring too many edges in the Puppet DAG is a sure way to end up in a mess. Best to only specify things that are absolutely necessary and Puppet can't guess itself [20:21:56] YuviPanda: I"ll have a go at a patch... [20:22:15] bd808: hmm, I have one notify and one require... [20:22:35] someone told yuvi the rule of 'if you break monitoring before the weekend, you become monitoring for the weekend' right? [20:22:37] ottomata: hmm, I think I understand now. [20:22:38] ;D [20:22:53] robh: :) [20:23:27] RECOVERY - puppet last run on cp1039 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:23:36] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [20:23:37] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:24:27] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:26:29] (03PS1) 10Ori.livneh: Disable HHVM beta feature on Wikidata wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163260 [20:26:35] (03PS1) 10Ottomata: Move dependencies into icinga module classes [puppet] - 10https://gerrit.wikimedia.org/r/163261 [20:27:10] ottomata: hmm, that should work, I think. [20:27:13] I'm not sure that's going to work...but gonna try [20:27:38] (03CR) 10Ottomata: [C: 032 V: 032] Move dependencies into icinga module classes [puppet] - 10https://gerrit.wikimedia.org/r/163261 (owner: 10Ottomata) [20:27:51] * YuviPanda isn't too much of a fan of bare 'require's [20:28:23] (03PS1) 10Andrew Bogott: Move servermon ldap to the new ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/163262 [20:28:25] I think one of the reasons this is complicated is because we define our own service {} [20:28:46] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:29:12] YuviPanda: I'm not hugely for them either, it depends on the class that is being required [20:29:50] bare requires are how puppet3 should work. parameters should be loaded from hiera [20:29:55] ottomata: I think (over next week or sometime after), we should try and get rid of our custom service {} definition [20:30:13] ok, I now need shepherds and reviewers for… icinga, tendril, graphite, servermon. [20:30:15] and a lot of other FIXMEs that should be handled by the package [20:30:40] (03CR) 10coren: "Sorry, but in this case Ryan is mistaken. Please read 'man update-ca-certificates'" [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [20:30:56] RECOVERY - puppet last run on cp1038 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:31:29] bd808: can I get a +1 for https://gerrit.wikimedia.org/r/#/c/163189/, and can it be added to the scap list? (Or you can tell me how to do it.) [20:31:52] (03CR) 10BryanDavis: [C: 031] Move wikitech to the new ldap server, ldap-eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163189 (owner: 10Andrew Bogott) [20:32:26] ok, now i'm down to experimenting too, YuviPanda, I've got the dep cycle one link shorter... [20:32:28] andrewbogott: Add the patch to https://wikitech.wikimedia.org/wiki/Deployments#Week_of_September_29th in the first swat window [20:32:37] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:32:38] ottomata: what's it now? [20:33:17] RECOVERY - puppet last run on cp1037 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:33:39] andrewbogott: It's a prod no-op so we could technically merge now, but Friday! [20:34:17] (03PS1) 10Ottomata: Include icinga base class in ::web and ::naggen [puppet] - 10https://gerrit.wikimedia.org/r/163267 [20:34:20] ah this makes sense YuviPanda [20:34:30] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:34:30] the problem is that with a class level dependency [20:34:53] ::naggen says it has to have icinga class done first. [20:34:57] RECOVERY - puppet last run on cp1040 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:34:57] but the icinga class can't finish first [20:35:26] because the ::naggen class refreshes the service inside of the icinga class...actually, does that make sense? [20:35:30] hmmmmMM [20:35:31] maybe not.. [20:35:32] phew [20:35:33] no [20:35:39] * YuviPanda reparses [20:35:50] (03CR) 10Ottomata: [C: 032 V: 032] Include icinga base class in ::web and ::naggen [puppet] - 10https://gerrit.wikimedia.org/r/163267 (owner: 10Ottomata) [20:35:55] but wanting to refresh the service doesn't mean this has to be done first.... [20:35:57] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:35:58] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:35:58] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:36:06] you can refresh the service *after* it is defined and evaluated as well [20:36:06] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:36:13] yeah. [20:36:28] I assume that's the difference between before and notify [20:36:40] * YuviPanda should read the docs again and not assume [20:36:57] hm, well, at least the use of include does kinda make sense here, as both ::web and ::naggen aren't totally dependent on icinga base class. Only some of the files inside of ::naggen are really dependent. [20:37:00] maybe...? [20:37:20] "Puppet has two types of resource relationships: Ordering; Ordering with notification" [20:37:26] https://docs.puppetlabs.com/puppet/latest/reference/lang_relationships.html#ordering-and-notification [20:37:43] bd808: done, thank you [20:38:13] "An ordering relationship ensures that one resource will be managed before another." [20:38:17] "A notification relationship does the same, but also sends the latter resource a refresh event" [20:38:29] YuviPanda: puppet running [20:38:43] (03CR) 10Andrew Bogott: [C: 032] Update from virt0/virt1000 to virt1000/labcontrol2001 in virt scripts. [puppet] - 10https://gerrit.wikimedia.org/r/163250 (owner: 10Andrew Bogott) [20:38:48] bd808: hmm, right. so a notify is a 'before+also-whenever-it-changes' [20:39:01] yup [20:39:02] I... didn't fully know that [20:39:03] (03PS2) 10Dzahn: move noc from fenari to misc-web [dns] - 10https://gerrit.wikimedia.org/r/163239 [20:39:12] (03CR) 10jenkins-bot: [V: 04-1] move noc from fenari to misc-web [dns] - 10https://gerrit.wikimedia.org/r/163239 (owner: 10Dzahn) [20:39:17] perils of learning a language from hacking on it rather than reading docs, I suppose [20:39:22] it's not obvious until you've made a cycle a few times [20:39:31] (03CR) 10Ori.livneh: [C: 032] Disable HHVM beta feature on Wikidata wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163260 (owner: 10Ori.livneh) [20:39:33] (03CR) 10Ryan Lane: [C: 031] "Coren's convinced me via IRC." [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [20:39:36] ottomata: yay! [20:39:37] (03Merged) 10jenkins-bot: Disable HHVM beta feature on Wikidata wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163260 (owner: 10Ori.livneh) [20:39:50] (03PS3) 10Dzahn: move noc from fenari to misc-web [dns] - 10https://gerrit.wikimedia.org/r/163239 [20:39:50] YuviPanda: looks good! [20:40:01] I had some really twisted dependency cycles when I first wrote the mw-v multiversion patches [20:40:03] ottomata: icinga -v /etc/icinga/icinga.cfg? [20:40:12] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:40:22] Total Warnings: 0 [20:40:22] Total Errors: 0 [20:40:27] YAY! [20:40:27] i looked at the changes it made too [20:40:29] only good things [20:40:35] ottomata: thanks for merging 'em all! :D [20:40:37] lots of trailing whitespace removals in config files i saw [20:40:38] yup [20:40:40] haven't done role yet [20:40:42] but you gotta go, right? [20:40:45] ottomata: ya [20:40:57] ottomata: I'll do the role and kill icinga.pp next week, and also some other cleanup [20:41:05] ottomata: probably late next week, travelling to India on Monday night [20:41:11] oh, where are you now? [20:41:14] ottomata: london :) [20:41:17] ah [20:41:29] haven't left since wikimania :) [20:41:31] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:41:32] ha, awesome [20:41:34] :) [20:41:42] YuviPanda: since this is the last one of your changes in in gerrit [20:41:43] (03CR) 10Andrew Bogott: [C: 032] Minor purge of virt0/pmtpa refs. [puppet] - 10https://gerrit.wikimedia.org/r/163251 (owner: 10Andrew Bogott) [20:41:47] (at least in THIS dependency chain) [20:41:52] i'll go ahead and fix it up and merge [20:41:54] it is pretty simple [20:41:57] ottomata: yay! thanks :) [20:42:11] yup [20:42:15] have a good eve! [20:42:19] ottomata: I shall buy you large $BEVERAGEOFCHOICE when we meet :) [20:42:27] * YuviPanda goes afk [20:42:28] (03CR) 10RobH: [C: 031] "i just like being on record of being wrong =P undoing my undoing of the +1" [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [20:42:30] (03PS1) 10Ori.livneh: Fix-up for fee38d2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163269 [20:42:40] (03PS1) 10Dzahn: move noc.wm behind misc-web varnish on terbium [puppet] - 10https://gerrit.wikimedia.org/r/163270 [20:42:50] (03PS2) 10Ori.livneh: Fix-up for fee38d2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163269 [20:42:57] (03CR) 10Ori.livneh: [C: 032] Fix-up for fee38d2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163269 (owner: 10Ori.livneh) [20:43:06] (03Merged) 10jenkins-bot: Fix-up for fee38d2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163269 (owner: 10Ori.livneh) [20:45:17] (03PS4) 10Ottomata: icinga: Setup icinga role! [puppet] - 10https://gerrit.wikimedia.org/r/163205 (owner: 10Yuvipanda) [20:45:37] !log ori Synchronized wmf-config/CommonSettings.php: Disable HHVM beta-feature on wikidatawiki (duration: 00m 06s) [20:45:43] Logged the message, Master [20:46:48] (03PS2) 10Andrew Bogott: Switch over to using a labcontrol2001 as the labs salt secondary. [puppet] - 10https://gerrit.wikimedia.org/r/163252 [20:47:00] (03CR) 10Ottomata: [C: 032 V: 032] icinga: Setup icinga role! [puppet] - 10https://gerrit.wikimedia.org/r/163205 (owner: 10Yuvipanda) [20:47:58] (03PS3) 10Andrew Bogott: Switch over to using a labcontrol2001 as the labs salt secondary. [puppet] - 10https://gerrit.wikimedia.org/r/163252 [20:49:03] (03PS1) 10Ottomata: Fully qualify icinga module in icinga role [puppet] - 10https://gerrit.wikimedia.org/r/163271 [20:49:13] (03CR) 10Andrew Bogott: [C: 032] Switch over to using a labcontrol2001 as the labs salt secondary. [puppet] - 10https://gerrit.wikimedia.org/r/163252 (owner: 10Andrew Bogott) [20:49:16] (03CR) 10Ottomata: [C: 032 V: 032] Fully qualify icinga module in icinga role [puppet] - 10https://gerrit.wikimedia.org/r/163271 (owner: 10Ottomata) [20:49:26] andrewbogott: :() [20:49:28] :) [20:49:31] merging yo thang [20:49:35] thanks! [20:50:02] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [20:51:45] (on it!) [20:51:53] so nice not to see Epic there! [20:51:54] :p [20:52:37] normal puppet fail [20:53:07] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [20:54:06] (03PS2) 10Andrew Bogott: Purge a bunch of pmtpa labs defs from openstack manifests. [puppet] - 10https://gerrit.wikimedia.org/r/163253 [20:55:35] (03CR) 10Andrew Bogott: [C: 032] Purge a bunch of pmtpa labs defs from openstack manifests. [puppet] - 10https://gerrit.wikimedia.org/r/163253 (owner: 10Andrew Bogott) [20:58:56] (03PS2) 10Andrew Bogott: Purged virt0 refs from the ldap module. [puppet] - 10https://gerrit.wikimedia.org/r/163254 [20:59:46] (03CR) 10Andrew Bogott: [C: 032] Purged virt0 refs from the ldap module. [puppet] - 10https://gerrit.wikimedia.org/r/163254 (owner: 10Andrew Bogott) [21:01:13] (03PS2) 10Andrew Bogott: Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163256 [21:02:30] ottomata: know anything about tendril? specifically, how to restart it? [21:03:00] wut is tendril? [21:03:28] Teennndriiillll [21:03:28] https://en.wikipedia.org/wiki/Tendril [21:03:31] oh [21:03:32] wikipedia [21:03:33] wait [21:03:40] I dont know what it is, but I'm'a break it! [21:04:02] a wikitech page! [21:04:07] https://wikitech.wikimedia.org/wiki/Tendril [21:04:36] ahah, this sounds like something mutante knows about [21:07:29] (03PS1) 10Andrew Bogott: Removed references to the old virt0 puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/163278 [21:09:05] (03CR) 10Andrew Bogott: [C: 032] Removed references to the old virt0 puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/163278 (owner: 10Andrew Bogott) [21:13:04] (03PS1) 10Gergő Tisza: Enable image dimension logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163279 [21:15:22] ottomata: is icinga somewhat stable at the moment? I'd like to change its ldap servers as well. [21:15:40] * andrewbogott on a rampage [21:16:08] tendril is a tool for the dba mostly [21:16:38] andrewbogott: you're a classic ops person, something working? Let me breaking it >:D [21:16:39] you can login with labs credentials [21:17:24] mutante: it uses ldap for those labs credentials, I need to move it to different ldap servers. [21:17:42] mutante: I'm wondering how to restart it and/or force it to pick up those changes so I can test [21:17:51] andrewbogott: yea, springle is the user, but i'd be fine testing the login [21:18:08] mutante: https://gerrit.wikimedia.org/r/#/c/163256/ [21:18:35] andrewbogott: it is like ishmael, an apache site on neon, sharing with icina [21:18:43] so restarting is restarting apache on neon [21:18:50] mutante: is that true of graphite as well? [21:18:57] also means only works if puppet run on neon is fixed [21:19:02] no, it's not [21:19:09] neon is icinga, ishmael and tendril [21:19:10] I thought puppet was working on neon, maybe not [21:19:12] but that's all [21:19:54] icinga is cool andrewbogott [21:20:07] (03CR) 10Hashar: [C: 031] "Note $wgLDAPServerNames can be given several servers (space separated), for a later change. See inline diff for references." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163189 (owner: 10Andrew Bogott) [21:20:31] andrewbogott: should be [21:20:39] andrewbogott: that means i'll try if ishamel login still works, and if it does, this is identical pretty much [21:21:07] mutante: yep. Probably good to give it a graceful as well to make sure that it's not just coasting on the old config [21:21:37] !log graceful'ed apache on neon [21:21:53] Logged the message, Master [21:22:09] mutante: oh, except I didn't merge the patch yet :) [21:22:11] confirmed it has the new config [21:22:19] (03CR) 10Andrew Bogott: [C: 032] Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163256 (owner: 10Andrew Bogott) [21:22:20] that was the idea [21:22:28] i wanted to merge it after confirming ishamel works [21:22:38] which was merged earlier but couldnt be applied yet [21:22:40] oh, sure. [21:22:53] ok, carry on :) [21:23:09] you'll have a few minutes before puppet applies the tendril change [21:23:19] yea, it's "already in progress" [21:23:28] and.. internal server error on icinga :/ [21:23:43] eh, ishmael [21:23:52] oh... [21:24:39] mutante: does that mean I should roll back the tendril and ishmael ldap changes? [21:25:03] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2021: active_shards: 6058: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [21:25:26] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [21:25:46] andrewbogott: ^ eh.. and puppet run fail? sigh [21:26:00] mutante: I'll look at that while you figure out ishmael [21:26:13] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2022: active_shards: 6061: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [21:26:32] andrewbogott: we have time to just check tendril until springle shows up :) [21:26:36] looking [21:27:30] pssh [21:27:35] mutante: :) [21:27:42] hehee [21:27:50] it's saturday for springle, remember :) [21:28:01] break what you like. it's the weekend and i'm not here [21:28:06] see? [21:30:19] springle: enjoy your weekend! :-] [21:31:17] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 108 seconds ago with 0 failures [21:32:44] hm, I can't tell if ishmael is working or not [21:32:50] tendril seems fine so far [21:33:01] same here [21:33:03] ok, ishmael definitely broken [21:33:08] but possibly not because of ldap :) [21:33:16] I'm going to graceful again and see what tendril thinks [21:33:37] the graceful broke ishmael , but it seems something different. yea. agreed [21:35:19] !log gracefulled apache on neon [21:35:24] andrewbogott: tendril seems ok with new settings, if you already did the graceful [21:35:26] Logged the message, Master [21:35:47] mutante: actually, I think it broke in the same was as tendril [21:35:53] um… I mean, as ishmael [21:35:59] not anymore.. [21:36:06] yes [21:36:17] so…. hm [21:36:21] then, i think we should revert [21:36:25] probably the same ldap issue as on gerrit [21:37:38] wait, "as on gerrit"? [21:37:57] how was that solved earlier.. and you are afk now? [21:38:18] (03PS1) 10Dzahn: Revert "Move tendril ldap to the new ldap servers." [puppet] - 10https://gerrit.wikimedia.org/r/163282 [21:38:43] mutante: 'as on gerrit' was pretty scary and not ready for general use. https://gerrit.wikimedia.org/r/163222 [21:38:52] Will wait for that to settle in and then try again. [21:39:15] (03PS1) 10Dzahn: Revert "Move ishmael to the new ldap servers." [puppet] - 10https://gerrit.wikimedia.org/r/163283 [21:39:35] let's see the revert actually fixing it first [21:40:04] andrewbogott: did you merge the change for icinga itself or not? [21:40:11] not [21:40:15] ok [21:40:40] (03CR) 10Dzahn: [C: 032] "we got an Internal Server Error after the next graceful on neon" [puppet] - 10https://gerrit.wikimedia.org/r/163283 (owner: 10Dzahn) [21:41:37] (03CR) 10Dzahn: [C: 032] "we got an Internal Server Error after the next graceful on neon" [puppet] - 10https://gerrit.wikimedia.org/r/163282 (owner: 10Dzahn) [21:42:40] mutante: are you doing puppet and graceful or shall I? [21:42:50] andrewbogott: already doing puppet [21:43:34] andrewbogott: did others get merged and worked, but on other servers that weren't neon? [21:43:47] or did they also just not graceful and would break [21:43:52] mutante: it depends on the service. [21:43:58] apache [21:44:12] But mostly I think these four are it, in terms of using ldap. [21:44:42] I'm rearranging all these patches to depend on that bit one of Coren's [21:44:45] *big one [21:45:59] andrewbogott: it reverted it.. and it fixed it [21:46:02] so yea... [21:46:05] ok [21:46:33] i was just thinking how many others there were that are merged [21:46:44] where it's also an apache using ldap auth [21:48:51] mutante: I think that all that I've merged (other than just now) are things that change ldap.conf. And things that rely on ldap.conf have hinting about what cert ca to use... [21:48:58] so I suspect that all is well. [21:49:02] You're right to be wary though [21:49:11] how about kibana? [21:50:14] hm... [21:50:19] kibana, maybe also about to break. [21:50:21] Where does that run? [21:50:25] (03PS2) 10Andrew Bogott: Use the Ubuntu Way of installing SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [21:50:27] flood! [21:50:27] (03PS2) 10Andrew Bogott: Move icinga ldap to the new servers. [puppet] - 10https://gerrit.wikimedia.org/r/163255 [21:50:29] (03PS2) 10Andrew Bogott: Move servermon ldap to the new ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/163262 [21:50:31] that's logstash, i can login, but has it really been restarted? [21:50:31] (03PS2) 10Andrew Bogott: Move graphite ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163257 [21:50:34] (03PS1) 10Andrew Bogott: Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163285 [21:50:35] heh [21:50:36] (03PS1) 10Andrew Bogott: Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163286 [21:50:47] mutante: I don't know if it's been restarted. I think so, not positive [21:51:11] mutante: There are 3 apaches that would need to be restarted. I can get them [21:51:42] bd808: it would be interesting though if then they also break [21:51:52] 'interesting' [21:52:01] well, then just revert it too [21:52:11] sure [21:52:43] PROBLEM - RAID on db1020 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:52:51] all 3 have the new apache config [21:53:52] Should I graceful the apaches there? That layer is all stateless [21:53:56] bd808: can you see if puppet restarted them? [21:54:06] bd808: yes please [21:54:08] (03PS1) 10Dzahn: Revert "Switch kibana to the new ldap servers." [puppet] - 10https://gerrit.wikimedia.org/r/163287 [21:54:13] bd808: yes, and if it stops working, then ^ [21:55:14] `sudo grep -i apache /var/log/puppet.log` is empty so no, puppet did not restart [21:55:48] ? puppet? [21:55:50] which makes sense right? we don't notfiy apache on config changes typically [21:55:54] .. so it could have shown up on the next manual graceful weeks later.. :p [21:56:00] oh, I see, ignore me [21:56:53] !log sudo apache2ctl graceful on logstash100[123] for ldap change [21:57:00] Logged the message, Master [21:57:12] hmm,, looks broken-ish [21:57:47] yea, ... revert [21:57:50] damn [21:58:10] Do we need to restart a cache server? [21:58:20] service I mean [21:58:39] "Wed Sep 24 16:14:58 2014] [error] (70014)End of file found: proxy: prefetch request body failed to 127.0.0.1:9200 (127.0.0.1) from 10.64.0.171 ()" [21:59:15] (03CR) 10Dzahn: [C: 032] "this also broke the service after the next apache graceful, but puppet does not do the graceful in this case" [puppet] - 10https://gerrit.wikimedia.org/r/163287 (owner: 10Dzahn) [21:59:46] bd808: could you run puppet again [22:00:31] !log running puppet on logstash100[123] to revert ldap change [22:00:36] Logged the message, Master [22:01:21] !log sudo apache2ctl graceful on logstash100[123] for ldap revert [22:01:27] Logged the message, Master [22:01:40] and it works again already? [22:01:43] it does [22:01:48] works again now [22:01:53] yea, that, thanks [22:02:15] * andrewbogott adds it to the patch series :( [22:02:22] (03PS3) 10Andrew Bogott: Use the Ubuntu Way of installing SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/163222 (owner: 10coren) [22:02:24] (03PS3) 10Andrew Bogott: Move icinga ldap to the new servers. [puppet] - 10https://gerrit.wikimedia.org/r/163255 [22:02:26] (03PS2) 10Andrew Bogott: Move tendril ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163286 [22:02:28] (03PS2) 10Andrew Bogott: Move ishmael to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163285 [22:02:30] (03PS3) 10Andrew Bogott: Move servermon ldap to the new ldap servers [puppet] - 10https://gerrit.wikimedia.org/r/163262 [22:02:32] (03PS3) 10Andrew Bogott: Move graphite ldap to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163257 [22:02:34] (03PS1) 10Andrew Bogott: Switch kibana to the new ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/163290 [22:02:47] any others? [22:03:40] looks like all [22:03:59] andrewbogott: it works fine in labs? https://gerrit.wikimedia.org/r/#/c/162689/ [22:04:20] mutante: yep, works in labs [22:04:57] alright [22:05:54] mutante: is there a way for me to use salt in labs that gives me actual hostnames rather than i-0000names? [22:06:18] * bd808 would love that [22:06:41] jeremyb said it was a matter of changing the salt configs [22:08:24] i don't know better [22:09:07] I think they would all need to have new certs signed after [22:09:22] because that name is the cert name on the master [22:20:06] (03PS2) 10Dzahn: move noc.wm behind misc-web varnish on terbium [puppet] - 10https://gerrit.wikimedia.org/r/163270 [22:25:06] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [22:33:05] RECOVERY - Disk space on fluorine is OK: DISK OK [22:33:21] disregard the fluorine warning, that was me:P [22:33:40] holy crap, the api logs are huuuuuuuge [22:37:46] Reedy (or anyone else who deals with OAuth requests): can haz OAuth approval (or feedback) for the request from User:Dmak78 ? [22:39:38] ragesoss: on mw.org? [22:39:57] greg-g: that's the only place to request OAuth approval, right? [22:40:05] yeah, that's where the request happned. [22:40:08] yeah [22:40:09] the app actually is for Wikipedia. [22:40:11] I think :) [22:40:24] but we definitely put in the request at mw.org [22:40:28] * greg-g tries to remember who's on the list other than csteipp_afk [22:41:38] Deskana ? ^^ [22:42:07] greg-g: https://www.mediawiki.org/w/index.php?title=Special:ListUsers&group=oauthadmin [22:42:26] nonexistent.conf: No such file or directory - unlink nonexistent.conf - ok :P) [22:42:38] ragesoss: yeah, Deskana is your best bet right now [22:42:44] thanks bd808 [22:42:47] (and JohnLewis :) ) [22:42:54] thanks much, all! [22:48:24] Deskana: if you get a chance to look at Dmak78's OAuth consumer request, please do. No rush, but if there are any roadblocks to approval, hoping to clear them by next week. [22:51:07] (03PS4) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() [puppet] - 10https://gerrit.wikimedia.org/r/159692 [22:51:42] (03CR) 10Ori.livneh: [C: 032 V: 032] wmflib: add to_milliseconds() / to_seconds() [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh) [22:52:08] (03PS3) 10Ori.livneh: Allow wikidev users to restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/163075 [22:55:46] (03CR) 10Andrew Bogott: [C: 032] Allow wikidev users to restart HHVM [puppet] - 10https://gerrit.wikimedia.org/r/163075 (owner: 10Ori.livneh) [22:56:14] andrewbogott: thanks! [22:57:12] (03PS1) 10BryanDavis: integration: Ensure that /srv/deployment/integration is present [puppet] - 10https://gerrit.wikimedia.org/r/163297 [22:57:31] Krinkle: ^ I think that may fix your puppet issue [22:57:56] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: puppet fail [22:58:35] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: puppet fail [22:58:35] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: puppet fail [22:59:35] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [22:59:46] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: puppet fail [23:01:50] (03CR) 10Dzahn: "Error 400 on SERVER: Could not autoload puppet/parser/functions/to_milliseconds: /etc/puppet/modules/wmflib/lib/puppet/parser/functions/to" [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh) [23:04:26] (03PS1) 10Dzahn: Revert "wmflib: add to_milliseconds() / to_seconds()" [puppet] - 10https://gerrit.wikimedia.org/r/163302 [23:05:07] (03PS2) 10BryanDavis: integration: Ensure that /srv/deployment/integration is present [puppet] - 10https://gerrit.wikimedia.org/r/163297 [23:05:12] (03CR) 10Dzahn: [C: 032] "puppet fail on all appservers" [puppet] - 10https://gerrit.wikimedia.org/r/163302 (owner: 10Dzahn) [23:07:37] (03CR) 10Dzahn: [C: 032] move noc.wm behind misc-web varnish on terbium [puppet] - 10https://gerrit.wikimedia.org/r/163270 (owner: 10Dzahn) [23:08:03] (03CR) 10BryanDavis: "Cherry-picked to integration-puppetmaster and tested on integration-dev.eqiad.wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/163297 (owner: 10BryanDavis) [23:10:18] (03PS1) 10Legoktm: Fix RSS feed for keyword rename of "hiphop" --> "hhvm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/163303 [23:12:43] (03CR) 10Andrew Bogott: [C: 032] integration: Ensure that /srv/deployment/integration is present [puppet] - 10https://gerrit.wikimedia.org/r/163297 (owner: 10BryanDavis) [23:12:56] thanks andrewbogott [23:16:34] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:16:42] (03CR) 10Dzahn: [C: 032] "tested with curl from fenari itself over to terbium, replaced broken symlinks into /h/w/ with the actual files, scp'ed from fenari" [dns] - 10https://gerrit.wikimedia.org/r/163239 (owner: 10Dzahn) [23:16:44] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:17:36] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [23:17:55] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:18:44] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [23:26:37] !log switched noc.wikimedia.org to terbium, behind misc-web [23:26:44] Logged the message, Master [23:35:38] (03PS1) 10Dzahn: remove pdf servers [dns] - 10https://gerrit.wikimedia.org/r/163308 [23:52:26] (03PS1) 10Dzahn: remove unneeded Apache conf from noc [puppet] - 10https://gerrit.wikimedia.org/r/163312 [23:55:29] (03PS1) 10Dzahn: remove fenari [dns] - 10https://gerrit.wikimedia.org/r/163313 [23:56:38] (03CR) 10Dzahn: "duplicate of Change-Id: Ic1bacaae8b3ef6fec206" [puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [23:56:49] (03Abandoned) 10Dzahn: move dsh to module [puppet] - 10https://gerrit.wikimedia.org/r/96413 (owner: 10Dzahn) [23:58:49] Hey opsen [23:58:58] I have a question about deploying random things with git-deploy [23:59:32] It appears to be accepted practice to deploy node_modules directories (containing npm packages that our code depends on) through git-deploy