[00:00:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:13:52] !log upgrading wikitech-static to 1.22wmf11 [00:14:03] Logged the message, Master [00:17:52] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [00:24:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 00:24:40 UTC 2013 [00:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:43:04] (PS1) Ori.livneh: Don't attempt to send beta labs errors/fatals to vanadium [operations/puppet] - https://gerrit.wikimedia.org/r/75821 [00:44:20] Ryan_Lane: got a sec for CR/merge of super simple patch? [00:44:25] sure [00:44:30] the above? [00:44:35] yep :) [00:45:28] heh [00:45:37] yeah template_variables. ick [00:45:46] (CR) Ryan Lane: [C: 2] Don't attempt to send beta labs errors/fatals to vanadium [operations/puppet] - https://gerrit.wikimedia.org/r/75821 (owner: Ori.livneh) [00:46:18] much appreciated [00:46:26] yw [00:54:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 220 seconds [00:54:43] hrm, i'm not pulling it in with puppetd -tv [00:54:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 00:54:50 UTC 2013 [00:55:03] no? [00:55:08] on labs or in production? [00:55:12] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:55:12] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [00:55:15] labs [00:55:21] we made some changes recently [00:55:23] i'll check prod [00:55:25] let me make sure it's still working [00:56:32] heh, actually we shouldn't expect to see anything in prod since the rendered template should remain the same there [00:56:38] but labs isn't pulling it in, at least. [00:58:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:00:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [01:00:44] ori-l: should be fixed [01:00:56] cron was pointing at the wrong spot [01:01:16] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: No file(s) found for import of '../private/manifests/passwords.pp' at /etc/puppet/manifests/base.pp:10 on node i-00000390.pmtpa.wmflabs [01:02:03] well shit [01:02:03] heh [01:02:39] look at the bright side: that's probably a better outcome than actually successfully pulling prod's passwords.pp [01:02:44] ah [01:02:48] missing symlink [01:04:15] now it's fixed [01:04:23] heh [01:04:32] that private repo never leaves those systems ;) [01:05:00] seems to be working [01:05:06] great [01:05:28] and it pulled the change, which rendered correctly. sweet! thanks again. [01:05:53] tw [01:05:55] yw* [01:07:04] can you close the bug for that please [01:07:18] there's a bug for that? [01:13:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:15:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:18:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:20:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:24:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 01:24:42 UTC 2013 [01:25:15] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:28:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:30:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [01:33:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [01:35:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:48:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:49:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [01:49:17] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:17] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [01:54:57] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 01:54:52 UTC 2013 [01:55:07] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:58:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:00:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [02:15:32] !log LocalisationUpdate completed (1.22wmf11) at Thu Jul 25 02:15:32 UTC 2013 [02:15:43] Logged the message, Master [02:18:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [02:20:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:27:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 02:27:45 UTC 2013 [02:28:11] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:29:13] !log LocalisationUpdate completed (1.22wmf10) at Thu Jul 25 02:29:13 UTC 2013 [02:29:23] Logged the message, Master [02:33:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:35:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:43:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:45:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:47:27] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 25 02:47:27 UTC 2013 [02:47:37] Logged the message, Master [02:55:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 02:54:51 UTC 2013 [02:55:11] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:59:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 212 seconds [03:00:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [03:14:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 231 seconds [03:15:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [03:18:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [03:20:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [03:24:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 03:24:41 UTC 2013 [03:25:15] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:04] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:03] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [03:59:13] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 190 seconds [04:00:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [04:01:03] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [04:01:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:01:41 UTC 2013 [04:02:13] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:04:13] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 181 seconds [04:05:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [04:06:03] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:22:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [04:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:24:46 UTC 2013 [04:25:42] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:33:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [04:38:55] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 21 seconds [04:48:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [04:53:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 25 seconds [04:54:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:54:41 UTC 2013 [04:55:40] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:13:46] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [05:17:46] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [05:24:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 05:24:39 UTC 2013 [05:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:33:46] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 197 seconds [05:35:51] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [05:43:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [05:53:51] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 26 seconds [05:54:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 05:54:47 UTC 2013 [05:55:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:58:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [06:04:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 214 seconds [06:08:52] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 27 seconds [06:13:22] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 183 seconds [06:16:18] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:18:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [06:23:52] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: No successful Puppet run in the last 10 hours [06:23:52] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 27 seconds [06:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 06:24:47 UTC 2013 [06:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:54:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 06:54:41 UTC 2013 [06:54:53] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run in the last 10 hours [06:55:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:56:53] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [07:01:10] (CR) Faidon: [C: -1] ""include contint::" doesn't have any place in site.pp. This should be included in the jenkins (= contint) role classes." [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:02:58] (PS2) Faidon: phase out misc::contint::test::packages [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:03:13] (CR) Faidon: [C: 2] "Oh, I just realized that the subsequent commit moves this under a role class :)" [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:06:44] (CR) Faidon: [C: -1] "(3 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [07:09:35] (CR) Faidon: [C: 1] "Yeah, sure, why not." [operations/software/varnish/vhtcpd] - https://gerrit.wikimedia.org/r/75128 (owner: BBlack) [07:13:13] hashar: (moving a thread to this channel to keep things sane) re: fluoride, yes -- ! i have a half-written update i need to finish and post on bugzilla [07:17:04] ori-l: if there is a way I could help let me know :) [07:24:54] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 07:24:47 UTC 2013 [07:25:24] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:25:28] ori-l: """ Don't attempt to send beta labs errors/fatals to vanadium""" well done! [07:25:51] ori-l: I noticed that a while ago but was not able to find out where it was defined [07:28:31] yeah, so deployment-fluoride now gets a copy of errors/fatals on udp 8423 [07:32:30] hashar_: hmm, exception.log and fatal.log disappeared from /home/wikipedia/logs [07:32:36] :( [07:32:40] well, they got gzipped and rotated to archive/, but i'd expected them to have been recreated by now [07:32:53] unless no exception / fatal happened [07:33:20] seems unlikely, but they're easy to generate [07:33:20] http://www.pt.wikibooks.beta.wmflabs.org/robots.txt [07:33:30] ;] [07:34:00] that created fatal.log \O/ [07:34:07] No robots.php for beta labs? [07:34:09] maybe we can have log rotate recreate an empty file for us [07:34:18] Elsie: it just www.pt.wikibooks is not recognized [07:34:30] ah yes, and i should have checked the mtime utc [07:34:35] it appears logrotate just ran a few minutes ago [07:34:41] http://pt.wikibooks.beta.wmflabs.org/robots.txt [07:35:15] ori-l: you might want to handle that [07:35:33] also I am not sure how your script will behave when log rotate kick in since it will no more have access to the file [07:35:58] it's not tailing the files; it's getting a copy of the udp stream [07:36:56] niceee [07:37:09] i was only looking in /home/wikipedia/logs because i was in the middle of writing a big bugzilla comment explaining how error logs are handled and i was fact-checking as i was going along. "so fatal.log and exception.log are in /home/wikipedia/logs... uhh.... ummm.. at least i thought they were" [07:38:06] paravoid: I am getting a second jenkins slave so I create a new role for them and took the occasion to cleanup the misc/contint.pp a bit. There is still the ugly iptables stuff that need to be phased out though :( [07:38:21] ori-l: doh [07:38:42] ori-l: if you are on writing doc, maybe you can do it on wikitech wiki, that would apply to production as well [07:46:54] hashar: I'll probably just finish the comment and then copy some parts of it to wikitech [07:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:56:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 07:56:48 UTC 2013 [07:57:09] !log upgrading ceph to 0.67-rc2 [07:57:19] Logged the message, Master [07:57:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:57:57] (CR) Hashar: "(3 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [07:59:01] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [07:59:31] (CR) Hashar: "I was merely ranting, sorry Leslie. The bug can stay closed, people will just have to use icinga.wm.o and we are done :-]" [operations/puppet] - https://gerrit.wikimedia.org/r/75786 (owner: Lcarr) [08:00:11] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [08:01:01] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [08:04:51] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [08:09:35] mark: are you around yet ? :-] [08:10:29] mark: apparently cookies are stripped out by Varnish text cache :/ https://bugzilla.wikimedia.org/show_bug.cgi?id=51988#c5 [08:16:20] (PS1) Faidon: radosgw: don't print 100 Continue [operations/puppet] - https://gerrit.wikimedia.org/r/75831 [08:16:21] (PS1) Faidon: ceph-add-disk: update for dumpling [operations/puppet] - https://gerrit.wikimedia.org/r/75832 [08:16:55] (CR) Faidon: [C: 2] radosgw: don't print 100 Continue [operations/puppet] - https://gerrit.wikimedia.org/r/75831 (owner: Faidon) [08:16:58] (CR) Faidon: [C: 2] ceph-add-disk: update for dumpling [operations/puppet] - https://gerrit.wikimedia.org/r/75832 (owner: Faidon) [08:19:12] (PS1) QChris: Fix setting bug status in hooks-bugzilla configuration [operations/puppet] - https://gerrit.wikimedia.org/r/75834 [08:19:18] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Thu Jul 25 08:19:13 UTC 2013 [08:19:18] RECOVERY - Puppet freshness on ms-fe1004 is OK: puppet ran at Thu Jul 25 08:19:13 UTC 2013 [08:19:28] RECOVERY - Puppet freshness on ms-fe1003 is OK: puppet ran at Thu Jul 25 08:19:18 UTC 2013 [08:19:28] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Thu Jul 25 08:19:18 UTC 2013 [08:20:37] okay, that's not too bad [08:20:43] I'm done already [08:20:49] that's... refreshing :) [08:21:24] :-D [08:21:33] paravoid: wanna clean up the contint roles this morning? [08:21:45] heh [08:21:51] i was going to say "never admit that out loud!" [08:22:17] Note: grrrit-wm doesn't relay MERGED messages from anyone not jenkins-bot anymore, since there's a preceding C: 2 anyway. So... do not panic :) [08:23:49] are you asking me if I want to clean them up? [08:23:53] in that case, no :P [08:23:56] I'd rather you do it [08:23:57] ;-) [08:24:30] !log running swift->ceph thumb sync [08:24:40] Logged the message, Master [08:24:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 08:24:46 UTC 2013 [08:24:55] I am willing to do the cleanup but not sure how to reorganize the roles. [08:25:18] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [08:25:22] role::ci::slave to install jenkins + CI dependency seems good enough since Jenkins is not used outside of CI for now [08:25:40] indeed [08:25:44] and role::ci::master [08:26:08] gotta need role::ci::zuul as well :-] [08:26:20] why? [08:26:25] it's all on the same box isn't it? [08:26:53] yup currently [08:27:01] but Zuul / Jenkins master are really independant [08:27:08] though they communicate with each other. [08:27:52] I might also setup a second jenkins master for failover [08:28:06] that means different jenkins/zuul modules, not different role classes :) [08:28:14] Zuul latest version supports multiple master to trigger jobs, so whenever a jenkins is rebooting, jobs are still triggered :-] [08:31:21] paravoid: for gerrit , I am not sure what you meant at https://gerrit.wikimedia.org/r/#/c/75498/2/manifests/role/contint.pp,unified [08:31:29] I did a reply if you can have a look at it [08:33:05] so, could you lay here the hierarchy as you envision it? [08:33:35] role::ci::[...] -> (contint ->) jenkins, zuul is how I envision it [08:33:37] er [08:33:47] role::ci::[...] -> (contint ->) jenkins, zuul, gerrit [08:34:10] jenkins, zuul and gerrit being parameterized [08:36:26] paravoid: something like http://paste.openstack.org/show/41727/ [08:36:58] sec [08:37:05] with the software themselves staying modules as they are with parameters [08:40:07] sounds good [08:40:13] but [08:40:19] this assumes all role::jenkins classes are gone, right? [08:41:11] let me rephrase that: what those role::ci classses will include? [08:43:01] that is the over engineering question [08:43:02] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:43:22] I though that role::jenkins::{slave,master} could eventually be reused by another team [08:43:26] such as fundraiser team [08:43:57] but there is probably no point in over engineering that and I can most probably rename role::jenkins::{slave,master} to role::ci::{slave,master} [08:44:03] less class involved this way [08:44:06] well, FR can reuse the "jenkins" classes [08:44:20] and if we see the need for common functionality, we can move from role::ci -> jenkins [08:44:34] but right now the role::jenkins class e.g. assumes you've put SSDs and did all those tmpfs stuff [08:44:38] which is very setup-specific [08:44:52] indeed [08:45:05] so that makes little sense to keep them named role::jenkins which would be missleading [08:45:12] correct [08:45:20] i like that [08:45:35] moving from role::jenkins to role::ci and filling these with the rest [08:45:37] + the role::jenkins::slave::production assume the master is gallium (the ssh key restrict connections to gallium) [08:45:43] right [08:46:07] sounds good will work on that [08:46:31] then you raised a question about including the role::gerrit::production::replicationdest in role::ci::slave [08:46:31] https://gerrit.wikimedia.org/r/#/c/75498/2/manifests/role/contint.pp,unified [08:46:52] that is needed to setup the gerrit-slave user which is used by the Gerrit server to ssh to the Jenkins slave and push the git objects [08:47:10] yeah [08:47:12] that's not a role [08:47:22] but I realize this isn't your thing [08:47:30] so let's fix this later, talking things over with Gerrit people :) [08:47:40] good :-) [08:47:47] I am renaming my class and phasing out role::jenkins so [08:47:52] perfect [08:48:30] want a big huge patch or a lot of small ones ? :-] [08:48:47] one big patch with a clear purpose is fine [08:50:06] thanks :) [08:50:08] you rock :) [08:52:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [08:54:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 08:54:40 UTC 2013 [08:55:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:15:34] and I found a bug [09:15:35] \O/ [09:17:58] so where did the cookies go [09:18:00] i want my cookies [09:18:32] mark: no idea :-] But we can't login on beta if you want to track it down there [09:21:16] the frontends are eating the cookies [09:21:24] they must be hungry [09:21:40] grbhab need yet another module. [09:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:25] paravoid: I got to generate a .gitconfig for the nodes which is a basic flat file. Wondering if that should be a module such as git::userconfig which would expand a .gitconfig erb template with things such as username and email. [09:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:24:54] i found the orig-cookies [09:25:02] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 09:24:52 UTC 2013 [09:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:30:24] (PS1) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75838 [09:30:40] (CR) Mark Bergsma: [C: 2] Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75838 (owner: Mark Bergsma) [09:31:55] (PS3) Hashar: creates role::ci::{master,slave,website} [operations/puppet] - https://gerrit.wikimedia.org/r/75498 [09:33:12] (CR) Hashar: "Rebased / refactored. The change introduces role::ci:: classes which takes care of setting up SSD and including modules. I have moved a" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [09:35:01] moar coffee [09:35:36] moar puppet activity [09:36:10] moar sleep [09:40:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [09:45:16] (PS4) MaxSem: Remove mobile hacks from bits [operations/puppet] - https://gerrit.wikimedia.org/r/73342 [09:49:55] (CR) Faidon: [C: 2] creates role::ci::{master,slave,website} [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [09:51:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:59] hashar: [09:52:13] i'm just getting uncacheable responses from the mediawiki backends on beta [09:53:08] mark: ? [09:53:13] (the puppet activity bit) [09:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.519 second response time [09:55:06] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 09:54:58 UTC 2013 [09:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:56:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:45] (CR) Aklapper: [C: 1] Fix setting bug status in hooks-bugzilla configuration [operations/puppet] - https://gerrit.wikimedia.org/r/75834 (owner: QChris) [09:57:06] mark: more details , [09:57:07] ? [09:57:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [10:06:10] MaxSem: your previous patchset said after Aug 17 [10:06:32] what changed? [10:06:36] paravoid, we flushed caches yesterday [10:10:06] oh [10:10:26] we did? [10:10:29] why? [10:11:06] https://rt.wikimedia.org/Ticket/Display.html?id=5267 [10:11:58] ok [10:11:59] thanks [10:12:28] nothing on SAL, ticket's still open [10:12:36] I wonder if whoever did this thought of esams [10:12:45] Ryan did [10:13:06] no idea about esams:) [10:15:08] does it work from russia? :) [10:15:42] oh god damnit [10:17:11] (PS1) Mark Bergsma: Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 [10:17:29] (PS1) Aklapper: Add new bug status 'PATCH_TO_REVIEW' to the queries [operations/puppet] - https://gerrit.wikimedia.org/r/75843 [10:18:21] (PS2) Mark Bergsma: Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 [10:18:37] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [10:19:18] (CR) Mark Bergsma: [C: 2] Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 (owner: Mark Bergsma) [10:20:37] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [10:21:27] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [10:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:22:27] lunch bbl [10:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:23:57] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [10:25:07] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 10:24:57 UTC 2013 [10:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:25:57] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.407 second response time [10:36:27] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [10:40:26] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.008817672729 secs [10:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 10:54:41 UTC 2013 [10:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:23:45] (CR) Physikerwelt: "Deyan... can you check the list of dependencies and specify which dependency is needed at which phase?" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [11:25:02] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 11:24:58 UTC 2013 [11:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:26:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:28:42] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [11:29:03] paravoid: back around. Someone in my coworking place is asking me whether we looked at Riak http://basho.com/riak-cloud-storage/ Sees to provide functionalities similar to swift/ceph. [11:29:46] (PS2) Petr Onderka: made indexes into trees [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75668 [11:30:32] I have looked at it a bit [11:30:39] most of the interesting features are in the paid-for version [11:30:45] like geo replication [11:30:46] yeah that is what I thought :] [11:31:52] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:18] that person recently phased out its Ceph installation [11:32:40] he had too many troubles upgrading and noticed file chunks disappearing and causing cluster corruptions [11:32:48] I don't have all the details though [11:40:25] (CR) Deyan: "I can't understand the context -- I can only read Physikerwelt's comments, and not AzaToth's." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [11:59:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 11:59:40 UTC 2013 [12:00:18] hashar, is there even one distributed storage that doesn't suck?:P [12:00:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:06:07] MaxSem: that has been the subject of my IRL conversations here for the last half hour :-] [12:09:33] (CR) Physikerwelt: "I was talking about this file" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [12:11:35] we've had several severe issues with ceph [12:11:40] but none related to corruption, fortunately [12:11:49] (or integrity in general) [12:16:21] paravoid, I heard we're not gonna use it? [12:17:49] it's still in pilot [12:17:54] and we're proceeding with it [12:17:59] but drafting plan Bs [12:24:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 12:24:47 UTC 2013 [12:25:01] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:26:50] ah finally the Riak evangelist is gone :] [12:29:06] haha [12:29:49] * YuviPanda proposes using MongoDB to replace Ceph [12:29:52] clearly webscale [12:33:32] paravoid: he had some valid concerns about Ceph and like some features of Riak. But I have simply no clue what it invovles [12:33:46] + I had to figure out how to write puppet spec [12:34:15] (PS1) Hashar: git::userconfig to easily craft .gitconfig files [operations/puppet] - https://gerrit.wikimedia.org/r/75855 [12:34:16] (PS1) Hashar: contint: jenkins .gitconfig generated by git::userconfig [operations/puppet] - https://gerrit.wikimedia.org/r/75856 [12:34:19] evil ^^^^^ [12:34:28] mysql, swift, ceph, redis, NFS [12:34:34] do we use anything else for anything related to storage? [12:34:40] persistance, rather? [12:34:41] memcached [12:34:44] we don't use redis for storage [12:34:50] nor mysql [12:34:57] depends on how you define storage actually [12:35:04] paravoid: so I noticed some contint manifest craft a jenkins .gitconfig file. That needs to happen on each slave (aka /var/lib/jenkins-slave/.gitconfig ) so I created a new git::userconfig define in a git module to easily craft a .gitconfig :-D [12:35:37] paravoid: yeah, 'persistance' than storage [12:36:17] (CR) Hashar: "Andrew, Alexandros, that is my first rspec writing :-] Added you as reviewers for your information." [operations/puppet] - https://gerrit.wikimedia.org/r/75855 (owner: Hashar) [12:38:04] (PS1) Manybubbles: Fix syntax error in elasticsearch. [operations/puppet] - https://gerrit.wikimedia.org/r/75858 [12:43:01] (CR) Hashar: "Our way to handle iptables rules in puppet is definitely crazy :-/ I am sure we have a bunch of public host with nrpe listening and not b" [operations/puppet] - https://gerrit.wikimedia.org/r/75777 (owner: Demon) [12:45:25] (PS2) Hashar: fix system_role for role::protoproxy::ssl::beta [operations/puppet] - https://gerrit.wikimedia.org/r/75074 [12:45:53]