[00:00:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:13:52] !log upgrading wikitech-static to 1.22wmf11 [00:14:03] Logged the message, Master [00:17:52] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [00:24:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 00:24:40 UTC 2013 [00:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:43:04] (PS1) Ori.livneh: Don't attempt to send beta labs errors/fatals to vanadium [operations/puppet] - https://gerrit.wikimedia.org/r/75821 [00:44:20] Ryan_Lane: got a sec for CR/merge of super simple patch? [00:44:25] sure [00:44:30] the above? [00:44:35] yep :) [00:45:28] heh [00:45:37] yeah template_variables. ick [00:45:46] (CR) Ryan Lane: [C: 2] Don't attempt to send beta labs errors/fatals to vanadium [operations/puppet] - https://gerrit.wikimedia.org/r/75821 (owner: Ori.livneh) [00:46:18] much appreciated [00:46:26] yw [00:54:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 220 seconds [00:54:43] hrm, i'm not pulling it in with puppetd -tv [00:54:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 00:54:50 UTC 2013 [00:55:03] no? [00:55:08] on labs or in production? [00:55:12] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [00:55:12] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [00:55:15] labs [00:55:21] we made some changes recently [00:55:23] i'll check prod [00:55:25] let me make sure it's still working [00:56:32] heh, actually we shouldn't expect to see anything in prod since the rendered template should remain the same there [00:56:38] but labs isn't pulling it in, at least. [00:58:12] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:00:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [01:00:44] ori-l: should be fixed [01:00:56] cron was pointing at the wrong spot [01:01:16] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not parse for environment production: No file(s) found for import of '../private/manifests/passwords.pp' at /etc/puppet/manifests/base.pp:10 on node i-00000390.pmtpa.wmflabs [01:02:03] well shit [01:02:03] heh [01:02:39] look at the bright side: that's probably a better outcome than actually successfully pulling prod's passwords.pp [01:02:44] ah [01:02:48] missing symlink [01:04:15] now it's fixed [01:04:23] heh [01:04:32] that private repo never leaves those systems ;) [01:05:00] seems to be working [01:05:06] great [01:05:28] and it pulled the change, which rendered correctly. sweet! thanks again. [01:05:53] tw [01:05:55] yw* [01:07:04] can you close the bug for that please [01:07:18] there's a bug for that? [01:13:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:15:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:18:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:20:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:24:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 01:24:42 UTC 2013 [01:25:15] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:28:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:30:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [01:33:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [01:35:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [01:48:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [01:49:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 24 seconds [01:49:17] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:50:17] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [01:54:57] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 01:54:52 UTC 2013 [01:55:07] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [01:58:17] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:00:17] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [02:15:32] !log LocalisationUpdate completed (1.22wmf11) at Thu Jul 25 02:15:32 UTC 2013 [02:15:43] Logged the message, Master [02:18:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 186 seconds [02:20:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:27:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 02:27:45 UTC 2013 [02:28:11] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:29:13] !log LocalisationUpdate completed (1.22wmf10) at Thu Jul 25 02:29:13 UTC 2013 [02:29:23] Logged the message, Master [02:33:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:35:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:43:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [02:45:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [02:47:27] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 25 02:47:27 UTC 2013 [02:47:37] Logged the message, Master [02:55:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 02:54:51 UTC 2013 [02:55:11] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [02:59:11] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 212 seconds [03:00:11] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [03:14:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 231 seconds [03:15:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [03:18:15] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 187 seconds [03:20:15] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [03:24:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 03:24:41 UTC 2013 [03:25:15] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:03] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:50:04] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:55:03] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: No successful Puppet run in the last 10 hours [03:59:13] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 190 seconds [04:00:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [04:01:03] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: No successful Puppet run in the last 10 hours [04:01:43] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:01:41 UTC 2013 [04:02:13] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:04:13] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 181 seconds [04:05:13] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 6 seconds [04:06:03] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:22:22] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: No successful Puppet run in the last 10 hours [04:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:24:46 UTC 2013 [04:25:42] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [04:33:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [04:38:55] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 21 seconds [04:48:50] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [04:53:50] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 25 seconds [04:54:50] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 04:54:41 UTC 2013 [04:55:40] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:13:46] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [05:17:46] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 14 seconds [05:24:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 05:24:39 UTC 2013 [05:25:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:33:46] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 197 seconds [05:35:51] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 7 seconds [05:43:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [05:53:51] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 26 seconds [05:54:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 05:54:47 UTC 2013 [05:55:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [05:58:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [06:04:51] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 214 seconds [06:08:52] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 27 seconds [06:13:22] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 183 seconds [06:16:18] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:18:52] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 221 seconds [06:23:52] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: No successful Puppet run in the last 10 hours [06:23:52] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 27 seconds [06:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 06:24:47 UTC 2013 [06:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:54:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 06:54:41 UTC 2013 [06:54:53] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run in the last 10 hours [06:55:23] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [06:56:53] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [07:01:10] (CR) Faidon: [C: -1] ""include contint::" doesn't have any place in site.pp. This should be included in the jenkins (= contint) role classes." [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:02:58] (PS2) Faidon: phase out misc::contint::test::packages [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:03:13] (CR) Faidon: [C: 2] "Oh, I just realized that the subsequent commit moves this under a role class :)" [operations/puppet] - https://gerrit.wikimedia.org/r/75497 (owner: Hashar) [07:06:44] (CR) Faidon: [C: -1] "(3 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [07:09:35] (CR) Faidon: [C: 1] "Yeah, sure, why not." [operations/software/varnish/vhtcpd] - https://gerrit.wikimedia.org/r/75128 (owner: BBlack) [07:13:13] hashar: (moving a thread to this channel to keep things sane) re: fluoride, yes -- ! i have a half-written update i need to finish and post on bugzilla [07:17:04] ori-l: if there is a way I could help let me know :) [07:24:54] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 07:24:47 UTC 2013 [07:25:24] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:25:28] ori-l: """ Don't attempt to send beta labs errors/fatals to vanadium""" well done! [07:25:51] ori-l: I noticed that a while ago but was not able to find out where it was defined [07:28:31] yeah, so deployment-fluoride now gets a copy of errors/fatals on udp 8423 [07:32:30] hashar_: hmm, exception.log and fatal.log disappeared from /home/wikipedia/logs [07:32:36] :( [07:32:40] well, they got gzipped and rotated to archive/, but i'd expected them to have been recreated by now [07:32:53] unless no exception / fatal happened [07:33:20] seems unlikely, but they're easy to generate [07:33:20] http://www.pt.wikibooks.beta.wmflabs.org/robots.txt [07:33:30] ;] [07:34:00] that created fatal.log \O/ [07:34:07] No robots.php for beta labs? [07:34:09] maybe we can have log rotate recreate an empty file for us [07:34:18] Elsie: it just www.pt.wikibooks is not recognized [07:34:30] ah yes, and i should have checked the mtime utc [07:34:35] it appears logrotate just ran a few minutes ago [07:34:41] http://pt.wikibooks.beta.wmflabs.org/robots.txt [07:35:15] ori-l: you might want to handle that [07:35:33] also I am not sure how your script will behave when log rotate kick in since it will no more have access to the file [07:35:58] it's not tailing the files; it's getting a copy of the udp stream [07:36:56] niceee [07:37:09] i was only looking in /home/wikipedia/logs because i was in the middle of writing a big bugzilla comment explaining how error logs are handled and i was fact-checking as i was going along. "so fatal.log and exception.log are in /home/wikipedia/logs... uhh.... ummm.. at least i thought they were" [07:38:06] paravoid: I am getting a second jenkins slave so I create a new role for them and took the occasion to cleanup the misc/contint.pp a bit. There is still the ugly iptables stuff that need to be phased out though :( [07:38:21] ori-l: doh [07:38:42] ori-l: if you are on writing doc, maybe you can do it on wikitech wiki, that would apply to production as well [07:46:54] hashar: I'll probably just finish the comment and then copy some parts of it to wikitech [07:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [07:56:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 07:56:48 UTC 2013 [07:57:09] !log upgrading ceph to 0.67-rc2 [07:57:19] Logged the message, Master [07:57:21] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [07:57:57] (CR) Hashar: "(3 comments)" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [07:59:01] RECOVERY - HTTP radosgw on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [07:59:31] (CR) Hashar: "I was merely ranting, sorry Leslie. The bug can stay closed, people will just have to use icinga.wm.o and we are done :-]" [operations/puppet] - https://gerrit.wikimedia.org/r/75786 (owner: Lcarr) [08:00:11] RECOVERY - HTTP radosgw on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.003 second response time [08:01:01] RECOVERY - HTTP radosgw on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.004 second response time [08:04:51] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [08:09:35] mark: are you around yet ? :-] [08:10:29] mark: apparently cookies are stripped out by Varnish text cache :/ https://bugzilla.wikimedia.org/show_bug.cgi?id=51988#c5 [08:16:20] (PS1) Faidon: radosgw: don't print 100 Continue [operations/puppet] - https://gerrit.wikimedia.org/r/75831 [08:16:21] (PS1) Faidon: ceph-add-disk: update for dumpling [operations/puppet] - https://gerrit.wikimedia.org/r/75832 [08:16:55] (CR) Faidon: [C: 2] radosgw: don't print 100 Continue [operations/puppet] - https://gerrit.wikimedia.org/r/75831 (owner: Faidon) [08:16:58] (CR) Faidon: [C: 2] ceph-add-disk: update for dumpling [operations/puppet] - https://gerrit.wikimedia.org/r/75832 (owner: Faidon) [08:19:12] (PS1) QChris: Fix setting bug status in hooks-bugzilla configuration [operations/puppet] - https://gerrit.wikimedia.org/r/75834 [08:19:18] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Thu Jul 25 08:19:13 UTC 2013 [08:19:18] RECOVERY - Puppet freshness on ms-fe1004 is OK: puppet ran at Thu Jul 25 08:19:13 UTC 2013 [08:19:28] RECOVERY - Puppet freshness on ms-fe1003 is OK: puppet ran at Thu Jul 25 08:19:18 UTC 2013 [08:19:28] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Thu Jul 25 08:19:18 UTC 2013 [08:20:37] okay, that's not too bad [08:20:43] I'm done already [08:20:49] that's... refreshing :) [08:21:24] :-D [08:21:33] paravoid: wanna clean up the contint roles this morning? [08:21:45] heh [08:21:51] i was going to say "never admit that out loud!" [08:22:17] Note: grrrit-wm doesn't relay MERGED messages from anyone not jenkins-bot anymore, since there's a preceding C: 2 anyway. So... do not panic :) [08:23:49] are you asking me if I want to clean them up? [08:23:53] in that case, no :P [08:23:56] I'd rather you do it [08:23:57] ;-) [08:24:30] !log running swift->ceph thumb sync [08:24:40] Logged the message, Master [08:24:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 08:24:46 UTC 2013 [08:24:55] I am willing to do the cleanup but not sure how to reorganize the roles. [08:25:18] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [08:25:22] role::ci::slave to install jenkins + CI dependency seems good enough since Jenkins is not used outside of CI for now [08:25:40] indeed [08:25:44] and role::ci::master [08:26:08] gotta need role::ci::zuul as well :-] [08:26:20] why? [08:26:25] it's all on the same box isn't it? [08:26:53] yup currently [08:27:01] but Zuul / Jenkins master are really independant [08:27:08] though they communicate with each other. [08:27:52] I might also setup a second jenkins master for failover [08:28:06] that means different jenkins/zuul modules, not different role classes :) [08:28:14] Zuul latest version supports multiple master to trigger jobs, so whenever a jenkins is rebooting, jobs are still triggered :-] [08:31:21] paravoid: for gerrit , I am not sure what you meant at https://gerrit.wikimedia.org/r/#/c/75498/2/manifests/role/contint.pp,unified [08:31:29] I did a reply if you can have a look at it [08:33:05] so, could you lay here the hierarchy as you envision it? [08:33:35] role::ci::[...] -> (contint ->) jenkins, zuul is how I envision it [08:33:37] er [08:33:47] role::ci::[...] -> (contint ->) jenkins, zuul, gerrit [08:34:10] jenkins, zuul and gerrit being parameterized [08:36:26] paravoid: something like http://paste.openstack.org/show/41727/ [08:36:58] sec [08:37:05] with the software themselves staying modules as they are with parameters [08:40:07] sounds good [08:40:13] but [08:40:19] this assumes all role::jenkins classes are gone, right? [08:41:11] let me rephrase that: what those role::ci classses will include? [08:43:01] that is the over engineering question [08:43:02] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [08:43:22] I though that role::jenkins::{slave,master} could eventually be reused by another team [08:43:26] such as fundraiser team [08:43:57] but there is probably no point in over engineering that and I can most probably rename role::jenkins::{slave,master}  to role::ci::{slave,master} [08:44:03] less class involved this way [08:44:06] well, FR can reuse the "jenkins" classes [08:44:20] and if we see the need for common functionality, we can move from role::ci -> jenkins [08:44:34] but right now the role::jenkins class e.g. assumes you've put SSDs and did all those tmpfs stuff [08:44:38] which is very setup-specific [08:44:52] indeed [08:45:05] so that makes little sense to keep them named role::jenkins which would be missleading [08:45:12] correct [08:45:20] i like that [08:45:35] moving from role::jenkins to role::ci and filling these with the rest [08:45:37] + the role::jenkins::slave::production assume the master is gallium (the ssh key restrict connections to gallium) [08:45:43] right [08:46:07] sounds good will work on that [08:46:31] then you raised a question about including the role::gerrit::production::replicationdest in role::ci::slave [08:46:31] https://gerrit.wikimedia.org/r/#/c/75498/2/manifests/role/contint.pp,unified [08:46:52] that is needed to setup the gerrit-slave user which is used by the Gerrit server to ssh to the Jenkins slave and push the git objects [08:47:10] yeah [08:47:12] that's not a role [08:47:22] but I realize this isn't your thing [08:47:30] so let's fix this later, talking things over with Gerrit people :) [08:47:40] good :-) [08:47:47] I am renaming my class and phasing out role::jenkins so [08:47:52] perfect [08:48:30] want a big huge patch or a lot of small ones ? :-] [08:48:47] one big patch with a clear purpose is fine [08:50:06] thanks :) [08:50:08] you rock :) [08:52:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [08:54:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 08:54:40 UTC 2013 [08:55:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:15:34] and I found a bug [09:15:35] \O/ [09:17:58] so where did the cookies go [09:18:00] i want my cookies [09:18:32] mark: no idea :-] But we can't login on beta if you want to track it down there [09:21:16] the frontends are eating the cookies [09:21:24] they must be hungry [09:21:40] grbhab need yet another module. [09:22:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:25] paravoid: I got to generate a .gitconfig for the nodes which is a basic flat file. Wondering if that should be a module such as git::userconfig which would expand a .gitconfig erb template with things such as username and email. [09:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [09:24:54] i found the orig-cookies [09:25:02] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 09:24:52 UTC 2013 [09:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:30:24] (PS1) Mark Bergsma: Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75838 [09:30:40] (CR) Mark Bergsma: [C: 2] Fix typo [operations/puppet] - https://gerrit.wikimedia.org/r/75838 (owner: Mark Bergsma) [09:31:55] (PS3) Hashar: creates role::ci::{master,slave,website} [operations/puppet] - https://gerrit.wikimedia.org/r/75498 [09:33:12] (CR) Hashar: "Rebased / refactored. The change introduces role::ci:: classes which takes care of setting up SSD and including modules. I have moved a" [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [09:35:01] moar coffee [09:35:36] moar puppet activity [09:36:10] moar sleep [09:40:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [09:45:16] (PS4) MaxSem: Remove mobile hacks from bits [operations/puppet] - https://gerrit.wikimedia.org/r/73342 [09:49:55] (CR) Faidon: [C: 2] creates role::ci::{master,slave,website} [operations/puppet] - https://gerrit.wikimedia.org/r/75498 (owner: Hashar) [09:51:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:51:59] hashar: [09:52:13] i'm just getting uncacheable responses from the mediawiki backends on beta [09:53:08] mark: ? [09:53:13] (the puppet activity bit) [09:53:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.519 second response time [09:55:06] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 09:54:58 UTC 2013 [09:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [09:56:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:45] (CR) Aklapper: [C: 1] Fix setting bug status in hooks-bugzilla configuration [operations/puppet] - https://gerrit.wikimedia.org/r/75834 (owner: QChris) [09:57:06] mark: more details , [09:57:07] ? [09:57:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [10:06:10] MaxSem: your previous patchset said after Aug 17 [10:06:32] what changed? [10:06:36] paravoid, we flushed caches yesterday [10:10:06] oh [10:10:26] we did? [10:10:29] why? [10:11:06] https://rt.wikimedia.org/Ticket/Display.html?id=5267 [10:11:58] ok [10:11:59] thanks [10:12:28] nothing on SAL, ticket's still open [10:12:36] I wonder if whoever did this thought of esams [10:12:45] Ryan did [10:13:06] no idea about esams:) [10:15:08] does it work from russia? :) [10:15:42] oh god damnit [10:17:11] (PS1) Mark Bergsma: Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 [10:17:29] (PS1) Aklapper: Add new bug status 'PATCH_TO_REVIEW' to the queries [operations/puppet] - https://gerrit.wikimedia.org/r/75843 [10:18:21] (PS2) Mark Bergsma: Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 [10:18:37] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [10:19:18] (CR) Mark Bergsma: [C: 2] Use bereq instead of req in vcl_pass/vcl_miss [operations/puppet] - https://gerrit.wikimedia.org/r/75842 (owner: Mark Bergsma) [10:20:37] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [10:21:27] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [10:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:22:27] lunch bbl [10:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [10:23:57] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [10:25:07] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 10:24:57 UTC 2013 [10:25:27] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [10:25:57] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.407 second response time [10:36:27] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [10:40:26] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.008817672729 secs [10:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 10:54:41 UTC 2013 [10:55:26] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:23:45] (CR) Physikerwelt: "Deyan... can you check the list of dependencies and specify which dependency is needed at which phase?" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [11:25:02] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 11:24:58 UTC 2013 [11:25:22] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [11:26:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:28:42] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [11:29:03] paravoid: back around. Someone in my coworking place is asking me whether we looked at Riak http://basho.com/riak-cloud-storage/ Sees to provide functionalities similar to swift/ceph. [11:29:46] (PS2) Petr Onderka: made indexes into trees [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75668 [11:30:32] I have looked at it a bit [11:30:39] most of the interesting features are in the paid-for version [11:30:45] like geo replication [11:30:46] yeah that is what I thought :] [11:31:52] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:32:18] that person recently phased out its Ceph installation [11:32:40] he had too many troubles upgrading and noticed file chunks disappearing and causing cluster corruptions [11:32:48] I don't have all the details though [11:40:25] (CR) Deyan: "I can't understand the context -- I can only read Physikerwelt's comments, and not AzaToth's." [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [11:59:45] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 11:59:40 UTC 2013 [12:00:18] hashar, is there even one distributed storage that doesn't suck?:P [12:00:25] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:06:07] MaxSem: that has been the subject of my IRL conversations here for the last half hour :-] [12:09:33] (CR) Physikerwelt: "I was talking about this file" [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [12:11:35] we've had several severe issues with ceph [12:11:40] but none related to corruption, fortunately [12:11:49] (or integrity in general) [12:16:21] paravoid, I heard we're not gonna use it? [12:17:49] it's still in pilot [12:17:54] and we're proceeding with it [12:17:59] but drafting plan Bs [12:24:51] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 12:24:47 UTC 2013 [12:25:01] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:26:50] ah finally the Riak evangelist is gone :] [12:29:06] haha [12:29:49] * YuviPanda proposes using MongoDB to replace Ceph [12:29:52] clearly webscale [12:33:32] paravoid: he had some valid concerns about Ceph and like some features of Riak. But I have simply no clue what it invovles [12:33:46] + I had to figure out how to write puppet spec [12:34:15] (PS1) Hashar: git::userconfig to easily craft .gitconfig files [operations/puppet] - https://gerrit.wikimedia.org/r/75855 [12:34:16] (PS1) Hashar: contint: jenkins .gitconfig generated by git::userconfig [operations/puppet] - https://gerrit.wikimedia.org/r/75856 [12:34:19] evil ^^^^^ [12:34:28] mysql, swift, ceph, redis, NFS [12:34:34] do we use anything else for anything related to storage? [12:34:40] persistance, rather? [12:34:41] memcached [12:34:44] we don't use redis for storage [12:34:50] nor mysql [12:34:57] depends on how you define storage actually [12:35:04] paravoid: so I noticed some contint manifest craft a jenkins .gitconfig file. That needs to happen on each slave (aka /var/lib/jenkins-slave/.gitconfig ) so I created a new git::userconfig define in a git module to easily craft a .gitconfig :-D [12:35:37] paravoid: yeah, 'persistance' than storage [12:36:17] (CR) Hashar: "Andrew, Alexandros, that is my first rspec writing :-] Added you as reviewers for your information." [operations/puppet] - https://gerrit.wikimedia.org/r/75855 (owner: Hashar) [12:38:04] (PS1) Manybubbles: Fix syntax error in elasticsearch. [operations/puppet] - https://gerrit.wikimedia.org/r/75858 [12:43:01] (CR) Hashar: "Our way to handle iptables rules in puppet is definitely crazy :-/ I am sure we have a bunch of public host with nrpe listening and not b" [operations/puppet] - https://gerrit.wikimedia.org/r/75777 (owner: Demon) [12:45:25] (PS2) Hashar: fix system_role for role::protoproxy::ssl::beta [operations/puppet] - https://gerrit.wikimedia.org/r/75074 [12:45:53] gallium has too many system roles : http://paste.openstack.org/show/41767/ :D [12:46:16] paravoid, we use MySQL for ExternalStorage:P [12:47:10] "storage" [12:48:14] !log gallium cleared out /etc/update-motd.d/05* which adds the roles in motd. They got out of sync. Rerunning puppet [12:48:24] Logged the message, Master [12:55:12] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 12:55:07 UTC 2013 [12:56:02] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [12:57:26] (PS2) Hashar: contint: generate .gitconfig files for all jenkins users [operations/puppet] - https://gerrit.wikimedia.org/r/75856 [12:58:00] (CR) Hashar: "Moaar contint related cleanup :-]" [operations/puppet] - https://gerrit.wikimedia.org/r/75856 (owner: Hashar) [13:03:30] (PS3) Hashar: contint: python dependency for publish-console.py [operations/puppet] - https://gerrit.wikimedia.org/r/75632 [13:09:43] (PS1) Mark Bergsma: Fix XFF handling on all Varnish clusters [operations/puppet] - https://gerrit.wikimedia.org/r/75860 [13:24:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 13:24:43 UTC 2013 [13:24:54] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [13:33:07] (PS3) Petr Onderka: made indexes into trees [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75668 [13:34:05] (CR) Faidon: [C: 2] Fix syntax error in elasticsearch. [operations/puppet] - https://gerrit.wikimedia.org/r/75858 (owner: Manybubbles) [13:34:42] mark: is your patch "use beret instead of req in vcl_pass/vcl_miss" the thing that fixed the login issue on beta? https://gerrit.wikimedia.org/r/#/c/75842/ [13:34:51] The authenticity of host 'stafford.pmtpa.wmnet (10.0.0.24)' can't be established. [13:34:51] yes [13:34:54] RSA key fingerprint is 9a:4b:ce:17:20:e3:66:3d:52:fc:3e:a3:9b:73:c5:44. [13:34:57] Are you sure you want to continue connecting (yes/no)? [13:34:59] wtf [13:35:02] from sockpuppet [13:35:08] hmmm [13:35:20] i just ran ssh-keygen -R for analytics1017.eqiad.wmnet [13:35:27] its been reinstalled [13:35:43] on sockpuppet [13:35:50] that wasn't the exact command [13:35:54] hm not in history [13:36:15] (CR) Hashar: "Fixed up login on beta https://bugzilla.wikimedia.org/show_bug.cgi?id=51988" [operations/puppet] - https://gerrit.wikimedia.org/r/75842 (owner: Mark Bergsma) [13:36:33] mark: thank you! Seems beta managed to catch at least two varnish bugs. That is great. [13:36:45] yes [13:37:23] stafford is still in /etc/ssh/ssh_known_hosts [13:37:33] it's the new git user [13:37:49] ah hm k [13:37:57] RECOVERY - Puppet freshness on analytics1017 is OK: puppet ran at Thu Jul 25 13:37:48 UTC 2013 [13:39:50] jeff_green: good morning...want to move db1008 today? [13:40:23] (PS1) Petr Onderka: added support for IPv6 anonymous users [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75864 [13:42:36] heya paravoid, if I created a .deb for this: [13:42:37] https://pypi.python.org/pypi/pygeoip/ [13:42:39] what should I call it? [13:42:42] python-pygeoip? [13:42:51] python-geoip already exists, but is a different library [13:44:07] RECOVERY - Disk space on analytics1017 is OK: DISK OK [13:45:17] RECOVERY - DPKG on analytics1017 is OK: All packages OK [13:45:50] ottomata: are you sure it did not get packaged already ? [13:47:46] (CR) Petr Onderka: [C: 2 V: 2] made indexes into trees [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75668 (owner: Petr Onderka) [13:48:41] ottomata: cause there is already a python-geoip in debian, seems to come straight from maxmind [13:49:17] yeah, that is a different library [13:49:57] evan is using pygeoip for some editor stuff [13:50:00] (PS2) Mark Bergsma: Fix XFF handling on all Varnish clusters [operations/puppet] - https://gerrit.wikimedia.org/r/75860 [13:50:37] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:37] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:37] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:37] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:37] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:38] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:51:07] ottomata: I hate it when there are several python modules all sharing the same name hehe [13:51:16] (PS2) Petr Onderka: added support for IPv6 anonymous users [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75864 [13:51:34] (CR) Petr Onderka: [C: 2 V: 2] added support for IPv6 anonymous users [operations/dumps/incremental] (gsoc) - https://gerrit.wikimedia.org/r/75864 (owner: Petr Onderka) [13:51:46] ottomata: anyway in this case I would name it python-pygeoip [13:51:52] there is already a bunch of packages named this way [13:52:19] with Source: pygeoip [13:52:23] Package: python-pygeoip [13:53:55] or : https://github.com/maxmind/geoip-api-python but that is python-geoip package [13:54:27] ok great [13:54:28] yeah [13:54:38] yeah i was gonna go wity python-pygeoip, just seems lame [13:54:42] but that is the name of the module, so meh :/ [13:54:47] thanks [13:54:55] can you base your soft on that module? [13:54:57] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 13:54:48 UTC 2013 [13:54:57] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [13:59:57] RECOVERY - NTP on analytics1017 is OK: NTP OK: Offset -0.02387273312 secs [14:10:45] ottomata: python-pygeoip is fine [14:10:49] but the real question is [14:10:54] why wouldn't you use python-geoip? [14:11:03] dunno, this is not my thang :) [14:11:27] ask :) [14:11:35] his last day was yesterday :p [14:11:44] lol [14:11:46] look at the code [14:11:54] if I look at it then I own it [14:12:46] it is a thing running out of his homedir on stat1, it is used to generate some data that feeds into som elimn graphs that people like, i am trying to puppetize, and this is the only dep that I don't have [14:14:04] I don't think we should do a random piece of work for an unmaintained piece of code that noone owns and that it's possibly entirely avoided [14:15:16] it should be very easy to debianize, i just want it to keep running as it is [14:15:27] i can make it run with pip install as the stats user manually, if you like [14:15:39] instead of building a deb for it [14:15:54] i prefer someone starting to own it in a way that at least knows why a dependency is A and not B [14:15:55] and just put a comment in the puppet class [14:15:57] tsk, tsk. pip is bad, mmmkay? [14:16:09] not adding features, but just standard maintenance [14:16:16] it would be a non puppetized pip run as an unprivileged user, but ja [14:17:55] is chad h on irc? [14:18:11] yes, as ^demon [14:18:15] when hes connected [14:18:19] paravoid, in general I agree, but this is a small thing, and people use code that is most useful to them [14:18:29] oh [14:18:34] i don't want to spend a lot of time on this [14:18:35] I can never remember names [14:18:53] we're only doing this since evan is gone, and it would be nice to have it in puppet [14:18:57] so that it isn't lost [14:19:12] qchris is sort of owning the code, but I don't htink he wants to touch and possible break it right now [14:19:12] it works [14:19:27] it is easy enough to build this .deb as a single dependency [14:19:34] and then I can puppetize it fully [14:20:29] let's add more bloat and increase our maintenance overhead because noone wants to look at the code fearing they'll own it [14:20:54] let's build a deb, then forward port it a year and a half later [14:21:03] paravoid: cranky? [14:21:06] yes [14:21:30] nope: let's add make a single library available for use so that an existing non puppetized working system can be puppetized [14:21:31] hehe [14:23:31] I'll own it [14:23:34] where is that code? [14:24:07] https://gerrit.wikimedia.org/r/#/admin/projects/analytics/editor-geocoding [14:24:42] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 14:24:38 UTC 2013 [14:24:52] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [14:25:40] it is currently checkoud in erosen's homedir on stat1 umm [14:25:43] /home/erosen/src/geowiki [14:26:24] i think not up to date there [14:26:24] but ja [14:27:13] ok [14:27:25] it's exactly two lines diff [14:27:25] YuviPanda: ? [14:27:42] and it'll perform better because it's in C [14:28:29] paravoid: just because it's in C doesn't mean it performs better :-P [14:29:30] (PS1) Ottomata: filter and collector might show more processses than 1 [operations/puppet] - https://gerrit.wikimedia.org/r/75875 [14:29:32] pygeoip is in C, no? [14:29:40] oh nope [14:29:42] sorry ready that all wrong [14:30:10] (CR) Ottomata: [C: 2 V: 2] filter and collector might show more processses than 1 [operations/puppet] - https://gerrit.wikimedia.org/r/75875 (owner: Ottomata) [14:33:47] https://gerrit.wikimedia.org/r/75878 [14:34:57] note that the version from pip wouldn't work according to readme.md [14:35:10] paravoid, the API is the same? [14:35:17] Note: `pip inst [14:35:17] all pygeoip` install a non-functioning version, instead `git clone` the repository and run `python s [14:35:20] etup.py install`. [14:35:20] :-) [14:35:56] AzaToth: hmm? [14:36:34] mark: Not urgent but I'm hoping you will look at https://gerrit.wikimedia.org/r/#/c/75347/ sometime. [14:37:16] paravoid, seems to work :) ok! [14:37:45] do you still think it was more worthy to package & maintain a pure python version? :) [14:38:23] oh hush with your "I told you sos"! [14:38:25] :) [14:38:57] :P [14:41:25] if you prefer, we could do try: except [14:41:32] and support both [14:41:36] but I'll leave that to you :P [14:42:29] i'm about to try it all the way through, if it finishes, i think we're good [14:42:39] k [14:42:51] well, we use the maxmind library in a bunch of places anyway [14:42:55] if it's buggy, we have bigger problems [14:42:58] yea [14:43:07] the python module is a straight api binding, nothing special there [14:51:34] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:52:30] (PS1) Ottomata: Adding misc::statistics::geowiki [operations/puppet] - https://gerrit.wikimedia.org/r/75881 [14:53:10] (CR) Ottomata: [C: 2 V: 2] Adding misc::statistics::geowiki [operations/puppet] - https://gerrit.wikimedia.org/r/75881 (owner: Ottomata) [14:53:24] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:54:44] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 14:54:42 UTC 2013 [14:54:54] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [14:56:59] ew [14:57:04] (PS1) Ottomata: Running geowiki-process-data cron daily [operations/puppet] - https://gerrit.wikimedia.org/r/75882 [14:57:06] git::clone and running code out of therer [14:57:09] we don't do that [14:57:14] uhhhhhhhhhh [14:57:30] from gerrit? [14:57:33] we do no? [14:57:43] no [14:57:53] there are a few cases but these are oversights and need to be fixed [14:57:57] YuviPanda: grrrit-wm has "reset by peer" a couple of times now I seen [14:58:09] AzaToth: seems to be a regular thing happening on toollabs. [14:58:10] Coren: ^ [14:58:16] k [14:58:17] AzaToth: it won't lose any messages, however. [14:58:22] since they're on Redis [14:58:23] so [14:58:42] * Coren reads backlog. [14:58:54] ther'es even a bug for it, I think? [14:59:12] YuviPanda: thought it crashed [14:59:22] AzaToth: It's not clear what causes it. I heard that logs show the IRC server is closing the connection, but I haven't seen anyone do more detailed diagnostics. [14:59:55] AzaToth: that would cause a timeout, no? [15:00:07] AzaToth: and if it crashes I'll see an exception in the log [15:00:20] hmm, let me sse logs anyway [15:00:43] YuviPanda: a exception could have been caught and the client silently reconnected [15:00:57] I didn't mean a segfault type crash [15:01:25] I meant a python like crash where the important exception is masked away a couple of callbacks away [15:01:38] AzaToth: I log errors. Don't see any [15:03:02] YuviPanda: nothing intresting around (last hour):26:40? [15:03:36] you have log level ERR+? [15:03:37] AzaToth: nope [15:03:41] i see a restart [15:03:43] no error info [15:03:47] let me poke at it a bit more in a bit [15:03:55] so the client restarted [15:04:16] what language is it? [15:05:20] If you need some background music while poking around in the code, I can recommend http://www.the90sbutton.com [15:05:49] AzaToth: it did. [15:05:51] no error messages [15:05:53] no idea why it is hpapening either [15:06:29] :D [15:06:34] will try out when I'm on less shitty internet [15:06:42] heh [15:06:57] I don't know where the code can be read [15:08:04] AzaToth: for greg-g? [15:08:05] err [15:08:06] grrrit-wm: ? [15:08:14] (PS2) Ottomata: Running geowiki-process-data cron daily [operations/puppet] - https://gerrit.wikimedia.org/r/75882 [15:08:15] it's in labs/tools/grrrit [15:08:23] YuviPanda: perhaps a cron job sending it SIGHUP [15:08:31] YuviPanda: no, I need the code for greg-g [15:08:34] ツ [15:08:41] you can ask for his DNA sequence, doubt you'll get it tho [15:09:20] YuviPanda: reason it's not organized under operations? [15:09:39] AzaToth: because it is running in toollabs? [15:09:51] !log reedy synchronized php-1.22wmf12 [15:10:02] Logged the message, Master [15:10:06] YuviPanda: ok [15:10:40] AzaToth: plus I doubt I can get self +2 under operations/ :P [15:11:01] hehehe [15:11:20] (nor would I want to) [15:12:02] so it could be Coren's glory grid engine who restarts it [15:12:08] -' [15:12:15] that's possible. Maybe it hits memory limits and restart [15:12:16] s [15:12:20] which would explain the lack of an exception [15:12:24] yea [15:12:45] we'll need Coren to do a more detailed analysis [15:12:48] it has a 1000M cieling [15:12:51] *ceiling [15:12:56] and current max mem is only 750 [15:12:56] *diagnosis [15:12:57] M [15:13:48] is it puppetized? [15:13:57] AzaToth: no, because it runs on toollabs? [15:14:03] it *can't* be puppetized :) [15:14:06] because toollabs [15:14:20] hmm [15:14:46] toollabs infrastructure is puppetized [15:14:48] not the tools themselves [15:14:52] ok [15:15:38] on which instance is it running on? [15:15:57] AzaToth: toollabs? :D [15:15:59] http://tools.wmflabs.org/?status [15:16:03] lolrrit-wm [15:16:04] is the toolname [15:18:33] Heh. lol. :-) [15:19:21] AzaToth: The gridengine doesn't terminate processes, just restarts them when they exit on their own. [15:19:40] AzaToth: Wait, that's not true: there /is/ a case where it can terminate a process: if it busts the memory limit. [15:19:44] yeah [15:19:52] (In which case the accounting file will tell you) [15:19:53] PROBLEM - MySQL Slave Delay on db1008 is CRITICAL: CRIT replication delay 191 seconds [15:22:38] AzaToth: I have my 23andme DNA sequence in digital form, will that help? [15:22:46] cmjohnson1: yep let's move it! [15:22:53] RECOVERY - MySQL Slave Delay on db1008 is OK: OK replication delay 5 seconds [15:22:54] greg-g: would suffice [15:22:59] cool [15:23:02] greg-g: but I want it in print [15:23:07] A4 [15:23:13] jeff-green: cool anything special or just go ahead and bring it down? [15:23:26] Coren: where's the accounting file? [15:23:43] AzaToth: now you're just being picky [15:23:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:48] cmjohnson1: lemme shut down mysql, I'm not sure I trust the init script during shutdown [15:24:05] AzaToth: You need to use qacct to access it. qacct -j will give you final information on a process, including maxvmem [15:24:11] okay...just bring it all down and lmk when i can move it [15:24:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [15:24:49] (Or use jobname and it'll give you every job of that name) [15:25:03] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 15:24:54 UTC 2013 [15:25:53] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [15:26:24] cmjohnson1: good golly mysql is still shutting down... [15:27:24] Coren: none found... [15:27:40] seems lolrrit-vm is not running [15:28:12] and now I accidentally ran "qacct -d" [15:28:20] "qacct -j" [15:28:26] and can't stop it [15:28:43] AzaToth: -wm [15:28:44] not -vm [15:28:57] !log reedy synchronized docroot and w [15:29:07] Logged the message, Master [15:29:55] YuviPanda: still waiting for qacct -j to stop spamming [15:30:23] (PS1) Reedy: Add symlinks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75888 [15:30:24] (PS1) Reedy: test2wiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75889 [15:30:25] (PS1) Reedy: Add .swp to .gitignore [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75890 [15:30:46] * Reedy kicks grrrit-wm [15:30:48] That output was probably already on your way in miliseconds; now you just have to wait for it to drain. :-) [15:30:52] heh [15:31:12] Reedy: was that late? [15:31:33] More just seconds [15:31:38] true [15:31:39] Coren: why can't I Ctrl-C? [15:32:48] AzaToth: Because the qacct program is long done. It took it miliseconds at most to send that output. That you have limited bandwidth to actually get and display it doesn't change the fact that there is nothing /to/ interrupt. [15:33:01] [0:0][azatoth@tools-login ~]$ qacct -j lolgrrit-wm [15:33:02] error: job name lolgrrit-wm not found [15:33:09] Coren: k [15:33:20] YuviPanda: ↑ [15:33:42] AzaToth: lolrrit-wm [15:33:47] AzaToth: No g. [15:34:06] (CR) Ottomata: [C: 2 V: 2] Running geowiki-process-data cron daily [operations/puppet] - https://gerrit.wikimedia.org/r/75882 (owner: Ottomata) [15:34:14] AzaToth: lolrrit-wm :) [15:34:31] (PS5) Physikerwelt: Creating initial debianization [operations/debs/latexml] - https://gerrit.wikimedia.org/r/75513 (owner: AzaToth) [15:34:45] daym [15:35:16] Exit status 137 == SIGKILL [15:35:55] Coren: that's yesterday [15:36:34] unless the time is off [15:36:53] It's synced, and in UTC [15:37:32] But it'd make sense, it's running now since Wed Jul 24 19:22:28 2013 [15:37:37] I.e. restarted immediately. [15:37:43] (CR) Reedy: [C: 2] Add symlinks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75888 (owner: Reedy) [15:37:50] (CR) Reedy: [C: 2] test2wiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75889 (owner: Reedy) [15:37:54] (Merged) jenkins-bot: Add symlinks [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75888 (owner: Reedy) [15:37:56] (CR) Reedy: [C: 2] Add .swp to .gitignore [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75890 (owner: Reedy) [15:37:59] (Merged) jenkins-bot: test2wiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75889 (owner: Reedy) [15:38:05] Oh, did you want to see information on the /currently/ running job? [15:38:11] yes [15:38:12] Use qstat then, not qacct. :-) [15:38:15] (Merged) jenkins-bot: Add .swp to .gitignore [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75890 (owner: Reedy) [15:39:00] YuviPanda: according to that, it hasn't been restarted since yesterday [15:39:12] Indeed not. [15:40:01] YuviPanda: can't read the log files for lolrrit due to me not be in the inner circle [15:40:30] (PS1) Aude: explicitly load Wikibase i18n and related files [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75892 [15:41:15] (PS2) Reedy: explicitly load Wikibase i18n and related files [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75892 (owner: Aude) [15:41:20] (CR) Reedy: [C: 2] explicitly load Wikibase i18n and related files [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75892 (owner: Aude) [15:41:30] (Merged) jenkins-bot: explicitly load Wikibase i18n and related files [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75892 (owner: Aude) [15:44:10] cmjohnson1: ok you can move db1008 at will [15:44:24] alright... [15:44:37] (PS1) Ottomata: Fixing typo [operations/puppet] - https://gerrit.wikimedia.org/r/75893 [15:44:54] (CR) Ottomata: [C: 2 V: 2] Fixing typo [operations/puppet] - https://gerrit.wikimedia.org/r/75893 (owner: Ottomata) [15:45:56] PROBLEM - Host db1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:18] !log reedy Started syncing Wikimedia installation... : test2wiki to php-1.22wmf12 and rebuild l10n cache [15:54:29] Logged the message, Master [15:54:38] (PS1) Ottomata: Fixing another typo [operations/puppet] - https://gerrit.wikimedia.org/r/75894 [15:54:46] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 15:54:40 UTC 2013 [15:54:50] (CR) Ottomata: [C: 2 V: 2] Fixing another typo [operations/puppet] - https://gerrit.wikimedia.org/r/75894 (owner: Ottomata) [15:54:56] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [16:00:45] mw1130: rsync: send_files failed to open "/wmf-config/.InitialiseSettings.php.swp" (in common): Permission denied (13) [16:00:45] rarrrgh [16:01:18] -rw------- 1 root root 20480 Jul 25 06:51 .InitialiseSettings.php.swp [16:01:19] ... [16:04:25] jeff_green: moved [16:07:07] sweet. thanks [16:11:09] Can a root delete /a/common/wmf-config/.InitialiseSettings.php.swp from tin please? [16:12:55] Reedy: someone get that for you yet or nah? [16:13:06] !log reedy Finished syncing Wikimedia installation... : test2wiki to php-1.22wmf12 and rebuild l10n cache [16:13:40] Reedy: done [16:13:44] Logged the message, Master [16:14:03] Thanks [16:14:48] welcome [16:15:36] Reedy: !!!! [16:15:40] http://en.wikivoyage.org/wiki/Stockholm [16:15:47] [78f46131] 2013-07-25 16:15:28: Fatal exception of type MWException [16:15:52] any details? [16:15:55] Uhh [16:15:59] give me a minute [16:16:27] 2013-07-25 16:15:28 mw1175 enwikivoyage: [78f46131] /wiki/Stockholm Exception from line 315 of /usr/local/apache/common-local/php-1.22wmf11/includes/MagicWord.php: Error: invalid magic word 'noexternallanglinks' [16:16:42] localisation update? [16:16:49] mebbe [16:16:50] Scap run [16:17:12] :] [16:17:15] *scap has been run [16:17:19] those magic words! [16:17:29] Yeaah :/ [16:17:37] running l10nupdate [16:17:41] ok [16:18:46] Uggggh [16:18:53] 10s of extensions out of date on the meta repo [16:19:00] oh, wtf! [16:19:15] * aude put $IP/extensions/Wikibase/client/WikibaseClient.i18n.magic.php  [16:19:20] in extensions list [16:19:28] that should be okay, right? [16:19:44] Submodule path 'Wikibase': checked out 'c5eb128a12548356b13359a7ff6e3debe1ec141c' [16:19:48] or does it really need to load the entire extension [16:20:37] if so, we can load wikibase client directly and load the i18n ones specifically for othe other components [16:24:00] sloooooow [16:24:44] if putting the alisas and magic word files in explicit in extension list does not work, then not sure what to do [16:24:58] easiest to just wait and see [16:25:02] ok [16:25:11] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 16:25:01 UTC 2013 [16:25:51] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [16:27:04] wikidata is rather broken [16:27:16] well has issues [16:27:17] it's still doing wmf11 [16:27:17] then wmf10, wmf12 [16:27:29] ok [16:28:22] Reedy: test2wiki is down now with Error: invalid magic word 'noexternallanglinks' [16:28:35] Yup, we know [16:28:41] ok [16:28:41] yeah [16:28:46] waiting for localisation stuff to do it's magic [16:29:06] we could do with some progress stats on these scripts [16:29:17] when i run merge message list, i get [16:29:18] 'wikibaseclientmagic' => "$IP/extensions/Wikibase/client/WikibaseClient.i18n.magic.php", [16:29:28] !log LocalisationUpdate completed (1.22wmf11) at Thu Jul 25 16:29:28 UTC 2013 [16:29:31] even just done X of Y servers = Z% done [16:29:38] Logged the message, Master [16:31:43] so, when i run merge message list with "enwiki", it deos not include wikibase repo files [16:31:49] (since wikibase repo not installed there) [16:32:08] explicitly including it, i get the bad initializtion order error [16:32:27] * aude can just remove that from wikibase, at leats in the branch [16:34:44] Reedy: i am removing the exception [16:34:59] we just trust people configure things correctly but do not need to enforce it currently [16:38:06] aude Reedy do you expect test2wiki back soon? [16:38:24] i hope so [16:38:30] me too [16:38:32] more of an issue that wikivoyage is down [16:38:35] ir was [16:38:37] or was [16:38:59] urgh [16:39:07] (PS1) Aude: not convinced this is the right approach [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75901 [16:39:12] Reedy: https://gerrit.wikimedia.org/r/#/c/75901/ [16:39:23] and i'll have a backport for wmf11 / wmf12 [16:40:47] !log LocalisationUpdate completed (1.22wmf10) at Thu Jul 25 16:40:47 UTC 2013 [16:40:58] Logged the message, Master [16:43:01] AzaToth: sorry, just came back [16:43:05] want access to lolrrit? [16:43:16] might just be a ping timeout or somesuch that automatically resets connection [16:43:32] YuviPanda: I could take a look [16:47:02] Reedy: https://gerrit.wikimedia.org/r/#/c/75902/ [16:47:40] assumign the submodules are correct [16:47:43] please check [16:49:15] aude ? [16:49:32] Eloquence: removing enforcement of wikibase initialization order [16:49:41] it's apparently problematic for localisation update [16:49:45] yes. is Reedy actually around or do we need to get someone else to help you? [16:49:52] no, he's working on it [16:49:54] kk [16:50:22] ok, i'm amending [16:50:38] what's going on? wikidata issues? [16:50:47] paravoid: bunch of stuff [16:51:04] i think wrong version of wikibase accidentally got deployed [16:51:24] plus interfering with localisation update by enforcing certain order of loading the extensions [16:51:29] (which i removed) [16:52:34] okay [16:52:38] purely mediawiki-related, right? [16:52:44] yes [16:52:48] wikivoyage is broken :P [16:52:52] i know [16:52:53] so you folks have it all covered I'm guessing? [16:52:57] yes [16:53:06] okay, let us know if you need anything [16:53:06] as fast as reedy can work [16:54:40] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 16:54:39 UTC 2013 [16:54:50] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [16:55:10] PROBLEM - Puppet freshness on professor is CRITICAL: No successful Puppet run in the last 10 hours [16:57:10] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [16:57:48] Running updates for 1.22wmf12 (on test2wiki) [16:57:48] 526371 MediaWiki messages are updated [16:57:48] Updated 1141503 messages in total [16:57:55] https://gerrit.wikimedia.org/r/#/c/75907/ is the right version of wikibase that shoudl work better for localisation update [16:58:00] Reedy: if you want that [16:58:53] Reedy: btw, this smells like https://wikitech.wikimedia.org/wiki/Incident_documentation/20130627-Site [16:58:57] and again wikidata [16:59:03] Krinkle: yes [16:59:15] i removed the thing that is problematic [16:59:18] * aude hopes that's all [16:59:32] and it is not quite the same actually [16:59:58] Krinkle: It isn't aborting [17:00:03] ok, good [17:00:36] -rw-rw-r-- 1 reedy wikidev 20973 Jul 25 15:50 ExtensionMessages-1.22wmf12.php [17:00:48] still, is it all correct? all the wikibase stuff there? [17:01:34] 'WikibaseDataModel' => "$IP/extensions/WikibaseDataModel/WikibaseDataModel.i18n.php", [17:01:34] 'WikibaseLib' => "$IP/extensions/Wikibase/lib/WikibaseLib.i18n.php", [17:01:34] 'wikibaseclient' => "$IP/extensions/Wikibase/client/WikibaseClient.i18n.php", [17:01:34] 'Wikibaseclientalias' => "$IP/extensions/Wikibase/client/WikibaseClient.i18n.alias.php", [17:01:34] 'wikibaseclientmagic' => "$IP/extensions/Wikibase/client/WikibaseClient.i18n.magic.php", [17:01:37] can we just undeploy Wikidata and move ahead? [17:01:42] i would like https://gerrit.wikimedia.org/r/#/c/75901/ approved and do https://gerrit.wikimedia.org/r/#/c/75907/ instead for now [17:01:57] Reedy: that's for client [17:02:00] where is wikibase repo? [17:02:05] can we just undeploy mobile edits and move ahead? [17:02:48] and https://gerrit.wikimedia.org/r/#/c/75904/ for wmf12 or just pull the latest of mw1.22-wmf11 branch of wikibase [17:02:48] sarcasm detection: FAIL [17:03:09] * aude undeploy MaxSem :) [17:03:15] aude: There isn't any... [17:03:32] Reedy: not good i think [17:04:19] i think wikidata is missing its messages [17:04:37] the wikibase messages [17:06:19] Seems to be just those missing [17:06:41] yeah [17:06:58] a problem for wikidata [17:07:05] (CR) Reedy: [C: 2] not convinced this is the right approach [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75901 (owner: Aude) [17:07:07] although let's worry about wikivoyage first [17:07:15] (Merged) jenkins-bot: not convinced this is the right approach [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75901 (owner: Aude) [17:08:04] for https://gerrit.wikimedia.org/r/#/c/75907/, i at least checked all the wikidata realted extensions have correct submodules [17:08:16] don't know about other stuff though [17:10:34] ok wikivoyage is back [17:10:36] !log LocalisationUpdate completed (1.22wmf12) at Thu Jul 25 17:10:36 UTC 2013 [17:10:46] Logged the message, Master [17:11:00] wikidata is still missing messages [17:11:11] Special:Version is fataling [17:11:15] Of course it is [17:11:16] grrr [17:11:19] yeah [17:11:20] !log reedy synchronized php-1.22wmf11/extensions/Wikibase [17:11:30] Logged the message, Master [17:11:41] rl cache refresh [17:11:54] maybe i just hit something in squid [17:12:03] wikivoyage is not back [17:12:09] everything is slooooooow [17:12:18] grrrrrrr [17:12:20] (CR) Mark Bergsma: [C: -1] "Header stripping should only happen on the front layer" [operations/puppet] - https://gerrit.wikimedia.org/r/75860 (owner: Mark Bergsma) [17:12:55] Oooo cool: [17:12:58] https://wiki.apache.org/incubator/SamzaProposal [17:13:02] ok, after that revert of yours, the extension messages file for wmf11 looks more sensible [17:13:08] Reedy: good [17:13:09] storm alternative build on Hadoop YARN + Kafka :) [17:13:33] i think it loads only what a wikipedia loads and somehow skipped wikibase repo [17:13:57] it should load everything and not care about initialization order at that point [17:14:18] wmf11 is first at least too [17:14:22] ok [17:14:27] (CR) Mark Bergsma: [C: 2] Add a lean definition (and custom template) for SSL proxies to localhost [operations/puppet] - https://gerrit.wikimedia.org/r/75606 (owner: Mark Bergsma) [17:15:32] (PS8) Mark Bergsma: Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 [17:16:55] (CR) Mark Bergsma: [C: 2] Setup NGINX for HTTPS on the Varnish servers [operations/puppet] - https://gerrit.wikimedia.org/r/75601 (owner: Mark Bergsma) [17:17:29] test2wiki is back, thanks [17:17:33] ood [17:17:34] good [17:18:26] Having version controlled l10n cache would be so bloody useful [17:18:34] in a local git repo clone [17:18:37] just revert back quickly [17:19:24] how about that git-deploy, eh? [17:19:28] (PS1) Mark Bergsma: Install amssq47 as an SSL relay [operations/puppet] - https://gerrit.wikimedia.org/r/75913 [17:20:00] Reedy: yes! [17:20:09] (CR) Mark Bergsma: [C: 2] Install amssq47 as an SSL relay [operations/puppet] - https://gerrit.wikimedia.org/r/75913 (owner: Mark Bergsma) [17:23:30] fscking laptop [17:23:36] :( [17:24:22] it's taken to notifying me of low battery without enough time to go and get my power adapter [17:24:27] so it hibernates [17:25:00] god, just what we need in this situation :( [17:25:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 17:24:56 UTC 2013 [17:25:51] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [17:27:25] aude: Damn it, wmf12 is dying again [17:27:35] with my patch? [17:27:51] I think I might have missed one [17:28:20] make sure wikibase is on latest commit for mw1.22-wmf11 [17:28:20] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki back to wmf11 [17:28:25] Logged the message, Master [17:28:33] Reedy: how much longer will localisation cache take for wmf11? [17:28:43] It's built [17:28:46] It just needs to sync [17:28:46] ok [17:28:48] ok [17:28:55] which it should do if wmf12 wasn't killing everything [17:28:56] !log reedy Started syncing Wikimedia installation... : [17:28:59] hence the revert [17:29:04] ok [17:29:06] Logged the message, Master [17:29:42] if this does not fix for any reason, then we can disable wikibase for wikivoyage until we do get it fixed [17:29:43] lets do it the quicker way [17:29:57] if it's missing language links, oh well [17:30:22] (PS1) Mark Bergsma: Don't install the Ganglia Apache plugin for now [operations/puppet] - https://gerrit.wikimedia.org/r/75915 [17:30:47] and wikidata dispatching to wikipedia only [17:30:54] if required [17:31:35] i see test2 is on wmf11 now [17:31:46] so then we'll know when wmf11 is okay again to reenable wikibase [17:32:07] (that's if the synching does not finish soon and fix) [17:33:32] aaand test2wiki is gone again [17:33:43] yeah it's on wmf11 now [17:33:53] which is broken [17:40:59] * aude also would love if mediawiki would not throw an exception if it can't find a magic word [17:41:14] Indeed [17:41:18] :( [17:41:23] debug log and just don't process it [17:41:24] it's so brittle to introduce new words [17:41:48] wmf11 l10n cache should be nearly all synce [17:41:54] wikidata is just missing messages but otherwise survives at the moment [17:42:06] no magic words there [17:42:11] Reedy: k [17:42:15] bots don't need messags! [17:42:20] it's back [17:42:31] oh my! [17:42:34] * greg-g breathes [17:42:42] wikidata looks good too [17:42:48] yep [17:42:50] same here [17:43:05] Is it bed time yet? [17:43:07] !log reedy synchronized php-1.22wmf11/cache/l10n/ [17:43:10] gah, takes waaaaaaaaaaaaay too long to rebuild stuff [17:43:19] Logged the message, Master [17:43:26] once you get a chance to breathe, mind updating that bug https://bugzilla.wikimedia.org/show_bug.cgi?id=52038 with what happened/cause/etc? [17:43:28] and not have backup cache copy [17:43:35] a number of issues [17:43:46] * greg-g nods yeah [17:44:04] * Reedy goes to find a drink [17:44:10] ok :) [17:44:27] :) [17:45:02] Being able to run scap only for specific versions could be useful too [17:45:19] so, git-deploy? :P [17:45:30] uh, yeah [17:48:07] Right [17:48:21] So, the extension-messages change for wmf12 is what broke the wmf11 message cache [17:48:29] We've reverted that, and now scap is fine for wmf11 [17:48:47] but broken for wmf12 due to the exception thrown by wikibase... [17:48:52] Which was been removed? [17:49:07] https://gerrit.wikimedia.org/r/#/c/75904/ [17:49:09] Need merging [17:49:35] yes [17:49:52] is there no way to run localisation update for just wmf12? [17:50:02] that's what we needed [17:50:38] We can do it "manually" then sync-dir [17:50:38] cmjohnson1: Would you prefer i broke testwiki with deployment prep stuff? [17:51:21] heh [17:51:47] I can't really use testwikidatawiki, mediawikiwiki or loginwiki ;) [17:52:41] Reedy? not following you...what are you breaking? [17:52:47] !log reedy synchronized php-1.22wmf12/extensions/Wikibase [17:52:59] Logged the message, Master [17:53:04] cmjohnson1: I need a wiki on the "new" version to get a localisation cache built prior to deploy [17:53:07] Reedy: hmmm [17:53:17] I've been using test2wiki, and hence, that gets broken sometimes [17:53:21] aude: ? [17:53:28] Reedy: I thihnk you mean chrismcmahon ? [17:53:36] Oh, fail [17:53:37] :) [17:53:39] yeah, sorry cmjohnson1 [17:53:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:51] test2wiki is back again, thanks [17:53:56] Reedy: ah makes more sense ...well i say break whatever you like [17:54:03] cmjohnson1: hey! don't say that! [17:55:00] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 17:54:54 UTC 2013 [17:55:10] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [17:55:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [17:56:11] Reedy: ? [17:56:27] greg-g: filled in the bug report [17:56:29] ? [17:56:34] Thanks [17:56:40] ok [17:57:08] aude: thanks much, anything to add Reedy ? [17:57:24] we enforce the initialisation order, at least not anytime soon [17:57:30] so things can work the old way [17:57:55] err, we won't enforce [18:02:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [18:04:08] Oh look, deployment window time [18:04:22] oh, fun! [18:05:00] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [18:05:10] !log reedy Started syncing Wikimedia installation... : [18:05:16] looks a bit happier now though [18:05:23] Updating ExtensionMessages-1.22wmf12.php... [18:05:23] done [18:05:23] Updating LocalisationCache for 1.22wmf12... done [18:05:49] good [18:06:29] whew [18:16:58] !log authdns-update to change IPs for db1008 [18:17:07] Logged the message, Master [18:22:09] scap scap scap scap scapping [18:22:49] !log reedy Finished syncing Wikimedia installation... : [18:22:56] praise jeebus [18:22:59] Logged the message, Master [18:24:40] :) [18:24:52] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 18:24:47 UTC 2013 [18:24:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf12 [18:25:04] Logged the message, Master [18:25:38] (PS1) Reedy: testwiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75921 [18:25:42] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [18:26:51] Reedy: didn't happen to remember to turn on vips today, did you? 'twas a bit crazy I know [18:27:06] Nope... [18:27:16] Should be more than enough time in this window to do it [18:27:53] * greg-g nods [18:28:08] Reedy: when not busy, https://gerrit.wikimedia.org/r/#/c/75617/ would be nice to have :) [18:28:14] Reedy: https://gerrit.wikimedia.org/r/#/c/74514/ [18:29:26] wmf12 has certainly quietened down [18:29:41] (CR) Reedy: [C: 2] testwiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75921 (owner: Reedy) [18:29:48] (Merged) jenkins-bot: testwiki, testwikidatawiki, loginwiki and mediawikiwiki to 1.22wmf12 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75921 (owner: Reedy) [18:30:54] (PS2) Bsitu: Preparation of Echo and Thanks for metawiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75795 [18:31:33] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.22wmf11 [18:31:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:43] Logged the message, Master [18:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [18:32:56] 217 Warning: require() [function.require]: Unable to allocate memory for pool. [18:33:01] I guess APC doesn't like 3 versions [18:34:24] :( [18:35:05] It'll soprt itself out in a few minutes [18:35:34] I wonder if there's a nice way to get it to purge files with a specific prefix [18:35:58] (PS1) Reedy: Wikipedias to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75922 [18:36:10] (CR) Reedy: [C: 2] Wikipedias to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75922 (owner: Reedy) [18:36:17] (Merged) jenkins-bot: Wikipedias to 1.22wmf11 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75922 (owner: Reedy) [18:36:26] (PS2) Reedy: hewikivoyage: also sortPrepend en [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75617 (owner: Tzafrir) [18:36:38] (CR) Reedy: [C: 2] hewikivoyage: also sortPrepend en [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75617 (owner: Tzafrir) [18:36:46] (Merged) jenkins-bot: hewikivoyage: also sortPrepend en [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75617 (owner: Tzafrir) [18:37:17] (PS2) Reedy: Proposed settings for VIPS. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74514 (owner: Brian Wolff) [18:37:38] (PS3) Reedy: Proposed settings for VIPS. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74514 (owner: Brian Wolff) [18:37:52] (CR) Reedy: [C: 2] Settings for VIPS. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74514 (owner: Brian Wolff) [18:38:00] (Merged) jenkins-bot: Settings for VIPS. [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74514 (owner: Brian Wolff) [18:39:07] !log reedy synchronized wmf-config/ [18:39:17] Logged the message, Master [18:40:21] tada! [18:42:20] 2 new wikis to be created.. [18:43:10] (CR) Reedy: "Is this waiting on anything in particular? Or can it be deployed?" [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74268 (owner: Ryan Lane) [18:43:15] (PS2) Reedy: (bug 51803) set up flood flag for ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75538 (owner: TTO) [18:43:21] (CR) Reedy: [C: 1] Enable Secure Login everywhere [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74268 (owner: Ryan Lane) [18:43:26] (CR) Reedy: [C: 2] (bug 51803) set up flood flag for ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75538 (owner: TTO) [18:43:33] (Merged) jenkins-bot: (bug 51803) set up flood flag for ckbwiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75538 (owner: TTO) [18:43:46] (PS2) Reedy: (bug 49600) add Portal namespace for sowiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75540 (owner: TTO) [18:43:47] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [18:43:54] (CR) Reedy: [C: 2] (bug 49600) add Portal namespace for sowiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75540 (owner: TTO) [18:44:03] (Merged) jenkins-bot: (bug 49600) add Portal namespace for sowiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75540 (owner: TTO) [18:44:48] !log reedy synchronized wmf-config/InitialiseSettings.php [18:44:58] Logged the message, Master [18:46:12] Reedy: new wikis? [18:46:14] which ones? [18:47:22] Wikivoyage Vietnamese [18:47:27] Wikipedia Tuvan [18:47:29] oh, it gets wikibase [18:47:34] Both do! [18:47:39] is there a way to populate sites table first? [18:47:50] or guess it will work [18:47:58] it should work [18:48:03] we fixed the issue [18:49:57] !log authdns-update: change ns1's IP to new service IP; glues were updated [18:50:07] Logged the message, Master [18:54:47] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 18:54:40 UTC 2013 [18:55:37] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [19:14:12] (PS1) Pyoungmeister: grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 [19:14:42] (PS2) Pyoungmeister: removing unused enwikijobqueue check [operations/puppet] - https://gerrit.wikimedia.org/r/75797 [19:25:04] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 19:25:02 UTC 2013 [19:25:11] (CR) Pyoungmeister: [C: 2] removing unused enwikijobqueue check [operations/puppet] - https://gerrit.wikimedia.org/r/75797 (owner: Pyoungmeister) [19:25:31] (PS2) Pyoungmeister: grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 [19:25:34] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [19:29:09] (PS1) Ottomata: Rendering fs.default.name if not using yarn [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75951 [19:29:27] (CR) Ottomata: [C: 2 V: 2] Rendering fs.default.name if not using yarn [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/75951 (owner: Ottomata) [19:29:35] (PS3) Pyoungmeister: grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 [19:31:55] (PS1) Ori.livneh: ensure => directory on GlusterFS log dir [operations/puppet] - https://gerrit.wikimedia.org/r/75976 [19:38:20] (CR) Asher: [C: 1] grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 (owner: Pyoungmeister) [19:38:32] (PS4) Pyoungmeister: grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 [19:38:39] (CR) Ryan Lane: [C: 2] ensure => directory on GlusterFS log dir [operations/puppet] - https://gerrit.wikimedia.org/r/75976 (owner: Ori.livneh) [19:39:02] (CR) Pyoungmeister: [C: 2] grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 (owner: Pyoungmeister) [19:39:33] who has the job queue stuff unmerged on sockpuppet? [19:39:46] I'm merging it [19:40:42] Aaron|home: ^^ [19:49:19] * Aaron|home knows nothing of that [19:54:48] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 19:54:46 UTC 2013 [19:55:38] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [20:00:48] ok...so now shipping things to you with 9 months? [20:01:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.155 second response time [20:17:14] AzaToth: ^ [20:18:49] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [20:19:53] (PS1) MaxSem: Disable mobile upload CTAs [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76001 [20:19:56] does the fact that notpeter is actually peter prove P = NP? [20:21:01] * YuviNoPower adds to quips [20:24:59] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 20:24:52 UTC 2013 [20:25:39] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [20:27:33] did anyone do anything with mw1041 recently? it looks out of sync, executing wmf8 [20:28:01] !log Ran sync-common on mw1041 - looked outta sync [20:28:11] Logged the message, Master [20:30:54] (PS1) Jgreen: remove db1008, it has moved to frack [operations/puppet] - https://gerrit.wikimedia.org/r/76002 [20:32:23] (CR) Jgreen: [C: 2 V: 2] remove db1008, it has moved to frack [operations/puppet] - https://gerrit.wikimedia.org/r/76002 (owner: Jgreen) [20:34:16] aha, mw1041 is not in mediawiki-installation DSH group, can someone fix this? [20:34:17] !log olivneh synchronized php-1.22wmf11/extensions/GuidedTour 'E3 deployment: bugfixes for GuidedTour and GettingStarted (1/2)' [20:34:26] Logged the message, Master [20:34:38] !log olivneh synchronized php-1.22wmf11/extensions/GettingStarted 'E3 deployment: bugfixes for GuidedTour and GettingStarted (2/2)' [20:34:46] ^ StevenW [20:34:48] Logged the message, Master [20:35:06] Muchas gracais señor [20:38:04] The apache syslogs are very noisy [20:38:28] aaand [20:38:30] they've all just gone [20:38:32] YuviNoPower: looki [20:38:42] AzaToth: \o/ [20:39:06] Reedy: when is mediawiki going to switch to 1.22wmf12? [20:39:13] mediawiki? [20:39:16] mediawiki.org? [20:39:17] .org [20:39:18] yeah [20:39:24] !mw MediaWIki 1.22/Roadmap [20:39:37] It was done 2 hours 40 minutes ago [20:39:38] silly wm-bot. [20:39:53] the source of mw.org still shows 1.22wmf11 [20:40:09] https://www.mediawiki.org/wiki/Special:Version [20:40:13] bsitu, caching? also, http://noc.wikimedia.org/conf/highlight.php?file=wikiversions.dat [20:40:14] MediaWiki 1.22wmf12 (322b84a) [20:40:53] YuviNoPower: nothing [20:40:58] see [20:41:01] grr [20:41:16] MaxSem: thanks, maybe cache [20:42:36] (CR) Dzahn: [C: 2] Add new bug status 'PATCH_TO_REVIEW' to the queries [operations/puppet] - https://gerrit.wikimedia.org/r/75843 (owner: Aklapper) [20:42:40] (CR) Helder.wiki: [C: 1] Enable Secure Login everywhere [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/74268 (owner: Ryan Lane) [20:46:01] !log olivneh synchronized php-1.22wmf12/extensions/GuidedTour 'E3 deployment: bugfixes for GuidedTour and GettingStarted (1/2)' [20:46:10] Logged the message, Master [20:46:21] !log olivneh synchronized php-1.22wmf12/extensions/GettingStarted 'E3 deployment: bugfixes for GuidedTour and GettingStarted (2/2)' [20:46:31] Logged the message, Master [20:46:59] YuviNoPower: offcourse I forgot to check the pid... [20:47:08] AzaToth: hmm? [20:47:11] right [20:47:13] did that change/ [20:47:14] ? [20:48:24] !log olivneh synchronized php-1.22wmf12/extensions/GuidedTour 'E3 deployment: bugfixes for GuidedTour and GettingStarted (1/2)' [20:48:42] !log olivneh synchronized php-1.22wmf12/extensions/GettingStarted 'E3 deployment: bugfixes for GuidedTour and GettingStarted (2/2)' [20:49:24] (PS5) Pyoungmeister: grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 [20:50:00] YuviNoPower: the script is rerun, so logically the pit should have changed [20:50:08] I forgot to check the pid before though [20:50:14] shouldn't that be in accounting? [20:50:42] well [20:50:54] I have 0% knowledge what "accounting" is [20:52:57] YuviNoPower: qstat doesn't show any pid [20:53:15] (CR) Pyoungmeister: [C: 2] grabbing 4 pmtpa mw boxes for ES testing [operations/puppet] - https://gerrit.wikimedia.org/r/75923 (owner: Pyoungmeister) [20:53:15] AzaToth: yeah but should show if the grid restarted it? [20:53:18] neither does qacc [20:53:44] YuviPanda: I don't know what data is accounted for [20:53:48] sigh [20:53:59] I guess we just note pid and see if that restarts again? [20:54:04] well, it *does* restart [20:54:06] we just dunno why [20:54:20] 50674 30928 0.0 0.2 803428 17492 ? SNl 20:16 0:00 /usr/bin/nodejs /data/project/lolrrit-wm/lolrrit-wm/src/relay.js [20:54:37] seems it restarted there 20:16 [20:54:49] Coren: ping [20:55:07] Pong. [20:55:32] OooOoo. [20:55:39] That wasn't the grid's doing. [20:55:45] Coren: hmm [20:56:06] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 20:55:57 UTC 2013 [20:56:13] That was the local loop restart -- the /job/ itself has never been interrupted, but a continuous job is, in essence, "while true; do your_stuff; sleep 30; done" [20:56:36] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [20:56:36] hmm [20:56:55] Well, actually, it's more like "while ! your_stuff; do sleep 30; done" but same idea. [20:56:57] so you mean the program silently crashed and then it got restarted [20:57:12] * Coren nods. Without involving the gridengine at all. [20:57:17] ok [20:57:32] That's why as far as /it/ is concerned, it never restarted. [20:58:08] Coren: imo `it` should "log" that it wasn't concerned when "it" _happened_ [20:58:30] That's actually not stupid. [20:58:30] :-P [20:58:30] ツ [20:58:54] "timestamp Your program just exited with value $? but it's not my fault!" [20:59:26] * Coren adds that now since that'll help your debugging. [20:59:43] ty [20:59:56] where does that end up at? [21:02:27] YuviPanda: it could be the exception in lolrrit-wm.err [21:02:56] http://paste.debian.net/18621/ [21:02:59] AzaToth: I just deleted .err [21:03:05] I remember getting that error long long time ago [21:03:15] if it is still there next time, that should be fine [21:03:24] k [21:03:31] I... just realized I could've looked at mtime [21:03:52] fyi, instead if deletion a file, just do ":>lolrrit-wm.err" [21:04:05] grr [21:04:23] ":>" is the best bashism ツ [21:04:48] :D next time [21:04:50] I'm going to sleep now [21:04:53] thanks for the help, AzaToth [21:04:58] at least we know it isn't an oom :) [21:05:14] heh [21:05:19] perhaps it is ツ [21:13:15] (PS1) Asher: fix install path for updateinterwikicache [operations/puppet] - https://gerrit.wikimedia.org/r/76008 [21:14:35] (CR) Asher: [C: 2 V: 2] fix install path for updateinterwikicache [operations/puppet] - https://gerrit.wikimedia.org/r/76008 (owner: Asher) [21:19:38] (PS1) Pyoungmeister: also include wikidev group [operations/puppet] - https://gerrit.wikimedia.org/r/76009 [21:21:18] (CR) Pyoungmeister: [C: 2] also include wikidev group [operations/puppet] - https://gerrit.wikimedia.org/r/76009 (owner: Pyoungmeister) [21:23:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:24:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.176 second response time [21:25:36] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 21:25:28 UTC 2013 [21:25:36] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [21:27:17] (PS1) Lcarr: trying explicitely scoping the exec [operations/puppet] - https://gerrit.wikimedia.org/r/76011 [21:27:50] (CR) jenkins-bot: [V: -1] trying explicitely scoping the exec [operations/puppet] - https://gerrit.wikimedia.org/r/76011 (owner: Lcarr) [21:29:03] (Abandoned) Lcarr: trying explicitely scoping the exec [operations/puppet] - https://gerrit.wikimedia.org/r/76011 (owner: Lcarr) [21:30:48] paravoid: if you have the time/energy, i'm trying to figure out how to make the varnish::instance class see the "exec[generate varnish.pyconf]" event that is located in varnish::monitoring::ganglia ... [21:31:16] instance.pp::81 and montoring/ganglia.pp::10 , in modules/varnish/manifests [21:33:23] Jeff_Green: these db errors are being logged regularly from various eqiad apaches - Thu Jul 25 21:28:34 UTC 2013 mw1162 foundationwiki Error connecting to db1008.eqiad.wmnet: Unknown MySQL server host 'db1008.eqiad.wmnet' (0) [21:35:09] binasher: soudns like https://rt.wikimedia.org/Ticket/Display.html?id=5435 [21:35:30] needs switch port [21:43:01] PROBLEM - DPKG on mw134 is CRITICAL: Connection refused by host [21:43:11] PROBLEM - Apache HTTP on mw131 is CRITICAL: Connection refused [21:43:11] PROBLEM - Disk space on mw134 is CRITICAL: Connection refused by host [21:43:11] PROBLEM - twemproxy process on mw131 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 65534 (nobody), command name nutcracker [21:43:31] PROBLEM - RAID on mw134 is CRITICAL: Connection refused by host [21:43:41] Ryan_Lane: Could I please have a look at https://gerrit.wikimedia.org/r/#/c/75834/ ? [21:44:05] Ryan_Lane: It would help to automatically update the bug status upon submitting related patches in gerrit. [21:47:19] Ryan_Lane: s/I/you/ :-) [21:48:12] RECOVERY - Disk space on mw134 is OK: DISK OK [21:48:31] RECOVERY - RAID on mw134 is OK: OK: no RAID installed [21:49:01] RECOVERY - DPKG on mw134 is OK: All packages OK [21:52:17] (CR) Dzahn: [C: 2] Fix setting bug status in hooks-bugzilla configuration [operations/puppet] - https://gerrit.wikimedia.org/r/75834 (owner: QChris) [21:52:58] qchris: i merged that, i know Andre just introduced that new status, he +1ed it and the explanation sounded sane [21:53:10] mutante: Thanks! [21:54:48] yw [21:55:01] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 21:54:55 UTC 2013 [21:55:21] PROBLEM - NTP on mw134 is CRITICAL: NTP CRITICAL: Offset unknown [21:55:31] PROBLEM - NTP on mw133 is CRITICAL: NTP CRITICAL: Offset unknown [21:55:51] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [21:57:31] RECOVERY - NTP on mw133 is OK: NTP OK: Offset 0.0007276535034 secs [22:00:24] (PS1) Ottomata: Puppetizing HA NameNode via Quorum Based JournalNode. [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/76018 [22:05:21] RECOVERY - NTP on mw134 is OK: NTP OK: Offset -0.003217935562 secs [22:25:00] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 22:24:55 UTC 2013 [22:25:50] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [22:26:30] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:00] binasher: grrr. ok thanks. [22:28:21] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [22:33:23] LeslieCarr: you want the varnish service to refresh when the generated pyconf file is updated? [22:36:39] (PS1) Jdlrobson: EventLogging: Add schema for mobile watchlist interactions [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76026 [22:38:07] ori-l: i want the varnish service refresh to happen before the pyconf file is generated [22:39:30] LeslieCarr: set the exec to refreshonly => true [22:39:37] and have the varnish service notify => the exec resource [22:45:19] ahha i actually see where the problem is, the ganglia class isn't called within that class, even though it refers to it, and professor didn't have it called at all [22:45:21] woot [22:45:22] thanks :) [22:53:32] greg-g: I can haz LD slot? [22:54:08] RoanKattouw: oh? :) [22:54:45] RECOVERY - Puppet freshness on professor is OK: puppet ran at Thu Jul 25 22:54:33 UTC 2013 [22:54:58] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jul 25 22:54:48 UTC 2013 [22:55:48] PROBLEM - Puppet freshness on zinc is CRITICAL: No successful Puppet run in the last 10 hours [22:56:19] RoanKattouw: go forth and prosper with bsitu [23:00:49] RoanKattouw: I will do a config change to enable Echo on meta + an extension change for 1.22wmf11, are you doing a scap? [23:00:49] I don't think I'm doing a scap but let me check [23:00:50] Nope, I won't need to scap [23:00:53] greg-g, i can haz deploy a config change? [23:01:12] MaxSem: you too! [23:01:28] wee:) [23:01:32] "you get an LD! and You get an LD! and YOU get an LD!" [23:01:38] RoanKattouw: I am pushing out the config change now, but It may me a while to prepare the extension change, you can go ahead after I push out the config change [23:01:39] MaxSem: what's the change? [23:01:51] https://gerrit.wikimedia.org/r/76001 [23:01:52] OK, prepping my extension change now [23:02:28] if I knew what CTAs were I'd say something sensible here. [23:02:31] MaxSem: ok [23:02:38] coordinate with RoanKattouw and bsitu [23:03:11] So, our schedule for today is: [23:03:16] 1) bsitu config change (going out now) [23:03:23] 2) Roan VE update (prepping) [23:03:27] 3) bsitu extension update [23:03:31] 4) MAxSem config change [23:03:41] right. [23:03:57] oh Thursday lightning deploys.... [23:05:50] bsitu: Ping me when it's my turn [23:05:58] will do [23:08:22] !log bsitu synchronized echowikis.dblist 'add metawiki to echo dblist' [23:09:51] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable htmlemail global and enable echo,thanks for metawiki' [23:10:20] RoanKattouw: done [23:11:07] Thanks [23:11:36] YuviPanda: Where's the Gerrit bot? :( [23:12:56] Now doing VE update [23:13:35] * RoanKattouw glares at https://gerrit.wikimedia.org/r/#/c/76005/ [23:13:54] ori-l: https://gerrit.wikimedia.org/r/#/c/76005/ grumble grumble [23:14:38] bsitu: What is the extension you're updating after me? [23:15:33] RoanKattouw: it's an Echo change for 1.22wmf12 [23:15:37] i mean 1.22wmf11 [23:15:40] Oh, OK, so nothing related to GettingStarted? [23:15:44] no [23:16:14] Do you have ori-l in sight? I need him to explain why https://gerrit.wikimedia.org/r/#/c/76005/ was merged but never deployed [23:16:36] yes, let me ask him [23:17:00] It looks like he might have done wmf11 but forgotten wmf12? [23:17:16] he is checking IRC now [23:17:27] ori-l: ^^ [23:17:32] hey [23:17:45] investigating, but the simplest explanation (i screwed up and forgot) seems likely [23:18:02] but let me check [23:18:51] It came in when I ran git pull in wmf12, so it looks like you forgot to me; especially given that you did deploy the same version to wmf11 [23:20:12] yes, that's correct. i'm sorry. what would you like to do? it's a minor change and well-tested on wmf11, so deploying it with your changes will be safe. but if you'd rather revert, that's ok too. [23:20:30] I am seeing some broken message like: in 1.22wmf12, this new message is deployed with the auto-deploy of 1.22wmf12 earlier today, do I need to run scap to update it? [23:21:18] ori-l: I can just deploy it [23:21:26] Just wanted to check if it's OK [23:21:26] MaxSem: your LD might be delayed until Monday [23:21:32] bsitu: Yes, scap [23:21:40] OK, thanks. Sorry 'bout that, RoanKattouw and MaxSem. [23:21:40] But let others come along for the ride first [23:21:42] depending on how long this takes [23:21:46] We're fine [23:21:48] k [23:21:52] RoanKattouw: okay, thx [23:22:02] OK, I've pulled and updated VE in 11 and 12, and GettingStarted in 12 [23:22:10] I'm just not a huge fan of rolling all of them into one so it's harder to test.... but.... [23:22:11] But not pushed, because I intend to ride bsitu's scap [23:22:26] Right, that's fair [23:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:26] I can sync them real quick if you like [23:22:31] Let me do that now [23:22:33] thanks [23:22:43] * greg-g gets less tense [23:22:53] I'm on the edge of my seat over here. [23:23:09] Elsie: :P [23:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [23:23:24] Syncing VE to 11 and 12, then GS to 12 [23:23:43] !log catrope synchronized php-1.22wmf11/extensions/VisualEditor 'Update VE to master' [23:23:48] So for the rest of the schedule I propose: [23:23:54] 1) wait for my syncs to finish [23:24:01] 2) let bsitu update Echo (no scap) [23:24:04] 3) let MaxSem do his thing [23:24:07] !log catrope synchronized php-1.22wmf12/extensions/VisualEditor 'Update VE to master' [23:24:15] 4) let bsitu scap to try and fix the message issue [23:24:16] morebots is missing. [23:24:22] Hmm, crap [23:24:27] grrrit-wm is also missing [23:24:31] !log catrope synchronized php-1.22wmf12/extensions/GettingStarted 'Belated GettingStarted sync, wmf11 was synced but wmf12 was not' [23:24:33] stupid bots [23:24:36] Ryan_Lane: Did you touch wikitech recently? morebots is down [23:24:38] logmsgbot doing fine :D [23:25:03] siebrand is diverting Ryan_Lane's attention [23:25:21] yeah, it will take me some time to prepare the change, so you guys can go ahead with your changes, and I will run scap at the end [23:25:22] Yeah I hear [23:25:26] OK [23:25:30] I wonder how many log entries have been missed this year due to morebots. [23:25:32] MaxSem: You're up, go [23:25:48] RoanKattouw: it has a persistent bug, most likely an aggressive tcp idle timeout at rackspace that the bot doesn't know to recover from or avoid via a keepalive. [23:26:29] ecma? [23:26:35] JavaScript. [23:26:38] I always think of OOXML [23:26:43] Can anyone else restart grrrit-wm? [23:27:01] I don't know, honestly [23:27:04] (who can) [23:27:11] Sigh. [23:27:28] time to move it to labs? [23:27:40] I think it's already on Labs. [23:27:42] eh, I mean morebots [23:27:49] Yeah, that should be moved. [23:27:59] anywho, MaxSem, you doing your thing? [23:28:06] yup [23:28:09] cool [23:28:12] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/76001/' [23:28:17] ah, there [23:28:20] It should be moved but it should also be fixed. [23:28:42] greg-g, bsitu - I'm done [23:28:46] RoanKattouw: thanks again for that [23:29:16] bsitu: go forth and scap, I suppose, in under the wire [23:29:20] :) [23:29:27] greg-g: thx [23:30:44] ohai morebots [23:30:56] https://bugzilla.wikimedia.org/show_bug.cgi?id=52067 [23:32:35] https://bugzilla.wikimedia.org/show_bug.cgi?id=52068 [23:33:30] :) [23:33:31] morebots: we need to talk. [23:33:31] I am a logbot running on wikitech-static. [23:33:31] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [23:33:31] To log a message, type !log . [23:33:53] why are the bots talking to each other?! [23:33:56] ugh. forget it. [23:33:59] :P [23:35:11] https://bugzilla.wikimedia.org/show_bug.cgi?id=52069 [23:35:37] +1 [23:35:59] thanks for that, Elsie [23:36:33] restarting grrit-wm [23:36:40] or however the hell it's spelled [23:36:51] grrrit-wm: hi [23:36:51] heh [23:37:00] (PS1) Lcarr: including monitoring class [operations/puppet] - https://gerrit.wikimedia.org/r/76028 [23:37:02] (CR) Lcarr: [C: 2] including monitoring class [operations/puppet] - https://gerrit.wikimedia.org/r/76028 (owner: Lcarr) [23:37:09] !log tools added myself to lolrrit-wm tool [23:37:15] haha bad grrrrit [23:37:19] Logged the message, Master [23:37:19] (CR) Bsitu: [C: 2] Set $wgAllowHTMLEmail default to true [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75787 (owner: Bsitu) [23:37:20] (Merged) jenkins-bot: Set $wgAllowHTMLEmail default to true [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75787 (owner: Bsitu) [23:37:21] (CR) Bsitu: [C: 2] Preparation of Echo and Thanks for metawiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75795 (owner: Bsitu) [23:37:22] (Merged) jenkins-bot: Preparation of Echo and Thanks for metawiki [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/75795 (owner: Bsitu) [23:37:29] haha [23:37:35] (CR) MaxSem: [C: 2] Disable mobile upload CTAs [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76001 (owner: MaxSem) [23:37:36] (Merged) jenkins-bot: Disable mobile upload CTAs [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76001 (owner: MaxSem) [23:37:45] oh crap. I logged in the wrong channel [23:37:48] can haz grrrit.wikimedia.org alias, plz? [23:38:01] greg-g: :D [23:38:23] I like the new login page for wikitech [23:38:26] it's way cleaner [23:38:47] bsitu: done yet? [23:39:02] create link is gone :( [23:39:02] nope [23:39:02] k [23:39:12] is it better design for create to be within login? [23:39:25] create page is about 982473948729837423 times better [23:39:27] https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&type=signup&returnto=Main+Page [23:39:43] Create page? [23:39:48] create account [23:39:51] Right. [23:39:59] The token field looks drunk. [23:40:03] yes [23:40:04] it does [23:40:16] it must not be in the format the javascript expects [23:41:06] it's not wrapped in a div [23:43:24] yep. fixed. [23:43:28] now to push in a change [23:48:04] scapping a second [23:48:31] a second?, *in* a second? *for* a second (time)? [23:49:24] oops, should be in a second [23:49:25] :) [23:49:30] :) [23:49:59] well, with that, I'm going to go home. don't break anything [23:50:48] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:48] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:48] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:48] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:48] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:49] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [23:51:56] greg-g: hopefully not, sorry about using scap in a lighting window, it's not in my plan, but it looks like the auto-deploy doesn't update the message cache [23:52:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:52:38] bsitu: yeah :/ [23:53:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [23:59:17] !log bsitu Started syncing Wikimedia installation... : IE ajax browser cache fix for 1.22wmf11 and scap to update message cache for 1.22wmf12 [23:59:27] Logged the message, Master