[00:00:09] ekrem is a Wikimedia MediaWiki RC to IRC relay (misc::mediawiki-irc-relay). [00:00:29] 54 class misc::mediawiki-irc-relay { [00:00:38] 63 file { "/usr/local/bin/udpmxircecho.py": [00:00:47] content => template("misc/udpmxircecho.py.erb"), [00:01:01] (03PS1) 10Ryan Lane: Add GCM cipher and remove DES [operations/puppet] - 10https://gerrit.wikimedia.org/r/83043 [00:01:01] yep [00:01:34] so, operations/puppet in puppet/templates/misc/udpmxircecho.py.erb [00:01:39] ok, patch coming [00:02:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [00:04:56] http://www.pcworld.com/article/2048222/report-nsa-defeats-many-encryption-efforts.html#tk.fb-pc [00:04:57] :( [00:05:53] Bsadowski1: there's relatively little technical info there [00:06:04] Bsadowski1: it's unlikely that 2048 RSA is a problem [00:06:18] (We hope. Maybe.) [00:06:24] 1024 on the other hand.... [00:06:27] It's a multi-pronged approach. [00:06:47] (03PS1) 10Ori.livneh: Left-trim whitespace from IRC messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/83044 [00:06:49] Super-computers, stealing keys, manipulating development of protocols... [00:06:49] Elsie: the US gov uses 2048 keys [00:07:00] Why would it need to spy on itself? :-) [00:07:01] Elsie: the commit message answers your question partly [00:07:14] Elsie: if they can break 2048 then other governments can too [00:07:25] so they'd be recommending higher [00:07:33] s/recommending/requiring/ [00:07:46] mutante: any chance you could merge that? [00:07:46] So we may be safe(r) for now, maybe. [00:08:00] ori-l: That's a nice commit message. [00:08:12] "n addition to deploying supercomputers to crack encryption, the NSA has worked with U.S. and foreign technology companies to build entry points into their products, the report said. " [00:08:12] (03CR) 10MZMcBride: "Nice commit message." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83044 (owner: 10Ori.livneh) [00:08:15] In* [00:08:31] Bsadowski1: Yes, I think everyone has read the articles by now. [00:08:37] Okay :) [00:09:05] (03CR) 10Dzahn: [C: 032] Left-trim whitespace from IRC messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/83044 (owner: 10Ori.livneh) [00:09:27] thanks mutante [00:09:50] mutante, ori-l: Thanks! For deployment, do we just wait for Puppet to run? [00:09:55] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 00:09:45 UTC 2013 [00:09:59] Elsie: on it:) [00:10:03] ori-l: yw [00:10:05] :D [00:10:15] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [00:10:17] there, it made the change [00:10:23] text = sp[1] [00:10:23] + text = text.lstrip() [00:11:34] mutante: it doesn't refresh the service [00:11:45] it should have a notify => Service['udpmxircecho'] [00:11:56] but "service udpmxircecho restart" should do it [00:11:58] eh, yea, i was just looking for an init script :p sigh [00:12:12] What does "mx" stand for? [00:12:16] udpmxircecho: unrecognized service [00:12:24] and no init.d [00:12:32] * ori-l facepalms [00:12:32] provider => base, [00:13:16] mutante: hang on [00:14:01] irc 3142 0.0 0.0 43268 1496 ? S May17 0:00 su -s /bin/bash -c /usr/local/bin//udpmxircecho.py rc-pmtpa ekrem.wikimedia.org irc [00:14:04] irc 3143 0.0 0.2 41724 10860 ? Sl May17 137:30 python /usr/local/bin//udpmxircecho.py rc-pmtpa ekrem.wikimedia.org [00:14:13] looks on wikitech [00:15:07] ori-l: duuh.. i got it. i saw this before [00:15:14] /usr/local/bin/start-ircbot [00:15:15] PROBLEM - MySQL Processlist on db1052 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 86 statistics [00:15:20] nohup su -s /bin/bash -c "$bindir/udpmxircecho.py rc-pmtpa ekrem.wikimedia.org" irc [00:15:44] https://wikitech.wikimedia.org/wiki/IRC#Starting_the_bot [00:15:57] I was staring at that, but didn't register it. [00:16:03] oh my [00:16:09] !log restarting irc bot on ekrem [00:16:12] I was about to submit a patch to making that subscribe to the Python script [00:16:12] Logged the message, Master [00:16:15] RECOVERY - MySQL Processlist on db1052 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 4 statistics [00:16:22] The original format is back. :-) [00:16:25] but I'm not futzing with that [00:16:27] does it work? [00:17:02] Elsie: so, can you confirm the fix? [00:17:14] I had to restart my bot. [00:17:18] It's coming back up now. [00:17:24] I believe it's fixed. [00:17:32] cool [00:17:48] woot. mutante, Reedy, Ryan_Lane -- thanks! [00:18:06] thanks for the fix! [00:18:11] :) [00:19:10] it works, it can spam the other channel again :) ttyl [00:19:15] PROBLEM - MySQL Processlist on db1052 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 87 statistics [00:19:21] Thanks again. [00:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [00:24:05] ori-l: yw [00:31:15] RECOVERY - MySQL Processlist on db1052 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [00:31:56] (03PS1) 10Ryan Lane: WORK IN PROGRESS: Simplify git-deploy configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/83046 [00:37:19] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [00:40:29] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:45:19] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [00:48:29] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:57:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [01:12:50] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [01:14:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.473 second response time [01:29:50] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [01:38:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.252 second response time [01:40:50] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 01:40:44 UTC 2013 [01:41:50] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [02:02:19] !log LocalisationUpdate failed: git pull of extensions failed [02:02:25] Logged the message, Master [02:04:00] uh-oh [02:05:12] Another manganese hardcoding perhaps? [02:05:23] ! [remote rejected] HEAD -> refs/publish/master/newFromDocumentInsertion (n/a (unpacker error)) [02:05:25] rawr [02:05:30] error: unpack failed: error Missing unknown 74b6ecd291e1e72df8be62aee6e0e8a0a4b7083a [02:06:43] reedy@tin:/var/lib/l10nupdate/mediawiki/extensions$ sudo -u l10nupdate git submodule foreach pull [02:06:44] Entering 'AJAXPoll' [02:06:44] /usr/lib/git-core/git-submodule: 1: eval: pull: not found [02:06:44] Stopping at 'AJAXPoll'; script returned non-zero status. [02:10:30] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 02:10:23 UTC 2013 [02:10:50] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [02:11:26] ... [02:33:20] !log LocalisationUpdate failed: git pull of extensions failed [02:33:41] uhoh. [02:34:53] that was me running it a second time after having updated most of the repos [02:35:14] Needless to say it's not happy [02:35:18] reedy, u needz to sleep [02:39:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:40:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [02:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [03:12:20] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [03:13:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.704 second response time [03:28:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.173 second response time [03:35:10] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [03:40:20] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 03:40:10 UTC 2013 [03:40:20] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [03:44:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:45:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.280 second response time [03:49:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:50:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [03:57:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:58:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [04:07:02] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [04:10:23] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 04:10:20 UTC 2013 [04:11:02] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [04:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:24:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [04:38:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:39:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.080 second response time [04:40:22] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 04:40:16 UTC 2013 [04:41:02] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [04:49:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:50:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:10:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:10:55] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [05:11:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [05:12:37] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [05:16:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [05:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:24:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [05:28:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [05:38:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:38:55] (03PS1) 10Ori.livneh: StatsD Ganglia backend: document config options [operations/puppet] - 10https://gerrit.wikimedia.org/r/83066 [05:39:57] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 05:39:52 UTC 2013 [05:40:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [05:40:38] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [05:52:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [06:14:51] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [06:34:48] paravoid: can you check something on the videoscalers for me? since you updated ffmpeg new uploads fail with exit code 1, not clear to me why, could you login and just run avconv to see if that gives any errors [06:38:05] paravoid: most likely related to bug 53800 [06:39:51] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 06:39:44 UTC 2013 [06:50:23] (03PS1) 10J: restart mw-cgroup on cgconfig restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/83067 [06:57:32] if paravoid is not around, anyone else that can login to tmh1001.eqiad.wmnet and tmh1002.eqiad.wmnet and run sudo service mw-cgroup restart [07:00:38] on tmh1001 this is done [07:00:48] on tmh1002 there is a problem with the restart [07:04:24] apergos: can you try stop and start in that case? [07:06:11] that's not the issue.. stopping fails, starting fails [07:06:23] stopping fails because it's not running [07:06:30] I don't know why starting fails yet [07:07:37] prestart process terminated with status 1 [07:07:40] is cgconfig running? [07:07:59] good point, I didn't check that [07:08:04] the init script sets up /sys/fs/cgroup/memory/mediawiki/ [07:08:29] if it exists and /sys/fs/cgroup/memory/mediawiki/jobs exists too might also be ok as is [07:10:25] they did exist earlier, I shouldn't have touched anything [07:10:41] after attempting to restart cgconfig they don't [07:10:44] figures [07:10:44] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [07:12:42] apergos: after restarting cgconfig you have to restart mw-cgroup too (until https://gerrit.wikimedia.org/r/#/c/83067/ is merged and this appens automatically) [07:14:44] yes but right now we have errors from cgroup trying to run [07:15:55] I did the steps in the cgroup init script by hand [07:16:11] the directory now seems to be there, let's see if I can get mw-cgroup to run now [07:17:25] the directories are there now [07:17:28] let's call that ok [07:18:24] do you have a way to check that jobs are running properly on tmh1002? or is there something I can look at? [07:19:06] ps ax | grep avconv [07:19:32] looking at http://ganglia.wikimedia.org/latest/?c=Video%20scalers%20eqiad&h=tmh1002.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [07:19:37] there is some cpu usage now [07:19:58] I don't see any of those but there were some ffmpeg2theora [07:20:22] sounds good too [07:21:19] thanks for fixing, will reset some failed transcodes and let you know if there are still issues [07:22:18] ok [07:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [07:23:29] apergos: do you know how to delete an image file from storage? a user is asking on #wikimedia-tech [07:23:35] (sorry to pull you into yet another issue) [07:23:55] lemme read over there [08:07:00] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [08:09:40] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 08:09:35 UTC 2013 [08:10:00] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [08:13:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:14:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:22:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:24:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:34:35] (03PS1) 10ArielGlenn: fix up rsync of pageviews etc bewtween dataset hosts not owned by backup user [operations/puppet] - 10https://gerrit.wikimedia.org/r/83070 [08:35:44] (03CR) 10ArielGlenn: [C: 032] fix up rsync of pageviews etc bewtween dataset hosts not owned by backup user [operations/puppet] - 10https://gerrit.wikimedia.org/r/83070 (owner: 10ArielGlenn) [08:35:57] (03PS1) 10Akosiaris: Create a definition to facilidate backup sets [operations/puppet] - 10https://gerrit.wikimedia.org/r/83071 [08:44:11] (03CR) 10Akosiaris: [C: 032] Create a definition to facilidate backup sets [operations/puppet] - 10https://gerrit.wikimedia.org/r/83071 (owner: 10Akosiaris) [08:49:21] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Fri Sep 6 08:49:13 UTC 2013 [08:53:21] yay [08:58:51] :-) [08:59:37] do you know about stafford not running puppet? I was assuming that was disabled cause of the puppetmaster package issues [08:59:53] if that's not you I'll send email to the list [09:03:51] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: No successful Puppet run in the last 10 hours [09:06:55] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [09:08:00] it is because of that [09:08:55] it turns out it can not be solved by specifying puppetmaster versions [09:09:13] there are dependencies between puppetmaster and puppet package [09:09:34] and ensure => latest and blah blah... still searching how to solve it. [09:09:42] that and the bug [09:10:05] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 09:09:56 UTC 2013 [09:10:55] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [09:11:02] ok [09:11:34] maybe I'll ask people to start !log ging when they disable puppet on a host so we have a record (and re-enable too) [09:11:39] thanks [09:22:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:36:52] (03PS1) 10TTO: Remove hardcoded accountcreator right for MSU proteins lab [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83075 [09:40:05] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 09:40:01 UTC 2013 [09:59:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:00:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.163 second response time [10:09:40] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [10:09:50] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 10:09:49 UTC 2013 [10:10:40] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [10:27:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [10:40:10] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 10:40:06 UTC 2013 [11:09:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [11:09:54] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 11:09:44 UTC 2013 [11:10:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [11:22:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.229 second response time [11:30:04] PROBLEM - Puppet freshness on stafford is CRITICAL: No successful Puppet run in the last 10 hours [11:33:44] (03PS1) 10Akosiaris: Revert commits 38717f3 4f44a53 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83084 [11:34:39] (03CR) 10Akosiaris: [C: 032] Revert commits 38717f3 4f44a53 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83084 (owner: 10Akosiaris) [11:38:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:39:25] RECOVERY - Puppet freshness on stafford is OK: puppet ran at Fri Sep 6 11:39:22 UTC 2013 [11:39:31] apergos: fixed that for ya ^ [11:39:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.896 second response time [11:40:34] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 11:40:23 UTC 2013 [11:40:44] PROBLEM - DPKG on stafford is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:40:54] got to go. I am best man in wedding a couple of hours from now [11:41:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [11:46:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:00:31] and he's gone, I'll thank him later :-) [12:03:14] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:04] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 3 copy to table, 4 statistics [12:13:34] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 12:13:31 UTC 2013 [12:14:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [12:17:38] Monday. that's a few days ;-) [12:31:10] wiki dooooooooooooown [12:31:22] now up [12:31:28] Vito? [12:31:34] what kind of information is that... [12:31:42] ~5 minutes down from Italy [12:43:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 12:42:57 UTC 2013 [12:44:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [12:45:19] (03PS1) 10saper: Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 [12:47:23] (03PS2) 10saper: Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 [12:56:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.080 second response time [12:59:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.839 second response time [13:12:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:35] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 13:12:00 UTC 2013 [13:13:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [13:27:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.481 second response time [13:32:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.738 second response time [13:36:54] (03CR) 10Faidon Liambotis: [C: 032] restart mw-cgroup on cgconfig restart [operations/puppet] - 10https://gerrit.wikimedia.org/r/83067 (owner: 10J) [13:42:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 13:41:59 UTC 2013 [13:43:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [13:52:45] yooo morning paravoid! [13:52:52] (/ afternoon to you!) [13:53:45] here is a quick and easy q I want a second opinion on: [13:54:04] should the analytics/eqiad kafka cluster be called 'kafka-eqiad', 'kafka-analytics' [13:54:06] somethign else maybe? [13:54:11] this is mainly for addressing the cluster in zookeeper [13:55:53] /kafka-analytics [13:55:54] /kafka-eqiad [13:55:54] /kafka/analytics [13:55:54] /kafka/eqiad [13:55:54] hm [13:57:09] kafka/eqiad sounds fine to me [13:57:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.251 second response time [13:57:47] k cool, i think so too [13:58:02] while we are still kinda testing things, i might want to version the znodes name [13:58:09] deleting topics is a mess [13:58:22] its easier to delete the whole znode and start a new if you want a clean setup [13:58:23] but ja [13:58:28] ok, next q! [13:58:29] :) [13:58:43] the varnishkafka.default file in the debian/ doesn't seem to go with varnishkafka [13:59:06] the DAEMON_OPTS flags aren't opts to varnishkafka [13:59:25] yeah, the changeset said init script is completely untested [13:59:29] ahhh ok [13:59:30] ha [13:59:35] ok i'll submit some patches then [13:59:36] the commit message I mean [13:59:51] so, it seems that start-stop-daemon isn't writing a pidfile like I think it should [13:59:56] varnishkafka doesn't do a pid file [13:59:58] that's why I didn't merge it initially :) [14:00:01] should start-stop-daemon do this on its own? [14:00:20] if not, shoudl I just make it use the killall-ish functionality instead of the pid file? [14:01:42] the daemon should do this [14:01:43] if it doesn't, start-stop-daemon can [14:01:43] but this means that the daemon shouldn't daemonize (or it'll change the pid and start-stop-daemon will lose it) [14:01:43] and start-stop-daemon should also do the daemonization [14:01:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:01:46] so it gets hairy pretty quickly [14:01:47] no, definitely not [14:01:49] ok cool [14:01:52] that is configurable [14:02:10] varnishkafka doesn't have to daemonize itself [14:02:15] k i'll look into that danke [14:02:18] --make-pidfile is okay [14:04:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.164 second response time [14:07:22] (03PS1) 10Cmjohnson: siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 [14:07:43] (03CR) 10jenkins-bot: [V: 04-1] siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 (owner: 10Cmjohnson) [14:08:46] cmjohnson1: It is whining about the trailing whitespace? :) [14:09:26] (03PS1) 10Ottomata: Naming eqiad Kafka cluster /kafka/eqiad in zookeeper [operations/puppet] - 10https://gerrit.wikimedia.org/r/83093 [14:09:33] grr [14:10:34] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 14:10:31 UTC 2013 [14:10:37] (03CR) 10Ottomata: [C: 032 V: 032] Naming eqiad Kafka cluster /kafka/eqiad in zookeeper [operations/puppet] - 10https://gerrit.wikimedia.org/r/83093 (owner: 10Ottomata) [14:10:45] (03PS2) 10Cmjohnson: siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 [14:11:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [14:11:06] (03CR) 10jenkins-bot: [V: 04-1] siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 (owner: 10Cmjohnson) [14:14:43] (03PS3) 10Cmjohnson: siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 [14:15:01] (03CR) 10jenkins-bot: [V: 04-1] siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 (owner: 10Cmjohnson) [14:15:50] * siebrand grins at cmjohnson1. Now it's one comma too many ;) [14:16:13] not enough coffee this morning [14:16:21] * siebrand laughs. [14:17:53] siebrand: remove the comma after you ...correct? [14:18:27] cmjohnson1: https://integration.wikimedia.org/ci/job/operations-puppet-validate/6970/console should contain what's wrong, but it looks like it may be the comma to me. [14:18:39] 14:15:00 err: Could not parse for environment production: Syntax error at '{'; expected '}' at /srv/ssd/jenkins-slave/workspace/operations-puppet-validate/manifests/site.pp:2618 [14:18:42] i see that ..but it's the wrong line [14:18:48] 2618 wasn't touched [14:19:03] has to be the commma [14:19:17] (03PS4) 10Cmjohnson: siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 [14:20:11] * siebrand cheers cmjohnson1 on! [14:21:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:57] (03CR) 10Cmjohnson: [C: 032 V: 032] siebrand access to stat1 RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83092 (owner: 10Cmjohnson) [14:23:39] (03PS1) 10Ottomata: Putting kafka log data dir in a subdir of the mount points. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83094 [14:27:36] siebrand: merged [14:27:56] cmjohnson1: oh, really? I expected to have to wait for 3 days. [14:28:39] you had approval so i merged [14:29:24] cmjohnson1: I'm not certain that's correct. Or did Ken also approve? The procedure states that it needs approval from line manager (Howie Fung for me), and director of ops, and a three day waiting period. [14:29:57] (03CR) 10Faidon Liambotis: [C: 032] StatsD Ganglia backend: document config options [operations/puppet] - 10https://gerrit.wikimedia.org/r/83066 (owner: 10Ori.livneh) [14:30:28] that is the requirement so i better fix that [14:30:33] cmjohnson1: Don't want to get you into trouble, or have to wait longer than needed; just pointing out a deviation from the expectation that was raised on https://wikitech.wikimedia.org/wiki/Requesting_shell_access [14:31:03] cmjohnson1: anyway, thanks a bunch. [14:31:29] cmjohnson1: one tip on Gerrit: You only have to Code-Review+2, and don't need to Verified+2. [14:31:47] cool...thx for that [14:31:48] cmjohnson1: If you code review +2, Jenkins will merge the patch set once all verification jobs succeed. [14:32:05] cmjohnson1: See https://gerrit.wikimedia.org/r/#/c/83084/ for an example in the puppet repo. [14:32:05] ah...that is better [14:35:24] (03PS2) 10Ottomata: Putting kafka log data dir in a subdir of the mount points. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83094 [14:36:06] (03CR) 10Ottomata: [C: 032 V: 032] Putting kafka log data dir in a subdir of the mount points. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83094 (owner: 10Ottomata) [14:36:45] siebrand: i am going to revert the change...will get ken's approval today for you and do it again. [14:36:59] (03CR) 10Faidon Liambotis: [C: 04-1] "(4 comments)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [14:37:15] cmjohnson1: sure. [14:37:42] (03PS1) 10Cmjohnson: Revert "siebrand access to stat1 RT5726 has not been 3 days-need approval" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83095 [14:38:39] (03CR) 10Cmjohnson: [C: 032] Revert "siebrand access to stat1 RT5726 has not been 3 days-need approval" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83095 (owner: 10Cmjohnson) [14:38:54] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:39:54] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:41:39] (03CR) 10Ottomata: "(2 comments)" [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [14:44:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 14:43:56 UTC 2013 [14:45:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] (03PS4) 10Ottomata: varnishkafka module. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 [14:49:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.535 second response time [14:52:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:44] (03PS5) 10Faidon Liambotis: varnishkafka module. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [15:06:56] (03CR) 10Faidon Liambotis: [C: 032] Add a varnishkafka module [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [15:09:27] paravoid: just commit message change? [15:09:33] yes [15:09:35] danke!@ [15:09:38] the trailing dot drove me nuts [15:09:49] hehe [15:09:51] (03CR) 10Faidon Liambotis: [C: 04-1] "(2 comments)" [operations/dns] - 10https://gerrit.wikimedia.org/r/82750 (owner: 10Dzahn) [15:10:44] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 15:10:37 UTC 2013 [15:10:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.906 second response time [15:11:04] PROBLEM - Puppet freshness on sq42 is CRITICAL: No successful Puppet run in the last 10 hours [15:11:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [15:14:58] hey paravoid, another q [15:15:05] i just noticed that in the latest 0.8 branch of kafka [15:15:10] kafka-run-class.sh has been changed a bit [15:15:19] i'm updating the debian/bin//kafka script to match [15:15:30] but, now I also have to update the kafka init.d script to match too [15:15:42] is it so bad for the kafka init script ot use the debian/bin/kafka script to start kafka server? [15:15:53] that way I don't have to maintain the changes in more than one place [15:16:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:28] where do we install debian/bin/kafka too? [15:18:34] I don't remember :) [15:19:12] ummm /usr/sbin [15:19:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.327 second response time [15:21:56] why wouldn't I like to run something in /usr/sbin from the init script? [15:23:31] is this a quiz? [15:23:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:59] :) [15:24:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.762 second response time [15:25:35] lol [15:25:41] it's fine [15:25:52] haha, as in it is ok to do so? [15:26:00] unless I misunderstand something, yes [15:26:02] ok cool [15:26:10] did I ever object to this? [15:27:24] naw i don't thikn so, i think alex wrote the init script, i remember you partially objecting to keeping the server-start command in the /usr/sbin/kafka script [15:27:37] (03CR) 10Faidon Liambotis: [C: 04-1] "I don't like the extra configuration class. Just use a variable depending on $::realm in the main class, IMHO." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 (owner: 10Hashar) [15:31:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:49] (03CR) 10Jgreen: [C: 032 V: 031] try to workaround otrs missing log object issue [operations/puppet] - 10https://gerrit.wikimedia.org/r/83100 (owner: 10Jgreen) [15:43:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 15:43:00 UTC 2013 [15:44:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [15:44:22] (03PS1) 10Andrew Bogott: Revert "Use %transient-key when generating our apt-signing key" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83101 [15:44:23] (03PS1) 10Andrew Bogott: Revert "Lots of ruckus to get our apt repo 'signed'." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83102 [15:44:24] (03PS1) 10Andrew Bogott: Mark our project-local apt repo as trusted. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83103 [15:45:51] (03CR) 10Andrew Bogott: [C: 032] Revert "Use %transient-key when generating our apt-signing key" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83101 (owner: 10Andrew Bogott) [15:46:03] (03CR) 10Andrew Bogott: [C: 032] Revert "Lots of ruckus to get our apt repo 'signed'." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83102 (owner: 10Andrew Bogott) [15:46:23] (03CR) 10Andrew Bogott: [C: 032] Mark our project-local apt repo as trusted. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83103 (owner: 10Andrew Bogott) [16:07:43] (03PS1) 10coren: Tool Labs: opencv (and python bindings to same) [operations/puppet] - 10https://gerrit.wikimedia.org/r/83105 [16:08:17] hm, paravoid, is this a problem then: [16:08:21] (03CR) 10coren: [C: 032] "Trivial package addition." [operations/puppet] - 10https://gerrit.wikimedia.org/r/83105 (owner: 10coren) [16:08:28] if start-stop-daemon backgrounds /usr/sbin/kafka [16:08:47] and /usr/sbin/kafka actually runs the 'java kafka.Kafka' broker process [16:08:57] will the pidfile that start-stop-daemon creates not quite be correct? [16:09:12] it'll be the pid of the /usr/sbin/kafka script, right? [16:09:35] ottomata2: Depends on whether /usr/sbin/kafka just starts java or execs to it. [16:10:11] just starts it [16:10:18] ah but it could exec to it [16:10:21] hm, ja [16:10:42] When possible, it's always better to exec to it anyways. Don't keep the shell around needlessly. [16:11:14] yeah [16:12:53] is someone around who can build debian packages... I'd like to get operations/debs/LaTeXML to apt.wikimedia.org [16:13:24] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 16:13:18 UTC 2013 [16:14:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [16:26:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.581 second response time [16:29:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:30:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.738 second response time [16:40:34] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 16:40:26 UTC 2013 [16:41:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [16:41:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.344 second response time [16:57:28] is someone around who can build debian packages... I'd like to get operations/debs/LaTeXML to apt.wikimedia.org [17:10:14] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 17:10:10 UTC 2013 [17:11:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [17:11:47] (03PS1) 10Cmjohnson: removing decom srv281 and sq36 from dns [operations/dns] - 10https://gerrit.wikimedia.org/r/83109 [17:13:57] (03CR) 10Cmjohnson: [C: 032] removing decom srv281 and sq36 from dns [operations/dns] - 10https://gerrit.wikimedia.org/r/83109 (owner: 10Cmjohnson) [17:16:33] (03PS1) 10Cmjohnson: adding sibrand to admins and site for stat1 access RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83110 [17:21:49] (03CR) 10Reedy: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/83110 (owner: 10Cmjohnson) [17:25:58] reedy: my poor spelling :-P [17:26:22] (03PS2) 10Reedy: adding sibrand to admins and site for stat1 access RT5726 [operations/puppet] - 10https://gerrit.wikimedia.org/r/83110 (owner: 10Cmjohnson) [17:26:29] Stupid bot [17:35:17] (03CR) 10Edenhill: [C: 031] Add a varnishkafka module [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/82885 (owner: 10Ottomata) [17:40:04] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 17:40:00 UTC 2013 [17:41:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [17:43:11] (03PS1) 10Andrew Bogott: Move a bunch of module-specific files into the base module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83111 [17:43:52] (03CR) 10jenkins-bot: [V: 04-1] Move a bunch of module-specific files into the base module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83111 (owner: 10Andrew Bogott) [17:50:27] (03PS2) 10Andrew Bogott: Move a bunch of module-specific files into the base module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83111 [18:02:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:08] is someone around who can build debian packages... I'd like to get operations/debs/LaTeXML to apt.wikimedia.org [18:03:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [18:08:33] (03CR) 10Physikerwelt: "The debian package is available now." [operations/puppet] - 10https://gerrit.wikimedia.org/r/61767 (owner: 10Physikerwelt) [18:09:44] RECOVERY - Puppet freshness on manganese is OK: puppet ran at Fri Sep 6 18:09:43 UTC 2013 [18:10:04] PROBLEM - Puppet freshness on manganese is CRITICAL: No successful Puppet run in the last 10 hours [18:12:08] <^d> !log puppet & gerrit service stopped on manganese so nobody hits them anymore [18:12:11] Logged the message, Master [18:27:42] yay (for logging that you are disabling puppet) [18:32:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:34:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.602 second response time [18:34:40] !log updated Parsoid to ef1776732 [18:34:43] Logged the message, Master [19:03:55] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: No successful Puppet run in the last 10 hours [19:04:48] (03PS1) 10Andrew Bogott: Move a bunch of templates into the 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83120 [19:07:32] (03PS3) 10Cmjohnson: adding sq36 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82884 [19:08:27] (03Abandoned) 10Cmjohnson: adding sq36 to decom list [operations/puppet] - 10https://gerrit.wikimedia.org/r/82884 (owner: 10Cmjohnson) [19:17:28] (03CR) 10CSteipp: [C: 031] Add GCM cipher and remove DES [operations/puppet] - 10https://gerrit.wikimedia.org/r/83043 (owner: 10Ryan Lane) [19:24:53] paravoid: I've been saddled with the onus of building an up-to-date phantomjs .deb for WMF [19:25:15] paravoid: can you point me to an outline of the process I should follow? [19:25:44] paravoid: I guess I start by creating a repo under operations/debs? [19:27:59] (03CR) 10Andrew Bogott: [C: 032] Move a bunch of module-specific files into the base module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83111 (owner: 10Andrew Bogott) [19:33:16] (03PS2) 10Hashar: tweak memcached limit on beta (89GB -> 15GB) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 [19:35:05] (03CR) 10Hashar: "PS2 uses a selector based on $::realm instead of a configuration class." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 (owner: 10Hashar) [19:36:59] (03CR) 10Hashar: [C: 031] retab and quoting in blogs.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83034 (owner: 10Dzahn) [19:39:47] (03PS2) 10Andrew Bogott: Move a bunch of templates into the 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83120 [19:41:37] (03CR) 10Andrew Bogott: [C: 032] Move a bunch of templates into the 'base' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/83120 (owner: 10Andrew Bogott) [19:48:35] (03PS1) 10Yuvipanda: Rename labsproxy module to dynamicproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/83127 [19:57:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.335 second response time [20:06:39] !log rebooting virt11 [20:06:43] Logged the message, Master [20:11:34] PROBLEM - SSH on virt11 is CRITICAL: Connection refused [20:12:53] PROBLEM - Disk space on virt11 is CRITICAL: Connection refused by host [20:13:13] PROBLEM - RAID on virt11 is CRITICAL: Connection refused by host [20:13:33] PROBLEM - DPKG on virt11 is CRITICAL: Connection refused by host [20:16:03] PROBLEM - Host virt11 is DOWN: PING CRITICAL - Packet loss = 100% [20:19:33] RECOVERY - SSH on virt11 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [20:19:34] RECOVERY - DPKG on virt11 is OK: All packages OK [20:19:43] RECOVERY - Host virt11 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [20:19:53] RECOVERY - Disk space on virt11 is OK: DISK OK [20:20:13] RECOVERY - RAID on virt11 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [20:20:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.262 second response time [20:29:48] (03PS2) 10Chad: Remove obsolete sudo setup [operations/puppet] - 10https://gerrit.wikimedia.org/r/82763 [20:34:23] PROBLEM - NTP on virt11 is CRITICAL: NTP CRITICAL: Offset unknown [20:39:23] RECOVERY - NTP on virt11 is OK: NTP OK: Offset 0.0006934404373 secs [20:54:17] (03PS1) 10Ori.livneh: statsd-ganglia: node 0.10+ compat; add custom filters feature [operations/puppet] - 10https://gerrit.wikimedia.org/r/83132 [21:03:49] Hey. Is anyone with access to the servers available and willing to help me dig on a gateway timeout when running a bot? https://bugzilla.wikimedia.org/show_bug.cgi?id=43046 [21:04:13] I thought it was linked to the page size, but apparently it works for larger pages, but not on some smaller ones [21:06:11] (03PS2) 10Dzahn: add careers and jobs DNS entries as requested in RT #5709 [operations/dns] - 10https://gerrit.wikimedia.org/r/82750 [21:10:58] strainu, 250kb pages are large enough [21:11:12] well, yeah [21:11:32] except now it works for a page 220k long [21:11:39] but doesn't work for a page 184k long [21:12:03] It seems it has something to do with the content [21:12:29] and I was hoping to get a more detailed error from the logs [21:15:32] strainu, can you trigger an error now? [21:16:02] yep, Strainubot will shortly make an edit on ro:Lista_monumentelor_istorice_din_București,_sector_1 [21:16:33] looking at logs, poke me when it fails [21:18:23] MaxSem, pywikipediabot just signaled an error [21:18:53] prolly correllates with "Maximum execution time of 180 seconds exceeded" [21:19:01] (03PS1) 10Dzahn: jobs/careers redirects on wikimedia and wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 [21:19:12] but the change went through [21:19:23] right? [21:19:32] yup, page got saved but it timed out ob subsequent parse [21:19:33] (03CR) 10jenkins-bot: [V: 04-1] jobs/careers redirects on wikimedia and wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 (owner: 10Dzahn) [21:20:10] the parse is used to regenerate the cache, etc? [21:20:42] the parse is used to make you see the page;) [21:21:13] mmm...not sure I understand? It's an api call, what is it to see? [21:21:38] because when I edit from the page, I don't get that error [21:21:43] not for this size, anyway [21:21:44] still, to update a shitload of tables, a parse is needed [21:22:01] (03PS2) 10Dzahn: jobs/careers redirects on wikimedia and wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 [21:22:03] (03CR) 10jenkins-bot: [V: 04-1] jobs/careers redirects on wikimedia and wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 (owner: 10Dzahn) [21:22:16] that's not a bug [21:22:44] ok, so let's say it's linke to the complexity of the page [21:23:04] that one has lots of images and templates in templates [21:23:30] compared to a larger page, but with fewer complex data (coordinates and images) [21:23:41] but then, why does it work from browser and not from API? [21:23:44] wth, where does that come from now.. without matching section [21:24:14] duh, my bad [21:24:45] (03PS3) 10Dzahn: jobs/careers redirects on wikimedia and wikipedia [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 [21:25:59] (03CR) 10Dzahn: "(1 comment)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/83134 (owner: 10Dzahn) [21:26:46] (03CR) 10Dzahn: "- changed to wikipedia-lb for those in wikipedia" [operations/dns] - 10https://gerrit.wikimedia.org/r/82750 (owner: 10Dzahn) [21:34:34] (03PS1) 10Ottomata: Updating kafka script and init scripts with recent changes in Kafka bin/*.sh scripts from 0.8 branch. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/83137 [21:54:00] (03CR) 10Dzahn: [C: 032] retab and quoting in blogs.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/83034 (owner: 10Dzahn) [21:57:03] (03CR) 10Dzahn: [C: 032] fixes for wikitravel links and updates. add a trim() when unserializing API data to fix parsing for a lot of wikis sending whitespace [operations/debs/wikistats] - 10https://gerrit.wikimedia.org/r/82564 (owner: 10Dzahn) [22:01:46] hmm, so back in 2010 there was a bug, https://bugzilla.wikimedia.org/show_bug.cgi?id=23231, that requested we "Set wgBlockDisablesLogin to true on Wikimedia private wikis". It specified particular wikis to enable, but since then we've had new private wikis. I'm opening a bug to request that wgBlockDisablesLogin is enabled for IEGCom wiki, but would it be better just to request it for [22:01:47] all private wikis? [22:02:00] ahhhhh paravoid, i'm struggling with doing something in sh that I can do with bash [22:02:06] Nemo_bis, maybe you'd have an opinion as you were the original bug reporter? ^ [22:02:25] i want to say [22:02:47] if [ -n $var -a $@ does not contain —flag ] [22:03:02] i think in bash I can use [[ $@ != *—flag* ]] [22:03:05] but i c an't do that in sh [22:05:37] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:27] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [22:18:11] (03CR) 10Jalexander: [C: 031] Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 (owner: 10saper) [22:19:19] (03CR) 10Dzahn: [C: 031] Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 (owner: 10saper) [22:20:57] (03CR) 10Dzahn: [C: 032] Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 (owner: 10saper) [22:21:00] (03PS1) 10Ottomata: Adding environment var ZOOKEEPER_URL. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/83191 [22:21:12] (03Merged) 10jenkins-bot: Install localized v3 logo for plwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 (owner: 10saper) [22:24:55] anyone available to talk about setting up a cloak? [22:25:33] cajoel: http://meta.wikimedia.org/wiki/IRC/Cloaks [22:26:08] it's done by freenode people .. you need to go via that request form linked there [22:26:26] I made the diff on our office wiki [22:26:34] looks like I need a user page on the meta wiki instead [22:26:47] you need one of http://meta.wikimedia.org/wiki/IRC/Group_Contacts [22:26:56] unfortunately they don't want tickets [22:27:10] (03PS2) 10Ottomata: Updating kafka script and init scripts with recent changes in Kafka bin/*.sh scripts from 0.8 branch. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/83137 [22:27:52] (03PS2) 10Ottomata: Adding environment var ZOOKEEPER_URL. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/83191 [22:29:38] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [22:29:41] !log sync-common-file InitSettings, deploy change 83086 [22:29:41] Logged the message, Master [22:29:44] Logged the message, Master [22:30:14] (03CR) 10Dzahn: "22:29 logmsgbot: dzahn synchronized ./wmf-config/InitialiseSettings.php" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/83086 (owner: 10saper) [22:30:17] (03PS1) 10Ottomata: Updating TODO.md with MirrorMaker item [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/83192 [22:30:31] (03CR) 10Ottomata: [C: 032 V: 032] Updating TODO.md with MirrorMaker item [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/83192 (owner: 10Ottomata) [22:37:53] (03PS1) 10Lcarr: adding in management linkage in ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/83193 [22:38:25] (03CR) 10Lcarr: [C: 032] adding in management linkage in ulsfo [operations/dns] - 10https://gerrit.wikimedia.org/r/83193 (owner: 10Lcarr) [22:42:30] how does one go about creating a new lists.wikimedia.org mailing list? [22:43:50] bugzilla [22:44:44] p858snake|l: thanks - product wikimedia, component mailing lists? [22:45:16] that would be a better one for it yes [23:22:56] (03PS1) 10Ori.livneh: Add Ganglia view for NavigationTiming data [operations/puppet] - 10https://gerrit.wikimedia.org/r/83198 [23:23:11] (03CR) 10jenkins-bot: [V: 04-1] Add Ganglia view for NavigationTiming data [operations/puppet] - 10https://gerrit.wikimedia.org/r/83198 (owner: 10Ori.livneh) [23:24:05] (03PS2) 10Ori.livneh: Add Ganglia view for NavigationTiming data [operations/puppet] - 10https://gerrit.wikimedia.org/r/83198 [23:24:40] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:25:31] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [23:25:52] Ryan_Lane: got a minute for a puppet merge? patch is https://gerrit.wikimedia.org/r/#/c/83198/ , pretty low-risk. [23:28:02] looking [23:28:14] yay [23:29:09] (03CR) 10Ryan Lane: [C: 032] Add Ganglia view for NavigationTiming data [operations/puppet] - 10https://gerrit.wikimedia.org/r/83198 (owner: 10Ori.livneh) [23:29:14] double yay [23:29:20] done [23:29:49] could you force a puppet run on nickel? [23:30:57] i should submit a patch giving myself access to that host, p-void suggested that too [23:31:50] (03CR) 10Faidon Liambotis: "Sure, why not. But please, better commit messages :) See http://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines for a good guide." [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/83137 (owner: 10Ottomata) [23:32:59] (03CR) 10Faidon Liambotis: [C: 032] add careers and jobs DNS entries as requested in RT #5709 [operations/dns] - 10https://gerrit.wikimedia.org/r/82750 (owner: 10Dzahn) [23:33:47] (03CR) 10Faidon Liambotis: [C: 032] statsd-ganglia: node 0.10+ compat; add custom filters feature [operations/puppet] - 10https://gerrit.wikimedia.org/r/83132 (owner: 10Ori.livneh) [23:34:29] p-void is merging [23:34:36] p-void has a flight in 8h and hasn't packed yet [23:36:10] that's a loooong time from now [23:36:25] (03PS1) 10Ori.livneh: nickel: add self ('olivneh') w/sudo [operations/puppet] - 10https://gerrit.wikimedia.org/r/83201 [23:37:01] it is, but I also have to fit sleep somewhere in there :P [23:37:10] hm, I wonder if that needs RT request and 3-day embargo [23:37:40] btw, if you're looking for ganglia things to do [23:37:47] the views could use a refactoring and getting out of misc:: [23:37:54] it didn't for https://gerrit.wikimedia.org/r/#/c/81401/ [23:38:00] and yes, I've got plans :P [23:50:38] Where are the ua filtering stuff thingies maintained these days? [23:50:38] https://bugzilla.wikimedia.org/show_bug.cgi?id=53881 [23:52:19] hrm, did puppet run on nickel? [23:52:41] Looks like https://github.com/wikimedia/operations-debs-wikistats doesn't contain any of that logic [23:52:53] e.g. the thing that generates http://stats.wikimedia.org/wikimedia/squids/SquidReportClients.htm [23:53:08] * ori-l shrugs [23:54:12] aww yeah http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&tab=v&vn=perceived_latency [23:54:39] \\\( [23:55:28] i couldn't find a way to prevent the escaping [23:55:35] i experimented [23:57:26] why no DNS? [23:57:54] because.. [23:58:04] (03PS1) 10Ori.livneh: Add DNS lookup time to NavigationTiming StatsD reporter [operations/puppet] - 10https://gerrit.wikimedia.org/r/83203 [23:58:07] ..i forgot [23:58:17] lol [23:59:19] (03CR) 10Faidon Liambotis: [C: 032] Add DNS lookup time to NavigationTiming StatsD reporter [operations/puppet] - 10https://gerrit.wikimedia.org/r/83203 (owner: 10Ori.livneh) [23:59:48] ori-l: ah, you should ping me when you need something [23:59:56] I'm just now seeing your request to force the run :)