[00:43:58] New patchset: Tim Starling; "Make timeouts actually work" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34852 [00:59:02] New review: Tim Starling; "* Tested client read timeout in suggest() by adding a Thread.sleep(60) on the server side." [operations/debs/lucene-search-2] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/34852 [01:43:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 301 seconds [01:45:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:59:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 232 seconds [02:00:25] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 277 seconds [02:24:34] !log LocalisationUpdate completed (1.21wmf4) at Fri Nov 23 02:24:34 UTC 2012 [02:24:45] Logged the message, Master [02:44:49] RECOVERY - Puppet freshness on cp3019 is OK: puppet ran at Fri Nov 23 02:44:35 UTC 2012 [02:55:01] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:02:35] can I get an account creation? https://www.mediawiki.org/w/index.php?title=Developer_access&oldid=608625#User:SHL [03:03:12] i don't see any of the usual suspects for that task. but i guess that's not a big surprise [03:03:19] btw, enjoy your turkeys! [03:17:58] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [03:25:04] New patchset: Tim Starling; "Updated debian directory" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34854 [03:25:32] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34481 [03:25:39] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34852 [03:31:35] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34854 [03:46:46] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:47:08] New review: Tim Starling; "Just how temporary is this?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24235 [03:50:04] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [04:02:21] !log many lucene search servers failed to bind to port 1099 when they were restarted by the upgrade, restarting manually [04:02:30] Logged the message, Master [04:04:22] !log oh yeah, and I upgraded lucene to my version with the timeouts, deployed to pmtpa only via puppet [04:04:29] Logged the message, Master [04:11:22] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [04:12:52] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [04:15:01] * jeremyb guesses that paged? [04:18:02] not sure what happened there, I specified dsh -F1 but it seemed to treat it as -F2 [04:20:05] !log on fenari: updated the "search" dsh node group based on nmap -sP and fixed the remaining search servers [04:20:11] Logged the message, Master [04:21:21] I guess I should have put a sleep in [05:00:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:01:04] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [05:25:39] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [07:12:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:33:15] hello [08:34:46] world [08:42:08] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:42:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:42:09] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:48:25] apergos: ping pong :-] I could use some new merges [08:48:32] oh? [08:48:54] Zuul configuration is in /etc/zuul/wikimedia which belong to root:jenkins with mode 755 so I can't write to it :-( [08:49:06] aww [08:49:16] all right, point me to the patches [08:49:22] the root cause is that git::clone does not honor the mode 0775 which is passed to it. The workaround is to let the dir belong to jenkins:jenkins https://gerrit.wikimedia.org/r/#/c/34848/ :) [08:49:30] easy one :) [08:49:57] the thing is that I originally wanted puppet to deploy the conf automatically so that made sense to use root as a owner [08:50:16] you don't still want that? [08:50:26] not really [08:50:36] how will the conf file be maintained? [08:50:48] just out of the git repo? [08:50:50] the reason is that one can send an invalid configuration which then mean that puppet will reload zuul and it will crash [08:51:03] yeah the conf file is in a dedicated git repo [08:51:08] integration/zuul-config [08:51:12] uh huh [08:51:18] i send patch there, then git pull in /etc/zuul/wikimedia [08:51:26] so you're going to do manual pulls every so often [08:51:31] indeed [08:51:35] and reload the service manually [08:51:40] then double check that everything works fine [08:51:41] I expect some time you'll get tired of that but for now [08:52:01] the plan is to add a linting job to verify the configuration [08:52:03] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34848 [08:52:33] then improve Zuul so it is able to check the configuration. [08:52:47] once that is done, I will happily convert that workflow so that puppet does everything for us :-] [08:52:56] and [08:53:31] then I have to talk to mark about git::clone() not applying $mode, but that is a different subject [08:54:18] refresh of Exec[install_zuul] is now running on gallium [08:54:27] yeah that one too [08:54:36] doe [08:54:51] somehow puppet think that a new version has been fetched from git and rerun the installer [08:55:38] directory stil looks to be owned by root [08:55:38] I now start to understand why Faidon keep saying that puppet should not be used as a deployment system =] [08:55:54] drwxr-xr-x 3 root jenkins 4096 Nov 22 13:13 /etc/zuul/wikimedia/ [08:56:04] hmm [08:57:02] ahh it does not ensure anything, just run git clone / git pull with the $user and $group credentials :( [08:57:14] not fixing the actual path [08:57:15] damn [08:57:15] sorry Ariel :( [08:57:22] s'ok [08:58:09] apergos: could you fix the perm temporarily with: chown -R jenkins /etc/zuul/wikimedia [08:58:11] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [08:58:11] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:58:22] done [08:58:24] I will update the other change I made to git::clone [08:58:25] thanks! [08:58:28] I figured that was next :-P [08:58:53] nice [08:58:57] thanks!!! :-] [08:59:06] hopefully I am not going to interrupt you anymore [08:59:23] ok no worries [09:04:05] New patchset: Hashar; "git::clone did not honor perms/ownership" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34748 [09:04:31] New review: Hashar; "Patchset 2 adds user and group to the file { $directory: } statement." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34748 [09:59:14] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:14:06] New patchset: Hashar; "dummy generic module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34861 [10:14:06] New patchset: Hashar; "Generic class to install OpenJDK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34862 [10:14:06] New patchset: Hashar; "OpenJDK JRE/JDK on CI host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34863 [10:22:31] hashar: :-) [10:22:45] paravoid: hello :-] [10:23:11] hi [10:23:15] saw that you're not coming after all [10:23:18] too bad [10:23:26] yeah we had something else planned on saturday :-( [10:23:39] a few friends coming home, completely forgot about that [10:31:17] New review: Nikerabbit; "I will deploy this next Tuesday if not done before." [operations/mediawiki-config] (master); V: 0 C: 1; - https://gerrit.wikimedia.org/r/34300 [10:32:00] paravoid: if all swift servers are happy, I got some changes pending in puppet :-] [10:38:40] depends on which servers you're referring to (proxies or backends) and depends on what you mean by happy :) [10:39:19] what are the changes? [10:39:23] feel free to add me as a reviewer [10:39:48] I can have a look but I'm not going to deploy them until Tuesday the earliest [10:39:52] (vac) [10:41:13] will add you and paste here [10:41:23] https://gerrit.wikimedia.org/r/#/c/34861/ creates a "generic" puppet module [10:41:41] https://gerrit.wikimedia.org/r/#/c/34862/ is a class to install OpenJDK under the new generic module above :-] [10:42:02] nak [10:42:23] I eventually got tired of editing the huuuuuuge manifests/generic-definitions.pp :-] [10:42:39] we should move to a java or openjdk module [10:42:58] should have thought about that [10:43:26] New review: Faidon; "We should probably move to a java or openjdk or something, not recreate a huge generic module." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/34862 [10:43:47] a generic module is going to get only slightly better than generic-definitions.pp [10:44:06] I think we need to truly modularize things :) [10:44:14] yeah I agree, an openjdk module makes muuuch more sense [10:44:20] another change that could use some input from you is a module to host some of our shell scripts https://gerrit.wikimedia.org/r/#/c/29937/ [10:45:19] I added myself as a reviewer [10:45:28] Change abandoned: Hashar; "Per discussion with Faidon on IRC: "a generic module is going to get only slightly better than gener..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34861 [10:45:28] I'll have a look when I'm back from vac [10:45:33] thanks! [10:45:52] migrating my change to an "openjdk" package [10:46:37] I think the analytics people wanted to use (or use already?) oracle jvm/jdk [10:46:52] I think so [10:46:54] oracle and openjdk are going to be very similar, so maybe it makes sense to have a common module [10:47:00] but mobile team wants to use openjdk as well :-) [10:47:05] ahh [10:47:06] true [10:47:06] New patchset: Hashar; "Generic class to install OpenJDK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34862 [10:47:13] so a module named java [10:47:15] then java::sun / java::openjdk ? [10:47:22] maybe, yeah [10:47:27] not sure, haven't researched it much [10:47:35] New review: Hashar; "rebased to get rid of the obsolete dependency." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34862 [10:52:35] luuuve puppet [10:52:54] New patchset: Hashar; "java module and class to install OpenJDK" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34862 [10:52:54] New patchset: Hashar; "OpenJDK JRE/JDK on CI host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34863 [10:53:37] New review: Hashar; "renamed module from generic::packages::openjdk to java::openjdk thus providing as well the new "java..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34862 [10:54:23] New review: Hashar; "Parent change https://gerrit.wikimedia.org/r/34862 has been changed, the generic::packages::openjdk ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34863 [10:55:08] paravoid: thanks for your quick review :-] Enjoy your vacations [10:55:14] and deb hacking [11:39:59] New patchset: Mark Bergsma; "Loop over all epoll events to check if we need to read vca_pipes" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34869 [11:51:30] New patchset: Mark Bergsma; "Read the entire vca pipe at once at Linux typical size of 64 kiB" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34870 [11:52:03] New patchset: Mark Bergsma; "Read the entire vca pipe at once at Linux typical size of 64 kiB" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34871 [11:54:25] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34870 [12:40:15] New patchset: Mark Bergsma; "Read the entire vca pipe at once at Linux typical size of 64 kiB" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34871 [12:51:52] New patchset: Mark Bergsma; "Loop over all epoll events to check if we need to read vca_pipes" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34869 [12:53:52] New patchset: Mark Bergsma; "Read the entire vca pipe at once at Linux typical size of 64 kiB" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34871 [12:56:15] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:19:12] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [13:34:57] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [13:35:15] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [13:43:24] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34724 [13:43:38] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34725 [13:43:59] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34869 [13:44:19] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34871 [13:44:42] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [13:46:12] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [14:05:16] !log Built new varnish 3.0.3plus~rc1-wm6 packages with fixed epoll deadlock, and inserted it into the precise-wikimedia APT repository [14:05:24] Logged the message, Master [14:32:15] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:20] !log Dist-upgraded and rebooted all esams bits servers (new Varnish package) [14:33:26] Logged the message, Master [14:33:27] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 118.15 ms [14:36:36] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [14:39:54] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [14:41:17] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:35] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [14:42:20] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.89 ms [14:42:38] PROBLEM - NTP on arsenic is CRITICAL: NTP CRITICAL: Offset unknown [14:47:26] RECOVERY - NTP on arsenic is OK: NTP OK: Offset 0.009712815285 secs [14:50:35] PROBLEM - Host strontium is DOWN: PING CRITICAL - Packet loss = 100% [14:51:47] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [14:56:53] PROBLEM - Host sq70 is DOWN: PING CRITICAL - Packet loss = 100% [14:57:47] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [14:57:52] !log Dist-upgraded and rebooted all pmtpa bits servers (new Varnish package) [14:57:58] Logged the message, Master [14:58:50] RECOVERY - Host sq70 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:59:17] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [15:01:50] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:04:27] !log maxsem synchronized php-1.21wmf4/extensions/CategoryTree/ 'https://gerrit.wikimedia.org/r/#/c/34668/' [15:04:33] Logged the message, Master [15:05:17] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [15:06:57] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [15:07:41] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [15:09:32] !log Dist-upgraded and rebooted all eqiad bits servers (new Varnish package) [15:09:38] Logged the message, Master [15:27:11] New patchset: Mark Bergsma; "Fix a deadlock of worker threads on the vca_pipe under load" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/34883 [15:27:12] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm6) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/34884 [15:27:36] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/34883 [15:28:08] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/34884 [16:20:07] !log maxsem synchronized php-1.21wmf4/includes/Message.php 'Debugging' [16:20:14] Logged the message, Master [17:13:39] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:28:23] !log maxsem synchronized php-1.21wmf4/includes/Message.php 'Debugging' [18:28:30] Logged the message, Master [18:43:24] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:43:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:43:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:59:27] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:59:27] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [20:00:11] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [22:41:42] New patchset: Ori.livneh; "Enable PostEdit for ptwiki & svwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34954 [22:43:23] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34954 [22:57:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:15:50] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling PostEdit on ptwiki & svwiki' [23:15:58] Logged the message, Master [23:20:23] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [23:33:50] ori-l, having fun on holiday too? :D