[00:02:18] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:30] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:04:05] ... [00:05:56] Useful. [00:06:19] i am on it [00:06:29] it crashed earlier already ..sigh [00:06:48] Yeah. [00:07:11] * James_F rethinks his team moving from Mingle to Bugzilla for prioritisation ahead of our launch. :-) [00:07:27] !log powercycling kaulen, this time no console output at all [00:07:34] Logged the message, Master [00:08:27] James_F: * Starting web server apache2 [ OK ] [00:08:44] mutante: Thanks. :-) [00:08:45] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:09:57] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.005 seconds [00:13:18] It's been fine for ages [00:13:23] and then twice in one day [00:16:13] http://sphotos-c.ak.fbcdn.net/hphotos-ak-prn1/604117_10151118660426471_1496901487_n.jpg [00:56:25] !log repooled mw58,mw59 (upgrades) srv284 (hw ticket was resolved, reinstalled) [00:56:32] Logged the message, Master [01:09:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [01:15:11] !log rebooting srv193 (test.wp) for upgrade [01:15:18] Logged the message, Master [01:17:09] PROBLEM - Host srv193 is DOWN: PING CRITICAL - Packet loss = 100% [01:17:15] lol [01:17:54] RECOVERY - Host srv193 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [01:19:02] Matthew_: heh, did you come to ask about test.wikipedia? :p [01:19:26] mutante: No, I just idle here :) [01:19:35] ok;) [01:28:58] !log sync-apache srv284-only, start apache (was missing all.conf), repool [01:29:05] Logged the message, Master [01:29:07] ..and laters [01:36:58] meh, or not, depooled and look at it again later..needs other stuff.. out [01:39:39] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 241 seconds [01:39:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 250 seconds [01:42:48] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:44:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [02:00:12] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 264 seconds [02:00:30] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 281 seconds [02:24:23] !log LocalisationUpdate completed (1.21wmf4) at Thu Nov 22 02:24:22 UTC 2012 [02:24:30] Logged the message, Master [02:38:18] RECOVERY - Puppet freshness on mw29 is OK: puppet ran at Thu Nov 22 02:38:05 UTC 2012 [02:38:18] RECOVERY - Puppet freshness on mw25 is OK: puppet ran at Thu Nov 22 02:38:14 UTC 2012 [02:38:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:38:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:39:21] RECOVERY - Puppet freshness on mw32 is OK: puppet ran at Thu Nov 22 02:38:50 UTC 2012 [02:41:19] RECOVERY - Puppet freshness on mw34 is OK: puppet ran at Thu Nov 22 02:40:52 UTC 2012 [02:42:48] RECOVERY - Puppet freshness on mw73 is OK: puppet ran at Thu Nov 22 02:42:43 UTC 2012 [02:44:18] RECOVERY - Puppet freshness on mw70 is OK: puppet ran at Thu Nov 22 02:44:08 UTC 2012 [02:45:21] RECOVERY - Puppet freshness on mw24 is OK: puppet ran at Thu Nov 22 02:45:06 UTC 2012 [02:45:48] RECOVERY - Puppet freshness on mw31 is OK: puppet ran at Thu Nov 22 02:45:23 UTC 2012 [02:46:51] RECOVERY - Puppet freshness on mw71 is OK: puppet ran at Thu Nov 22 02:46:43 UTC 2012 [02:47:18] RECOVERY - Puppet freshness on mw72 is OK: puppet ran at Thu Nov 22 02:46:52 UTC 2012 [02:48:21] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Thu Nov 22 02:48:13 UTC 2012 [02:50:45] RECOVERY - Puppet freshness on mw22 is OK: puppet ran at Thu Nov 22 02:50:28 UTC 2012 [02:51:48] RECOVERY - Puppet freshness on mw33 is OK: puppet ran at Thu Nov 22 02:51:25 UTC 2012 [02:53:18] RECOVERY - Puppet freshness on mw30 is OK: puppet ran at Thu Nov 22 02:53:10 UTC 2012 [02:54:30] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:54:30] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:00:21] RECOVERY - Puppet freshness on mw20 is OK: puppet ran at Thu Nov 22 02:59:58 UTC 2012 [03:01:24] RECOVERY - Puppet freshness on mw27 is OK: puppet ran at Thu Nov 22 03:01:10 UTC 2012 [03:03:21] RECOVERY - Puppet freshness on mw26 is OK: puppet ran at Thu Nov 22 03:02:58 UTC 2012 [03:04:24] RECOVERY - Puppet freshness on mw28 is OK: puppet ran at Thu Nov 22 03:04:13 UTC 2012 [03:06:21] RECOVERY - Puppet freshness on mw21 is OK: puppet ran at Thu Nov 22 03:06:10 UTC 2012 [03:34:14] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Thu Nov 22 03:33:54 UTC 2012 [03:46:50] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [03:48:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [03:55:50] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [05:24:06] New patchset: Tim Starling; "Add timeouts to RMI communications" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/34481 [05:35:01] PROBLEM - Lucene on search13 is CRITICAL: Connection timed out [05:36:21] RECOVERY - Lucene on search13 is OK: TCP OK - 0.003 second response time on port 8123 [05:46:24] PROBLEM - Lucene on search13 is CRITICAL: Connection timed out [05:51:26] this is getting to be frequent... [05:51:54] i guess it's 8am for apergos... but also a holiday [05:52:00] (search13) [05:52:42] RECOVERY - Lucene on search13 is OK: TCP OK - 3.010 second response time on port 8123 [05:59:09] it's 8 am [06:00:24] ok that should take care of it [06:52:20] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:56:20] I've got read timeouts more or less working [06:56:56] except that there's a bug in lsearchd somewhere which means that after a read timeout, it can never contact that host again and spams the log with errors once every 10 seconds [07:03:08] that's a large except [07:15:35] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [07:31:18] New review: Hashar; "Thanks for the clean up!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34607 [08:31:10] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [08:34:28] PROBLEM - swift-container-auditor on ms-be8 is CRITICAL: Connection refused by host [08:34:28] PROBLEM - swift-account-auditor on ms-be8 is CRITICAL: Connection refused by host [08:34:37] PROBLEM - swift-object-auditor on ms-be8 is CRITICAL: Connection refused by host [08:34:55] PROBLEM - swift-object-replicator on ms-be8 is CRITICAL: Connection refused by host [08:34:55] PROBLEM - swift-container-replicator on ms-be8 is CRITICAL: Connection refused by host [08:34:55] PROBLEM - swift-account-reaper on ms-be8 is CRITICAL: Connection refused by host [08:35:22] PROBLEM - swift-container-server on ms-be8 is CRITICAL: Connection refused by host [08:35:22] PROBLEM - swift-object-server on ms-be8 is CRITICAL: Connection refused by host [08:35:22] PROBLEM - swift-account-replicator on ms-be8 is CRITICAL: Connection refused by host [08:35:40] PROBLEM - swift-account-server on ms-be8 is CRITICAL: Connection refused by host [08:35:40] PROBLEM - SSH on ms-be8 is CRITICAL: Connection refused [08:35:40] PROBLEM - swift-container-updater on ms-be8 is CRITICAL: Connection refused by host [08:35:40] PROBLEM - swift-object-updater on ms-be8 is CRITICAL: Connection refused by host [08:37:11] New patchset: ArielGlenn; "provide for xfs filesystem labels without making the filesystem (ms-bexx)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34525 [08:38:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34525 [08:44:17] New patchset: ArielGlenn; "ms-be8 as 720xd with ssds last in disk layout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34693 [08:46:33] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34693 [08:55:55] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [08:58:10] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:03:25] RECOVERY - SSH on ms-be8 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:03:34] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [09:13:47] hey, looks like several half-baked servers from the lateast batch are in rotation, causing a flood of stuff like 10.0.8.34 apache2[25005]: PHP Warning: require(/usr/local/apache/common-local/php-1.21wmf4/index.php) [function.require]: failed to open stream: Permission denied in /usr/local/apache/common-local/live-1.5/index.php on line 3 [09:14:05] also 10.0.8.29 [09:14:39] lemme check those two [09:15:51] oh, srv284, it's been a pain forever [09:16:19] RECOVERY - swift-account-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:16:19] RECOVERY - swift-container-auditor on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:16:28] RECOVERY - swift-account-reaper on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:16:37] RECOVERY - swift-object-server on ms-be8 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:16:47] RECOVERY - swift-object-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:17:04] RECOVERY - swift-container-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:17:40] RECOVERY - swift-container-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:17:40] RECOVERY - swift-object-updater on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:17:40] RECOVERY - swift-container-server on ms-be8 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:17:40] RECOVERY - swift-account-server on ms-be8 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:17:49] RECOVERY - swift-object-auditor on ms-be8 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:19:48] perms fixed on 284 [09:19:55] RECOVERY - Apache HTTP on srv284 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.073 second response time [09:20:05] thanks! [09:20:52] the other host has something else going on there [09:22:52] it looks ok actually, just the standard bogus docroot errors [09:23:22] !log fixed perms on srv284 php-1.*wmf* and wmf-config, test (were 700 instead of 777) [09:23:33] Logged the message, Master [09:28:51] yeah that was the only host with perm denied in the errors, should be set now [09:30:20] * MaxSem wonders if any instance of 100+ errors in fatalmonitor should be reported to this channel [09:32:20] I don't know enough about what sort of errors wind up in there to say [09:32:45] that one was important because it prevented any pages from being served from that host [09:46:05] RECOVERY - swift-account-replicator on ms-be8 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:58:21] so I think I have the zuul deployment ready, then I will have to schedule the deployment with someone from ops. Not sure what is the best way to ask. [10:58:29] should I just drop a mail to ops-l with an RT ticket ? [10:58:44] rt is good, what is zuul? :-D [10:59:02] ooh forgot to give context [10:59:04] also let's see what the schedule looks like, maybe you could propose a couple dates/times [10:59:23] that is a python daemon that listen to Gerrit events and trigger jobs according to a specification. [10:59:53] hmm sounds cool, what kind of jobs were folks thinking of? [11:00:00] the change is mostly about applying a role class, double checking that the various templates are properly executed and that the service is running. [11:00:26] got a gerrit link? [11:02:03] apergos: https://gerrit.wikimedia.org/r/#/c/34555/ [11:02:13] the rest of the code is mostly in modules/zuul [11:02:20] ok [11:02:22] been reviewed / merged in by Faidon [11:02:26] and applied on labs [11:02:45] I don't expect any trouble, just need someone with root access to run merge / run puppet if something is not properly set up [11:02:50] right [11:02:56] will write all of that in an RT ticket [11:02:58] * apergos looks at the module for a bit [11:03:18] labs is really great [11:03:36] I have set up a fresh instance, applied the class and that let me fix a few culprit I didn't catch previously [11:03:42] less troubles for the ops! [11:03:57] I'm kind of wondering... [11:03:59] if... [11:04:57] well first what server is this going to go on then? [11:05:12] gallium the contint server [11:05:34] logical [11:05:39] well [11:05:47] there's really no other prep work or whatever? [11:05:52] cause as I see the deployment schedule... [11:06:05] today is really quiet for some hours :-D [11:06:30] well if you want to do it today we surely can :-] [11:06:43] though you are probably busy with swift stuff [11:06:55] there's always stuff to do but [11:07:01] sure why not :-D [11:07:27] what I won't be helpful with is if some zuul piece isn't working right [11:08:05] but it's a pretty self-contained service on a single host so [11:10:39] give me a time frame for today if you like and I'll be available [11:10:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:11:17] apergos: I guess we want to lunch, so in 2 hours if that works for you ? [11:11:35] sure [11:11:42] ping when you want me [11:11:46] nice [11:11:51] will fill in the RT ticket meanwhile [11:11:57] and grab a sandwich :-) [11:12:13] ok [11:29:42] New patchset: ArielGlenn; "separate stanza for ms-be8 for test of fs labeling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34704 [11:32:03] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34704 [11:35:56] apergos: https://rt.wikimedia.org/Ticket/Display.html?id=3958 I added you in cc :-) [11:36:14] ok cool [11:37:36] New patchset: Hashar; "deploy zuul on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34555 [11:38:11] New review: Hashar; "PS2: added links to RT and bugzilla." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34555 [11:39:32] New patchset: ArielGlenn; "fix when we label swift filesystems, got test condition backwards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34705 [11:40:49] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34705 [11:42:20] rats still no good [11:46:56] New patchset: ArielGlenn; "path for mount command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34706 [11:48:17] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34706 [11:50:15] \o/ [12:06:19] !log restarted Jenkins to load in the new zuul jobs. [12:06:25] Logged the message, Master [12:10:27] doh [12:10:34] jenkins did not like that restart :( [12:10:36] New patchset: ArielGlenn; "ms-be6-8 and 10 in one stanza for 720xds, toss test stanza" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34710 [12:10:38] uh oh [12:10:52] hm no it is alive [12:10:59] just the GUI being locked somehow [12:12:17] ahh it is busy parsing all the old build files :( [12:12:21] I really need to clean them up [12:13:46] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34710 [12:13:51] sorry for the delay [12:13:56] no worries at all [12:14:11] as you see I;m getting on with my stuff (which can be interrupted at any point) [12:16:58] !log Installed plugins on Jenkins for Zuul deployment: notification, build-timeout [12:17:05] Logged the message, Master [12:20:53] good enough for now. I am going to grab a snack [12:21:05] enjoy [12:35:18] there's a labs acct creation waiting that I can't do. (has SVN already) [12:35:41] 22 12:32:42 <+wm-bot> Change on mediawiki a page Developer access was modified, changed by Jeremyb link https://www.mediawiki.org/w/index.php?diff=608443 edit summary: /* User:SHL */ deferred [12:35:51] * jeremyb goes back to sleep [12:38:04] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [12:38:17] lies [12:39:34] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:39:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:39:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:41:22] PROBLEM - swift-container-auditor on ms-be10 is CRITICAL: Connection refused by host [12:41:49] PROBLEM - swift-object-auditor on ms-be10 is CRITICAL: Connection refused by host [12:42:07] PROBLEM - swift-object-updater on ms-be10 is CRITICAL: Connection refused by host [12:42:16] PROBLEM - swift-account-auditor on ms-be10 is CRITICAL: Connection refused by host [12:42:16] PROBLEM - swift-account-reaper on ms-be10 is CRITICAL: Connection refused by host [12:42:16] PROBLEM - swift-container-server on ms-be10 is CRITICAL: Connection refused by host [12:42:16] PROBLEM - swift-container-replicator on ms-be10 is CRITICAL: Connection refused by host [12:42:16] PROBLEM - swift-object-server on ms-be10 is CRITICAL: Connection refused by host [12:42:17] PROBLEM - SSH on ms-be10 is CRITICAL: Connection refused [12:42:17] PROBLEM - swift-object-replicator on ms-be10 is CRITICAL: Connection refused by host [12:42:18] PROBLEM - swift-container-updater on ms-be10 is CRITICAL: Connection refused by host [12:42:18] PROBLEM - swift-account-replicator on ms-be10 is CRITICAL: Connection refused by host [12:42:34] PROBLEM - swift-account-server on ms-be10 is CRITICAL: Connection refused by host [12:55:37] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:55:37] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:57:07] PROBLEM - Host ms-be10 is DOWN: PING CRITICAL - Packet loss = 100% [13:02:49] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [13:03:48] 25 minutes to prepare a burger sandwich doh [13:04:02] so many people eat at 1pm in the neighborhood [13:04:06] apergos: I am around :-) [13:04:31] ok [13:04:44] what needs to be done? [13:04:56] (my install is broken in some new way on the next host, a perfect time for a break) [13:05:24] merge in https://gerrit.wikimedia.org/r/#/c/34555/ [13:05:30] deploy on sockpuppet [13:05:34] run puppetd -tv on gallium [13:05:37] I think :-] [13:05:57] heh [13:06:44] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34555 [13:07:45] let's see [13:07:56] puppet running now [13:08:15] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Monitor_service[jenkins] is already defined in file /var/lib/git/operations/puppet/manifests/misc/contint.pp at line 152; cannot redefine at /var/lib/git/operations/puppet/manifests/zuul.pp:35 on node gallium.wikimedia.org [13:08:17] woops [13:08:30] arghh [13:08:50] I gave it a name! [13:09:20] ok the same name [13:09:22] not helpful [13:09:25] hehe [13:10:06] sending patch [13:10:22] New patchset: Hashar; "rename zuul monitoring service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34711 [13:11:23] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34711 [13:11:55] let's try again [13:13:30] well it's making some packages present so that's progress [13:14:33] the conf is there :-) [13:15:49] I'll run again and see if anything else gets processed. don't see anything broken [13:15:54] for some reason I had to start the service manually [13:16:30] shoulda waited to see if puppet would have gotten it on the next round [13:16:40] let me stop the service [13:16:49] done [13:16:52] you can puppetd -tv again [13:17:02] it's already going [13:17:19] I think you stopped it in time though, it's just now deciding which config version to apply [13:18:47] the service zuul is defined with enabled=>true and hasrestart => true [13:19:02] so maybe puppet does not enforce it to run [13:19:11] we;ll see [13:19:49] yeah enable is to make sure it is started on boot [13:19:51] which is nice [13:20:06] if one want a service to be always running, we need ensure => running [13:20:20] so that means I can manually stop it whenever needed [13:20:29] which is nice [13:20:30] notice: /Stage[main]/Misc::Contint::Test::Testswarm/File[/etc/testswarm]: Not removing directory; use 'force' to override [13:20:30] in case you need this [13:20:39] ah [13:21:08] well if it was going to restart something I guess it would have [13:21:15] apergos: you can delete /etc/testswarm , we are no more using that tool [13:21:37] gone [13:22:06] started service [13:22:14] looks like you have to do it manually all right [13:22:27] so that is cloning mediawiki/core.git right now [13:22:34] which is a lot of data :-] [13:22:43] ohboy [13:22:49] that should keep it busy for a while [13:23:00] is that going to impact other services on that host (jenkins)? [13:23:06] kind of [13:23:18] Zuul is receiving events from Gerrit and trigger jobs in Jenkins [13:23:23] ok well [13:23:28] but it is only triggering new jobs I have deployed before lunch [13:23:30] so we are fine [13:23:30] good there's not a lot of folks on line playing today :-D [13:23:37] all the other jobs keep running and don't need zuul :-] [13:23:43] always deploy during us holidays [13:23:47] indeed [13:23:48] much quieter! [13:23:51] or early in the morning :-]]]]]] [13:23:54] heh [13:24:13] although lately their evening work bleeds into our early mornign so that's been getting cluttered [13:25:03] so that is still cloning but I have no idea how to verify how well it is progressing [13:25:16] git-upload-pack '/mediawiki/core' [13:26:04] PROBLEM - NTP on ms-be10 is CRITICAL: NTP CRITICAL: No response from NTP server [13:26:37] apergos: anyway I think we are done. The service does receive event from Gerrit which mean ssh is properly configured [13:26:47] still have to verify that it trigger jobs in Jenkins though [13:26:55] ok [13:27:22] you wanna close the rt... oh, I have to I guess :-D but let's wait a little bit to be sure [13:28:12] indeed [13:28:17] waiting for the git clone to finish [13:28:28] then I will verify whether it can connect to jenkins [13:28:35] should be fine though, I guess you can install your next box now :-] [13:28:50] I'm in the middle of the install and there's some issue with the disks (again) [13:28:51] I should have proper permissions to sort out most of the issues [13:29:05] conf is under jenkins user umbrella and I can sudo as jenkins user on that box :-] [13:29:49] excellent [13:35:23] cloned!!! ;-] [13:35:36] !g 34540 [13:35:36] https://gerrit.wikimedia.org/r/#q,34540,n,z [13:36:06] yay [13:36:43] no I need to sort out a 403 with Jenkins ;-) [13:37:18] probably forgot some user permissions [13:38:09] little details [13:38:47] the good thing is that I have updated upstream documentation [13:38:54] so I just have to copy paste [13:39:02] i luuuuve open source [13:39:53] !log Jenkins: updated zuul-bot user permission so it can trigger jobs. [13:39:59] Logged the message, Master [13:41:07] SUCCESS http://integration.mediawiki.org/ci/job/mediawiki-core-merge/1/console [13:41:13] apergos: that is a total success :-] [13:41:20] apergos: thanks a lot :-]]]]]]]]] [13:41:27] * hashar does the happy dance [13:42:06] yay! [13:42:32] I love it when a production deployment take less than hour [13:42:37] thanks to labs, puppet and everything [13:42:44] and all the previous work [13:42:50] yeah [13:42:55] ook should I close that ticket now? [13:42:58] I have reinstalled that service like 3 or 4 times [13:43:04] yeah, definitely close the RT ticket [13:43:06] I will close the bug [13:43:37] no the hard part: writing documentation and preparing a mail for wikitech-l [13:43:52] yep [13:43:57] I look forward to seeing the docs [13:44:53] I wrote some doc for ops on https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:45:01] but anyway, that is not going to replace the good old system [13:45:11] I will have them run in parralel [13:45:30] just like the migration from /mnt/thumb to swift [13:45:31] ;) [13:47:18] ha ha [13:47:52] at least ms5 can really go away [13:48:03] where ms7 cannot because it was used for way too much other random crapola [13:51:44] apergos: I am wondering what 'ms' is for ? Misc Server ? [13:52:03] media server :-P [13:52:18] ahh [13:52:50] we should start up an ops glossary on wikitech just like guillom is doing on mw.org :-) [13:54:09] isn't the mw.org one intended for wikimedia jargon in part? [13:54:24] maybe ops terms could be added to it [13:54:26] "sever admin log" is in it [13:57:10] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:57:50] hmm can the install be working? zeroed out a bunch of sectors on the install drives... [13:58:03] *server admin log [13:58:05] grr [13:59:02] apergos: hashar: yes, Ops jargon is much welcome in the glossary (and there are many ops terms already :) [13:59:13] thought so :-) [13:59:13] what is the url already ? [13:59:20] will add in a few [13:59:28] forget (though I made a tiny change already) [13:59:44] I should look at the wikitech-l post [14:10:04] RECOVERY - SSH on ms-be10 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:12:38] !log jenkins: configure jobbuilder-bot user permissions [14:12:44] Logged the message, Master [14:15:12] INFO:jenkins_jobs.builder:Reconfiguring jenkins job mediawiki-core-lint [14:15:13] yeahhh [14:15:16] \O/ [14:15:24] woo hoo [14:15:56] that is ap python script which generate Jenkins jobs for us [14:16:01] out of a basic yaml file [14:16:25] we will no more have to click checkbox in Jenkins gui [14:16:32] and can use templates to bulk update jobs [14:16:33] \O/ [14:16:35] so happy [14:19:05] :-) [14:20:20] I kept being interrupted on that project [14:20:30] the last 2 months (since the all staff) have been hard to me :/ [14:20:39] anyway {{done}} [14:21:45] time to celebrate: where is the booze? :-D [14:22:23] ahh having beer with coworker is something I am missing :/ [14:23:25] gah this first puppet run is slooooowww [14:23:48] apt-get update then installing allllll our packages isn't it ? [14:24:16] can't tell what phase it's in [14:24:23] oh [14:24:30] compile puppet.conf [14:24:33] :-/ [14:25:59] moving again [14:29:07] RECOVERY - NTP on ms-be10 is OK: NTP OK: Offset 0.02779483795 secs [14:39:44] RECOVERY - swift-container-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:40:03] RECOVERY - swift-object-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:40:03] RECOVERY - swift-account-reaper on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:40:04] RECOVERY - swift-container-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:40:29] RECOVERY - swift-container-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:40:38] RECOVERY - swift-object-server on ms-be10 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:40:56] RECOVERY - swift-account-server on ms-be10 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:40:56] RECOVERY - swift-object-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:40:57] RECOVERY - swift-container-updater on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [14:41:14] RECOVERY - swift-account-auditor on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:41:14] RECOVERY - swift-object-auditor on ms-be10 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:42:17] RECOVERY - swift-account-replicator on ms-be10 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:52:02] !log msbe8 and msbe10 installed, not yet deployed [14:52:09] Logged the message, Master [15:17:41] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [15:19:11] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:47:53] out again to get my daugher [16:32:53] afk for a few hours [16:51:45] PROBLEM - Varnish HTTP bits on cp3019 is CRITICAL: Connection refused [16:53:42] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [16:54:54] RECOVERY - Varnish HTTP bits on cp3019 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.241 seconds [17:16:39] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [17:34:29] New patchset: Mark Bergsma; "Read vca_pipes ss pointers first to avoid blocking worker threads" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34724 [17:34:29] New patchset: Mark Bergsma; "Make the worker thread end of vca_pipes non-blocking as well" [operations/debs/varnish] (patches/optimize-epoll-thread) - https://gerrit.wikimedia.org/r/34725 [18:47:22] mark, that's pretty cool [18:47:40] nicely done [18:59:20] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:27:01] ahhh puppet will kill me [20:39:45] New patchset: Hashar; "git::clone did not honor $mode parameter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34748 [20:40:24] I guess it is bed time for me :/ [21:12:00] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [21:47:25] PROBLEM - Puppet freshness on cp3019 is CRITICAL: Puppet has not run in the last 10 hours [21:48:36] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/34748 [21:59:30] New patchset: Hashar; "/etc/zuul/wikimedia no more belong to root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34848 [22:16:40] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [22:20:59] New patchset: Hashar; "update puppet-lint rake target" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34850 [22:26:25] RECOVERY - Puppet freshness on ms-be8 is OK: puppet ran at Thu Nov 22 22:26:23 UTC 2012 [22:30:42] and with that off to bed, see folks tomorrow [22:40:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:40:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:40:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:56:34] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [22:56:34] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:57:47] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours