[00:00:07] !log pgehres Started syncing Wikimedia installation... : Deploying Extension:AccountAudit and an Echo thing [00:00:15] Logged the message, Master [00:27:43] New patchset: Bsitu; "Assign high priority to EchoNotificationJob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [00:32:21] !log pgehres Finished syncing Wikimedia installation... : Deploying Extension:AccountAudit and an Echo thing [00:32:29] Logged the message, Master [00:32:33] good god, finally [00:34:03] AaronSchulz: scap pushes out mediawiki-config as well, yes? [00:36:45] I believe so [00:37:03] pgehres: Is the deployment line clear after this? [00:37:06] Well, it seems to be working on test2wiki [00:37:11] RoanKattouw: all yours [00:37:20] its just not listed in Special:Version [00:37:20] Awesome [00:37:28] I just need Krinkle to merge one more thing and I'm in business [00:37:41] pgehres: Are you sure the extension doesn't have a broken $wgExtensionCredits or something crazy like that? [00:37:48] It's possible [00:38:12] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/53683/9/AccountAudit.php,unified [00:38:31] That should do it [00:38:33] test2wiki? [00:38:39] yeah [00:39:05] It's in $wgExtensionCredits according to eval.php [00:39:20] yeah, and it inserts rows into the db as well [00:39:33] On fenari that is [00:40:21] But not on mw1058 [00:40:53] New patchset: Faidon; "(bug 47807) Strip proxy URLs for mobile requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61519 [00:41:18] Even though the config totally says it should be there [00:41:28] RoanKattouw: shall I try a sync-dir on wmf-config? [00:42:05] catrope@mw1058:/apache/common/wmf-config$ sudo -u apache php /apache/common/multiversion/MWScript.php eval.php test2wiki [00:42:07] > var_dump($wgAutoloadClasses['AccountAudit']); [00:42:08] NULL [00:42:10] Hmm, not sure [00:42:18] Actually [00:42:19] huh [00:42:25] Why don't you touch InitialiseSettings.php and sync-file it [00:42:32] It might just be a config cache thing [00:43:02] Because: [00:43:05] catrope@mw1058:/apache/common/wmf-config$ grep AccountAudit InitialiseSettings.php -A 2 [00:43:06] 'wmgUseAccountAudit' => array( [00:43:08] 'default' => false, [00:43:09] 'test2wiki' => true [00:43:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61519 [00:43:35] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Touching InitialiseSettings' [00:43:43] Logged the message, Master [00:43:53] There it is [00:45:12] RoanKattouw: if it were earlier I would enable on all wikis, but I think I want to go home ... [00:54:57] New review: Krinkle; "@Nemo: If you find something that works for you (try it locally e.g. with Firebug or Chrome Dev Tool..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [00:56:00] mFacenet: I have more detail about my problem and it's super bizarre [00:56:10] I can lock myself out of any box by issuing this: "rsync -Cavz pig/include/ analytics1003.eqiad.wmnet:/home/milimetric" [00:56:25] in that case, i locked myself out of analytics1003 [00:56:41] the command went through fine, but any ssh attempt to that box afterwards is blocked [00:57:03] milimetric, let one of the operations people know, I'm just lurking and was giving advice based on what I've seen in the past [00:57:14] cool, thanks mFacenet [00:57:31] ops peoples: if anyone has any idea what's going on, this is definitely weird ^^ [00:59:04] !log catrope synchronized php-1.22wmf3/extensions/VisualEditor/ 'Update VE' [00:59:11] Logged the message, Master [00:59:34] RoanKattouw: not sure if you saw above, virt0 is the answer [00:59:48] to the question "where is wikitech hosted these days" [01:00:56] Yes [01:01:00] I saw, I got distracted [01:01:11] The config isn't versioned, I guess [01:01:15] Cause I didn't find a .git directory [01:01:24] So I'll just have to update as root? Cause my regular shell user isn't on virt0 [01:07:19] milimetric: don't rsync home [01:07:34] milimetric: you probably also sent .ssh/authorized_keys [01:07:42] which in turn prevented you from accessing the box [01:07:50] sent? [01:07:50] to avoid that, you can go: [01:08:01] i've done the same commands a hundred times [01:08:02] rsync'd [01:08:07] unlikely [01:08:10] today sometimes they work sometimes they lock me out [01:08:17] hm [01:08:19] well, they're in a script file that i haven't changed [01:08:22] well, that is my best guess [01:08:43] because the target is your home directory [01:08:56] and if you were to overwrite .ssh/authorized_keys, you'd be locked out [01:08:59] you mean /home/milimetric instead of /home/milimetric/? [01:09:17] hm [01:09:29] possibly -- that is a good point, because it would effect the perms of the target [01:10:12] i mean, i'm sure someone has said this, but in the short term, just keep incrementing i for analytics100${i} :) [01:10:24] but also [01:10:28] don't target your home directory [01:10:42] target /home/milimetric/tmp/ or something [01:11:18] also, the relative path for rsync is always the home dir, so you can just go: an04:tmp/ to target /home/milimetric/tmp/ [01:11:21] bbl [01:24:02] RoanKattouw: i would be surprised if ryan didn't have labsconsole/wikitech in version control somewhere [01:25:11] Yeah [01:25:17] No .git dir in sight though [01:27:39] virt0 is in... Tampa? [01:29:53] Yup, Tampa [01:53:27] gerrit-wm: Hmm. [02:04:18] !log cleaning up neon's /var/log (100% full), restarting icinga... [02:04:21] sigh [02:04:26] Logged the message, Master [02:04:35] paravoid: VE is now running on wikitech :) [02:04:39] \o/ [02:04:42] you're awesome :) [02:04:44] Writing a mailing list post about it now [02:04:52] It's opt-in though, it's the same config as in prod [02:04:57] So you have to go into your preferences and enable ikt [02:05:24] First edit: https://wikitech.wikimedia.org/w/index.php?title=PowerDNS&action=history :) [02:06:25] nice! [02:07:32] * Coren chucnkes. [02:07:36] chuckles* [02:07:53] My user page is a template. "Sorry, you cannot edit this element". :-) [02:07:55] now if only we merged wikitech & mediawiki.org [02:08:01] hah [02:08:24] Announced on ops@ and wikitech-l [02:15:03] !log LocalisationUpdate completed (1.22wmf2) at Tue Apr 30 02:15:03 UTC 2013 [02:15:10] Logged the message, Master [02:16:49] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [02:18:01] Aaron|home: [02:18:02] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.28?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:05] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.29?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:09] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2a?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:11] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2b?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:14] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2c?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:17] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2d?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:21] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2e?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:23] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2f?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:25] pastebins :) [02:18:27] that's strange isn't it? [02:18:27] that's terbium [02:18:53] prefix 4:/ ?! [02:19:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [02:19:13] odd [02:19:38] http://p.defau.lt/?hSRlA6l4vd41toLIVWp45A [02:19:54] are you sure that's not just the logs having weird encoding? [02:20:38] the rest look normal [02:20:43] also, standard apache logging [02:21:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [02:28:07] paravoid: meh, looks like 4/ too me [02:29:03] yes [02:29:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [02:29:16] why look for 4/ under every wiki under each container? [02:29:30] container shard that is [02:31:03] anyway [02:31:04] sleep [02:31:06] bye :) [02:31:33] getFileList() called on zone/4/ will do that (since there are 256 shards) [02:31:49] the calling sh script could go through the 256 manually though [02:32:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [02:32:36] that would be less RTTs, with a bit more hard coding [02:33:50] !log LocalisationUpdate completed (1.22wmf3) at Tue Apr 30 02:33:50 UTC 2013 [02:33:59] Logged the message, Master [02:44:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 207 seconds [02:47:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:54:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [02:57:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [03:19:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 204 seconds [03:23:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [03:27:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:34:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [03:36:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [03:44:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [03:53:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [04:06:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 30 04:06:11 UTC 2013 [04:06:20] Logged the message, Master [04:14:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:19:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [04:24:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:26:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [04:28:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [04:49:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:52:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [04:54:09] New patchset: Tim Starling; "Log for bug 47807" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61537 [04:54:55] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61537 [04:55:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 191 seconds [04:59:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [05:00:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [05:19:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [05:23:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [05:30:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 219 seconds [05:36:50] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:38:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [05:41:40] PROBLEM - RAID on mc15 is CRITICAL: Timeout while attempting connection [05:42:40] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [05:45:01] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [05:47:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 24 seconds [06:04:24] New review: Nemo bis; "@Krinkle: there is already bug 36471." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [06:15:52] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [06:17:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:18:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [07:06:00] PROBLEM - Puppet freshness on vanadium is CRITICAL: No successful Puppet run in the last 10 hours [07:09:11] New patchset: Tim Starling; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [07:14:30] New review: Tim Starling; "$_SERVER['REQUEST_URI'] contains the host when the request line is an absolute URL. This could possi..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [08:24:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [08:26:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [08:33:36] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [08:35:24] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [09:06:30] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:41] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [10:11:01] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [10:23:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 188 seconds [10:24:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [10:31:30] New review: Hashar; "Daniel, that yet another puppet-lint related change :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61244 [10:32:24] lunnnchhh [10:44:57] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [10:44:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:21:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:02] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [11:27:09] !log refresh-translatable-pages.php finished for mediawikiwiki and metawiki [11:27:17] Logged the message, Master [11:28:21] yay [11:37:29] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:14:34] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 214 seconds [12:15:35] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [12:17:15] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [12:31:13] New patchset: Hashar; "system_role for role::applicationserver::appserver::beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61558 [12:31:45] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.013 second response time [12:44:36] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [12:45:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [12:56:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:09:31] New patchset: Krinkle; "gerrit: Collapse logo in layout on narrow screens" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:10:35] New review: Krinkle; "Re-instate -1. The intention is good, but this CSS change doesn't work. It doesn't collapse the wind..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58082 [13:13:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [13:15:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [13:19:35] Change abandoned: Nemo bis; "Right, sorry, better if the state is clear." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:48:27] New patchset: Cmjohnson; "Adding new key for rfaulk (rt5040) for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [13:54:49] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 226 seconds [13:56:32] !log jenkins seems to not be working -- restarting [13:56:40] Logged the message, Master [13:57:12] New patchset: Peachey88; "Adding new key for rfaulk (RT 5040) for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [13:58:39] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [13:58:49] PROBLEM - SSH on caesium is CRITICAL: Server answer: [13:58:49] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [13:58:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [13:58:58] PROBLEM - SSH on cp1043 is CRITICAL: Server answer: [13:59:48] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:59:49] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:59:59] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [14:01:39] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:02:49] PROBLEM - SSH on caesium is CRITICAL: Server answer: [14:02:51] apergos: ping? [14:02:59] RECOVERY - SSH on cp1043 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:03:25] paravoid: pongggg [14:05:05] :) [14:05:12] need any help with swift? [14:05:30] don't think so, thanks for checking [14:05:48] how did ceph writes deployment go? [14:06:23] works :) [14:06:59] great! and reads are next week? [14:06:59] I need to do some peripheral stuff for now like... documentation :) [14:07:02] ah [14:07:05] details :-D [14:07:08] :-) [14:07:17] and mailing ops@ about the status [14:07:22] New patchset: Cmjohnson; "Adding rfaulk's new key to admins for stat1 access..rt5040" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61569 [14:08:12] that would be nice (docs), seems like icinga for example still has no docs on wikitech :-/ [14:08:30] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61569 [14:08:43] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:09:09] yeah [14:09:12] ben was really good at that :) [14:09:46] so what's the holdup with swift? [14:10:02] indeed [14:10:25] waiting for the first round of object replication to complete after having to remove the old ms-be2 devices (new zone) [14:10:39] they moved it to a new rack see, with no other server in there [14:10:51] oh [14:10:52] and apparently one can't change the zone on a device, the only thing you can do is delete them and rebalance [14:11:10] I would have expected it to be done by now tbh but hopefully later today [14:12:37] what was the entry before? [14:12:39] 33%? [14:13:08] no, and that's what's so annoying [14:15:59] hey paravoid, would you have some time today to look at the kafka puppet module? we would love to see this getting merged :D [14:16:32] I'll try to find some time [14:17:13] thaaaaaaank yooouuuuuuuuu!!!! [14:18:52] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [14:18:53] heh, don't thank me yet :) [14:19:30] well we owe you many thank you's for past reviews as well ;) [14:20:52] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [14:30:04] cmjohnson1: What exactly wasn't working? [14:30:21] Jenkins restarts easily take an hour. Doing that unscheduled takes out CI and Gerrit review entirely. [14:31:36] krinkle: it wasn't reviewing...sorry i did not know it took an hour or so. [14:31:47] cmjohnson1: define 'wasn't reviewing' [14:32:00] Did you check whether it was busy? Zuul can have a long backlog sometimes when it is busy [14:32:19] my changes were not being reviewed by jenkins [14:32:33] Was it listed on https://integration.wikimedia.org/zuul/ ? [14:32:35] krinkle ...no i did not...i should have ...sorry [14:32:36] (as queued) [14:34:25] cmjohnson1: Please never restart Jenkins unless it has crashed. Restarting it without gracefully stopping Zuul first causes a lot of false positives accros Gerrit. And even then restarting takes a looooong time because there is a lot of data. Should only be done scheduled and in correspondence with the CI team (unless it is down, in which a restart is always justified because it would bring it b [14:34:25] ack up) [14:34:46] but it wasn't down I believe :) [14:35:35] Krinkly..duly noted...I didn't realize the ramifications....i will be more careful next time. [14:35:59] thanks :) [14:37:40] (It's a bit like restarting mysql while php is still running requests and Apache even taking new requests) [14:40:45] oh..that's bad! ...yeah I thought the service was more like "morebots"....yeah so lesson learned on that one...(krinkle) [14:41:22] yeah, it's quite a bit bigger than a gerrit bot [14:42:21] hashar: It's still restarting [14:42:46] looking [14:43:11] We'll just have to sit this one out. There is no shortcut to initialising Jenkins [14:45:26] Krinkle: apparently got restarted for some reason. There is no stack trace though [14:45:35] hashar: back scroll ^^^ [14:45:58] * cmjohnson1 apologizes again! [14:46:55] ahhh [14:47:01] restarted intentionally :-D [14:47:26] Krinkle: the slow start up has been fixed upstream btw [14:47:34] need to upgrade Jenkins to the new version and that would fix it [14:49:14] hashar: Interesting, got a commit or issue I can look at? I wonder what they did to speed it up [14:49:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [14:53:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [14:54:08] New patchset: coren; "Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [14:54:27] Krinkle: it received a lot of patches apparently. The root issue is https://issues.jenkins-ci.org/browse/JENKINS-8754 [14:54:42] """ I'm marking this resolved based on the fix in 1.485 that does lazy loading of build records.""" [14:55:28] and jenkins upgrade is at https://bugzilla.wikimedia.org/show_bug.cgi?id=47744 [14:57:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [14:57:59] Right, so they lazy-load it [14:58:04] That'll fix it. [15:03:02] New patchset: Krinkle; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:12:57] cmjohnson1: so yeah Jenkins is a bit slow to restart (up to an hour which is a bug). Most of the time the issue lie somewhere else :-] [15:13:18] cmjohnson1: and thanks for restarting it, it had at least one thread locked at 100% CPU usage, I guess the restarted cleared it :-] [15:13:46] Krinkle: I need to get the jenkins.deb uploaded on apt.wm.o then do the upgrade + plugin upgrade [15:13:50] hashar: yeah...i did not even think that the jenkins was that complicated. I know now but sorry for the unannounced downtime [15:13:55] Krinkle: will give a poke at it next week probably [15:14:02] cmjohnson1: shit happens :-] [15:14:09] it does [15:14:23] cmjohnson1: that is "just" annoying east coast ops and the european devs. Less than half a hundred people hehea [15:14:29] that is small given the half billions of people we serve [15:14:31] ;-D [15:14:46] heh ..valid point [15:15:06] cmjohnson1: about slowness. Gerrit send a flow of events to Zuul which is a python daemon handling the events. [15:15:13] New review: coren; "Addressed. New changeset incoming." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:15:38] cmjohnson1: https://integration.wikimedia.org/zuul/ will shows what is currently running and show a Queue lengths: 0 events, 0 results. The events correspond to events received from Gerrit. [15:15:43] so yeah, Jenkins is what runs the unit tests on patch-submission and (for most mediawiki repositories) does the merge after approving the change. Without it there is basically no linting and no merging happening and development is held up (other than local development of course) [15:16:05] cmjohnson1: then Zuul handle them and some events do not trigger any job. so the event queue can flush very quickly :-] [15:16:28] I should find a way to show the Jenkins build queue as well [15:16:43] hashar: The jenkins queue is shown on every page in jenkins itself [15:16:46] it's in the sidebar [15:16:50] (or you mean show it outside jenkins?) [15:16:59] on the zuul status page [15:17:15] ideally I would like jenkins to send metrics to graphite [15:17:20] New patchset: coren; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:17:21] http://dev.hubspot.com/blog/bid/81142/Just-open-sourced-Send-Jenkins-Hudson-stats-to-Graphite <:-D [15:17:34] dafu? [15:17:35] sorry http://velohacker.com/2012/01/12/graphing-jenkins-statistics/ [15:17:40] Why did this make a new patchset? [15:17:53] Ah, no it didn't. [15:18:03] Sometimes gerrit can be sooo confusing in its workflow. [15:18:23] that did make a new patchset but not a new change ? :D [15:19:38] hashar: Hm. I note that my responses then end up buried in 2; but I didn't make it into a service because it makes no sense to put an upstart /task/ as one; the ensure => running would just 'start' it at every puppet run. [15:20:41] IMO, YMMV, AATJ. [15:20:59] ahh [15:21:03] that make sense probably :D [15:21:23] and the upstart job is to make it happen at machine startup ? [15:21:36] Specifically, 'on starting autofs' [15:25:31] that make sense [15:25:43] Coren: for the autofs configuration, I can't tell [15:25:51] but I can test it on beta :-D [15:26:05] though that needs /home to be copied I guess [15:28:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [15:29:11] The autofs config is known working for tools, at the very least; I expect to worries there. [15:31:52] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [15:34:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61039 [15:36:58] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:10] New review: Hashar; "Good to me :-]" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/61578 [15:46:28] ^^ Coren looks fine. I guess I will have to copy the /home from the project storage to labnfs [15:48:52] hashar: rsync ftw. :-) [15:54:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 212 seconds [15:56:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [15:58:01] Coren: yeah alias cp='rsync -av' [16:01:54] New review: Ryan Lane; "Some minor complaints and questions." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61578 [16:14:53] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:16:36] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:08] New patchset: MaxSem; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:24:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [16:28:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [16:29:34] ori-l: around? [16:30:59] New review: Hashar; "Ah nice! Now we can start writing some unit tests for missing.php." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:33:28] New review: Faidon; "I'd prefer if we could get definitions like systemuser & upstart_job in some common modules & not re..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60187 [16:35:19] New review: Faidon; "The fail-if-not-Ubuntu is okay, but I can't promise if extra complexity for other platforms will be..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60333 [16:36:35] New patchset: Faidon; "Avoid using regexps where string literals would do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61164 [16:37:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61164 [16:39:23] New review: Faidon; "LGTM -- although DocumentRoot could use tabs instead of two spaces to be consistent with the rest of..." [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/60934 [16:41:24] New review: Faidon; "Anything else I can do to help you push this forward? I think the only objection was from Diederik (..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [16:43:35] New review: Diederik; "I have removed my objection." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54116 [16:47:06] New review: Faidon; "A few inline comments." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/59611 [16:49:21] New patchset: Lcarr; "renaming rfaulk's new key so it doesn't conflict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61589 [16:50:02] hey paravoid [16:50:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [16:50:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61589 [16:52:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [16:55:15] ori-l: heya [16:55:41] I +2 all of your changes -some with a few comments- should I just merge them? [16:55:59] yeah, I was just reading your comments. thanks! and yes, please do [16:56:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60094 [16:56:52] New patchset: Faidon; "Create self-standing IPython Notebook Puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [16:57:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [16:57:18] New patchset: Faidon; "Update README" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60332 [16:57:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60332 [16:57:41] New patchset: Faidon; "Provide 'certfile' and 'password' parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60333 [16:57:53]