[00:00:07] !log pgehres Started syncing Wikimedia installation... : Deploying Extension:AccountAudit and an Echo thing [00:00:15] Logged the message, Master [00:27:43] New patchset: Bsitu; "Assign high priority to EchoNotificationJob" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [00:32:21] !log pgehres Finished syncing Wikimedia installation... : Deploying Extension:AccountAudit and an Echo thing [00:32:29] Logged the message, Master [00:32:33] good god, finally [00:34:03] AaronSchulz: scap pushes out mediawiki-config as well, yes? [00:36:45] I believe so [00:37:03] pgehres: Is the deployment line clear after this? [00:37:06] Well, it seems to be working on test2wiki [00:37:11] RoanKattouw: all yours [00:37:20] its just not listed in Special:Version [00:37:20] Awesome [00:37:28] I just need Krinkle to merge one more thing and I'm in business [00:37:41] pgehres: Are you sure the extension doesn't have a broken $wgExtensionCredits or something crazy like that? [00:37:48] It's possible [00:38:12] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/53683/9/AccountAudit.php,unified [00:38:31] That should do it [00:38:33] test2wiki? [00:38:39] yeah [00:39:05] It's in $wgExtensionCredits according to eval.php [00:39:20] yeah, and it inserts rows into the db as well [00:39:33] On fenari that is [00:40:21] But not on mw1058 [00:40:53] New patchset: Faidon; "(bug 47807) Strip proxy URLs for mobile requests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61519 [00:41:18] Even though the config totally says it should be there [00:41:28] RoanKattouw: shall I try a sync-dir on wmf-config? [00:42:05] catrope@mw1058:/apache/common/wmf-config$ sudo -u apache php /apache/common/multiversion/MWScript.php eval.php test2wiki [00:42:07] > var_dump($wgAutoloadClasses['AccountAudit']); [00:42:08] NULL [00:42:10] Hmm, not sure [00:42:18] Actually [00:42:19] huh [00:42:25] Why don't you touch InitialiseSettings.php and sync-file it [00:42:32] It might just be a config cache thing [00:43:02] Because: [00:43:05] catrope@mw1058:/apache/common/wmf-config$ grep AccountAudit InitialiseSettings.php -A 2 [00:43:06] 'wmgUseAccountAudit' => array( [00:43:08] 'default' => false, [00:43:09] 'test2wiki' => true [00:43:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61519 [00:43:35] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Touching InitialiseSettings' [00:43:43] Logged the message, Master [00:43:53] There it is [00:45:12] RoanKattouw: if it were earlier I would enable on all wikis, but I think I want to go home ... [00:54:57] New review: Krinkle; "@Nemo: If you find something that works for you (try it locally e.g. with Firebug or Chrome Dev Tool..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [00:56:00] mFacenet: I have more detail about my problem and it's super bizarre [00:56:10] I can lock myself out of any box by issuing this: "rsync -Cavz pig/include/ analytics1003.eqiad.wmnet:/home/milimetric" [00:56:25] in that case, i locked myself out of analytics1003 [00:56:41] the command went through fine, but any ssh attempt to that box afterwards is blocked [00:57:03] milimetric, let one of the operations people know, I'm just lurking and was giving advice based on what I've seen in the past [00:57:14] cool, thanks mFacenet [00:57:31] ops peoples: if anyone has any idea what's going on, this is definitely weird ^^ [00:59:04] !log catrope synchronized php-1.22wmf3/extensions/VisualEditor/ 'Update VE' [00:59:11] Logged the message, Master [00:59:34] RoanKattouw: not sure if you saw above, virt0 is the answer [00:59:48] to the question "where is wikitech hosted these days" [01:00:56] Yes [01:01:00] I saw, I got distracted [01:01:11] The config isn't versioned, I guess [01:01:15] Cause I didn't find a .git directory [01:01:24] So I'll just have to update as root? Cause my regular shell user isn't on virt0 [01:07:19] milimetric: don't rsync home [01:07:34] milimetric: you probably also sent .ssh/authorized_keys [01:07:42] which in turn prevented you from accessing the box [01:07:50] sent? [01:07:50] to avoid that, you can go: [01:08:01] i've done the same commands a hundred times [01:08:02] rsync'd [01:08:07] unlikely [01:08:10] today sometimes they work sometimes they lock me out [01:08:17] hm [01:08:19] well, they're in a script file that i haven't changed [01:08:22] well, that is my best guess [01:08:43] because the target is your home directory [01:08:56] and if you were to overwrite .ssh/authorized_keys, you'd be locked out [01:08:59] you mean /home/milimetric instead of /home/milimetric/? [01:09:17] hm [01:09:29] possibly -- that is a good point, because it would effect the perms of the target [01:10:12] i mean, i'm sure someone has said this, but in the short term, just keep incrementing i for analytics100${i} :) [01:10:24] but also [01:10:28] don't target your home directory [01:10:42] target /home/milimetric/tmp/ or something [01:11:18] also, the relative path for rsync is always the home dir, so you can just go: an04:tmp/ to target /home/milimetric/tmp/ [01:11:21] bbl [01:24:02] RoanKattouw: i would be surprised if ryan didn't have labsconsole/wikitech in version control somewhere [01:25:11] Yeah [01:25:17] No .git dir in sight though [01:27:39] virt0 is in... Tampa? [01:29:53] Yup, Tampa [01:53:27] gerrit-wm: Hmm. [02:04:18] !log cleaning up neon's /var/log (100% full), restarting icinga... [02:04:21] sigh [02:04:26] Logged the message, Master [02:04:35] paravoid: VE is now running on wikitech :) [02:04:39] \o/ [02:04:42] you're awesome :) [02:04:44] Writing a mailing list post about it now [02:04:52] It's opt-in though, it's the same config as in prod [02:04:57] So you have to go into your preferences and enable ikt [02:05:24] First edit: https://wikitech.wikimedia.org/w/index.php?title=PowerDNS&action=history :) [02:06:25] nice! [02:07:32] * Coren chucnkes. [02:07:36] chuckles* [02:07:53] My user page is a template. "Sorry, you cannot edit this element". :-) [02:07:55] now if only we merged wikitech & mediawiki.org [02:08:01] hah [02:08:24] Announced on ops@ and wikitech-l [02:15:03] !log LocalisationUpdate completed (1.22wmf2) at Tue Apr 30 02:15:03 UTC 2013 [02:15:10] Logged the message, Master [02:16:49] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [02:18:01] Aaron|home: [02:18:02] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.28?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:05] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.29?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:09] - 10.64.32.13 - - [30/Apr/2013:02:17:31 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2a?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:11] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2b?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:14] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2c?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:17] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2d?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:21] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2e?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:23] - 10.64.32.13 - - [30/Apr/2013:02:17:32 +0000] "GET /swift/v1/wikipedia-it-local-transcoded.2f?limit=9000&prefix=4%2F HTTP/1.1" 204 160 "-" "PHP-CloudFiles/1.7.10" [02:18:25] pastebins :) [02:18:27] that's strange isn't it? [02:18:27] that's terbium [02:18:53] prefix 4:/ ?! [02:19:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [02:19:13] odd [02:19:38] http://p.defau.lt/?hSRlA6l4vd41toLIVWp45A [02:19:54] are you sure that's not just the logs having weird encoding? [02:20:38] the rest look normal [02:20:43] also, standard apache logging [02:21:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [02:28:07] paravoid: meh, looks like 4/ too me [02:29:03] yes [02:29:09] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [02:29:16] why look for 4/ under every wiki under each container? [02:29:30] container shard that is [02:31:03] anyway [02:31:04] sleep [02:31:06] bye :) [02:31:33] getFileList() called on zone/4/ will do that (since there are 256 shards) [02:31:49] the calling sh script could go through the 256 manually though [02:32:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [02:32:36] that would be less RTTs, with a bit more hard coding [02:33:50] !log LocalisationUpdate completed (1.22wmf3) at Tue Apr 30 02:33:50 UTC 2013 [02:33:59] Logged the message, Master [02:44:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 207 seconds [02:47:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:54:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [02:57:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 18 seconds [03:19:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 204 seconds [03:23:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [03:27:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:34:05] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [03:36:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [03:44:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 218 seconds [03:53:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [04:06:11] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 30 04:06:11 UTC 2013 [04:06:20] Logged the message, Master [04:14:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:19:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [04:24:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:26:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:27:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [04:28:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [04:49:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [04:52:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [04:54:09] New patchset: Tim Starling; "Log for bug 47807" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61537 [04:54:55] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61537 [04:55:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 191 seconds [04:59:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 238 seconds [05:00:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [05:19:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [05:23:01] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [05:30:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 219 seconds [05:36:50] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [05:36:50] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:38:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [05:41:40] PROBLEM - RAID on mc15 is CRITICAL: Timeout while attempting connection [05:42:40] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [05:45:01] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [05:47:00] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 24 seconds [06:04:24] New review: Nemo bis; "@Krinkle: there is already bug 36471." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [06:15:52] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [06:17:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:18:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [07:06:00] PROBLEM - Puppet freshness on vanadium is CRITICAL: No successful Puppet run in the last 10 hours [07:09:11] New patchset: Tim Starling; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [07:14:30] New review: Tim Starling; "$_SERVER['REQUEST_URI'] contains the host when the request line is an absolute URL. This could possi..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [08:24:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 216 seconds [08:26:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 19 seconds [08:33:36] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [08:35:24] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [09:06:30] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:41] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [10:11:01] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [10:23:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 188 seconds [10:24:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [10:31:30] New review: Hashar; "Daniel, that yet another puppet-lint related change :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61244 [10:32:24] lunnnchhh [10:44:57] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [10:44:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [11:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [11:21:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:23:02] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [11:27:09] !log refresh-translatable-pages.php finished for mediawikiwiki and metawiki [11:27:17] Logged the message, Master [11:28:21] yay [11:37:29] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:02:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:14:34] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 214 seconds [12:15:35] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [12:17:15] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [12:31:13] New patchset: Hashar; "system_role for role::applicationserver::appserver::beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61558 [12:31:45] PROBLEM - search indices - check lucene status page on search1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.013 second response time [12:44:36] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [12:45:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 29 seconds [12:56:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:57:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:09:31] New patchset: Krinkle; "gerrit: Collapse logo in layout on narrow screens" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:10:35] New review: Krinkle; "Re-instate -1. The intention is good, but this CSS change doesn't work. It doesn't collapse the wind..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/58082 [13:13:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [13:15:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [13:19:35] Change abandoned: Nemo bis; "Right, sorry, better if the state is clear." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/58082 [13:48:27] New patchset: Cmjohnson; "Adding new key for rfaulk (rt5040) for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [13:54:49] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 226 seconds [13:56:32] !log jenkins seems to not be working -- restarting [13:56:40] Logged the message, Master [13:57:12] New patchset: Peachey88; "Adding new key for rfaulk (RT 5040) for stat1 access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [13:58:39] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [13:58:49] PROBLEM - SSH on caesium is CRITICAL: Server answer: [13:58:49] PROBLEM - SSH on gadolinium is CRITICAL: Server answer: [13:58:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [13:58:58] PROBLEM - SSH on cp1043 is CRITICAL: Server answer: [13:59:48] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:59:49] RECOVERY - SSH on gadolinium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [13:59:59] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61567 [14:01:39] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:02:49] PROBLEM - SSH on caesium is CRITICAL: Server answer: [14:02:51] apergos: ping? [14:02:59] RECOVERY - SSH on cp1043 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:03:25] paravoid: pongggg [14:05:05] :) [14:05:12] need any help with swift? [14:05:30] don't think so, thanks for checking [14:05:48] how did ceph writes deployment go? [14:06:23] works :) [14:06:59] great! and reads are next week? [14:06:59] I need to do some peripheral stuff for now like... documentation :) [14:07:02] ah [14:07:05] details :-D [14:07:08] :-) [14:07:17] and mailing ops@ about the status [14:07:22] New patchset: Cmjohnson; "Adding rfaulk's new key to admins for stat1 access..rt5040" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61569 [14:08:12] that would be nice (docs), seems like icinga for example still has no docs on wikitech :-/ [14:08:30] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61569 [14:08:43] RECOVERY - SSH on caesium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:09:09] yeah [14:09:12] ben was really good at that :) [14:09:46] so what's the holdup with swift? [14:10:02] indeed [14:10:25] waiting for the first round of object replication to complete after having to remove the old ms-be2 devices (new zone) [14:10:39] they moved it to a new rack see, with no other server in there [14:10:51] oh [14:10:52] and apparently one can't change the zone on a device, the only thing you can do is delete them and rebalance [14:11:10] I would have expected it to be done by now tbh but hopefully later today [14:12:37] what was the entry before? [14:12:39] 33%? [14:13:08] no, and that's what's so annoying [14:15:59] hey paravoid, would you have some time today to look at the kafka puppet module? we would love to see this getting merged :D [14:16:32] I'll try to find some time [14:17:13] thaaaaaaank yooouuuuuuuuu!!!! [14:18:52] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 222 seconds [14:18:53] heh, don't thank me yet :) [14:19:30] well we owe you many thank you's for past reviews as well ;) [14:20:52] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [14:30:04] cmjohnson1: What exactly wasn't working? [14:30:21] Jenkins restarts easily take an hour. Doing that unscheduled takes out CI and Gerrit review entirely. [14:31:36] krinkle: it wasn't reviewing...sorry i did not know it took an hour or so. [14:31:47] cmjohnson1: define 'wasn't reviewing' [14:32:00] Did you check whether it was busy? Zuul can have a long backlog sometimes when it is busy [14:32:19] my changes were not being reviewed by jenkins [14:32:33] Was it listed on https://integration.wikimedia.org/zuul/ ? [14:32:35] krinkle ...no i did not...i should have ...sorry [14:32:36] (as queued) [14:34:25] cmjohnson1: Please never restart Jenkins unless it has crashed. Restarting it without gracefully stopping Zuul first causes a lot of false positives accros Gerrit. And even then restarting takes a looooong time because there is a lot of data. Should only be done scheduled and in correspondence with the CI team (unless it is down, in which a restart is always justified because it would bring it b [14:34:25] ack up) [14:34:46] but it wasn't down I believe :) [14:35:35] Krinkly..duly noted...I didn't realize the ramifications....i will be more careful next time. [14:35:59] thanks :) [14:37:40] (It's a bit like restarting mysql while php is still running requests and Apache even taking new requests) [14:40:45] oh..that's bad! ...yeah I thought the service was more like "morebots"....yeah so lesson learned on that one...(krinkle) [14:41:22] yeah, it's quite a bit bigger than a gerrit bot [14:42:21] hashar: It's still restarting [14:42:46] looking [14:43:11] We'll just have to sit this one out. There is no shortcut to initialising Jenkins [14:45:26] Krinkle: apparently got restarted for some reason. There is no stack trace though [14:45:35] hashar: back scroll ^^^ [14:45:58] * cmjohnson1 apologizes again!  [14:46:55] ahhh [14:47:01] restarted intentionally :-D [14:47:26] Krinkle: the slow start up has been fixed upstream btw [14:47:34] need to upgrade Jenkins to the new version and that would fix it [14:49:14] hashar: Interesting, got a commit or issue I can look at? I wonder what they did to speed it up [14:49:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [14:53:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [14:54:08] New patchset: coren; "Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [14:54:27] Krinkle: it received a lot of patches apparently. The root issue is https://issues.jenkins-ci.org/browse/JENKINS-8754 [14:54:42] """ I'm marking this resolved based on the fix in 1.485 that does lazy loading of build records.""" [14:55:28] and jenkins upgrade is at https://bugzilla.wikimedia.org/show_bug.cgi?id=47744 [14:57:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [14:57:59] Right, so they lazy-load it [14:58:04] That'll fix it. [15:03:02] New patchset: Krinkle; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:12:57] cmjohnson1: so yeah Jenkins is a bit slow to restart (up to an hour which is a bug). Most of the time the issue lie somewhere else :-] [15:13:18] cmjohnson1: and thanks for restarting it, it had at least one thread locked at 100% CPU usage, I guess the restarted cleared it :-] [15:13:46] Krinkle: I need to get the jenkins.deb uploaded on apt.wm.o then do the upgrade + plugin upgrade [15:13:50] hashar: yeah...i did not even think that the jenkins was that complicated. I know now but sorry for the unannounced downtime [15:13:55] Krinkle: will give a poke at it next week probably [15:14:02] cmjohnson1: shit happens :-] [15:14:09] it does [15:14:23] cmjohnson1: that is "just" annoying east coast ops and the european devs. Less than half a hundred people hehea [15:14:29] that is small given the half billions of people we serve [15:14:31] ;-D [15:14:46] heh ..valid point [15:15:06] cmjohnson1: about slowness. Gerrit send a flow of events to Zuul which is a python daemon handling the events. [15:15:13] New review: coren; "Addressed. New changeset incoming." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:15:38] cmjohnson1: https://integration.wikimedia.org/zuul/ will shows what is currently running and show a Queue lengths: 0 events, 0 results. The events correspond to events received from Gerrit. [15:15:43] so yeah, Jenkins is what runs the unit tests on patch-submission and (for most mediawiki repositories) does the merge after approving the change. Without it there is basically no linting and no merging happening and development is held up (other than local development of course) [15:16:05] cmjohnson1: then Zuul handle them and some events do not trigger any job. so the event queue can flush very quickly :-] [15:16:28] I should find a way to show the Jenkins build queue as well [15:16:43] hashar: The jenkins queue is shown on every page in jenkins itself [15:16:46] it's in the sidebar [15:16:50] (or you mean show it outside jenkins?) [15:16:59] on the zuul status page [15:17:15] ideally I would like jenkins to send metrics to graphite [15:17:20] New patchset: coren; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [15:17:21] http://dev.hubspot.com/blog/bid/81142/Just-open-sourced-Send-Jenkins-Hudson-stats-to-Graphite <:-D [15:17:34] dafu? [15:17:35] sorry http://velohacker.com/2012/01/12/graphing-jenkins-statistics/ [15:17:40] Why did this make a new patchset? [15:17:53] Ah, no it didn't. [15:18:03] Sometimes gerrit can be sooo confusing in its workflow. [15:18:23] that did make a new patchset but not a new change ? :D [15:19:38] hashar: Hm. I note that my responses then end up buried in 2; but I didn't make it into a service because it makes no sense to put an upstart /task/ as one; the ensure => running would just 'start' it at every puppet run. [15:20:41] IMO, YMMV, AATJ. [15:20:59] ahh [15:21:03] that make sense probably :D [15:21:23] and the upstart job is to make it happen at machine startup ? [15:21:36] Specifically, 'on starting autofs' [15:25:31] that make sense [15:25:43] Coren: for the autofs configuration, I can't tell [15:25:51] but I can test it on beta :-D [15:26:05] though that needs /home to be copied I guess [15:28:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [15:29:11] The autofs config is known working for tools, at the very least; I expect to worries there. [15:31:52] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [15:34:47] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61039 [15:36:58] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:46:10] New review: Hashar; "Good to me :-]" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/61578 [15:46:28] ^^ Coren looks fine. I guess I will have to copy the /home from the project storage to labnfs [15:48:52] hashar: rsync ftw. :-) [15:54:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 212 seconds [15:56:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [15:58:01] Coren: yeah alias cp='rsync -av' [16:01:54] New review: Ryan Lane; "Some minor complaints and questions." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/61578 [16:14:53] New review: MaxSem; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:16:36] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [16:18:08] New patchset: MaxSem; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:24:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 237 seconds [16:28:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [16:29:34] ori-l: around? [16:30:59] New review: Hashar; "Ah nice! Now we can start writing some unit tests for missing.php." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [16:33:28] New review: Faidon; "I'd prefer if we could get definitions like systemuser & upstart_job in some common modules & not re..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60187 [16:35:19] New review: Faidon; "The fail-if-not-Ubuntu is okay, but I can't promise if extra complexity for other platforms will be..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/60333 [16:36:35] New patchset: Faidon; "Avoid using regexps where string literals would do" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61164 [16:37:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61164 [16:39:23] New review: Faidon; "LGTM -- although DocumentRoot could use tabs instead of two spaces to be consistent with the rest of..." [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/60934 [16:41:24] New review: Faidon; "Anything else I can do to help you push this forward? I think the only objection was from Diederik (..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [16:43:35] New review: Diederik; "I have removed my objection." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54116 [16:47:06] New review: Faidon; "A few inline comments." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/59611 [16:49:21] New patchset: Lcarr; "renaming rfaulk's new key so it doesn't conflict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61589 [16:50:02] hey paravoid [16:50:08] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 223 seconds [16:50:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61589 [16:52:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [16:55:15] ori-l: heya [16:55:41] I +2 all of your changes -some with a few comments- should I just merge them? [16:55:59] yeah, I was just reading your comments. thanks! and yes, please do [16:56:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60094 [16:56:52] New patchset: Faidon; "Create self-standing IPython Notebook Puppet module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [16:57:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60187 [16:57:18] New patchset: Faidon; "Update README" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60332 [16:57:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60332 [16:57:41] New patchset: Faidon; "Provide 'certfile' and 'password' parameters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60333 [16:57:53] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60333 [16:59:30] paravoid: diederik also replied on https://gerrit.wikimedia.org/r/#/c/54116/ , removing his objection [17:00:01] I saw that [17:00:20] we now have to find a way to install without recommends :-) [17:01:14] oh, right [17:02:39] btw, re: Systemuser and Upstart_job -- I don't think the two lines or so that each of them saves you justifies not using the built-in idioms [17:03:05] Upstart_job seems especially bad since it doesn't declare provider => upstart, but instead relies on the symlink to upstart-job for initd compatibility [17:04:13] it also looks for files outside the module IIRC, in ./files [17:04:24] and they have to be hard-coded; no template support. [17:06:40] PROBLEM - Puppet freshness on vanadium is CRITICAL: No successful Puppet run in the last 10 hours [17:06:51] grrrrr [17:12:49] paravoid: any chance you can help me debug a puppet error? [17:14:10] New review: Faidon; "sun-java6-jdk doesn't exist anymore in neither Ubuntu nor Debian, so that build-dep is spurious." [operations/debs/kafka] (master) C: -1; - https://gerrit.wikimedia.org/r/53170 [17:14:43] ori-l: I meant possibly replacing those two defs with some modularized proper versions :) [17:14:50] ori-l: what's the puppet error? [17:14:59] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Class[Role::Logging::Eventlogging] is already defined; cannot redefine at /var/lib/git/operations/puppet/manifests/role/logging.pp:243 on node vanadium.eqiad.wmnet [17:15:14] related to MaxSem's change [17:15:23] Ic762640780dac349a144420fa066d91210314d4c [17:16:13] yes [17:16:23] you need to make this class { '::eventlogging': [17:16:45] puppet first tries to resolve the bare class name inside the current namespace, then the top namespace [17:16:51] right, yep [17:16:52] patch coming [17:17:23] it's an annoying feature [17:17:35] I'd very much prefer to always have to qualify includes [17:17:56] hmm, why doesn't it trigger that error on labs? [17:18:36] ah, I didn't actually use the role class [17:19:41] New patchset: Ori.livneh; "Disambiguate scope of included 'eventlogging' class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61594 [17:20:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61594 [17:20:50] thanks paravoid, MaxSem [17:21:35] New patchset: coren; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [17:26:27] New review: coren; "Responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [17:26:45] New patchset: Ori.livneh; "Qualify path to IPython executable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61595 [17:27:02] ^ paravoid that's another bugfix :/ sorry [17:27:26] heya mutante [17:27:28] i can take this: [17:27:28] https://rt.wikimedia.org/Ticket/Display.html?id=5039 [17:27:35] but, i'm not sure of the best way to go about it [17:27:43] i've added records before, but not new domains [17:28:31] oo instructions! [17:28:32] https://wikitech.wikimedia.org/wiki/DNS#Adding_a_new_zone [17:29:21] ottomata: must be a trap [17:29:32] :) [17:29:33] New patchset: coren; "labsnfs: Transition class for switch to Labs NFS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [17:30:14] ottomata: i think faidon may have stepped out for a sec, any chance you can merge a small bugfix for me? https://gerrit.wikimedia.org/r/#/c/61595/ [17:30:24] appropriate especially since you just chided me the other day to qualify my paths :P [17:30:48] back :) [17:30:50] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61595 [17:31:01] ta da! [17:31:05] sorry :) [17:31:13] thanks guys [17:31:27] * ori-l tries puppetd -tv again [17:32:59] paravoid: https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0Ai_u2wTiMldddHVYLWQ4WGpQalhfMnVhREZFc1o4NlE#gid=0 [17:33:17] IPv6 traffic [17:33:26] mobile site only, as that's what we've got for unsampled [17:33:28] but still [17:34:04] 2% [17:34:06] excellent! [17:34:11] :) [17:34:46] can we have it in a limn graph? :-) [17:35:14] i mean, it's one row of data :P [17:35:32] I know, I mean to track this historically [17:35:36] per month or whatever [17:35:53] you'd have to talk to the rest of my team [17:36:04] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61487 [17:36:41] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61348 [17:37:57] New patchset: Bsitu; "Config Echo to use extension1 db" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61596 [17:38:13] paravoid, ottomata: works! thanks again [17:40:26] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61596 [17:40:43] New patchset: Faidon; "New upstream release." [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/61597 [17:41:53] New patchset: Faidon; "New upstream release." [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/61597 [17:42:25] Change merged: Faidon; [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/61597 [17:42:39] mutante, I've got a ticket that you're going to *love* https://rt.wikimedia.org/Ticket/Display.html?id=5042 :P [17:43:16] Thehelpfulone: aaahaha, "love" indeed:p [17:44:00] how many committees are there?:) [17:44:11] New patchset: Bsitu; "Add enwiki to Echo dblist file" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61598 [17:44:22] heh not many that are requesting new wikis, this could be the only new one in a while [17:44:53] you know what they say.. practice makes perfect! ;) [17:44:55] ok [17:45:01] hehe,yea [17:45:02] !log kaldari synchronized wmf-config/InitialiseSettings.php 'Syncing InitialiseSettings for Echo deployment' [17:45:10] Logged the message, Master [17:47:28] !log kaldari synchronized wmf-config/CommonSettings.php 'Syncing CommonSettings for Echo deployment' [17:47:37] Logged the message, Master [17:52:43] New review: coren; "Responses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [17:53:52] Krinkle: jsduck 4.8.0 on gallium [17:54:34] paravoid: marvellous, thanks! [18:03:59] New review: Krinkle; "(dz) manual rebase / fix path conflict." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [18:04:10] New review: Krinkle; "(dz) - don't use the -infile variable anymore" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [18:06:40] New review: Krinkle; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [18:10:49] New patchset: Krinkle; "wikibugs: Set up #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [18:12:18] New patchset: RobH; "creating racktables role for eqiad based server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [18:12:54] mutante: I'd like to merge https://gerrit.wikimedia.org/r/#/c/54984/, however that needs the change in operations/puppet (https://gerrit.wikimedia.org/r/#/c/37570/) to be merged first. Is the wikibugs puppetisation complete for this? [18:12:55] god damn it [18:13:03] I see you merged it but it looks like it is still not in #wikimedia-dev somehow [18:13:38] New patchset: RobH; "creating racktables role for eqiad based server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [18:13:56] there we go, accidentally pulled an unrelated file into my patchset. [18:14:54] New patchset: RobH; "creating racktables role for eqiad based server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [18:14:56] who should i talk to about revoking ssh keys, and puting a different one in? Bit of a laptop mishap [18:15:00] Krinkle: no, that won't work, it cant clone on the host [18:15:07] mutante: ^ [18:15:14] wanna look at patchset when you have a moment? [18:16:06] mutante: so the wikibugs puppet code you merged is unused at the moment? [18:16:30] ebernhardson: could you mail ops-requests@rt [18:16:30] What needs to be done to get it live? (so we can start updating the configuration through puppet) [18:17:14] mutante: sure, is it @rt.wikimedia.org ? [18:17:27] mutante: never used rt :) [18:17:28] Krinkle: yes, it's unused because last time we tried merging it it failed pretty bad and then were happy to even be able to hack it back to how it was before. [18:17:34] ebernhardson: yes, it is. thanks [18:17:42] If it takes 6 months and more to get it to use puppet, perhaps we can update the configuration without puppet? It's just a simple channel extension (I'm referring to the addition of -visualeditor not mediawiki>wikiimedia-dev, which can happen even with it staying in #mediawiki) [18:18:01] mutante: So it has been reverted? [18:18:30] Krinkle: the root cause is that it is runnin on mchenry, what needs to be done is upgrade that server or move the bot or deploy manually [18:18:32] https://gerrit.wikimedia.org/r/#/c/53973/ doesn't include a "Reverted in .." or something. [18:19:39] ok. well, I don't know how you want it to be done. I just like to get wikibugs to start channelling to #mediawiki-visisualeditor for VisualEditor bugs. The moving of the main stream from #mediawiki to #wikimedia-dev can be done through puppet once that is finished. [18:19:47] New patchset: Pyoungmeister; "db51 -> file per table" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61606 [18:19:57] Krinkle: no, i didn't want to revert everything, there were a whole bunch of changes in between, most of them making it better than before, so i just didnt include it on mchenry to prevent puppet from re-breaking it over and over [18:20:07] ok [18:20:30] PROBLEM - mysqld processes on db51 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:20:40] I don't know how it should be done, i also just tried to merge some of them once not expecting these issues at all [18:21:11] but.. if we can wait for the puppet "git clone" part.. [18:21:48] mutante: What's so special about mchenry? Is it not part of the puppetmaster pool or something? ("upgrade") [18:21:49] see, even moving it "manually" failed, for some reason it wasn't just changing the channel name in the existing unpuppetized init script [18:21:57] and the puppetized stuff started multiple processes of it [18:22:05] yikes [18:22:39] it's special because the git cloning won't work because the client is so old [18:25:16] mutante: what would an 'upgrade' entail? Are there other services on mchenry that need to be disabled temporarily thus making it non-trivial to do? [18:25:20] Or is it an undocumented process? [18:25:27] Krinkle: yes, its the main mail server :o [18:26:21] it's going to happen though one way or another, because we're moving out of Tampa [18:26:38] mutante: would manually applying the change work? (e.g. apply the diff that changes the channel to the configuration file) [18:27:47] maybe, there are a bunch of changes [18:28:17] its also related to using puppetized ircecho [18:29:32] the whole way it is being started right now is still: /usr/local/bin/start-wikibugs-bot [18:31:01] i commented on one of the changes we should include that somewhere, but it was being hoped this is replaced by puppetized ircecho [18:31:15] tail -n0 -f /var/wikibugs/wikibugs.log | \ [18:31:15] /usr/local/bin/ircecho "#mediawiki" wikibugs irc.freenode.net \ [18:31:28] simply changing the channel name there has been tried.. [18:33:59] PROBLEM - Puppet freshness on cp1031 is CRITICAL: No successful Puppet run in the last 10 hours [18:35:07] sorry, the only thing i can tell for sure right now is that above is how it works. and i need to focus on making a list of Tampa services and moving other old stuff [18:38:10] mutante: ok. [18:38:35] mutante: Can you give a guestimate on how long it might take to get this resolved (one way or another)? [18:40:26] to manually hack it before actual cloning works: it depends how much the "on duty" person of the week feels like getting into it. to fix the root cause: "a couple weeks" ... hmmm .. [18:41:20] New patchset: Kaldari; "Turning Echo on for English Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61610 [18:42:20] New review: Ottomata; "> There's a zookeeper package for example" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/53170 [18:42:47] kaldari: :) [18:43:50] New patchset: RobH; "revoke erik b key per rt 5045" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61611 [18:47:22] New review: RobH; "who needs zuul...." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/61611 [18:47:26] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61611 [18:50:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61606 [18:51:01] about to scap [18:51:47] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:37] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:56:38] neeed fooooood [18:56:40] back in an hour [18:57:47] PROBLEM - Disk space on mc15 is CRITICAL: Timeout while attempting connection [18:59:47] RECOVERY - Disk space on mc15 is OK: DISK OK [19:02:12] New patchset: Cmjohnson; "Decommission db10 - certs cleaned, changes made to site.pp, decom.pp and dhcpd files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61616 [19:02:50] !log removing db10 certs from puppet [19:02:58] Logged the message, Master [19:07:07] New patchset: Cmjohnson; "Decommission db10 - certs cleaned, changes made to site.pp, decom.pp and dhcpd files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61616 [19:13:19] mutante, Krinkle: there is a puppetized replacement for ircecho, see modules/tcpircbot [19:15:30] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61578 [19:15:50] New patchset: Cmjohnson; "Decommission db10 - certs cleaned, changes made to site.pp, decom.pp and dhcpd files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61616 [19:16:18] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61616 [19:18:38] !log kaldari Started syncing Wikimedia installation... : [19:18:45] Logged the message, Master [19:20:40] New patchset: Cmjohnson; "Decommissioning db10" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61620 [19:23:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 225 seconds [19:26:55] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 30 seconds [19:28:45] New patchset: Cmjohnson; "Decommissioning db10" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61620 [19:29:53] robh: can you double check that ...especially site.pp change thx [19:29:53] RoanKattouw, FYI: on https://www.mediawiki.org/wiki/VisualEditor:Test clicking edit gives me a "Error loading data from server: parsoidserver-http-bad-status:404" error [19:30:11] Ouch [19:30:20] WTF that's bad [19:30:57] I was about to investigate another related bug but I'll jump right into this one [19:32:30] thanks [19:33:35] Hah, someone tried very hard to break that page [19:33:37] It contains #REDIRECT [[www.google.com]] [19:33:57] And, unsurprisingly, the result is http://parsoid.wmflabs.org/mw/VisualEditor:Test [19:35:43] Thehelpfulone: The Parsoid team is aware, thanks for the report :) [19:45:21] New patchset: Lcarr; "removing deploymentscripts from bast1001 Two reasons : 1 it's not a deployment host and 2 - it breaks puppet since timidity is not installed err: /Stage[main]/Mediawiki/Service[timidity]: Could not evaluate: Could not find init script for 'timidity'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61625 [19:45:28] can i get someone to check this out ? ^^ [19:45:34] New patchset: BBlack; "Work-In-Progress vhtcpd code." [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/60390 [19:48:48] !log kaldari Finished syncing Wikimedia installation... : [19:48:56] Logged the message, Master [19:51:03] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61620 [19:53:19] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61610 [19:58:42] New patchset: Ori.livneh; "Add README and update dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61627 [19:58:43] ^ that change touches the docs only, if someone wants to merge :) [19:58:43] New patchset: Lcarr; "removing deploymentscripts from bast1001 Two reasons : 1 it's not a deployment host and 2 - it breaks puppet since timidity is not installed err: /Stage[main]/Mediawiki/Service[timidity]: Could not evaluate: Could not find init script for 'timidity'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61625 [19:58:43] rebasetastic [19:58:58] !log kaldari synchronized wmf-config/InitialiseSettings.php 'turning on Echo for English Wikipedia' [19:59:06] Logged the message, Master [19:59:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61625 [19:59:41] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61598 [20:03:41] !log kaldari synchronized echowikis.dblist 'syncing echowikis.dblist for cron jobs' [20:03:50] Logged the message, Master [20:08:30] New patchset: MaxSem; "This condition looks unneeded" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61629 [20:08:35] RECOVERY - Puppet freshness on vanadium is OK: puppet ran at Tue Apr 30 20:08:30 UTC 2013 [20:11:54] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [20:14:49] New review: awjrichards; "This should be fine - I can't think of any reason why we should be dbl checking $wgMobileUrlTemplate" [operations/mediawiki-config] (master); V: 1 C: 1; - https://gerrit.wikimedia.org/r/61629 [20:16:25] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/59553 [20:16:45] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61629 [20:18:02] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/60885 [20:33:54] New patchset: RobH; "creating racktables role for eqiad based server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [20:38:28] https://gdash.wikimedia.org/ is down [20:38:34] ironically [20:38:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 209 seconds [20:39:11] New patchset: Hashar; "beta: change syslog instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61636 [20:39:31] may I get a merge of https://gerrit.wikimedia.org/r/61636 for beta please? :-D [20:39:38] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [20:40:28] RECOVERY - DPKG on mc15 is OK: All packages OK [20:45:12] New review: Tychay; "Used to change Echo to use job queue instead of synchronously" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [20:45:17] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [20:45:18] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [20:48:40] New patchset: Kaldari; "Setting custom Echo help URL for enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61637 [20:51:04] heya hashar, you there? [20:51:28] ottomata: sprinting a beta migration then off to bed [20:51:34] hmmm ok ok [20:53:14] New patchset: Pgehres; "Enabling Extension:AccountAudit on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61639 [20:54:51] hashar sprints a lot [20:55:32] if only I could self.clone() [20:55:45] pgehres, will that be a new special page? https://www.mediawiki.org/wiki/Extension:AccountAudit is missing that bit of info [20:55:57] Thehelpfulone: nope [20:56:06] Just a new backend table and hook onlogin [20:56:29] Thehelpfulone: I am actually updating the docs currently [20:56:34] ah great [20:56:41] the purpose changed since I wrote them [20:58:47] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 225 seconds [20:59:12] binasher: roan and me are investigating post timeouts in the Parsoid varnish, and have tracked it down to Varnish's use of sess_timeout while receiving data from the client [20:59:26] the default is 5 seconds, which is too low for large pages [20:59:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 20 seconds [20:59:52] (For POST requests) [21:00:17] -p sess_timeout=15 in /etc/default/varnish would work, but does not seem to be directly supported by extraopts in the puppet varnish class [21:00:18] gwicke: RoanKattouw: that makes sense [21:00:48] what timeouts are set for the backends within the vcl? [21:00:53] I have not seen any way to set sess_timeout (or other runtime vars) from vcl [21:01:11] None, the VCL is like two lines [21:01:12] sess_timeout doesn't sound like what should be sent in that case [21:01:16] where's the vcl? [21:01:18] this is a client timeout, so cannot afaik be set in the backend section of the vcl [21:01:23] IIRC the default GET timeout is like 60s [21:01:29] And client-side our timeout is something like 100s [21:01:57] binasher: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=templates/misc/parsoid.vcl.erb;h=d65bcffc9e477903922b0faab1ab848f39e08579;hb=HEAD [21:02:08] gwicke: sess_timeout isn't vcl, it's a command line arg [21:02:22] keen: yep, that's my understanding too [21:02:23] -p sess_timeout=5 [21:02:41] keen: that's what gwicke said above [21:02:55] heh, missed it. ;) [21:03:32] wtf [21:03:50] varnish is running /etc/varnish/default.vcl [21:04:05] I guess the extraopts handling could be extended to allow non-name ones to be passed in [21:04:08] binasher: I think that's because I installed that file as default.vcl ? [21:04:19] why wasn't this setup using our varnish classes? [21:04:22] The Varnish setup on cerium/titanium is *extremely* simple [21:04:22] New patchset: Bsitu; "Enable job queue to process web and email notf on non-en wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61647 [21:04:33] in puppet/templates/varnish/varnish-default.erb [21:04:47] binasher: I am working on porting it to the new role stuff as a side project [21:05:07] IIRC when I tried setting it up initially I had trouble with the custom port (8000) when using the older Varnish classes [21:05:24] ah, so puppet/templates/varnish/varnish-default.erb is not used currently for the Parsoid boxes? [21:05:38] Probably not [21:05:54] ok [21:05:55] The setup currently is way too simple and I've been working on fixing it, but I haven't had much time [21:06:12] i think you want to add .first_byte_timeout to the vcl [21:06:24] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [21:06:30] It needs to be rewritten anyway because we'll want consistent hashing between the two like mobile has [21:06:35] binasher: https://www.varnish-cache.org/trac/ticket/849 [21:07:08] Mark suggested he was going to look at that, and I was hoping I could bribe Ryan to work on it with the CR I did for him this week [21:07:40] gwicke: is mediawiki posting to varnish or browsers? [21:07:49] binasher: MediaWiki [21:07:54] ahh [21:08:14] The Varnishes currently have public IPs but won't for much longer [21:08:29] And the traffic all goes through internal LVS VIPs [21:08:46] RoanKattouw: for now it would be good to figure out if we can get -p sess_timout=15 into /etc/default/varnish somehow [21:08:53] Right [21:09:00] Let me see [21:09:30] We can puppetize that file [21:09:35] Not sure how that's done elsewhere? [21:10:09] the default varnish class generates it from puppet/templates/varnish/varnish-default.erb I guess [21:10:26] Yes [21:10:32] I can put in a quick hack for this [21:10:40] Doing that now [21:10:55] RoanKattouw: ah. right. I was supposed to look at that for you [21:11:42] Ryan_Lane: If you actually do the whole thing I'll probably owe you quite a bit more CR :) [21:12:21] puppet/manifests/varnish.pp sets $extraopts if a $name is passed in, but does not seem to support passing in other extraopts currently [21:12:48] :D [21:13:06] or rather, it sets a default name if name == "" [21:13:18] I can't promise anything. I'll try to take a look at it this week. I'm also trying to upgrade openstack in production this week… sooo…. :) [21:13:28] Ryan_Lane: Yeah similar on my end [21:13:43] I just finished putting a bunch of stuff in production so I'm behind [21:13:53] I think I'm going to be absurdly swamped till ams hackathon [21:14:10] Yeah [21:14:20] If I have time I'll try to do it myself and get Mark to help me [21:14:27] because I'm upgrading openstack, switching everything away from gluster to nfs with coren, and getting database auth up in labs with coren [21:15:32] some of these things will be time consuming in a way I'll have free-time, though, so I'll look at it then [21:16:04] RoanKattouw: can you just insert -p sess_timeout=15 on cerium and restart varnish to test it? [21:16:59] Sure [21:17:11] It will get set back by puppet though, probably within the hour [21:17:21] Ryan_Lane: Awesome [21:17:27] RoanKattouw: i don't think so.. [21:17:39] Oh, right, it won't be [21:17:42] puppet isn't being used to install /etc/default/varnish [21:17:46] He's right [21:17:47] The file looks different though [21:18:09] nm found it [21:18:23] Swap file "/etc/default/.varnish.swp" already exists! [21:18:24] heh [21:18:37] OK done on cerium [21:18:38] this is just something you edited on top of the package default version [21:18:41] Now doing titanium [21:18:44] Yeah [21:18:59] Puppet won't eat my homework because bad bad Roan hasn't puppetized this [21:19:26] OK, restarted both [21:19:32] Let's see if this thing works [21:19:46] * gwicke is trying with Obama [21:20:05] * RoanKattouw has to refresh the Cologne landmarks article because of a token error [21:21:14] I'm still getting a timeout [21:21:36] After 5s or 15s? [21:21:55] Mine takes 21s now as opposed to 7s (but still times out) [21:22:18] 16.41s spent waiting [21:22:39] So the config change seems to have worked, it's just that 15s also isn't long enough to serialize Obama or the Cologne Monuments page [21:23:24] New review: Tychay; "related enwiki change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61479 [21:23:25] maybe up it to 60? [21:23:38] Setting to 60, hang on [21:23:49] I was under the impression that sess_timeout only covers the data retrieval phase from the client [21:24:01] OK try now [21:24:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 195 seconds [21:25:04] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [21:26:45] gwicke: OK that worked for some definition of "worked" [21:27:03] Serializing [[Liste der Denkmäler im Kölner Stadtteil Altstadt-Nord]] now gives me a 504 [21:27:09] it times our really quickly for me [21:27:10] After 62 seconds [21:27:36] hm [21:28:50] timeout after about 5 seconds when trying to save Obama [21:29:02] this is using FF [21:29:54] I used FF and got a timeout after 62s [21:30:59] about 6 seconds in chromium [21:31:06] weird [21:31:12] on Obama? [21:31:18] Let me try Obama [21:31:22] In Chromium [21:35:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 192 seconds [21:35:27] gwicke: WTF yeah you're right [21:36:55] That's weird; maybe the request is too large? [21:37:57] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [21:38:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [21:39:54] !log maxsem synchronized php-1.22wmf3/extensions/MobileFrontend/ 'Weekly mobile deployment' [21:40:02] Logged the message, Master [21:44:11] gwicke: Retrying that request against cerium it seems to be a 503 [21:44:14] Pretty quickly [21:44:24] I'm guessing there's a Varnish setting for limiting the POST body size? [21:44:57] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 217 seconds [21:45:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 28 seconds [21:47:53] Ugh, OK [21:47:55] I see [21:48:10] I'm getting a 500 from Parsoid and Varnish takes all errors and translates them to 503s [21:48:12] (thanks Varnish) [21:48:20] I guess that must also be configurable somewhere [21:49:49] RoanKattouw: can you check the log on the parsoid backend? [21:49:57] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 202 seconds [21:50:34] Will do [21:50:53] From Varnish ends I either see an Internal Server Error or the backend just going away [21:51:46] Whoa [21:51:49] I found something already [21:51:53] *alright [21:52:01] Check out wtp1003:/var/lib/parsoid/nohup.out [21:52:25] WARNING: DSR inconsistency: cs/s mismatch for node: BODY s: 0; cs: 1 for Khalid_Altowelli (probably unrelated) [21:52:47] Error: maxFieldsSize exceeded, received 2144394 bytes of field data [21:52:51] haha I found it [21:52:51] there we go [21:52:52] Error: maxFieldsSize exceeded, received 2144394 bytes of field data [21:52:54] Yup [21:52:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [21:53:02] hrmm, git review is just sitting here for me ow.... [21:53:06] i wonder if its just me. [21:53:17] RobH: Can always Ctrl+C and try agian [21:53:30] But yeah it's slow sometimes :( [21:53:49] i did it a few times [21:53:52] its sitting over 5 minutes [21:55:34] That's really odd [21:55:40] And you clearly do have access to the internet ;) [21:55:54] New patchset: RobH; "creating racktables role for eqiad based server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [21:55:56] .....it went througyh! [21:55:59] huzzzaaaah [21:56:03] !log maxsem synchronized php-1.22wmf2/extensions/MobileFrontend/ 'Weekly mobile deployment' [21:56:10] Logged the message, Master [21:56:11] it took..... 4 minutes 45 seconds that time [21:56:17] i ctrl+c at 5 minutes. [21:56:25] so slow. [21:57:29] hrmm, java is taking up all cpu cycles on manganese. [21:57:37] under gerrit2 user. [21:58:08] !log restarting mysqld on db1025 to enable federated engine for testing [21:58:16] Logged the message, Master [21:58:17] mutante: ^ new changeset, but gerrit is now so slow to be useless. [21:58:27] and none of the gerrit experts are around. [21:58:33] RobH, is it safe to restart it? [21:58:45] It should revive itself [21:58:51] but as zuul is in dev, kinda [21:58:55] i am really not sure [21:59:31] .... pinging fellow ops who are about ... binasher mutante notpeter ? [21:59:53] plus not sure rebooting will solve this [22:00:00] !log maxsem synchronized wmf-config [22:00:06] since it may be all the test queues filling up and slowing it down [22:00:07] Logged the message, Master [22:00:11] hence reboot just makes it happen again. [22:00:24] zuul and gerrit aren't on the same system [22:00:40] ok, this is specifically gerrit on manganese [22:00:49] the gerrit process is pegging high [22:00:54] but not sure if its slow due to that [22:01:14] Probably is [22:01:28] PROBLEM - mysqld processes on db1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:01:30] * RoanKattouw aborts his git pull because he realizes it's pointless [22:02:51] Ryan_Lane: any experience troubleshooting gerrit? [22:02:59] i dont wanna restart it if thats not the issue. [22:03:22] definitely don't reboot that box [22:03:25] that's silly [22:03:28] RECOVERY - mysqld processes on db1025 is OK: PROCS OK: 1 process with command name mysqld [22:03:36] which is why i asked since someone asked [22:03:48] rebooting a system is usually not the right answer [22:03:49] sigh. stop mysql, deal with children & wife freaking about dead mouse, start mysql [22:04:29] Ryan_Lane: yep, but still no one knows how to troubleshoot gerrit =P [22:04:33] if it's the gerrit service causing the issue, then you restart the gerrit service [22:04:46] but first you check normal things [22:04:57] like: is a bad bot crawling gitweb? (it isn't) [22:05:04] disk isnt full [22:05:16] iostat isn't showing a lot of disk usage [22:05:23] very little waitio [22:05:38] no swap in use [22:05:41] free memory dropped, but isnt out [22:05:45] !log maxsem synchronized php-1.22wmf2/extensions/MobileFrontend/ 'grrr' [22:05:47] network isnt saturated [22:05:53] Logged the message, Master [22:06:04] if its an issue with gerrit, it appears to be within gerrit or its backend [22:06:13] unfortunately, im not sure what to check specifically for this [22:06:40] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [22:06:44] well i saw demon restarting the gerrit service at least once before fixing something like this [22:06:52] gitweb and git are defunct procs [22:07:07] Jeff_Green: no they aren't [22:07:14] 0:00 [gitweb.cgi] [22:07:17] there is a git [22:07:18] 0:00 [git] [22:07:19] yes yes, but gerrit controls it [22:07:28] oh they're spawned by the cgi or something? [22:07:30] fancy [22:07:33] ahh [22:07:38] i see comment added [22:07:40] PROBLEM - mysqld processes on db1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:07:45] so yea, normal. [22:08:15] the java instance of gerrit2 is just pegging well above 100% utlization over and over [22:08:24] why is mediawiki fetching from gerrit? [22:08:40] where do you seethat? [22:08:40] RECOVERY - mysqld processes on db1025 is OK: PROCS OK: 1 process with command name mysqld [22:08:48] specifically gitweb [22:09:00] !log rolling back db1025 change b/c "mysqld: unknown option '--federated'" [22:09:07] Logged the message, Master [22:09:20] err [22:09:25] sorry. that's the referrer [22:13:32] there's two IPs making constant requests to gitweb [22:13:47] i'm going to block them [22:14:17] I hate gitweb [22:15:43] the rss extension maybe? [22:15:49] possibly, yeah [22:16:14] New patchset: Ryan Lane; "Deny a couple ips making a ton of gitweb reqs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61699 [22:16:18] mediawiki is definitely making the req. it's not a referrer. [22:16:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61699 [22:16:52] Ryan_Lane: if that's the problem, maybe you should fix the mediawiki config instead? [22:17:00] (at some point) [22:17:01] how can i? it's not our wikis [22:17:09] ah [22:17:12] sorry :) [22:17:20] heh. no worries [22:17:40] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [22:20:07] bleh. forgot a section [22:21:13] New patchset: Ryan Lane; "Follow up to change 61699" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61703 [22:21:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61703 [22:25:59] are things a little better now? [22:26:09] RoanKattouw, RobH: ^^ [22:27:45] New patchset: Tim Starling; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [22:28:19] Ryan_Lane: pull wfm [22:28:21] Ryan_Lane: it appears faster for me [22:28:24] great [22:28:27] as in i click things, it works [22:28:28] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61637 [22:28:28] stupid gitweb [22:28:29] thx dude [22:28:31] yw [22:28:45] so gitweb was DoSing our service for us... [22:28:48] =P [22:29:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 232 seconds [22:31:23] New patchset: Ottomata; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [22:32:00] RECOVERY - mysqld processes on db51 is OK: PROCS OK: 1 process with command name mysqld [22:32:08] New patchset: Ottomata; "Puppetizing Hadoop for CDH4." [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/61710 [22:34:00] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [22:35:26] New review: Tim Starling; "PS3: use $_SERVER['HTTP_HOST'] and $_SERVER['HTTP_X_FORWARDED_PROTO'] directly, rather than construc..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [22:38:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 233 seconds [22:42:41] ugh spam bots [22:43:18] wow. baidu actually *is* crawling us even though we're denying all bots [22:43:27] New patchset: Ryan Lane; "Kill another bad bot from gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61719 [22:43:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 213 seconds [22:46:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61719 [22:47:52] just ban Baidu from our network:P [22:47:58] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 25 seconds [22:48:01] heh [22:52:17] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61639 [22:52:31] New review: Dzahn; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60766 [22:52:50] Did someone in operations do something with zuul, jenkins or gerrit in the last hour? [22:53:50] heh like kill it? [22:54:30] Krinkle: https://gerrit.wikimedia.org/r/#/c/61699/ [22:54:44] Krinkle, Gerit is being DoSed [22:55:32] mutante: It seems Zuul is unable to process the queue https://integration.wikimedia.org/zuul/ Jenkins is completely idle. not sure what's going on. [22:55:39] "INFO zuul.Gerrit: Getting information for 61652,1" [22:55:44] repeated 100 times in the logs [22:55:48] it is stuck somehow [22:56:00] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Enabling Extension:AccountAudit on all wikis and wmgEchoHelpPage' [22:56:07] Logged the message, Master [22:56:14] MaxSem: https://gerrit.wikimedia.org/r/#/c/61719/1/templates/apache/sites/gerrit.wikimedia.org.erb [22:56:18] Zuul is requesting the information for change 61652,1 over and over and over again [22:56:35] mutante: gallium.wikimedia.org; tail -f /var/log/zuul/zuul.log [22:57:50] RoanKattouw: gallium.wikimedia.org; tail -f /var/log/zuul/zuul.log [22:57:57] Krinkle: Gerrit was very overloaded just now [22:58:08] Krinkle: You can kill that job because l10n-bot already merged it [22:58:09] seems like it just stopped doing that and is fetching another one now [22:58:20] RoanKattouw: There is no job yet [22:58:52] It has moved on to change 61703,1 now [22:59:00] but still it does it over and over again for 61703,1 [22:59:06] one APi request should be enough [22:59:17] like Gerrit is giving an invalid response [22:59:19] 2013-04-30 22:58:56,676 INFO zuul.Scheduler: Adding operations/puppet, to [22:59:21] 2013-04-30 22:58:56,676 ERROR zuul.IndependentPipelineManager: Unable to find change queue for project operations/puppet [22:59:22] one API request is good enough for anyone [22:59:27] Krinkle: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues ? [22:59:47] mutante: No, that one was fixed actually [22:59:52] hmm.. ok [23:00:02] mutante: Zuul is still getting events from gerrit, https://integration.wikimedia.org/zuul/ is still alive [23:00:08] queue increasing [23:00:08] ugh, are we still being flooded? [23:00:10] then i wouldnt know of a fix besides waiting [23:00:22] Oh, guess what [23:00:28] It's using SSH to fetch stuff [23:00:45] That means that I can't see what it's doing exactly but it seems to be auth negotiation [23:01:06] it seems like it is "Getting information" repeatedly for each change, but eventually it still does move on to a next one [23:01:07] !log Graceful restart of Zuul on gallium [23:01:07] [02:47:53] just ban Baidu from our network:P [23:01:14] Logged the message, Master [23:01:26] "Note: Restart is not needed when just deploying a configuration change. Zuul can reread configuration from disk while running. This way no Gerrit events are missed. As such, please do not take restarting Zuul lightly, as it means any Gerrit events during that time will be missed and need to be manually re-triggered." [23:01:44] MaxSem: i think Ryan did already [23:01:48] There are 142 such events :( [23:01:53] by user agent [23:02:01] RoanKattouw: the queue will be saved [23:02:05] OK [23:02:14] It can't add things to the queue while it is not running [23:02:28] restart only takes a few seconds once it starts shutting down [23:02:42] now gerrit is KIA [23:03:05] RoanKattouw: It is currently waiting for the job for 61703,1 to finish [23:03:16] then it'll restart [23:04:34] and don [23:04:36] done* [23:05:07] restart complete, queue preserved, and back in action [23:05:41] RoanKattouw: Looks like that didn't help still stuck [23:05:42] now at "zuul.Gerrit: Getting information for 61693,1" [23:05:46] repeating it dozens of times in the log [23:05:54] 2013-04-30 23:05:39,853 INFO zuul.Merger: Updating local repository operations/puppet [23:06:18] 2013-04-30 23:06:14,779 ERROR zuul.IndependentPipelineManager: Unable to find change queue for project operations/puppet [23:06:21] wth [23:06:31] Yeah I saw that too [23:06:38] PROBLEM - Puppet freshness on db10 is CRITICAL: No successful Puppet run in the last 10 hours [23:06:48] Krinkle: I tried ssh jenkins-bot@manganese:29418 and it authenticates just fine [23:06:59] I can't trace what's being returned beyond that because the communication is encrypted [23:07:05] New patchset: Tim Starling; "Mostly rewrite missing.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [23:07:09] going to deploy the missing.php change now [23:07:47] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/61539 [23:09:05] it moved on again to another patch [23:11:05] If Zuul tries 20 times for each event, the queue will never clear [23:11:07] Krinkle: fwiw, /var/log/zuul/debug.log [23:11:30] TimStarling, are you deploying right now? [23:11:32] nice [23:11:37] yes [23:11:47] mutante: it seems to be getting better [23:11:51] !log tstarling synchronized wmf-config/missing.php 'rewrite (I17989fc4)' [23:11:51] less repetition [23:11:58] Logged the message, Master [23:12:01] Krinkle: i agree, it still feels like it's just catching up [23:12:08] MaxSem: i think Ryan did already [23:12:11] at least it does not seem stuck on one thing [23:12:31] I blocked a bunch of bots from gitweb [23:12:42] though it seemed to have helped, it looks like it didn't [23:12:50] SetEnvIf User-Agent Baiduspider bad_browser [23:12:51] that one [23:13:12] I'm pushing out new Parsoid code, so please ignore alerts in the next minutes [23:13:59] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [23:14:06] Krinkle: you know what, if i look at a random older log from yesterday, the fact that it is repeatedly getting info for the same patch set seems normal [23:14:25] !log maxsem synchronized php-1.22wmf2/extensions/MobileFrontend/ 'Revert MobileFrontend' [23:14:27] mutante: SNAFU, that's unfortunate [23:14:32] mutante: I also added a couple bad ips [23:14:33] Logged the message, Master [23:14:41] * Krinkle logs upstream bug at zuul [23:14:58] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [23:16:27] Krinkle: the difference is just that it does it way slower [23:17:12] right [23:17:18] but multiple loglines.. like grep 61308 zuul.log.2013-04-29 [23:17:37] hrmmm.. [23:17:54] i _can_ still see my gerrit ui though [23:18:15] !log maxsem synchronized php-1.22wmf3/extensions/MobileFrontend/ 'Revert MobileFrontend' [23:18:22] all done on the Parsoid front [23:18:23] Logged the message, Master [23:19:28] mutante: yeah, gerrit web ui is fairly responsive [23:19:52] mutante: RoanKattouw: I've asked clarkb (Zuul developer) for an eye on it. See #openstack-infra if you're interested. [23:20:02] dunno how debug.log helps, it just shows me it's doing stuff [23:20:13] not even errors [23:21:46] speaking of the gerrit UI being responsive... [23:21:50] it's currently not for me [23:23:29] it just took ~40 seconds to load a change... so maybe it's just under load? [23:23:44] yea [23:24:01] all i can tell it is still working on things even though slow [23:24:21] zuul that is [23:24:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 194 seconds [23:25:59] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [23:28:12] !log kaldari synchronized /php-1.22wmf3/extensions/Echo 'Updating Echo for some minor bugfixes' [23:28:19] Logged the message, Master [23:33:36] !log kaldari synchronized wmf-config/InitialiseSettings.php 'Syncing InitialiseSettings for Echo' [23:33:43] Logged the message, Master [23:37:31] PROBLEM - Disk space on db1059 is CRITICAL: NRPE: Command check_disk_space not defined [23:37:51] PROBLEM - RAID on db1059 is CRITICAL: NRPE: Command check_raid not defined [23:38:21] PROBLEM - DPKG on db1059 is CRITICAL: NRPE: Command check_dpkg not defined [23:44:21] !log kaldari synchronized wmf-config/CommonSettings.php 'Syncing CommonSettings for Echo' [23:44:29] Logged the message, Master [23:45:22] Krinkle: ok. it's back up [23:46:15] Ryan_Lane: https://gist.github.com/Krinkle/5492688 [23:46:30] It re-tried a few times, but it is no longer trying [23:46:37] log is idle [23:46:41] restart it maybe? [23:46:44] perhaps I should restart Zuul as well [23:46:45] * Krinkle does  [23:46:55] maybe it doesn't handle gerrit going away very well [23:47:08] can't see why that would be an issue... [23:47:23] web interface is snappy [23:47:37] log is active again [23:47:46] so far seems the same as before the gerrit restart [23:47:54] and getting faster [23:47:57] speeding up a lot [23:47:59] nice [23:48:14] great [23:48:17] https://integration.wikimedia.org/zuul/ [23:48:22] down from 169 to 137 [23:49:08] yeah. much better [23:49:29] seems it stopped again? [23:49:44] oh. or not [23:49:47] Ryan_Lane: Are you tailing zuul log? [23:49:53] nah. web interface [23:50:00] tail -f /var/log/zuul/zuul.log on gallium [23:50:11] not all events result in jenkins jobs [23:50:15] it is iterating through quickly [23:50:23] 128 left [23:50:26] ah. great [23:51:18] Ryan_Lane: Did you restart Gerrit or some specific serivce? [23:51:24] gerrit service [23:51:30] !log Ryan_Lane restarted the gerrit service [23:51:33] Krinkle, Ryan_Lane: is your gerrit config in puppet where i can see it? [23:51:35] ah. thanks [23:51:37] Logged the message, Master [23:51:42] !log Gracefully restarted Zuul on gallium [23:51:45] jeblair: yep. one sec [23:51:49] Logged the message, Master [23:52:44] jeblair: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=templates/gerrit;h=9dcd5cfa3712315345b026d2820cb243368e5d9c;hb=refs/heads/production [23:52:53] we have a spaghetti code repo [23:53:05] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/gerrit.pp;h=808d3446c89deb4c9ff9d7a732e7df44cf4e1f28;hb=refs/heads/production [23:53:17] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/role/gerrit.pp;h=56513113e26673d8728f358a2b208e1617707812;hb=refs/heads/production [23:53:29] * Krinkle change url from Server_admin_log to Server_Admin_Log (one transcludes the other, proper history) [23:53:32] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;f=files/gerrit;h=6793f9586cfff4d60b6f5a433aad975e892ada39;hb=refs/heads/production [23:54:10] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/site.pp;h=06a3511c5804dbc73c1775a654ef2feb9f81851f;hb=refs/heads/production [23:54:55] https://github.com/wikimedia/operations-puppet/tree/production/files/gerrit [23:54:55] https://github.com/wikimedia/operations-puppet/blob/production/manifests/gerrit.pp [23:55:24] Alrighty, queue is back to regular size [23:56:52] Krinkle: so old SAL from wikitech-old is _not_ lost as somebody was afraid of [23:57:08] Ryan_Lane, Krinkle: any chance you have cacti/graphite/collectd for that server publicly available? [23:57:12] and yay @ gerrit fix [23:57:57] mutante: Both https://wikitech.wikimedia.org/wiki/Server_admin_log and https://wikitech.wikimedia.org/wiki/Server_Admin_Log show the same page, but the former has a 2 revision history that transcludes the latter. The latter page is the one you want to see when viewing the wiki page [23:58:20] jeblair: ganglia, yes [23:58:29] jeblair: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+eqiad&h=manganese.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [23:58:34] Krinkle, Ryan_Lane: you might want to increase sshd.threads: https://groups.google.com/forum/?fromgroups=#!topic/repo-discuss/uRSRwYUpWnE [23:59:02] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [23:59:08] Krinkle, Ryan_Lane: (in gerrit). ours is set to 100, that msg says 50-200 is not unheard of, and you have 8 set now [23:59:31] * Krinkle yields to Ryan_Lane [23:59:51] ^demon would want to be here too