[00:02:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.249 second response time [00:02:56] binasher: wouldn't have it any other way \m/ \m/ [00:02:57] package_name => 'mariadb-server-5.5', [00:03:24] pgehres: RoanKattouw: http://pastebin.mozilla.org/2232511 [00:03:41] binasher: I captured an MW backtrace BTW [00:03:55] waitforslaves.log on fluorine [00:04:20] New review: Pyoungmeister; "need to flesh out role::db::labsdb." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [00:06:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.565 second response time [00:07:56] RECOVERY - Puppet freshness on mw1008 is OK: puppet ran at Thu Mar 21 00:07:47 UTC 2013 [00:10:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:40] pgehres: BOOM, I got it to work [00:40:49] .p..vjg......I....def...3MASTER_POS_WAIT('db1041-bin.000603', 604351573, 10)..?.................................. [00:46:32] pgehres: https://gerrit.wikimedia.org/r/55008 [00:48:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [00:52:10] New patchset: Tim Starling; "Add libjpeg-turbo-progs to imagescalers for image rotation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52707 [00:52:39] New review: Tim Starling; "It's OK to put libjpeg-turbo-progs to imagescalers, just not on API apaches." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/52707 [00:52:53] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52707 [00:53:05] yes, we had that discussion on the wrong changeset [00:53:11] I mentioned that this changeset isn't it [00:54:23] TimStarling: what about the screwing with your plans regarding memory usage? [00:57:13] paravoid: T-2 days, HURRY! [00:57:24] heh [00:59:36] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:04:02] !log catrope synchronized php-1.21wmf11/includes/db/LoadBalancer.php '319ace81250b97dff0e6ee6df4b1f570f468bf5c' [01:04:19] !log catrope synchronized php-1.21wmf12/includes/db/LoadBalancer.php '319ace81250b97dff0e6ee6df4b1f570f468bf5c' [01:19:26] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [01:34:45] PROBLEM - MySQL Recent Restart on db1051 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [01:34:45] PROBLEM - MySQL disk space on db1051 is CRITICAL: NRPE: Command check_disk_6_3 not defined [01:35:06] PROBLEM - MySQL Replication Heartbeat on db1051 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [01:35:14] PROBLEM - MySQL Slave Delay on db1051 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [01:35:14] PROBLEM - Full LVS Snapshot on db1051 is CRITICAL: NRPE: Command check_lvs not defined [01:35:33] PROBLEM - mysqld processes on db1051 is CRITICAL: NRPE: Command check_mysqld not defined [01:35:33] PROBLEM - MySQL Slave Running on db1051 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [01:35:33] PROBLEM - MySQL Idle Transactions on db1051 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [01:40:33] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:48:34] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [01:51:17] !log catrope synchronized php-1.21wmf12/includes/LinksUpdate.php 'Debugging hack for move page bug on nlwikimedia' [01:52:42] * Aaron|home eyes RoanKattouw [01:52:52] * RoanKattouw reverts live hack [01:52:59] There appears to be a bug in wmf12 when moving a page [01:53:14] See the backtrace in exception.log for exception 458e1031 [01:53:28] (the title in the exception message is the target title of the move) [01:55:50] OK yeah this is a bug in wmf12 [01:57:12] Reporting, then heading home [01:57:13] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 189 seconds [01:57:43] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [01:58:50] !log catrope synchronized php-1.21wmf12/includes/LinksUpdate.php 'Revert debugging hack' [02:00:14] RoanKattouw: morebots? [02:00:41] sigh [02:03:12] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [02:07:10] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [02:16:58] !log LocalisationUpdate completed (1.21wmf12) at Thu Mar 21 02:16:57 UTC 2013 [02:18:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [02:19:20] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [02:19:40] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 221 seconds [02:19:51] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 231 seconds [02:33:49] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 185 seconds [02:37:49] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [02:38:09] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [02:41:31] !log LocalisationUpdate completed (1.21wmf11) at Thu Mar 21 02:41:30 UTC 2013 [03:27:26] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [03:27:28] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [03:27:28] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:27:28] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [03:41:41] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:41:52] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [04:05:32] New patchset: Tim Starling; "Removed favicon.ico files obsoleted by I35d3af43" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55014 [04:07:41] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [04:09:44] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [04:52:13] New patchset: Ryan Lane; "Use mysqldump --single-transaction where possible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55015 [04:54:36] New patchset: Ryan Lane; "Use mysqldump --single-transaction where possible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55015 [04:57:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55015 [06:03:02] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [06:31:07] New patchset: Tim Starling; "Roll back all wikis to php-1.21wmf11 due to bug 46397" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55016 [06:31:56] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [06:32:04] Logged the message, Master [06:32:44] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55016 [06:33:57] !log tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: [06:34:03] Logged the message, Master [06:38:10] New review: Mattflaschen; "-1'ing myself. Do not submit until the error (see previous comments) is fixed." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54970 [06:38:56] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:55:07] New review: Ryan Lane; "Cleaning it up manually is fine, since this is a single system. It removes clutter from the repo." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52742 [06:55:29] New patchset: Tim Starling; "Remove search::apple-dictionary-bridge from ekrem" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52742 [06:55:48] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52742 [06:58:28] RECOVERY - MySQL Slave Delay on db67 is OK: OK replication delay NULL seconds [07:00:56] PROBLEM - MySQL Slave Running on db67 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You have an error in your SQL syntax: check the manual that co [07:07:32] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [07:08:12] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [07:17:13] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 19 seconds [07:17:32] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [08:39:04] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [08:39:26] New patchset: Aude; "allow CORS on both www.wikidata.org and wikidata.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55024 [08:40:04] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [08:42:03] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55024 [08:43:36] New review: Aude; "I don't know if this would mean their edits would be treated as logged out. that would be bad and w..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55024 [08:44:12] New review: Aude; "let's see if this helps and it might :)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55024 [09:14:41] !log olivneh synchronized wmf-config/CommonSettings.php 'Add 'wikidata.org' to to allow CORS' [09:14:47] Logged the message, Master [09:15:26] second time this week i let bash variable expansion eat part of my sync message [09:15:29] damn it [09:15:54] anyways aude ^ synced [09:16:11] ori-l: thanks [09:16:14] * aude hopes it helps [09:29:37] !log created git repo operations/debs/python-statsd [09:29:43] Logged the message, Master [09:32:51] New patchset: Hashar; ".gitreview file" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55028 [09:40:48] New patchset: Hashar; "logmsbot is no more in #wikimedia-tech" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55031 [09:41:39] New review: Hashar; "The previous changes are:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55031 [09:45:10] is Nagios officially dead and buried? [09:46:35] apergos: if you are around, would you be kind enough to kick logmsgbot out of #wikimedia-tech ? The change is https://gerrit.wikimedia.org/r/#/c/55031/ (on fenari ) [09:46:43] ori-l: ^^^ [09:48:43] lookng [09:51:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55031 [09:59:46] New review: Nemo bis; "fu I75abf6144d231f7335d025a59e448500d253fe13" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8344 [10:04:31] so [10:04:45] I browse github and open an issue about some file lacking [10:04:51] I then clone the repo and the file is there [10:04:53] WTF? [10:05:13] answer: upstream managed to commit the file before I cloned the repo!!! lightning speed! [10:05:44] I thought it was going to be a bug in github interface [10:12:45] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Puppet has not run in the last 10 hours [10:12:46] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Puppet has not run in the last 10 hours [10:13:03] building a debian package [10:13:10] wishhh me luck [10:14:45] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [10:16:25] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [10:16:45] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 193 seconds [10:18:19] dpkg-source: error: aborting due to unexpected upstream changes, see /tmp/python-statsd_1.5.8-1.diff.1kbj2k [10:18:20] almost there [10:22:39] unexpected upstream changes? [10:22:47] is than an eufemism for hash mismatch? [10:25:18] na debian oddity [10:28:45] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:28:52] aude: what will "[10:14:40] !log olivneh synchronized wmf-config/CommonSettings.php 'Add 'wikidata.org' to to allow CORS'" hopefully fix? [10:29:25] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:48:33] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Puppet has not run in the last 10 hours [10:56:46] yeahhh [10:56:51] only two errors :-] [11:00:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:06:32] PROBLEM - Puppet freshness on carbon is CRITICAL: Puppet has not run in the last 10 hours [11:10:34] New patchset: MaxSem; "Relax checks for translation memory Solr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55036 [11:25:52] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [11:27:23] !log Rebooting niobium with hyperthreading disabled [11:27:30] Logged the message, Master [11:30:52] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [11:39:29] New patchset: Hashar; "Initial upstream branch." [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55038 [11:39:29] New patchset: Hashar; "Imported Upstream version 1.5.8" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55039 [11:39:30] New patchset: Hashar; "Merge tag 'upstream/1.5.8'" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55040 [11:39:30] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [11:39:35] oops [11:39:57] stupid me [11:40:01] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55038 [11:40:10] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55039 [11:40:17] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55040 [11:40:23] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [11:41:40] mark: hi, when using git build package should we push the upstream code too or just our debian/ directory? [11:41:59] yes too [11:42:09] but probably not through gerrit review [11:44:17] PROBLEM - Host strontium is DOWN: CRITICAL - Plugin timed out after 15 seconds [11:44:25] mark: I guess non ops will not be able to push to gerrit :-D [11:44:29] yeah [11:44:32] not sure how to deal with that [11:44:44] it's just the initial push that is the problem really [11:45:19] that is what I just did :( [11:45:27] git-review pushed the upstream branch under master branch [11:46:40] there is also a pristine-tar branch :D [11:47:17] would you be kind enough to create the upstream / pristine-branch for the python-statsd repo ? https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,branches [11:47:57] then I can submit my various local branches, get them merged and I can get ottomatta to review my debian/* files :-) [11:49:27] <^demon> We could probably grant Create Reference a bit wider on operations/debs/* [11:49:31] <^demon> Then people could make their own branches. [11:50:10] <^demon> Anyway, breakfast. [11:50:13] you are the bosses https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs,access :D [11:50:17] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:50:54] ottomata: good morning :-]  You have missed the discussion about upstream / pristine-tar branches under operations/debs :( [11:51:31] haha, awesome, i am not yet awake [11:51:38] hopefully someone will summarize, maybe in that email thread? [11:51:42] take your time and enjoy the breakfast [12:03:16] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [12:07:14] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [12:38:56] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [12:40:46] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: Connection refused [12:48:46] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [12:49:33] !log Rebooting palladium with hyperthreading disabled [12:49:40] Logged the message, Master [12:51:27] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [12:55:36] RECOVERY - Host palladium is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:00:51] ori-l: hey :) i am missing background. what will "[10:14:40] !log olivneh synchronized wmf-config/CommonSettings.php 'Add 'wikidata.org' to to allow CORS'" hopefully fix? [13:10:50] PROBLEM - Host cp3021 is DOWN: CRITICAL - Plugin timed out after 15 seconds [13:14:52] RECOVERY - Host cp3021 is UP: PING OK - Packet loss = 0%, RTA = 82.90 ms [13:15:07] New patchset: QChris; "Make hooks-bugzilla comment on abandoning/restoring a change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55048 [13:25:56] New patchset: Matthias Mullie; "Re-enable AFTv5 on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55049 [13:27:40] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [13:27:40] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [13:27:40] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [13:27:40] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:33:09] ^demon: https://gerrit.wikimedia.org/r/54692 [13:33:14] moin moin [13:33:32] <^demon> I don't know ruby either :p [13:34:38] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:46] jeremyb_: ask zeljkof :-] [13:35:02] ^demon: i tested it! [13:35:17] ^demon: and it was based on consultation with #ruby :) [13:35:35] hashar, jeremyb_: what's up? ruby superhero needed? :) [13:35:40] * jeremyb_ spies zeljkof [13:35:42] jeremyb_: you might want to add that information to the commit message too :-D [13:35:52] zeljkof: jeremyb is asking for review on a erb template in puppet [13:35:54] zeljkof: https://gerrit.wikimedia.org/r/54692 [13:36:14] zeljkof: erb is not really my game, but I can take a look [13:36:33] it's not even erb. it's just ruby [13:36:34] <^demon> jeremyb_: Then it's probably ok :) [13:36:35] I just don't understand why we have an ERB template to generate the default [13:36:43] uh oh: Code Review - Error, Server Unavailable, 0 [13:36:49] when we could put the settings verbatim in the file and have ircecho handle them :-] [13:36:50] heh [13:37:10] hashar: default? [13:37:21] templates/ircecho/default.erb [13:37:35] that's the only template though [13:37:35] that expand some puppet variables to shell env variables [13:37:45] no shell env [13:37:52] we could just pass them as is and migrate all the ruby stuff to ircecho itself [13:38:07] errr, i guess? [13:38:08] jeremyb_: I have no idea what is going on in the file :) [13:38:16] i see no problem with leaving it as is [13:38:17] not a lot of help from me here [13:38:40] zeljkof: did you see the change was just adding one line? [13:38:57] before change it works in 1.8, not in 1.9.x [13:39:03] after change it works everywhere [13:39:03] jeremyb_: yes [13:39:09] output is the same in both places [13:39:28] jeremyb_: the change looks good to me, but I lack context to see if it could break anything later on [13:39:49] zeljkof: hint, i wrote that file to begin with. :) [13:40:02] i think hopefully i might have sufficient context [13:40:24] jeremyb_: how come you picked ruby? :) [13:40:47] New patchset: Hashar; "futureproof for ruby1.9.x (not in use yet)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [13:40:52] zeljkof: i didn't. there's not really any other choices [13:41:11] New review: Hashar; "please amend the commit message to explain why casting is needed with ruby 1.9 :-]" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54692 [13:41:51] jeremyb_: puppet is written in ruby? [13:41:55] yes [13:42:03] hashar is a fan of extra long commit msgs [13:42:10] jeremyb_: I like it already :) did not know that [13:42:13] ok, have to run for a few mins, bbiab [13:42:20] zeljkof: heh [13:42:57] jeremyb_: you can't just throw away a code without any context nor explaining what it is fixing. That kind of force us to play Sherlock Holmes to find it out. [13:43:44] hashar: it fixes an exception...; i think the inline comment i made indicates that it was related to bare strings that were not arrays [13:44:03] New patchset: Demon; "Block misbehaving bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55054 [13:44:05] the method i was using in 1.8 to force it to be an array didn't work in 1.9.x [13:44:35] anyway, really have to run for a bit [13:44:47] <^demon> You're shitting me. [13:45:16] ^demon: err? [13:45:41] <^demon> Well, somebody wrote a stupid bot that keeps requesting a tar.gz of DPLForum. [13:45:45] <^demon> That's what 55054 is for. [13:45:59] <^demon> But now I'm also getting bingbot indexing us? It's not listening to robots.txt [13:46:05] right, i read it :) [13:46:13] heh [13:46:21] you can block on useragent too if you like [13:46:41] <^demon> Well the user agent for the DPLForum one is just a generic Safari-looking UA. [13:46:58] <^demon> But it's requesting way too regularly of a stupid file to be a real person. [13:47:18] New review: Hashar; "Shouldn't you update the rule in the Proxy part?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55054 [13:47:39] ^demon: the gerrit apache conf also have a , you might want to block there [13:48:16] ^demon: i meant for bing [13:48:23] <^demon> Ah yes [13:48:24] <^demon> That. [13:48:38] New patchset: Demon; "Block misbehaving bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55054 [13:48:41] New patchset: Mark Bergsma; "Cache 4xx responses on the mobile backend servers for 5m" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55055 [13:48:52] <^demon> "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" [13:50:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55055 [13:50:46] <^demon> mark: Could you look at 55054 for me? [13:51:05] in a bit [13:51:16] babysitting my own change now [13:51:21] <^demon> Ok, thanks. [13:55:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55054 [13:55:18] it's live now [13:55:55] <^demon> Ok, thanks. [13:58:29] RECOVERY - Host cp3022 is UP: PING OK - Packet loss = 0%, RTA = 82.76 ms [14:04:03] <^demon> jeremyb_: Guy is still being an idiot and making requests, but at least he's not eating resources now :p [14:07:24] poke ^demon [14:07:30] * aude needs someone to review https://gerrit.wikimedia.org/r/#/c/49069/ [14:07:53] some people are still ending up at wikidata.org instead of www.wikidata.org [14:08:35] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [14:10:04] New review: Aude; "if you request http://wikidata.org that *does* redirect." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [14:10:29] <^demon> aude: lgtm, but I can't deploy that. [14:10:34] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [14:10:54] ^demon: not good to me [14:11:04] <^demon> hrm? [14:11:09] * jeremyb_ comments [14:11:13] ok [14:11:38] New patchset: Demon; "Send analytics repos to #wikimedia-analytics" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55057 [14:11:41] what i did was make it like how mediawiki.org is handled [14:11:47] or that's what we want [14:12:55] aude: i don't think that's what you want... [14:13:06] ? [14:13:27] we don't want anyone to end up on wikidata.org without the www [14:13:37] it's been only problems to have both [14:14:18] huh... it helps if i do the git pull in the right repo [14:14:23] New review: Yuvipanda; "Wheeee!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55057 [14:14:24] yeah :) [14:15:16] The Hebrew Wikivoyage was recently approved by the board of trustees and the language committee (https://bugzilla.wikimedia.org/show_bug.cgi?id=46416). Would anyone around be able to help launch it? [14:15:18] New patchset: Hashar; "expose realm in mw-deployment-vars.sh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [14:15:18] New patchset: Hashar; "(bug 41285) adapt `foreachwiki` for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [14:16:05] WikiJunkie_: you probably want to coordinate with relrod [14:16:07] errr [14:16:09] reedy [14:16:35] when will he be around ? [14:16:55] idk. he's here now but maybe not really here [14:17:02] just hang out here for a while :) [14:17:15] oh ok [14:19:04] * jeremyb_ waits for the git pull [14:19:08] ok [14:23:22] ottomata.meal( 'breakfast' ).isComplete? [14:25:19] yes! [14:25:22] true! [14:25:29] yeah, no likey pristine? [14:25:42] i just did what the tutorial recommended :/ [14:25:58] If you want to be able to exactly recreate the original tarball (orig.tar.gz) from Git you should also specify the --pristine-tar option. This is recommended. [14:26:19] New review: Anomie; "Why not use the getRealmSpecificFilename function from $MW_COMMON_SOURCE/multiversion/MWRealm.sh?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [14:26:24] New review: Jeremyb; "incomplete" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [14:26:37] ^demon: aude: ^ [14:27:16] ok, thanks [14:27:53] ottomata: so I got the package ready in a labs instance with all the changes thanks to your tutorial :-] But I would need the creation of 2 branches in the repository`upstream` and `pristine-tar` [14:28:07] ottomata: should be possible at https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,branches if you are part of the `ops` group [14:28:22] New review: Anomie; "In that case, why not throw in WMF_DATACENTER too?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [14:29:00] ah so, what i did before, was give me push direct push rights [14:29:02] can I do that for you? [14:29:36] are you in the ldap ops group? [14:29:43] ottomata: nop :-] [14:29:56] ottomata: so I need branch to submit my changes and get them merged for me [14:31:27] ok [14:31:34] the integration group now has push rights [14:31:34] so [14:31:43] you should be able to directly push the upstream and pristine tar branches now [14:31:51] trying [14:32:09] still need the branches :-] [14:32:33] git push gerrit upstream:upstream [14:32:39] mm maybe I should use refs/something [14:33:00] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [14:33:43] still need to something about the language subdomains, but not sure exaclty what [14:33:58] ottomata: I can't create the branch (tried: git push gerrit upstream:refs/heads/upstream ) [14:34:13] ottomata: seems it need a Push Branch permission [14:34:52] Change abandoned: Hashar; "??" [operations/debs/python-statsd] (refs/meta/config) - https://gerrit.wikimedia.org/r/55065 [14:35:58] this doesn't woork? [14:35:58] git push gerrit upstream [14:36:13] nop because the branch does not exist in Gerrit :( [14:36:30] hm. [14:36:34] i never had that problem [14:36:41] so we need to either create the branches manually or allow "Create Reference" right [14:37:18] the ldap/ops group is allowed to Create Reference in operations/debs/* https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs,access [14:37:21] must be that :-] [14:37:42] we probably don't want to grant that to the integration group [14:37:49] i'm editing this [14:37:50] https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,access [14:38:05] New review: Jeremyb; "(1 comment)" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/49069 [14:38:20] ok, i gave create reference perms to integeration grouop [14:38:24] try now? [14:38:36] works! [14:38:40] ah, cool! [14:39:03] https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,branches :-] [14:39:19] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [14:40:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 202 seconds [14:40:36] !g 55041 [14:40:36] https://gerrit.wikimedia.org/r/#q,55041,n,z [14:40:45] Change restored: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [14:40:54] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [14:41:10] stupid gerrit dependencies [14:41:24] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [14:41:41] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55066 [14:41:51] New review: Hashar; "resent with https://gerrit.wikimedia.org/r/55066" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55041 [14:42:14] ottomata: that worked thanks! You can remove `integration` from https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,access :-] [14:43:39] yeehaw [14:44:08] mk [14:44:30] arharharhhhhh [14:44:38] the change I pushed are in Gerrit [14:44:41] and abandonned [14:44:51] and Gerrit created a dependency on an abandoned change [14:44:54] :-/ [14:44:55] wa why? haha [14:45:01] hehe [14:45:21] Change restored: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55040 [14:45:28] Change restored: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55039 [14:45:38] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55039 [14:45:45] oh yeah, it gets weird if you abandon a change that's not on a topic branch, right? [14:45:48] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55040 [14:45:52] because then your local is no longer in sync wiiht gerrit [14:45:55] local master [14:45:58] so I sent the changes by mistakes to gerrit [14:46:10] then git pushed them by passing gerrit entirely [14:46:12] git reset --hard origin/master [14:46:17] now the Gerrit DB is no more in sync :/ [14:46:31] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55066 [14:46:54] going to change the Change-Id values and resubmit [14:46:59] PROBLEM - Puppet freshness on sanger is CRITICAL: Puppet has not run in the last 10 hours [14:47:16] errr, what? [14:47:18] hashar: huh? [14:47:23] oh [14:47:44] changing the change-id is a bad solutin [14:47:47] solution* [14:47:55] I dont care [14:48:06] :( [14:49:38] hashar [14:49:45] maybe just copy your debian directory [14:49:50] re clone [14:49:54] copy it back in [14:49:58] and make a new changeset [14:50:12] yeah need to do that [14:50:17] New review: Jeremyb; "realm==production yields a labs dblist?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55059 [14:50:31] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [14:50:44] jeremyb_: how is it now? [14:51:01] * aude tested it and it works with en.*** [14:51:01] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55068 [14:51:20] en.**.org or **.org [14:51:36] ottomata: na it still detect the parent Change-Id and bind the parent commit to the abandoned change. I need to change all the change-id in the master branch [14:52:12] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [14:52:22] hm? [14:52:29] if you recloned it wouldn't detect any change id [14:52:29] eek, tab [14:52:40] i'm saying, start over [14:52:57] git clone ssh://otto@gerrit.wikimedia.org:29418/operations/debs/python-statsd [14:53:11] cp -r /tmp/debian python-statsd [14:53:11] cd python-statsd [14:53:11] git add debian [14:53:12] git-review [14:53:19] (git commit ^^) [14:53:37] what I mean is the commit for the current `master` has a change-id which points to an abandoned change. So adding a new follow up commit will still let Gerrit query the parent change-id [14:53:42] and bind it to the abandoned change [14:53:47] I need to force push I guess [14:53:51] aude: no good. maybe i need to just do it myself :-) [14:54:14] hm? naw if you re clone it won't [14:54:22] just abandon all the changes in gerrit now [14:54:29] and make a new commit, it will get a new change id [14:54:32] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [14:54:39] ottomata: that is what I just did with https://gerrit.wikimedia.org/r/55068 :-] [14:54:42] the change ids are generated locally and then committed to gerrit [14:54:51] did you actually reclone? [14:54:53] what i have now works for me [14:54:53] ottomata: it still has master as a parent with hash a an abandoned change id [14:55:24] rm -r python-statsd [14:55:24] git clone ssh://otto@gerrit.wikimedia.org:29418/operations/debs/python-statsd [14:55:24] ... [14:55:28] jeremyb_: if it's still wrong, we are open to suggestions :) [14:55:45] * jeremyb_ also has an urge to fix hashar's thing himself [14:55:57] heh [14:56:22] * aude not sure what happens here, though if you visit https://wikidata.org [14:56:24] ottomata: that is irrelevant :-] the tip of master will still be https://gerrit.wikimedia.org/r/#/c/55040/ which I have pushed directly. [14:56:46] that gets redirected to https://www.wikidata.org even with this configuration? or where does that get handeld? [14:56:49] handled? [14:56:59] OH [14:57:01] you pushed it directly [14:57:11] wait [14:57:12] no you didn't [14:57:16] you pushed it for review [14:57:27] hmm, oh [14:57:30] i'm confused [14:57:31] can you grant me force push on https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,access [14:57:48] ottomata: I will alter the change-id field of the laster commit in master [14:57:54] that should fix my issue [14:58:47] hrmmmmm, Tim-away is the last one to touch this [14:59:16] aude: so, what if en.wikidata.org is harder than wikidata.org ? do one first and then followup? [14:59:43] jeremyb_: ? [14:59:52] ottomata: changing the change-id of the latest master commit will trick Gerrit. It will no more be able to find a parent change if I git-review :-D [14:59:53] k done [14:59:55] we just want to redirect everything for now [15:00:10] hrmmmmm [15:00:17] * jeremyb_ hates this [15:00:27] Change abandoned: Hashar; "(no reason)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55068 [15:00:36] we might have a fancy something to redirect for wikidata items only on language [15:00:40] not anytime soon [15:01:08] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [15:01:10] ok, i think i know what I'm going to do [15:01:15] * hashar rolls the drums [15:01:21] ok? [15:01:29] ottomata: solved \O/ [15:01:44] ottomata: you can get rid of the git force push on https://gerrit.wikimedia.org/r/#/admin/projects/operations/debs/python-statsd,access [15:02:09] done [15:02:43] tim suggests not to use our Special:ItemByTitle page but need to come up with something more tailored for what we need [15:03:14] * aude not sure what and it doesn't exist yet, so not worry about it now [15:03:45] New review: Hashar; "Faidon, Andrew," [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [15:04:30] ottomata: wonderful thank you! Now I have a proper change to review https://gerrit.wikimedia.org/r/#/c/55069/ :-] [15:06:09] er? [15:06:21] python-statsd? [15:06:27] yehhhh [15:06:38] build on my own based on Andrew tutorial! [15:06:46] hm, hashar, if you use —git-export-dir when you build the pacakge, i think nose shouldn't pollute anything, right? [15:06:49] what are we going to use that for? [15:06:58] ottomata: maybe :-] [15:08:36] paravoid: stated is a nodejs frontend to graphite, I don't think we are going to use it. Nonetheless Zuul has a dependency over python-statsd . Currently it will try to install it using pip [15:09:00] I know what statsd is [15:09:10] hi paravoid [15:09:14] but I'm wondering what you need it for [15:09:17] hi aude! [15:09:32] merely to avoid install the module via pip [15:09:44] I haven't yet found out how to bypass pip entirely [15:11:33] aude: this is more complicated than I originally thought. (I'm running across things that are already in the repo that maybe work now or maybe don't. and I have to figure out how the current stuff works) [15:12:00] aude: maybe poke me again in 2 hours if i haven't touched it by then [15:12:01] like what? [15:12:04] ok [15:12:10] * jeremyb_ has to do other stuff now [15:12:15] ok [15:12:15] aude: see postrewrites.conf [15:12:21] * aude looks [15:13:36] jeremyb_: that looks like it just normalizes the subdomains and such [15:13:41] yeah [15:13:50] and handles mobile [15:13:58] en.m.wikipedia.org ? or something [15:14:02] * aude not sure [15:14:13] having mobile wikidata is @todo someday also :) [15:15:16] * aude not sure where mobile is handled [15:19:19] mobile isn't apache at all [15:19:30] so it doesn't need mentioning in apache conf [15:20:18] (errr, well I guess that's probably wrong) [15:20:29] it is apache [15:20:41] anyway, bye [15:21:54] jeremyb_: ah, ok [15:22:08] later [15:31:08] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [15:31:28] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 204 seconds [15:32:26] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 19 seconds [15:33:17] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [15:45:04] !g 54524 [15:45:04] https://gerrit.wikimedia.org/r/#q,54524,n,z [15:55:33] I am out for now, will be back tonight [16:03:31] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [16:03:32] New patchset: Mark Bergsma; "Make "varnish" be the default instance name, instead of the hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55079 [16:04:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55079 [16:31:21] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [16:32:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [16:38:59] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:49:59] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:50:19] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [16:51:40] paravoid: you there? [16:54:20] New patchset: Pyoungmeister; "adding coreutils to base::puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55085 [16:56:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55085 [17:00:13] yes [17:00:40] notpeter: [17:00:54] paravoid: ah, I figured out my issue [17:00:58] it would seem that in lucid [17:01:03] timeout is in a package called timeout [17:01:08] not coreutils [17:01:19] soooo, puppet is borked on all lucid nodes [17:01:19] but [17:01:22] I can fix this :) [17:04:17] New patchset: Pyoungmeister; "install timeout on all lucid nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55088 [17:06:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55088 [17:08:29] !log install timeout package on all lucid nodes to get puppet back on track [17:08:38] Logged the message, notpeter [17:10:19] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:15:42] paravoid: and the version of timeout in lucid has no -k [17:15:44] wooo! [17:16:45] they're not exactly compatible [17:17:44] man, I've given myself *so much rope* to hang myself with on this one [17:18:21] hahaha [17:18:33] New review: Matthias Mullie; "Do not merge before Tue March 26" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/55049 [17:18:47] I'm so good to me :) [17:19:59] New patchset: Pyoungmeister; "removing -k option from puppet wrapper" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55090 [17:20:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55090 [17:23:13] !log purging wikispecies.net from squid [17:23:18] Logged the message, Master [17:23:44] !log running puppet on all lucid nodes via salt [17:23:50] Logged the message, notpeter [17:24:03] this might overload stafford, btw :) [17:30:47] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [17:31:08] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 189 seconds [17:31:58] !log csteipp synchronized php-1.21wmf12/includes/Title.php 'update for loadPageData()' [17:32:03] Logged the message, Master [17:39:15] !log csteipp rebuilt wikiversions.cdb and synchronized wikiversions files: Update test2wiki to 1.21wmf12 [17:39:21] Logged the message, Master [17:42:30] * aude trying to move a page on test 2 :) [17:42:34] New patchset: Mark Bergsma; "Reduce esams cache size, and compare persistent to file backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55093 [17:42:55] aude: :) [17:43:01] I just got a stacktrace.... :/ [17:43:05] LinkCache doesn't know redirect status of this title: Cool_Hand_Luke1 [17:43:32] yeah, same here [17:43:38] (well, Window2, ;) ) [17:44:03] that one might have to do with my use of lua [17:44:05] on that page [17:44:09] * aude tries different page [17:44:25] New patchset: Mark Bergsma; "Reduce esams cache size, and compare persistent to file backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55093 [17:44:49] different page worked [17:44:53] oh? [17:44:54] we might have an issue with the wikidata lua [17:44:57] New patchset: CSteipp; "Update test2wiki to wmf12" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55096 [17:45:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55093 [17:45:20] i had a script error anyway with my lua (from yesterday) and code probably didn't like that [17:45:23] when i moved the page [17:45:54] not a blocker but something we obviously need to fix and test with our lua stuff [17:45:59] New review: CSteipp; "Deployed" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/55096 [17:46:15] http://paste.lisp.org/display/136153 [17:46:32] oh, that's different [17:46:33] Change merged: CSteipp; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55096 [17:46:53] i was able to successfully move a different page [17:47:14] moving stuff fine [17:49:43] https://test2.wikipedia.org/wiki/Cool_Hand_Luke is not nice though [17:49:49] The revision #0 of the page named "Cool Hand Luke" does not exist. [17:50:00] New patchset: Pyoungmeister; "proper comparisson for lsbrelease" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55097 [17:50:13] No valid null revision produced in Title::moveToInternal [17:50:20] when i tried to move that out of the way [17:50:47] http://pastebin.com/fKxkvE1k [17:51:06] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55097 [18:02:25] New patchset: Mark Bergsma; "Add cp3003 to the esams upload pool" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55101 [18:04:43] * mark puts a stick in jenkins' ass [18:05:29] New review: Spage; "ACUX moved the needle, so let's deliver the better experience to several thousand new users until we..." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/54725 [18:06:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55101 [18:07:05] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Thu Mar 21 18:06:59 UTC 2013 [18:07:10] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54725 [18:07:14] RECOVERY - Puppet freshness on cp1014 is OK: puppet ran at Thu Mar 21 18:07:05 UTC 2013 [18:07:14] RECOVERY - Puppet freshness on blondel is OK: puppet ran at Thu Mar 21 18:07:08 UTC 2013 [18:07:14] RECOVERY - Puppet freshness on es4 is OK: puppet ran at Thu Mar 21 18:07:08 UTC 2013 [18:07:14] RECOVERY - Puppet freshness on sq63 is OK: puppet ran at Thu Mar 21 18:07:08 UTC 2013 [18:07:14] RECOVERY - Puppet freshness on singer is OK: puppet ran at Thu Mar 21 18:07:08 UTC 2013 [18:08:24] RECOVERY - Puppet freshness on amssq42 is OK: puppet ran at Thu Mar 21 18:08:14 UTC 2013 [18:08:24] RECOVERY - Puppet freshness on amssq41 is OK: puppet ran at Thu Mar 21 18:08:14 UTC 2013 [18:08:24] RECOVERY - Puppet freshness on amssq43 is OK: puppet ran at Thu Mar 21 18:08:15 UTC 2013 [18:08:25] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Thu Mar 21 18:08:16 UTC 2013 [18:08:25] RECOVERY - Puppet freshness on amssq45 is OK: puppet ran at Thu Mar 21 18:08:19 UTC 2013 [18:08:25] RECOVERY - Puppet freshness on amssq46 is OK: puppet ran at Thu Mar 21 18:08:19 UTC 2013 [18:08:26] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Thu Mar 21 18:08:21 UTC 2013 [18:08:34] RECOVERY - Puppet freshness on amssq48 is OK: puppet ran at Thu Mar 21 18:08:29 UTC 2013 [18:08:45] RECOVERY - Puppet freshness on amssq49 is OK: puppet ran at Thu Mar 21 18:08:35 UTC 2013 [18:10:52] RECOVERY - Puppet freshness on amssq50 is OK: puppet ran at Thu Mar 21 18:10:42 UTC 2013 [18:11:04] RECOVERY - Puppet freshness on knsq16 is OK: puppet ran at Thu Mar 21 18:10:57 UTC 2013 [18:11:54] RECOVERY - Puppet freshness on amssq51 is OK: puppet ran at Thu Mar 21 18:11:53 UTC 2013 [18:11:54] RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Thu Mar 21 18:11:53 UTC 2013 [18:13:25] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Thu Mar 21 18:13:19 UTC 2013 [18:13:46] RECOVERY - Puppet freshness on knsq18 is OK: puppet ran at Thu Mar 21 18:13:41 UTC 2013 [18:14:35] RECOVERY - Puppet freshness on amssq53 is OK: puppet ran at Thu Mar 21 18:14:25 UTC 2013 [18:14:46] RECOVERY - Puppet freshness on knsq19 is OK: puppet ran at Thu Mar 21 18:14:35 UTC 2013 [18:16:26] RECOVERY - Puppet freshness on knsq20 is OK: puppet ran at Thu Mar 21 18:16:15 UTC 2013 [18:16:35] RECOVERY - Puppet freshness on amssq54 is OK: puppet ran at Thu Mar 21 18:16:26 UTC 2013 [18:18:04] RECOVERY - Puppet freshness on amssq55 is OK: puppet ran at Thu Mar 21 18:18:02 UTC 2013 [18:19:05] RECOVERY - Puppet freshness on amssq56 is OK: puppet ran at Thu Mar 21 18:18:59 UTC 2013 [18:19:34] RECOVERY - Puppet freshness on knsq21 is OK: puppet ran at Thu Mar 21 18:19:32 UTC 2013 [18:20:35] RECOVERY - Puppet freshness on amssq57 is OK: puppet ran at Thu Mar 21 18:20:25 UTC 2013 [18:20:58] New patchset: Pyoungmeister; "add puppet run at reboot as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55103 [18:21:54] RECOVERY - Puppet freshness on knsq22 is OK: puppet ran at Thu Mar 21 18:21:52 UTC 2013 [18:22:09] robh: i want to finish deploying mw1209-1220 [18:22:54] RECOVERY - Puppet freshness on knsq23 is OK: puppet ran at Thu Mar 21 18:22:52 UTC 2013 [18:23:16] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Thu Mar 21 18:23:04 UTC 2013 [18:23:54] RECOVERY - Puppet freshness on knsq24 is OK: puppet ran at Thu Mar 21 18:23:47 UTC 2013 [18:24:15] RECOVERY - Puppet freshness on amssq59 is OK: puppet ran at Thu Mar 21 18:24:06 UTC 2013 [18:25:55] RECOVERY - Puppet freshness on amssq60 is OK: puppet ran at Thu Mar 21 18:25:43 UTC 2013 [18:27:04] RECOVERY - Puppet freshness on amssq61 is OK: puppet ran at Thu Mar 21 18:27:02 UTC 2013 [18:27:44] RECOVERY - Puppet freshness on knsq27 is OK: puppet ran at Thu Mar 21 18:27:43 UTC 2013 [18:29:14] RECOVERY - Puppet freshness on knsq28 is OK: puppet ran at Thu Mar 21 18:29:08 UTC 2013 [18:29:14] RECOVERY - Puppet freshness on amssq62 is OK: puppet ran at Thu Mar 21 18:29:13 UTC 2013 [18:30:15] RECOVERY - Puppet freshness on knsq29 is OK: puppet ran at Thu Mar 21 18:30:04 UTC 2013 [18:43:08] Jenkins is giving "checkstyle FAILURE (voting)" on E3's unrelated commits, e.g. https://gerrit.wikimedia.org/r/#/c/55108/ :( [18:44:40] AFAIK Jenkins has generally been broken in the wmf branches for a while [18:44:52] hasharAW: Thoughts [18:44:54] ? [18:45:08] o treally? [18:45:30] ah yeah [18:45:41] the jshint reporting is wrong, we have a bug about it [18:46:03] So we should remove jenkins as a reviewer and +2? [18:46:20] I guess [18:46:28] maybe the branch has broken JS too [18:46:29] https://integration.wikimedia.org/ci/job/mediawiki-core-jslint/4452/checkstyleResult/ [18:49:21] I am off again [18:54:26] cmjohnson1: cool, sorry, was on phone with the auto dudes [18:54:31] my car arrives tomorrow \o/ [18:54:50] awesome...np...i have been bugging peter about it [18:55:04] now you can go drive the pch with the top down [18:59:05] cmjohnson1: So you have them in use now? [18:59:14] robh: i am running sync-common on 1 host atm....they are all in dsh groups. that is it all [18:59:32] hrmmm.... i am paranoid, and i log when i am mid deploy on apaches [18:59:44] since now those in dsh nodes will throw errors for any dev attempting to deploy anything [19:00:20] they wont serve anythign on site until they are in pybal, so you wont break anything of course [19:00:33] right...yeah pybal is the last thing i want to do [19:00:40] but eventually someone is going to run an update, see a bunch of shit throwing errors, and possibly have a minor panic attack ;] [19:01:25] cmjohnson1: So I am not sure that sync common does the docroot [19:01:40] though i am pretty sure the puppet update runs docroot sync, i would just verify it exists before pushing it live. [19:02:20] agreed...how long does sync-common normally take? [19:02:28] i am only running on one host [19:03:03] awhile. [19:03:09] like 3 minutes or so [19:03:44] cmjohnson1: if you really wanna see what its doing [19:03:56] you are also able to cat the script, and rip out the actual sync commands [19:04:04] which you can then add the verbose flag to usually [19:04:13] just kind of helps early on to see all the shit its really doing. [19:04:26] i am getting some permission denied msgs [19:04:29] rsync: send_files failed to open "/php-1.21wmf11/.git/modules/extensions/MWSearch/index.lock" (in common): Permission denied (13) [19:05:00] ahh, that is a permission error [19:05:08] so you can fix this by going to fenari as root [19:05:15] and chowning that file to mwdeploy [19:05:52] oh wait [19:05:54] cmjohnson1: here ya go [19:06:08] its in /home/wikipedia/common/ then follow the path [19:06:42] You shouldn't need to chown stuff on fenari [19:06:47] set-group-write should do it I think [19:06:49] RoanKattouw: yep [19:06:52] thats why i said wait [19:06:57] was pulling the actual command [19:06:58] =] [19:07:06] chmod g+rw /home/wikipedia/common/php-1.21wmf11/.git [19:07:14] cmjohnson1: that plus rest of path after the .git [19:07:30] that will let wikidev handle it [19:07:39] then you can rerun the sync [19:07:52] (though since its .git crap i dont think it reaaaly matter, but i may be wrong) [19:08:05] best to fix false positive alerts [19:08:43] RoanKattouw: didnt say i was wrong ;] [19:08:47] so i must be right! [19:09:05] haha [19:09:16] cmjohnson1: if it wasn't already clear by now, if you are touching mediawiki, it used to be hunt down brion to help [19:09:19] now its find roan. [19:09:33] good times [19:09:37] now that he and tim are on wildly differing time zones its much easier to get help. [19:09:39] heh [19:09:50] Even more wildly differing :) [19:10:00] Actually -- we're probably closer now than we were before [19:10:05] we need an ops based person in australia. [19:10:17] we have the EU timezones and the US [19:10:32] need AU or south pacific. [19:10:46] cmjohnson1: we're gonna need you to move. [19:10:50] hahaha [19:11:04] Actually I'm not sure fixing the error for that .git file is important [19:11:12] It probably isn't, unless it's causing the remainder of the sync to fail [19:11:22] i think its really not [19:11:28] its an excess error, wont affect apache use [19:11:34] but causes folks to wonder why when they sync [19:11:39] Well [19:11:42] (iirc) [19:11:49] The sync script in the other direction should handle this I think [19:11:58] if run on a deployment host [19:12:02] Like, when doing a push sync [19:12:04] versus the end client [19:12:07] When you're doing a pull sync, it can be weird [19:12:09] Exactly [19:12:11] New patchset: Dzahn; "add a contact group "parsoid" with Roan as member and add it to notifications for parsoid LVS monitoring checks (RT-4318)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55119 [19:12:16] RoanKattouw: ^ [19:12:24] Most likely you were running it while someone else was actively messing with git on fenari [19:12:55] mutante: Cool. But AFAIK my phone provider is still listed wrong in contacts.cfg [19:13:18] RoanKattouw: eh, yea, true. is a change in private repo though [19:13:25] Yes [19:13:36] But it presumably needs to be changed or it won't work :) [19:14:20] New review: Catrope; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55119 [19:15:11] New patchset: Dzahn; "add a contact group "parsoid" with Roan as member and add it to notifications for parsoid LVS monitoring checks (RT-4318)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55119 [19:15:17] heh, that was almost edit conflict, of course, spaces [19:15:35] cmjohnson1: the reason we say it doesnt matter if its fixed [19:15:37] but we should fix is [19:15:48] the .git crap is for git updates on the directory structure, usually from fenari [19:16:03] so the end apaches not getting a file in the .git subdirectory is meh, not a big deal [19:16:13] but it spams the sync output with errors and thus is non-ideal [19:16:24] but will that cause the sync to fail? [19:16:29] i dont think so [19:16:30] I don't think so [19:16:39] nope,ignored it in the past [19:16:53] yea, i just didnt confirm it today, so didnt wanna say 'nah' [19:16:53] ;] [19:16:59] 'Rob said it would work!' [19:17:01] it seems to take an eternity to run [19:17:06] it probably just needs an --ignore line or something [19:17:09] well, on the first run its pulling everyting [19:17:13] so it'll be a bit. [19:17:16] mutante: It already ignores .git/object [19:17:27] mutante: But it can't ignore the other metadata because Special:Version needs it [19:17:38] ah,ok.. [19:17:53] i only see it throw errors now on lockfiles [19:17:59] (regularly that is) [19:18:03] Right [19:18:07] That's harmless [19:18:09] yep [19:18:59] do you have a loop (already created) to run this on all the other servers? [19:21:02] if you run the sync-common from fenari it does it to all apaches. [19:21:35] which, once you have the first apache running [19:21:38] you can just do to push to the rest [19:21:47] running sync-common on fenari syncs to all apaches, but honestly thats ok. [19:21:56] if anything it ensures everything is the same ;] [19:22:27] cool [19:23:28] so once you have all the files synced, we can add it to pybal with the False setting [19:23:34] which means you can watch the lvs server do the tests if its ok [19:23:37] without it attempting to pool it [19:25:08] k [19:25:42] brb [19:35:23] PROBLEM - MySQL Recent Restart on db1052 is CRITICAL: NRPE: Command check_mysql_recent_restart not defined [19:35:23] PROBLEM - MySQL disk space on db1052 is CRITICAL: NRPE: Command check_disk_6_3 not defined [19:35:42] PROBLEM - MySQL Replication Heartbeat on db1052 is CRITICAL: NRPE: Command check_mysql_slave_heartbeat not defined [19:35:53] PROBLEM - Full LVS Snapshot on db1052 is CRITICAL: NRPE: Command check_lvs not defined [19:35:53] PROBLEM - MySQL Slave Delay on db1052 is CRITICAL: NRPE: Command check_mysql_slave_delay not defined [19:36:12] PROBLEM - mysqld processes on db1052 is CRITICAL: NRPE: Command check_mysqld not defined [19:36:12] PROBLEM - MySQL Idle Transactions on db1052 is CRITICAL: NRPE: Command check_mysql_idle_transactions not defined [19:36:12] PROBLEM - MySQL Slave Running on db1052 is CRITICAL: NRPE: Command check_mysql_slave_running not defined [19:37:13] New review: Platonides; "hashar, the second all-labs.dblist should be all.dblist" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55059 [19:43:44] is someone working on db1052 ? [19:49:06] !log mflaschen synchronized wmf-config/CommonSettings.php 'E3 deployment' [19:49:12] Logged the message, Master [19:49:38] rsync errors in the mw12?? range [19:49:59] cmjohnson1: Is that the range you were working on? --^^ [19:50:16] mw1209-1220 [19:50:16] RoanKattouw: that's all new machines [19:51:45] !log mflaschen synchronized php-1.21wmf11/extensions/E3Experiments/ 'E3 deploy' [19:51:50] Logged the message, Master [19:51:55] cmjohnson1, looks like a match [19:52:34] !log mflaschen synchronized php-1.21wmf11/extensions/GettingStarted/ 'E3 deploy' [19:52:41] Logged the message, Master [19:52:58] !log mflaschen synchronized php-1.21wmf11/extensions/GuidedTour/ 'E3 deploy' [19:53:05] Logged the message, Master [19:53:05] Done [19:55:30] LeslieCarr: Did you get your answer about CORS + wikidata.org? It was hopefully going to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=41847#c30 and the surrounding issues. [19:55:38] It seems wikidata.org _still_ isn't properly redirecting. [19:59:23] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [19:59:43] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:00:07] hrm [20:00:47] https://wikidata.org/wiki/Wikidata:Main_Page and other forms load without 301'ing as they should. [20:01:18] More importantly, https://wikidata.org/wiki/Special:UserLogin and https://www.wikidata.org/wiki/Special:UserLogin are both valid. [20:01:25] So people log in to the wrong domain, essentially. [20:01:30] that¡s not what https://bugzilla.wikimedia.org/show_bug.cgi?id=45005 say that should be done [20:02:13] roankattouw: sync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9] [20:02:17] I believe the Wikidata folks finally gave up on having just "wikidata.org" and now just want things to work properly (with www). [20:02:24] is that common? ^^ [20:02:34] so https://bugzilla.wikimedia.org/show_bug.cgi?id=45005 [20:02:38] ? [20:02:55] Basically. [20:03:16] https://bugzilla.wikimedia.org/show_bug.cgi?id=41847 is the longer discussion, but resulted in duping up to 45005. [20:03:29] cmjohnson1: Happens occasionally [20:03:33] that's more of a dev thing, and not an ops thing [20:04:14] "You'll need to go stand in that line over there." ;-) [20:06:02] sorry [20:06:22] It's fine. It'll get cleaned up eventually. [20:10:02] RECOVERY - Puppet freshness on ms5 is OK: puppet ran at Thu Mar 21 20:10:01 UTC 2013 [20:10:02] RECOVERY - Puppet freshness on db1002 is OK: puppet ran at Thu Mar 21 20:10:01 UTC 2013 [20:13:12] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on db1017 is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on cp1001 is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on db45 is CRITICAL: Puppet has not run in the last 10 hours [20:13:13] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [20:13:14] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [20:13:15] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [20:13:15] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [20:13:15] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [20:13:16] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [20:13:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [20:13:44] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [20:14:12] PROBLEM - Puppet freshness on db1018 is CRITICAL: Puppet has not run in the last 10 hours [20:14:12] PROBLEM - Puppet freshness on db1021 is CRITICAL: Puppet has not run in the last 10 hours [20:14:12] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [20:14:12] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [20:14:12] PROBLEM - Puppet freshness on db53 is CRITICAL: Puppet has not run in the last 10 hours [20:14:12] PROBLEM - Puppet freshness on db67 is CRITICAL: Puppet has not run in the last 10 hours [20:14:13] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [20:14:13] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [20:14:14] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [20:14:14] PROBLEM - Puppet freshness on nescio is CRITICAL: Puppet has not run in the last 10 hours [20:14:14] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [20:14:15] PROBLEM - Puppet freshness on snapshot4 is CRITICAL: Puppet has not run in the last 10 hours [20:14:16] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [20:14:16] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [20:14:16] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [20:15:03] damn puppet [20:15:05] New patchset: Hashar; "zuul: migrate to the new status url" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55134 [20:15:13] New patchset: Catrope; "Temporary hack to make maintenance scripts work on tin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55135 [20:15:14] hashar: sorry :/ [20:16:22] notpeter: so you have been elected to merge in the very simple https://gerrit.wikimedia.org/r/55134 :-) [20:16:39] notpeter: that makes Zuul to report the new Zuul portal at https://integration.wikimedia.org/zuul/ [20:16:51] New review: Lcarr; "Can you instead of changing the role class change which class tin is a member of?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55135 [20:17:28] shorter and more awesome ! [20:17:29] :) [20:18:44] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming 'Enable NavigationTiming on mobile alpha and beta' [20:18:51] Logged the message, Master [20:18:55] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [20:19:00] !log olivneh synchronized php-1.21wmf12/extensions/NavigationTiming 'Enable NavigationTiming on mobile alpha and beta' [20:19:00] Logged the message, Master [20:19:06] Logged the message, Master [20:19:25] !log mlitn synchronized php-1.21wmf12/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [20:19:31] Logged the message, Master [20:19:31] New patchset: Catrope; "Make tin a maintenance server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55137 [20:20:54] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Thu Mar 21 20:20:49 UTC 2013 [20:24:25] Is anyone from WMF going to the openstack summit? [20:24:32] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [20:24:43] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [20:24:50] marktraceur: you can tell them I'd like to meet them [20:24:53] mark: i would lassume labs folks like Ryan_Lane and stuff [20:25:04] marktraceur: maybe ask in #wikimedia-labs [20:25:46] I think Ryan is slated to go there. [20:26:22] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 208 seconds [20:26:23] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 208 seconds [20:26:27] marktraceur: I'll be there [20:26:50] i'll be presenting, but not on anything wikimedia related [20:27:13] yay, marktraceur was asking on my behalf, because I want to meet any wikimedia folks attending, Ryan_Lane [20:27:31] so I look forward to meeting you [20:27:34] thanks marktraceur [20:28:00] anteaya: I'll be there. I'll be presenting on the user committee survey results with tim bell and jc martin [20:28:16] if you haven't met them, you should. they're really great [20:28:16] fantastic [20:28:44] I look forward to it, I can't comprehend the schedule so I will just find you on irc and make sure I hit your talk [20:28:53] heh [20:28:57] what day? [20:29:05] tuesday, i think? [20:29:06] lemme see [20:29:23] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [20:29:23] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [20:29:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 23 seconds [20:29:33] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [20:29:33] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [20:29:33] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [20:29:42] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:43] anteaya: yeah, tuesday [20:29:51] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [20:29:53] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:29:53] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:29:53] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [20:30:11] eep [20:30:15] hrmmmmmm. [20:30:16] why you kill rendering [20:30:16] robh: [20:30:16] er? [20:30:19] wait, are those all new? [20:30:26] Ryan_Lane: k, I just hope it isn't around lunch since I am part of a panel at that time, but I'll keep in touch [20:30:30] I don't see it on sched [20:30:32] i think those are the new ones, and they arent in pybal [20:30:41] cmjohnson1: what is the range you are working again? [20:30:42] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.892 second response time [20:30:51] 1209-1220 [20:30:52] 1209-1220 ..those are not new [20:30:55] so nm, those arent new [20:31:05] but we just did a sync to them, hrmm [20:31:16] cmjohnson1: i dont think this is your fault [20:31:23] but i think this is what both mutante and i saw the other day [20:31:27] for a couple days running [20:31:31] run a sync-common-all [20:31:38] and watch some apaches fall over then come back... [20:31:39] no i don't - it looks like they had a oom and crash [20:31:41] (not good) [20:31:42] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62836 bytes in 5.742 second response time [20:31:43] anteaya: maybe it's this? http://openstacksummitapril2013.sched.org/event/f0a0040a52cb60b60d1fc74c2b26d320 [20:31:48] i don't actually think it's your fault cmjohnson1 [20:32:05] if you'd like, i can blame you anyways though ? [20:32:05] LeslieCarr: dont think triggered by the file sync? [20:32:10] well it could be [20:32:15] right, not cmjohnson1 fault [20:32:15] if it were my fault...it would be a lot more ;-] [20:32:25] but it happened for same thing that he did, and i did two days ago, and daniel did yesteday [20:32:27] but i don't see why it would cause the oom error [20:32:34] well in that case, whtat's up with that script [20:32:35] Ryan_Lane: keynote, cool that should be easy to find, thanks [20:32:36] i dont either, its a file transfer =[ [20:32:46] and it didnt used to, but its very suspect now its happened 3 times [20:32:51] well, assuming it's that slot. [20:32:53] suddenly its a pattern. [20:33:05] and no one does anything [20:33:07] and they come back right? [20:33:13] oh, i see, it's a script+forkbomb ! [20:33:15] (or did folks reboot them this time?) [20:33:19] yep came back on its own this time [20:33:25] i checked but nope [20:33:35] yep, this is fubar and odd [20:33:38] but self fixing. [20:33:57] ie: we really need to fix it now that it keeps happening, just not sure how its triggering yet [20:34:33] cmjohnson1: So, now to add to pybal [20:34:41] we are goign to add them all with the setting to False [20:34:44] because I am paranoid [20:34:53] now, this you have to be root on fenari [20:35:09] and the pybal configuration files are located in /home/wikipedia/conf/pybal [20:35:20] these files are updated on the lvs servers when they change, so its live [20:35:20] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:30] ie: be VERY careful on these files or you can fuck the site. [20:35:40] but damn, why is that still not right? [20:35:42] grr [20:35:52] gah looks dead again [20:35:52] :( [20:36:07] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62836 bytes in 0.232 second response time [20:36:20] ETIMEDOUT on for example https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Staugustinescanterburyjustusmellituslaurencegraves.jpg/203px-Staugustinescanterburyjustusmellituslaurencegraves.jpg [20:36:27] im wating them repool [20:36:30] watching even [20:36:40] LeslieCarr: now it seems ok again. [20:36:45] scary brittle. [20:37:01] ok that went through now \o/ [20:37:06] cmjohnson1: So, lets start again. [20:37:08] :( [20:37:13] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.812 second response time [20:37:20] We are going to add a single new server to pybal configuration with its setting to false [20:37:28] !log restarting gmond on mw1153 [20:37:30] want me to comment out the rest? [20:37:34] Logged the message, Mistress of the network gear. [20:37:34] got another timeout on https://upload.wikimedia.org/wikipedia/commons/thumb/9/9a/Staugustinescanterburyjustusmellituslaurencegraves.jpg/204px-Staugustinescanterburyjustusmellituslaurencegraves.jpg … i'll assume it'll take a couple mins to totally clear up? [20:37:38] cmjohnson1: did you add them all? [20:37:58] no.. i have the addition done though [20:38:06] lets just add the one [20:38:14] with false [20:38:23] hey, i see a possible problem [20:38:30] these are both in appserversa dn imagescalers [20:38:40] what are? [20:38:44] those servers? [20:38:58] mw1153-1159 [20:39:07] LeslieCarr: in both pybal, both puppet, or both in both? [20:39:07] mw1153 to mw1160 even [20:39:09] lots of boths... [20:39:10] in pybal [20:39:19] New review: Catrope; "No, because that uninstalls other stuff which breaks git-deploy, which is used for Parsoid." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55135 [20:39:27] site.pp # mw1153-1160 are imagescalers (precise) [20:40:04] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [20:40:10] its in rendering in pybal [20:40:12] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.445 second response time [20:40:14] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [20:40:18] LeslieCarr: how are they in appservers? [20:40:22] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.604 second response time [20:40:34] unless she just pulled out of another pybal file [20:40:34] binasher: just from looking at pybal on lvs1003 [20:40:35] does anyone know where the check for an "informative user-agent" is implemented? [20:40:46] LeslieCarr: in the pybal config file? [20:40:48] not looking at the config files, just at what pybal thinks [20:41:03] what the fuck… negative memory in use http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Image+scalers+eqiad&m=cpu_report&s=by+name&mc=2&g=mem_report [20:41:21] brion: often happens after a ganglia aggregator proc restarts [20:41:24] also, is lulz ;) [20:41:26] nice [20:41:28] LeslieCarr: that's not what "ipvsadm -l" shows on lvs1003 [20:41:31] accuracy ftw :D [20:41:37] they're only in rendering [20:41:38] yea, it shows them as just scalers that i see. [20:41:48] what binasher said [20:41:50] looks as it should [20:41:50] oh shoot, 1153 not 1053 [20:41:51] doh [20:41:58] stupid numbers [20:42:08] sorry, misread [20:42:11] cmjohnson1: So, back to pybal, add the one, and you want to login to lvs1003 afterwards [20:42:17] add it as false, and write the file [20:42:20] and it goes live immediately [20:42:23] so be very careful. [20:42:42] then you login as root on lvs1003 [20:42:50] !log cmjohnson Started syncing Wikimedia installation... : [20:42:54] hmm a shot (oom) convert om mw1157 [20:42:57] Logged the message, Master [20:42:58] and then to see the tests for it: tail -f /var/log/pybal | grep servername [20:43:06] wait cmjohnson1 [20:43:08] brion: http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05220.html , see last message in the thread too. [20:43:09] what sync? [20:43:13] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [20:43:20] (i thought we synced already?) [20:43:23] i didn't add that [20:43:33] heh [20:43:34] right, the logbot did cuz you ran a command [20:43:44] what did you run? ;] [20:44:12] that was the sync-common-all finishing [20:44:18] ahh, ok [20:44:37] so yea, add with false to 'apaches' pybal config file [20:44:39] guess the cgroup limits are working anyways [20:44:54] write file, then ssh into lvs1003 (cuz thats the lvs server that handles internal lvs loads in eqiad) [20:45:11] and run that tail command, and you can see pybal does two kinds of checks [20:45:57] cmjohnson1: the two checks are the runcommand and proxyfetch [20:46:08] the proxyfetch is ensuring it'll actually serve data properly [20:46:14] and on mw1209 it is failing. [20:46:18] i see them both...the fetch failed [20:46:20] 2013-03-21 20:44:07.477565 [api ProxyFetch] mw1209.eqiad.wmnet (disabled/partially up/not pooled): Fetch failed, 0.002 s [20:46:29] so, that means some part of the apache stuff isnt right [20:46:40] puppet runs on mw1209 without errors? [20:46:42] Can I be honest here and tell you guys that the image scaler outage was my doing? I didn't think it would cause them all to topple over like that [20:46:54] What did you do? [20:47:05] oh? [20:47:39] I think we've discussed this issue once or twice before on a mailing list but it was just a theory and not really possible [20:47:47] Let me share the code I ran [20:47:58] wait, was this in shell? [20:48:01] New review: Hashar; "Done by using puppet '::site' to populate WMF_DATACENTER. I have not used the multiversion shell ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [20:48:08] or something anyone who takes this can spam and topple them? [20:48:14] cuz if its the latter, we shouldnt paste in here. [20:48:18] other channel please [20:48:24] this is public logged and I dont wanna have others doing this [20:48:28] johnduhart: thank you for letting us know - because that makes more sense than sync-common [20:48:33] It's really easy to do, I was able to do with with an EC2 instance [20:48:45] please move this discussion to the other channel [20:48:49] ok, then dont put it in the pub channel, lets not help other folks break it [20:48:50] Yeah we should discuss this in private [20:48:55] i dunno if john can join that other one. [20:49:06] RobH: You administer it, don't you? ;) [20:49:07] robh: there are some issues with packages being downgraded on mw1209...too much to list [20:49:17] let's make a temporary one [20:49:20] one sec, i'll make a chan and ask you guys to join [20:49:23] OK [20:49:25] someone PM john with temp one [20:49:30] and paste it in our private channelf or us [20:50:33] Wait I have two now from RobH and LeslieCarr [20:50:41] robhs [20:50:43] bleh. [20:50:48] go where im at, apegoes is here [20:50:49] bad robh [20:50:51] we outvoted her [20:50:51] oh okay [20:50:52] k [20:51:04] PROBLEM - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:04] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [20:51:18] Change abandoned: Hashar; "hmm will use $MW_COMMON_SOURCE/multiversion/MWRealm.sh as brad said :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [20:51:19] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:26] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:26] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:26] PROBLEM - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:26] PROBLEM - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:28] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [20:51:32] LeslieCarr: ^ [20:51:33] AHHH [20:51:34] Logged the message, Master [20:51:34] WTF [20:51:36] PROBLEM - LVS HTTP IPv4 on wikivoyage-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [20:51:41] nooooes [20:51:42] oh shit [20:51:44] that's a lot more [20:52:02] RECOVERY - LVS HTTP IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 62676 bytes in 0.006 second response time [20:52:04] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15548 bytes in 0.014 second response time [20:52:12] no good [20:52:13] !log mlitn synchronized php-1.21wmf12/extensions/ArticleFeedbackv5 'Update ArticleFeedbackv5 to master' [20:52:15] mlitn [20:52:19] Logged the message, Master [20:52:23] LeslieCarr: No,, can't be [20:52:26] who is mlitn ? [20:52:29] Oh wait [20:52:32] He touched wmf11 [20:52:33] too [20:52:42] LeslieCarr: Matthias (AFT team) [20:52:44] * apergos twitches [20:52:48] matthiasmullie: on this chan [20:52:58] whatever you did, please undo [20:53:11] asap [20:53:19] RECOVERY - LVS HTTP IPv4 on wikinews-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 62676 bytes in 0.006 second response time [20:53:20] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 95235 bytes in 0.029 second response time [20:53:25] New patchset: Hashar; "(bug 41285) adapt `foreachwiki` for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [20:53:27] RECOVERY - LVS HTTP IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15540 bytes in 0.015 second response time [20:53:29] they seem to be coming back. [20:53:33] RECOVERY - LVS HTTP IPv4 on wikivoyage-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 37750 bytes in 0.016 second response time [20:53:33] undoing may not be ideal [20:53:41] ie: may just knock them over again [20:53:43] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 62682 bytes in 0.065 second response time [20:53:43] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 62682 bytes in 0.017 second response time [20:54:32] matthiasmullie: you there ? [20:54:37] I'm here [20:54:47] RobH: LeslieCarr: should I undo, or wait? [20:55:04] if things are back, and its not breaking shit now [20:55:06] i say leave it. [20:55:08] 1 second [20:55:09] (I sync-dir'ed extensions/ArticleFeedbackv5 on php-1.21wmf11 & 12) [20:55:25] 20:50 UTC you did the sync , yes matthias ? [20:55:48] LeslieCarr: seems about right, yes [20:56:03] it's caused a major uptick in network utilization, i'd say let's check it out for about 5-10 more minutes and make sure it's not a fluke -- if the utilization stays up, then we will revert [20:56:09] on the api appservers [20:56:21] +1 LeslieCarr [20:56:23] LeslieCarr: got this on fenari: [20:56:34] mlitn@fenari:/home/wikipedia/common$ for f in php-1.21wmf{11,12}/extensions/ArticleFeedbackv5; do sync-dir $f 'Update ArticleFeedbackv5 to master'; done; [20:56:35] No syntax errors detected in /home/wikipedia/common/php-1.21wmf11/extensions/ArticleFeedbackv5 [20:56:36] copying to apaches [20:56:38] mw1210: rsync: mkdir "/apache/common-local/php-1.21wmf11/extensions/ArticleFeedbackv5" failed: No such file or directory (2) [20:56:39] mw1210: rsync error: error in file IO (code 11) at main.c(605) [Receiver=3.0.9] [20:56:40] mw1211: rsync: mkdir "/apache/common-local/php-1.21wmf11/extensions/ArticleFeedbackv5" failed: No such file or directory (2) [20:56:41] mw1211 …. [20:56:43] !log fenari disappeared from dsh group mediawiki-installation :-( [20:56:44] binasher/ notpeter any db errors ? [20:56:50] Logged the message, Master [20:56:51] matthiasmullie: that's because those mw's are new. [20:56:53] LeslieCarr: no [20:56:58] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.080 second response time [20:57:32] that's good at least :) [20:57:35] !log kaldari synchronized php-1.21wmf12/extensions/PageTriage/modules/ext.pageTriage.views.list 'syncing PageTriage js on wmf12' [20:57:40] Logged the message, Master [20:57:50] !log pulling mw1070, may have network issues [20:57:55] Logged the message, Master [20:58:22] paravoid: Did you push the ruby-jsduck package yet? (you merged it yesterday but doesn't appear to be live yet) [20:58:51] PROBLEM - Apache HTTP on mw1170 is CRITICAL: Connection refused [21:00:37] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:02:12] !log cmjohnson Finished syncing Wikimedia installation... : [21:02:17] Logged the message, Master [21:03:58] matthiasmullie: looks like it's about back to normal [21:04:24] LeslieCarr: that's great news [21:04:31] any idea what happened? [21:04:35] matthiasmullie: considering the fact that we just had an imagescaler outage, can you maybe give ops a ping and a request in this channel saying "hey, i want to deploy some new aft5 code" so we know what's up [21:04:58] not certain [21:05:17] !log Added fenari back to /etc/dsh/groups/mediawiki-installation per https://rt.wikimedia.org/Ticket/Display.html?id=4794 [21:05:27] Logged the message, Mr. Obvious [21:06:07] LeslieCarr: I no longer need to do anything today though; next time I'll give a shout here [21:06:49] thank you - two outages in a row gives leslie a heart attack! [21:07:50] matthiasmullie: Is AFTv5 being reenabled? [21:08:21] Susan: not today - probably next tuesday [21:09:11] New patchset: Hashar; "(bug 41285) adapt `foreachwiki` for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [21:09:16] I killed the blacklist category. [21:10:20] * Damianz gives Leslie a glass of whisky instead of a heart attack [21:10:56] Susan: what exactly do you mean? removed that category from all articles that had it? [21:11:03] Yes. [21:11:11] :) [21:11:16] It is dead. [21:11:16] mmm whisky-attack [21:11:18] https://en.wikipedia.org/wiki/Category:Article_Feedback_Blacklist [21:11:18] makes sense, since it should only appear on whitelisted categories [21:11:23] Right. [21:11:45] There are still two AFTv5 categories being used. [21:11:48] I'll merge them sometime. [21:12:52] New review: Hashar; "PS2 was a rebase to get rid of an abandoned dependency." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [21:13:07] Susan: always wondered why there were 2 :) [21:13:35] Maybe I'll rename them both to a title of my choosing. [21:13:48] Heh. [21:13:54] I guess that would break the config, though. [21:14:05] lol indeed ^^ [21:28:55] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.099 second response time [21:30:04] Krinkle: restarting Zuul aren't you ? :-] [21:30:54] hashar: yeah, restart instead of reload. [21:30:57] sorry [21:31:10] how long will it take? [21:31:15] it's taking forever. [21:31:32] usually you just need reload [21:31:37] I know that [21:31:44] so why restarting it ? :-] [21:31:48] I didn't mean to [21:31:52] ahh :-] [21:32:36] krinkle: so yeah zuul waits for all current jobs to complete [21:32:41] ok [21:32:51] but still process incoming events / triggers [21:32:53] that should be ok [21:33:07] it will simply save the jobs that need triggering and resume them on restart [21:33:09] (hopefully) [21:33:13] Yeah [21:33:23] I am not sure what happen on restart, maybe the queue is lost. I have no idea if it is written to disk [21:33:35] on reload that definitely work ( see SIGUSR1 in the code and doc ) [21:34:09] Why would it go into "queue-only mode preparing for reconfigureexit: " if it wouldn't save it [21:34:12] I have canceled a phpcs job, that one take ages [21:34:25] hashar: define ages? We run it on every patch set [21:34:39] ah, regressions [21:34:49] on change-merge there is a phpcs run on all of mediawiki core [21:34:53] I should probably get rid of it [21:34:58] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with args zuul-server [21:35:02] and simply build it once a day [21:35:23] so on exit: [21:35:25] 2013-03-21 21:34:03,289 DEBUG zuul.Scheduler: Queue length is 28 [21:35:26] 2013-03-21 21:34:03,289 DEBUG zuul.Scheduler: Saving queue [21:35:26] ;) [21:35:30] it must save it somewhere! [21:36:38] the cloud [21:36:53] state dir [21:37:08] I heard the cloud saves everything, you don't need to worry anymore [21:37:20] Krinkle: queue is saved in /var/lib/zuul/queue.pickle the dir is defined by 'state_dir' in zuul.conf [21:37:28] greg-g: yeah that is what worry me :-] [21:37:33] I need to rsync again with my brother [21:37:50] hashar: then we can just change #wikimedia-operations to #wikimedia-cloud-amz [21:38:16] !log authdns-update for terbium [21:38:22] Logged the message, RobH [21:38:49] Krinkle: somehow zuul was stalled again it seems :-( [21:39:23] No, that was me re-restarting it because my first one timed out, so it didn't start it. [21:39:29] https://gerrit.wikimedia.org/r/#/c/55152/ had its job triggered 22 minutes after submission [21:39:33] ahh [21:39:34] ok [21:39:35] and a second before I ran that, I guess it suddendly started automatically [21:39:47] Perhaps puppet doing service ensure=>running at the same time [21:39:57] it was down for 5 minutes [21:40:03] All fine now. [21:40:08] PROBLEM - zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with args zuul-server [21:40:10] \O/ [21:40:12] ahh [21:40:16] I hate that report [21:40:21] New patchset: Pyoungmeister; "setting up terbium, eqiad hume-ish host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55155 [21:40:31] hashar: Uh? That sounds bad. [21:40:39] But I don't see 2 [21:40:41] Krinkle: issue in the icinga monitoring [21:40:45] there is only 1 [21:40:48] ok [21:40:56] hashar: What does nagios do? [21:41:15] (I'm saying nagios, because I can only spell icinga properly when I copypaste it) [21:41:19] it run a nagios script known as check_procs [21:41:29] and expect exactly ONE process named zuul-server [21:41:39] Yes [21:41:50] Where does it go wrong? [21:42:01] I guess it find itself :-] [21:42:04] if ( check_procs ) { actual = 1 } else { actual = 2 } [21:42:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55134 [21:42:18] mutante: danke :-] [21:42:19] oh, so it is always critical? [21:42:27] Not just when its down or restarting [21:42:28] maybe [21:42:42] New patchset: Kaldari; "Turn on Thanks extension for testing on test.wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55156 [21:43:06] hashar: np. Applying configuration version '1363902150' [21:43:18] nic [21:43:19] e [21:43:33] I will reload zuul once that is done [21:43:46] mutante: zuul_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with args zuul-server [21:43:52] mutante: want me to fill a RT for it ? [21:44:02] New patchset: RobH; "terbium added" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55157 [21:44:07] I think check_procs find itself in addition to the service it is supposed to monitor [21:44:14] check_procs -a look at the argument list [21:44:25] ngrob [21:44:27] RobH: [21:44:32] rob [21:44:32] ? [21:44:39] cna you not add a def for terbium [21:44:42] Change merged: Bsitu; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55156 [21:44:45] I just did so in my patchset [21:44:50] and then they'll conflict [21:44:53] and then i'll be sad [21:45:01] why are you adding stuff im working on >_< [21:45:11] because I'm an asshole [21:45:16] hashar: notice: Finished catalog run in 132.98 seconds [21:45:17] hrmm, how do i revert just one file in git [21:45:26] https://gerrit.wikimedia.org/r/#/c/55155/ [21:45:26] hashar: yes, i can fix that [21:45:35] you can just delete it out of your copy of site.pp [21:45:41] and to an --amend [21:45:43] and push again [21:45:45] hashar: * Reloading Zuul zuul :) [21:45:51] mutante: I remember we talked about that already. I think we only fixed the check_jenkins one and forgot the zuul one [21:45:58] k [21:46:02] yea, that's also what i though [21:46:03] t [21:46:27] I knew we talked about it face to face [21:46:32] that is what surprised me [21:46:34] i wanted the fancy git command to just update the file back but not the rest [21:46:35] ;] [21:46:39] but the fix has been made only on jenkins :-] [21:46:40] !log running puppet and reloading zuul on gallium. enables hashar's new shiny https://integration.wikimedia.org/zuul/ [21:46:44] (mostly cuz i dunno it ;) [21:46:46] Logged the message, Master [21:47:11] but i did the not git crazy way [21:47:24] I'm not sure there is a way to do it from inside of gerrit, tbh [21:47:26] RobH: rewriting your patch entirely? [21:47:27] hashar: Queue only mode: preparing to reconfigure, queue length: 19 [21:47:36] mutante: yeah it is safe now :-] [21:47:36] New patchset: RobH; "terbium added" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55157 [21:47:49] i guess i could git reset it, that would work on a specific file [21:47:50] hashar: :) [21:47:53] so meh, could have done that. [21:47:57] would have been cleaner, oh well [21:48:03] eh [21:48:04] well [21:48:06] thank you! [21:48:12] you are a gentleman and a scholar [21:48:19] no worries, thx for making the output more logical. [21:48:19] mutante: thanks again :-] [21:48:30] for balacning both hume and terbium that is [21:48:34] mutante: do you want a RT for the check_zuul ? [21:48:40] yeah, I like this way of doing it [21:48:49] hashar: no, don't worry about it [21:48:54] :-] [21:49:21] so is gerrit/zul/jenkins on fritz? [21:49:30] on gallium [21:49:34] or will my patience be rewarded with nerification [21:49:36] ? [21:49:41] verification even. [21:49:43] !log kaldari synchronized wmf-config/InitialiseSettings.php 'Syncing change to InitialiseSettings for Thanks ext' [21:49:49] Logged the message, Master [21:50:00] !log kaldari synchronized wmf-config/CommonSettings.php 'Syncing change to CommonSettings for Thanks ext' [21:50:06] Logged the message, Master [21:50:55] RobH: oh, btw, you can trigger jenkins rechecks by just putting "recheck" into a comment [21:51:05] uhh, i dont have the initial check [21:51:15] i see peter's in puppet validate, but not mine. [21:51:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55155 [21:51:23] so wouldnt recheckign just be adding more to the queue? [21:51:28] hashar: ? [21:52:16] RobH: Zuul is reloading :-D [21:52:20] so it is not processing events [21:52:27] waiting for current build to finish [21:52:29] yea, i was about to piong and ask that [21:52:31] ^_^ [21:52:41] piong! [21:52:46] ahhhhh [21:52:49] typos are my pal. [21:52:50] * hashar shoot self [21:53:05] i should get local puppet validation running like mutante [21:53:17] so when this happens i can be all 'i checked this shit locally, merge!' [21:53:25] puppet-lint, i am going to write a mail [21:53:48] then mark can come yell at me that bypassing zuul isnt a real solution ;] [21:53:52] RobH: Just attackclone the grit repo pushmerge, then rubygem the lymphnode js shawarma module [21:54:12] lol [21:54:14] that was an impressive bunch of nonsense. [21:54:28] binasher: you know the centralnotice v2.3 patch you reviewed -- it's now been merged into core and I'm wondering when you might have a couple of minutes to apply it to meta and the test wikis? it's not high priority; I'm just trying to schedule things [21:54:29] and now i want shawarma. [21:54:46] notpeter: damn youuuuuuu [21:54:49] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [21:54:56] that's actually how you install html9 responsive boilerstrapJS [21:54:58] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [21:54:59] http://html9responsiveboilerstrapjs.com/ [21:56:26] notpeter: you should turn that into a module [21:56:40] mutante: seems likely [21:56:49] I'll get on that [21:57:39] i'll get you a toolserver account to verify it builds on Solaris [21:59:04] hashar: interestingly: [21:59:06] root@gallium:~# /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 -a zuul-server [21:59:10] PROCS OK: 1 process with args 'zuul-server' [21:59:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55157 [22:01:11] hashar: oh, haha, you know, i think stuff like this triggers it: [22:01:18] tail -f /var/log/zuul/zuul.log [22:01:21] when a human does it [22:01:37] that certainly is another command with zuul in the args:p [22:01:56] robh: will i need to sync again after puppet updates on mw1209-1220? [22:02:01] i am assuming yes [22:02:02] let me turn that into the complete cmdline, as with jenkins, using --ereg-argument-array [22:02:11] cmjohnson1: hrmm, i dont think so actually [22:02:20] puppet fires those syncs as well [22:02:27] so its doing them again. [22:02:47] the scripts exist independent of that cuz we need an immediate fire option when mediawiki updates or apache updates go live [22:03:03] since we cannot afford to let them wait for puppets 30 minute call in interval [22:03:19] this is also why ryan is tinkering with salt on cluster [22:03:27] as its made for immediate fire changes [22:04:13] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [22:04:26] may explain the slow puppet run on these [22:08:13] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [22:12:44] New patchset: Dzahn; "fix NRPE check commands for zuul-server, use the complete cmdline instead of just checking string in args, avoid getting warnings for 2 running procs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55162 [22:15:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55162 [22:16:00] New review: Dzahn; "root@gallium:~# /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^/usr/bin/p..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55162 [22:20:42] mutante: nice :-] [22:29:49] anyone know why beta cluster is down? http://en.wikipedia.beta.wmflabs.org/ was up earlier today. hashar? [22:29:59] .. [22:30:02] pooor cluster [22:30:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55119 [22:31:00] chrismcmahon: yeah the apaches returns error 500 [22:31:32] kaldari: I just noticed the change in mediawiki-config so I don't know the context, but it looks like it got deployed rather fast. After only 1 commit in Gerrit merged by a collegue. And deployed several hours after initial creation to the cluster. [22:31:47] Shouldn't new deployed extensions be reviewed slightly more? [22:32:05] Not you in particular, but it seems like a bad trend that is getting more common. [22:32:42] it was reviewed by several people, and is a very tiny extension [22:33:03] I know which is why I'm not making a big deal out of it. [22:33:12] But it could've added a database table and would've gotten deployed just the same. [22:33:13] robh: puppet updates w/zero errors on all but when i check lvs1003 proxyfetch still failing [22:33:48] ^ only on mw1209 [22:33:48] Krinkle: if it involved any schema changes, I would have proceeded very differently as I hope anyone would [22:33:50] hrmm [22:34:01] kaldari: yes, I know you would. [22:34:05] speaking of puppet... i wonder what's up with neon [22:34:23] kaldari: But it worries me that it happens right because of you instead of the process. [22:34:57] I'd say new extensions should pass some kind of formal review, a schema change is an obvious thing that needs it. [22:34:58] but is that the only thing? [22:35:08] A XSS vector can be anywhere. [22:35:09] Krinkle: I was wondering about that myself, like should there be a more formal process for approving deployment of new extensions [22:35:25] I don't think anything can exempt the process, no matter how simple the extension. [22:35:55] cmjohnson1: checkign mw1209 now [22:36:10] just trying a manual sync-common first [22:36:15] kaldari: great :) [22:36:34] kaldari Krinkle one option we discussed in SF not long ago was deploying to beta labs before deploying to any production wiki. that'd be nice for testing. [22:37:07] Krinkle, chrismcmahon: sounds reasonable, perhaps something should be written up on wikitech [22:37:39] And linked to from https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1d:_new_extension [22:37:41] Sure, but (at least in the current state) beta is not part of any process as there is nothing to do. I mean yes one can install on labs, look at it, blink twice, and move on. That's better then nothing, but it's not solid yet. [22:37:43] https://wikitech.wikimedia.org/wiki/How_to_do_a_configuration_change#Install_a_new_extension_on_a_wiki [22:38:07] jeremyb_: neon just finished a sucessful puppet run, whats up [22:38:08] kaldari: Nah, how to deploy code shouldn't include what to do before reaching that point [22:38:12] that would clobber the manual imho [22:38:28] If you're deploying, the code itself should have no question about it. [22:38:44] kaldari: but did you follow https://wikitech.wikimedia.org/wiki/How_to_do_a_configuration_change#Install_a_new_extension_on_a_wiki [22:38:56] mutante: wooooo. amazing. RT 4727 [22:39:00] etc. [22:39:33] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [22:39:44] jeremyb_: ack, the mysql related error is gone meanwhile. fixed by Leslie [22:39:55] Krinkle: I believe we followed all of those instructions [22:39:57] jeremyb_: the initial reason for that bug, still there [22:40:00] afaik [22:41:37] Krinkle: actually we still need to add it to make-wmf-branch, but I think we've done everything else [22:41:59] And we did deploy and test before turning it on in the config [22:42:34] jeremyb_: contactgroups.cfg on icinga prod server: it does not look like before or after gerrit 53499, but confirmed robla is not on it anylonger. resolving 4724 [22:43:04] robla: now you're really not on the contacts for analytics monitoring anylonger.. confirmed on neon [22:43:55] !log bsitu Started syncing Wikimedia installation... : Update Echo, PageTriage to master, Add Thanks [22:44:00] Logged the message, Master [22:44:15] gerrrit so slow [22:45:15] mutante: roan's there? [22:45:19] in parsoid [22:46:27] jeremyb_: yes [22:46:51] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.217 second response time [22:46:51] RECOVERY - Apache HTTP on mw1215 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.209 second response time [22:46:51] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.224 second response time [22:47:09] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.236 second response time [22:47:19] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.216 second response time [22:47:48] mutante: cool :) [22:47:49] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.224 second response time [22:47:58] Thehelpfulone: you has reply [22:48:30] * jeremyb_ runs away for a bit [22:49:08] jeremyb_, okay, well I know they were trying to set up some sort of box for Martin from the RT tickets, but not sure if that was done [22:49:58] Thehelpfulone: we tried to contact Martin but had trouble reaching him [22:50:21] mutante, ah, when was this? [22:50:48] cmjohnson1: so they are recovering it seems [22:50:50] hrmm [22:50:54] notpeter: nah [22:51:03] those arent in pybal [22:51:05] Thehelpfulone: i don't know, but from what i heard multiple times within the last week(s) [22:51:07] pybal proxyfetch fails. [22:51:09] After the initial problems someone brought it up on Wikimedia-L and we got a response on the Bugzilla ticket the next day IIRC [22:51:31] lemme add one of those into pybal and see [22:51:41] mutante, who's been trying to contact him, because I'd really like to convince them that keeping everyone in the loop via that Bugzilla ticket is a *very good* idea [22:51:48] Thehelpfulone: that was only after nicole poked i think. [22:51:50] RobH: they won't pool [22:51:52] i really have to go. bbiab [22:51:58] jeremyb_, yeah, see you in a bit [22:51:59] notpeter: why, salt? [22:52:09] notpeter: cuz thats the step we are at, pooling. [22:52:19] Thehelpfulone: woosters [22:52:30] heh [22:52:40] in fact i sent him an email only yesterday [22:52:51] 2013-03-21 22:51:05.247092 [api ProxyFetch] mw1209.eqiad.wmnet (disabled/partially up/not pooled): Fetch failed, 0.040 s [22:52:54] only partially up [22:53:12] ARGH [22:53:15] notpeter: nevermind dude [22:53:45] woosters, could you update the Bugzilla ticket with those sort of things? If we know you're having trouble contacting him we can ask people from WMDE to poke him (Nicole was able to get him to respond within a day) [22:53:50] whoever added mw1209 added it to the wrong thing [22:53:53] thus it fails [22:54:22] thehelpfulone: i am planning to call him tomorrow actually [22:54:50] that was my error let me fix [22:54:55] oh, yeah, they're in the wrong pool [22:55:03] cmjohnson1: should be in apache not api [22:55:09] although i don't think that's the issue [22:55:12] it's just a page request [22:55:19] and both api and apache should respond the same [22:55:27] notpeter: yep, i saw it, and i think its wrong, but not the reason for the failure [22:55:33] cool [22:55:43] cuz yea, i expect the fetch to be the same, except maybe the pybal api fetch looks for specific something [22:55:46] we'll know in a moment. [22:55:50] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.242 second response time [22:56:02] woosters, ah okay, but could you put a progress update today on Bugzilla (e.g. do you have his SSH key, has he looked at the database yet to give an ETA yet, what's happened in the last month with regards to OTRS)? https://bugzilla.wikimedia.org/show_bug.cgi?id=22622 [22:57:39] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.210 second response time [22:57:59] Krinkle: I'm not sure why new extensions should be singled out for extra scrutiny though. Security and performance problems can come from any code. If anything, it seems that new core code should be the most scrutinized (and any schema changes). [22:58:33] RobH: it's trying to use localhost for mysql [22:58:40] there are some mediawiki confs missing [22:58:47] Sorry! This site is experiencing technical difficulties. [22:58:47] Try waiting a few minutes and reloading. [22:58:47] (Can't contact the database server: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2) (localhost)) [22:58:56] wtf. [22:59:11] sync-common is all the config files i thought [22:59:19] which i also thought puppet running would fire off [22:59:22] bleh. [23:00:39] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.208 second response time [23:00:49] RECOVERY - Apache HTTP on mw1211 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.232 second response time [23:00:50] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.229 second response time [23:01:02] robh: okay put in the right pool [23:01:05] kaldari: Though the process should be better documented and better enforced, I believe what's available right now: [23:01:07] * create a bug [23:01:16] (for review and for deploy) [23:01:21] that's basically the process [23:02:07] CC operations and other devs. Give it enough time and once someone says its reviewed it can be deployed I suppose. [23:02:28] Krinkle: I though that was only the process for people who couldn't deploy their own extensions. I've never followed that process before. [23:02:45] Nobody can review their own extension. [23:02:53] deploy yes, but that needs to be approved first. [23:03:01] Which you can't do yourself. [23:03:09] ANd should be on record imho. [23:03:13] At least in a way that others can know [23:03:19] besides in Gerrit? [23:03:44] Today (again just pointing it, not making any claims) it was created, pushed and merged by only your team afaik and whomever you decided to cc in gerrit. [23:03:46] !log bsitu Finished syncing Wikimedia installation... : Update Echo, PageTriage to master, Add Thanks [23:03:54] Logged the message, Master [23:03:57] and depolyed by you a few hours later [23:04:37] so how, specifically, should that be changed? [23:04:38] If it were in bugzilla people cc'ed on "Extension setup" in bugzilla (or rather those that should be CC-ed) will get notified [23:04:53] and it gets explicitly named as being "Review and deploy my extension" [23:05:12] instead of "Initial comimit" followed by "sync wmf-config "Yep, deploying it"" [23:05:37] kaldari: Well, it's not ideal, but even when not changed that process is already better. [23:05:59] how so better? [23:06:21] Krinkle: there is one potential problem with that. What if Reedy sees the bug and deploys the extension before we want it deployed? :) [23:06:35] Better because that way people who are CC-ed in that bugzilla component and/or see the bug in IRC can look into it. [23:06:47] kaldari: TALK [23:06:55] I don't see the problem. [23:07:18] I'm more worried bout the review, not the deployment. [23:07:34] Saying "This needs to be deployed with my team active, do not deploy without me" [23:07:39] is easy. [23:07:58] How about just an email to wikitech-l asking for review? Wouldn't that work just as well? [23:08:11] Sure, in addition, that's great. [23:08:38] I'd use the bug to keep track of high level details (assignee, status, deadline) [23:08:53] Krinkle: Seems like a lot of overhead for some cases [23:09:16] and quick commit with the commit to mediawiki-config installing it and marking it resolved. [23:09:26] kaldari: You proposed the mailing, not me. [23:09:35] New review: Dzahn; "this won't work yet because if something is critical => true," [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55119 [23:09:36] Creating a bug, assigning it and resolving it when done. how is that overhead? [23:09:54] You agree one needs to do something (anything) to get review though, right? [23:09:57] sending an email to Wikitech-l is easier [23:09:59] Pick your poison :) [23:10:22] kaldari: Yes, but that doesn't live forever and it is useful to get an overal status on what things are at [23:10:23] but I agree more eyes is better [23:10:46] Also easier to reference (a bug id, or some mailing list mirror somewhere) [23:10:47] I think https://gerrit.wikimedia.org/r/#/c/44843 and https://gerrit.wikimedia.org/r/#/c/52546/ could have used more eyes as well ;) [23:11:15] not that the code was bad, but just to get more feedback from people [23:11:38] with "more eyes is better" I come back to the idea of deploying to beta cluster first. [23:11:42] kaldari: I'd argue differently [23:11:49] kaldari: It is a revert [23:12:01] I could've -2 the original commit if I'd seen *that* [23:12:26] true, reverts are different [23:13:02] The second one, has been thoroughly tested by Paul Irish to fix our repaint issues. [23:13:15] And reviewed by a co-author of the framework. [23:13:28] I agree it was merged quickly, but not my me. [23:13:31] who's Paul Irish? Sorry if that's a dumb question, I've been up on 6 too long. [23:13:47] Not sure what you meant with the second link. How did that need more eyes? [23:14:26] Krinkle: I would have been nice if authors of big JS applications had known about it and been able to test against it. [23:14:31] I = It [23:14:59] oh that's the Google Chrome guy [23:15:00] kaldari: Did "Buffer cssText in addEmbeddedCSS" cause problems? [23:15:05] lots [23:15:18] I heard no such thing [23:15:21] #wikimedia-dev [23:15:26] Tell me about it ^ [23:15:34] I did, yesterday [23:15:45] well, one of them at least :) [23:16:12] the others I haven't had time to diagnose in detail: https://bugzilla.wikimedia.org/show_bug.cgi?id=46401 [23:16:49] Ironically, it caused repaint issues for PageTriage that didn't exist before [23:17:41] !log robh Started syncing Wikimedia installation... : [23:17:49] Logged the message, Master [23:21:38] am i supposed to be getting errors on test2? [23:22:25] i see it's a recently enabled extension from kaldari [23:22:46] can you elaborate? [23:23:07] The new extension isn't enabled on test2 [23:23:28] i wouldn't be so sure about that :) [23:23:31] * jeremyb_ stabs this pastebin [23:23:36] hmm [23:23:47] !log robh Finished syncing Wikimedia installation... : [23:23:56] Logged the message, Master [23:24:02] New patchset: Dr0ptp4kt; "Adding IPs for existing partner Celcom and upcoming partner Vimpelcom." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55220 [23:24:21] \o/ [23:24:59] kaldari: http://dpaste.com/1030666/plain/ [23:25:18] how is vimpelcom upcoming??? [23:25:24] New review: Brion VIBBER; "Looks good!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55220 [23:25:24] there's been a press release [23:25:42] in fact i just edited the FAQ a day or two ago [23:25:49] Hey, does anyone know what server / db server OTRS runs on? [23:25:51] jeremyb_: it's not live yet except a testing range, but yes i think we announced it's coming up [23:25:56] csteipp: yes [23:26:07] jeremyb_: Is it all by itself? [23:26:08] csteipp: williams [23:26:10] csteipp: no [23:26:31] binasher will know the latest with the DBs. i can get you info from some months ago [23:26:43] Can I get a +2 on https://gerrit.wikimedia.org/r/55220 ? I promise I'm not trying to be l33t, brion made me do it. [23:26:43] there's about 5 boxes in the SQL cluster [23:26:55] that gets me going in the right direction. thanks! [23:27:04] we need the ip range updates live for tonight please :) [23:27:20] needs a merge, and whatever other poking it might need to be ready for puppet production [23:27:52] csteipp: what's up? [23:28:22] jeremyb_: fixing... [23:28:27] Just trying to evaluate bug 46439 [23:28:29] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [23:28:29] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [23:28:30] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [23:28:30] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [23:28:38] (request for admins to have full sql access) [23:28:45] RobH: yo [23:29:13] csteipp: oh. well in that case I can tell you it shares hosts with gerrit [23:29:14] ? [23:29:21] brion: sup? [23:29:37] RobH: he wants a merge on https://gerrit.wikimedia.org/r/55220 [23:29:38] RobH: wanna +2 that for me? [23:29:58] do i need to go and verify these ips are legit or can i just say 'brion is good enough' [23:30:01] im gonna go with the latter. [23:30:11] i have root dude [23:30:17] New review: Diederik; "No, calling x-cs is incorrect as well, cs is an acronym for carrier short but we are now using the m..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [23:30:18] and if its wrong, im gonna lie and tell folks you said i would be kciked out if i didnt ;] [23:30:19] hehehe [23:30:29] New review: Dzahn; "http://bgp.he.net/net/183.171.160.0/24" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55220 [23:30:39] New review: RobH; "just like all the other ip additions" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55220 [23:30:49] * jeremyb_ wonders where you might be kicked from :P [23:30:49] PROBLEM - check google safe browsing for wikinews.org on google is CRITICAL: Connection timed out [23:30:53] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55220 [23:31:05] greg-g, chrismcmahon: Looks like we need to add the new extension to wmf12 even though it isn't enabled on any wmf12 wikis [23:31:30] brion: merged into production on puppet [23:31:38] does this need to hit in less than 30 minutes? [23:31:43] RECOVERY - check google safe browsing for wikinews.org on google is OK: HTTP OK: HTTP/1.1 200 OK - 3884 bytes in 1.163 second response time [23:31:46] RobH: for our reference, is merged all we need (and then the wait for puppet) or was there anything else behind the schenes we have to ask for? [23:31:50] nah has to be ready for like 9pm [23:31:54] it'll sync way before that [23:31:56] so its +2 and merge on gerrit [23:32:01] awesome thanks [23:32:03] then the merge on sockpuppet/stafford to push live [23:32:05] kaldari: so, it is deployed now on wmf11 (thus everywhere) and working fine? [23:32:09] the latter of that is root level shell [23:32:24] https://wikitech.wikimedia.org/wiki/Git#Public_repo [23:32:28] ah right it has to be pulled into production [23:32:35] we'll try to remember this :) [23:32:45] then yea, if its not a omg this needs to happen now [23:32:47] and can wait 30 [23:32:52] its just good to go after that and they call in [23:33:09] yup [23:33:37] kaldari: I was just about to pack up and head home.... [23:34:05] greg-g: it's deployed to wmf11 and working fine. It's only turned on on test.wiki. Somehow it's broken test2 though even though it's not enabled there.... [23:34:08] kaldari: there was an E3 rename a little while ago that brought down beta as well. [23:34:48] heh [23:35:00] Perhaps the fact that it's in the extension list, but doesn't exist there is causing some problem with the message cache building [23:35:28] greg-g: this seems like a pretty serious pitfall I was not aware of [23:35:40] interesting [23:35:46] unless it's just a fluke [23:36:08] yeah, wikidata had a similar issue this morning (though, my knowledge on the subject is high level, it may be totally different, but it sounds similar) [23:36:31] so, you need the code to be deployed to wmf12, even though nothing is running off of wmf12 right now? [23:36:34] (other than test2) [23:37:08] yeah, for some reason test2 is requesting the new extension files even though the extension isn't enabled on test2 [23:37:15] this is rather strange [23:37:25] https://test2.wikipedia.org/ [23:37:35] Warning: include_once(/home/wikipedia/common/php-1.21wmf12/extensions/Thanks/Thanks.php): failed to open stream: No such file or directory [23:37:38] yuck [23:37:42] 'wmgUseThanks' => array( [23:37:43] 'default' => false, [23:37:43] 'testwiki' => true, [23:37:43] ), [23:37:54] ideally, that should never happen [23:38:16] seems to be related to maintenance/mergeMessageFileList.php [23:38:54] kaldari hashar says what seems to be similar in labs just now re: beta: (05:06:08 PM) hashar: Fatal Error : Failed opening required '/usr/local/apache/common-local/php-master/extensions/E3Experiments/Experiments.php' [23:39:29] yeah, deploy the extension to wmf12, but keep it only turned on on test.wiki [23:39:34] I think that makes sense [23:39:43] this whole day is a cluster [23:40:05] greg-g: ++ [23:40:22] chrismcmahon: to the cluster comment or the suggestion? :) [23:40:41] greg-g: s/cluster/adventure/ [23:41:24] so, I'm curious why test2 is wanting the extension at all, can we figure that part out separately? [23:41:42] ideally, we wouldn't change test2 other than to do the fixes to the move bug [23:42:58] greg-g, chrismcmahon: better debug fast, it'll be fixed in a couple minutes [23:43:35] kaldari: because you're pushing it to wmf12, right? [23:43:41] yeah [23:43:42] k [23:43:48] unless you want me to wait [23:43:56] nah, do it [23:44:12] AaronSchulz has another one-line patch to fix the page move bug, hopefully :) [23:44:15] and we want to test that [23:48:44] unfortunately, we're getting to be a little too late in the day to deploy the fix out to phase1&phase2 wikis :/ [23:50:44] greg-g: we have something of a policy of not deploying on Fridays but this might be an exception [23:55:10] greg-g, chrismcmahon: no luck with deploying it to wmf12: error: unable to unlink old '.gitmodules' (Permission denied) [23:55:34] ? [23:55:38] What did you run to get that? [23:55:47] kaldari@fenari:/home/wikipedia/common/php-1.21wmf12$ git pull [23:55:57] Meh [23:55:58] Looking [23:56:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:30] wtf [23:56:37] The php-1.21wmf12 directory isn' [23:56:40] t group-writeable [23:57:05] kaldari: Try now [23:57:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [23:57:46] RoanKattouw: I love how you just appear out of nowhere to solve problems, like some geek version of batman :) [23:57:51] hahaha [23:58:15] that worked [23:59:39] hmm, test2 is still borked: https://test2.wikipedia.org/