[00:01:14] Reedy: this in mediawiki-config like the docroot for noc, right [00:01:43] or just local [00:01:59] yeah, mediawiki-config [00:12:13] New patchset: Dzahn; "add GPG key for hexmode to sign mw packages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32165 [00:12:37] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32165 [00:13:03] hexmode: if you want to update it in the future, you can just send gerrit patches [00:13:33] :) [00:15:10] Mmmmm DEADBEEF [00:20:02] so... sync-common now just calls "scap-1" and thats all? [00:20:14] did that also change recently? [00:21:04] Not very recently.. [00:21:12] ok [00:21:50] New patchset: Anomie; "(bug 41133) Allow per-realm dblists and wikiversions.dat" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [00:21:50] New patchset: Anomie; "Add ability for switching for eqiad-specific configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30792 [00:24:10] New patchset: Anomie; "(bug 41133) Allow per-realm dblists and wikiversions.dat" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/32168 [00:27:58] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:28:02] New patchset: Anomie; "(bug 41133) Allow per-realm dblists and wikiversions.dat" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [00:33:59] New patchset: CSteipp; "Update WV config based on their current LocalSettings config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32096 [00:35:10] New review: CSteipp; "typo" [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32096 [00:37:26] New patchset: CSteipp; "Update WV config based on their current LocalSettings config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32096 [00:40:25] PROBLEM - MySQL Slave Running on db1003 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant create database frwikivoyage: database exists on quer [00:40:38] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant create database frwikivoyage: database exists on quer [00:41:01] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: CRIT replication delay 250 seconds [00:41:01] * AaronSchulz looks at binasher [00:41:10] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 258 seconds [00:41:10] PROBLEM - MySQL Slave Running on db1010 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant create database frwikivoyage: database exists on quer [00:41:17] maaaaan [00:41:31] mutante: "if not exists" is your friend :) [00:42:07] * AaronSchulz loves to use that in his sql patches, heh [00:42:08] but i was supposed to create them manually this time, we discused this here [00:42:13] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 321 seconds [00:43:06] i'm fixing repl [00:43:28] mutante: forgot set sql_log_bin = 0 when running the create db on a master? [00:43:44] RECOVERY - MySQL Slave Running on db1003 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:44:20] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay 0 seconds [00:44:28] RECOVERY - MySQL Slave Running on db1010 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:45:09] binasher: no, using the same files from earlier [00:45:15] SET SQL_LOG_BIN = 0; [00:45:31] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:45:31] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 0 seconds [00:45:53] mutante: the files didn't include the "create database" though, did they? [00:46:08] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [00:47:37] binasher: no they didnt, gotcha, my notes said to do it before dropping anything and last time they were created by addwiki [00:56:31] !log demon synchronized php-1.21wmf3/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'Debugging' [00:56:37] Logged the message, Master [01:05:44] !log demon synchronized php-1.21wmf3/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'Debugging' [01:05:50] Logged the message, Master [01:13:20] !log demon synchronized php-1.21wmf3/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'Revert debugging' [01:13:33] Logged the message, Master [01:19:57] binasher: done, incl. db34 [01:28:52] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [01:34:20] New patchset: Dzahn; "re-enabling frwikivoyage Apache config" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32176 [01:36:05] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32176 [01:37:35] omg... [01:37:50] now i cant pull the apache config on fenari? [01:37:51] error: insufficient permission for adding an object to repository database .git/objects [01:37:54] fatal: failed to write object [01:39:11] 12K drwxrwxr-x 154 hashar wikidev 12K 2012-11-06 23:49 objects [01:39:20] owned by hashar?! [01:39:40] less than an hour ago..man [01:39:52] lol [01:39:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 288 seconds [01:39:58] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 242 seconds [01:40:05] root ftw [01:40:13] mysql slave ........ [01:40:38] @replag [01:40:50] none of the servers we touched :p [01:40:54] good ;) [01:41:11] sooo, fix apache git repo... [01:43:16] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [01:45:42] dzahn is doing a graceful restart of all apaches [01:46:02] !log dzahn gracefulled all apaches [01:46:10] Logged the message, Master [01:46:10] sooo.done.. [01:46:25] woop [01:48:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [01:48:54] !log reedy synchronized all.dblist 'frwikivoyage' [01:48:59] Logged the message, Master [01:58:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:58:55] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:58:55] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:00:54] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 305 seconds [02:00:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 306 seconds [02:03:57] New patchset: Hoo man; "(bug 41840) Disable 'editsection' pref for wikibase" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32179 [02:08:27] New review: Danny B.; "The correct way is to fix the wrong code of ns-0 which causes that and not to disable the preference..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/32179 [02:12:15] New review: Hoo man; "I don't think this needs any further discussion, this is the least bad way to workaround this until ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32179 [02:17:01] New review: Hoo man; "The pref. is currently set by 2 users on wikidatawiki:" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32179 [02:17:34] New review: Danny B.; "You want to change the behavior with much bigger impact than to properly fix the single-namespace is..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/32179 [02:19:18] New review: Danny B.; "Argumenting with the current number of users with such preference is invalid, because obviously when..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/32179 [02:20:04] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:23:20] New patchset: Reedy; "Readonly for wikivoyage wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32180 [02:28:55] !log LocalisationUpdate completed (1.21wmf3) at Wed Nov 7 02:28:51 UTC 2012 [02:28:58] Logged the message, Master [02:31:46] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [02:34:57] Change abandoned: Hoo man; "Worked around in the common.css" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32179 [02:41:58] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [02:49:00] !log LocalisationUpdate completed (1.21wmf2) at Wed Nov 7 02:49:00 UTC 2012 [02:49:10] Logged the message, Master [02:49:41] New patchset: CSteipp; "Disable wikivoyage sites in apache" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32181 [02:56:43] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:05:04] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [03:10:47] RECOVERY - Puppet freshness on erzurumi is OK: puppet ran at Wed Nov 7 03:10:30 UTC 2012 [03:12:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [03:34:25] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:39:22] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [03:48:23] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [03:48:40] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [04:28:40] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [04:42:42] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [04:45:15] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [04:49:23] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [04:54:17] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [05:21:17] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [06:18:03] New patchset: Stefan.petrea; "Openssl Package needed to be installed for jenkins At wmf-analytics we are trying to build libanon for CI (the same applies to libcidr and udp-filters, but will treat those in separate reviews)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32190 [06:35:19] New patchset: Stefan.petrea; "Openssl Package needed to be installed for jenkins At wmf-analytics we are trying to build libanon for CI (the same applies to libcidr and udp-filters, but will treat those in separate reviews)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32192 [06:36:23] Change abandoned: Stefan.petrea; "branch isn't as good as I wanted it to be(some files got deleted and stuff, I had just 2 files modif..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32190 [06:49:46] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [07:07:26] New review: Dzahn; "per chris" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32181 [07:07:26] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32181 [07:13:49] dzahn is doing a graceful restart of all apaches [07:14:26] !log dzahn gracefulled all apaches [07:14:35] Logged the message, Master [07:31:02] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [07:32:10] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:53:25] New patchset: Matthias Mullie; "Update WV config based on their current LocalSettings config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32096 [08:18:37] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [08:18:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [08:18:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:45:41] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:42] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [09:56:45] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:46] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:05:11] PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:11] PROBLEM - HTTP on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:05:12] RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:05:20] RECOVERY - HTTP on kaulen is OK: HTTP OK HTTP/1.1 200 OK - 461 bytes in 0.010 seconds [10:25:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32060 [10:28:48] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:31:17] New patchset: Mark Bergsma; "Make every frontend only talk to its own backend, for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32197 [10:31:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32197 [10:33:04] New patchset: Mark Bergsma; "Revert "mysql 5.5 compat"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32198 [10:33:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32198 [10:59:05] New patchset: Mark Bergsma; "Install upload Varnish on cp3004 as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32202 [10:59:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32202 [11:17:03] !log Pooled cp3004 with weight 1 [11:17:03] Logged the message, Master [11:30:04] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [11:31:36] New patchset: J; "Bug 41826 add wgmEnableTimedText" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32205 [11:33:27] New patchset: J; "Bug 41826 add wgmEnableTimedText" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32205 [11:48:28] mark: care to explain? [11:49:10] you're now hashing not just on the URL, but also on a header... which is not available when purging [11:49:31] so instead you should not adjust the hashing, but add a new header on which you can Vary: [11:49:39] that has drawbacks as well, but it doesn't break purging [11:50:07] thanks, that's way less dramatic and way more helpful than "you just broke purging" [11:50:38] but less brain exercise [11:53:34] mark: by the way -- this is unrelated by always confused me -- how come static assets from bits have x-cache-hit/miss headers from squids? is the setup varnish -> squid -> mw? [11:53:44] or am i misreading it? [11:54:30] they do? [11:55:27] let me check and produce an example before i make a complete idiot of myself [11:55:34] ok [11:57:03] ah nevermind, strontium is varnish, right? [11:57:12] yes [11:57:57] !log Increased weight of cp3004 to 9 [11:58:02] Logged the message, Master [11:58:15] nm then [12:00:24] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:00:24] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:00:24] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:35:16] it's funny how I replied about vary: cookie/accept-language in that RT ticket [12:35:26] and how it's a bad idea even if you normalize it [12:35:55] 5 days ago [12:42:40] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [12:52:39] !log Starting load testing of cp3003 and cp3004 [12:52:42] Logged the message, Master [13:02:43] did gerrit die? [13:06:03] probably [13:06:10] it'll come back again in a couple of minutes [13:06:58] it's back [13:11:12] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [13:11:18] Logged the message, Master [13:12:40] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:15:56] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31009 [13:15:59] Reedy: looks like I overlooked a variable typo in the code that only gets triggered in production: can i get one more sync of TMH [13:16:34] lol [13:16:53] I should write a script to do this.. [13:17:18] :) [13:17:55] It's a brainless activity to do it, and with the extension name you can make a generic Update FOO to master message [13:18:51] i might do that this afternoon [13:19:05] can even make it auto approve itself... [13:26:39] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [13:26:47] Logged the message, Master [13:33:38] heh, 9 lines of bash [13:33:47] just need to finish the auto approve bit [14:21:27] Reedy: nice, looks like I would need your new script right away, can you sync TMH once more? [14:44:57] j^: we [14:44:57] whee [14:45:00] first part works first time [14:45:03] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [14:46:11] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [14:46:13] Logged the message, Master [14:57:52] are tmh1/tmh2 running the latest code? somehow tmh2 looks a bit idle from http://ganglia.wikimedia.org/latest/?c=Video%20scalers%20pmtpa&h=tmh2.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [15:06:57] New patchset: Mark Bergsma; "Set thread_pools to 2, and reduce max thread count to a saner level" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32222 [15:07:29] New patchset: Hashar; "rake disable colors on non TTY" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30062 [15:07:38] New review: Hashar; "rebased" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30062 [15:08:06] pooor gerrit interface :/ [15:08:06] Service Temporarily Unavailable [15:08:16] ^demon: Gerrit having issue again apparently [15:08:21] why does gerrit's stability suck so much [15:08:22] though it only last for a few seconds [15:08:38] I am wondering if it is related to the Precise upgrade [15:08:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32222 [15:08:52] you're relating everything to precise upgrades [15:09:01] <^demon> It's unrelated to precise. [15:09:10] <^demon> It started a few days before we upgraded. [15:09:15] <^demon> And I've not tracked down why yet. [15:23:25] !log reactivating tele2 ipv6 peering and re-exporting routes to it [15:23:31] Logged the message, Master [15:56:36] New patchset: Hydriz; "Removed duplicate Wikivoyage section for wgSitename." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32223 [16:01:34] New review: Hydriz; "-1 for now to get some small issues fixed first." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/31949 [16:17:41] New patchset: J; "Bug 41826 add wgmEnableLocalTimedText" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32205 [16:39:27] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [16:40:13] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 117.29 ms [16:41:30] New patchset: Alex Monk; "(bug 41841) Add wikidata and wikivoyage to wgNoFollowDomainExceptions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32226 [16:49:02] New patchset: Demon; "Enable TMH on commonswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32231 [16:49:30] oh wow [16:49:30] already? [16:49:30] New review: Demon; "For merging in ~10m when the deployment window opens." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32231 [16:50:05] <^demon> paravoid: In 10 minutes, yeah. [16:50:51] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [16:51:47] Scary stuffs! [16:51:55] indeed [16:52:03] ^demon: before we do deployment to commons, we need TMH master synced to 1.21wmf3, https://gerrit.wikimedia.org/r/#/c/32205/ merged to mediawiki-config [16:52:19] <^demon> Turns out I'm enabling it right at lunch time, which is *fantastic* [16:52:22] <^demon> :) [16:52:22] A lot of moving parts [16:53:50] New patchset: Reedy; "Move everything else over to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32232 [16:54:10] New review: J; "would also want to enable transcoding on commons. " [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32231 [16:54:16] Change abandoned: Reedy; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32180 [16:55:26] <^demon> Reedy: Gonna have to rebase 32232, since it depends on abandoned 32180. [16:55:39] Yup, just fixing it [16:56:06] New patchset: Reedy; "Move everything else over to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32232 [16:56:14] j^: have we done any estimates regarding the scalability of tmh1/2? [16:56:20] that's better [16:56:24] how many transcoding hits can we sustain? [16:56:34] <^demon> Rebase button on change with abandoned dependency should rebase against destination branch. [16:56:38] <^demon> Would make most sense imho, rather than giving a stupid error [16:56:52] mmm, very much so [16:59:20] paravoid: its hard to predict, but initially it will have a backlog, and given the changes in support we might see more video uploads [16:59:55] +there is room for using them a bit better [17:00:26] but might need to add more transcoding boxes if a lot of new videos are uploaded [17:02:43] New patchset: Demon; "Enable TMH on commonswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32231 [17:02:53] <^demon> j^: PS2 has your transcode fix. [17:03:13] New patchset: Mark Bergsma; "Reduce thread limits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32234 [17:03:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32234 [17:04:24] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32205 [17:05:55] New patchset: Demon; "Enable TMH on commonswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32231 [17:06:16] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32231 [17:07:49] thanks ^demon! [17:08:03] <^demon> No problem. Syncing config now. [17:08:20] !log demon synchronized wmf-config/CommonSettings.php 'Enabling TMH for commonswiki' [17:08:25] Logged the message, Master [17:08:43] Can i get 'transcode-status' permissions on commons? [17:08:43] !log demon synchronized wmf-config/InitialiseSettings.php 'Enabling TMH for commonswiki' [17:08:46] Logged the message, Master [17:09:30] j^: mdale: what's the reason for restricting 'transcode-status' on any given wiki? [17:09:37] <^demon> j^: We can't assign individual permissions. Only group on commons with that permission right now is admins. [17:10:10] morebots: the special page is costly to generate [17:10:31] ^demon: are staff admins? [17:10:46] <^demon> If you're part of the global staff group, which not everybody is. [17:10:49] <^demon> (I'm not, for one) [17:11:16] hmm if its something your granfathered into... I should be good ;) [17:11:40] I can check on en .. one sec. [17:12:09] There's likely commons admins around [17:12:36] nope coming up blank for me. [17:12:36] http://en.wikipedia.org/wiki/Special:TimedMediaHandler [17:12:49] <^demon> Generally speaking: hiding a special page because it's expensive sounds like the wrong route to go. I'd much rather cache it and show it to other users (and maybe restrict re-generating it to a permission, if it's really that bad) [17:13:02] ^demon: agreed. [17:13:04] completely hiding seems strange [17:13:16] or do you mean you don't have permission to view X hiding? [17:13:49] <^demon> Man, this page is taking forever to load. There *really* needs to be some level of caching here. [17:13:49] ^demon: that is what we do for the per-page transcode status.. show to all .. restrict reseting transcode jobs to users with that status [17:14:29] j^ wrote it.. maybe not using all the MediaWiki caching tricks ? [17:15:18] or maybe we need more indexes on the transcode state table? [17:15:45] <^demon> Possibly. I don't have TMH installed locally yet so I haven't run explain on it. [17:15:47] * robla enters new bug for caching this page [17:15:59] the table can't be that large already, can it? :/ [17:16:18] dont think its that slow [17:16:23] just did not spend time testing it [17:16:47] if you want to dig up the queries, we can quickly EXPLAIN them and verify [17:17:00] <^demon> It was really slow for me on enwiki just now. [17:17:06] <^demon> I'm seeing some queries inside of a foreach(), which can't be fast :\ [17:17:09] and its not using cache since i wanted to see the current state [17:17:28] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:17:48] <^demon> We're doing a count(*) on the image table inside of a foreach. [17:17:55] <^demon> I'll be $5 that's it. [17:17:55] o_0 [17:18:15] <^demon> Line 156-161. [17:18:44] https://bugzilla.wikimedia.org/show_bug.cgi?id=41854 [17:18:57] "Bug 41854 - Cache expensive elements in Special:TimedMediaHandler " [17:21:24] img_media_type isn't indexed [17:21:53] and then there's an unknown extra conditional [17:22:08] I'll just dump it on the bug robla opened [17:23:09] http://commons.wikimedia.org/wiki/Commons:Media_help is a bit outdated now [17:23:36] array( 'transcode_key' => $key ), [17:23:36] isn't indexed [17:24:55] we should add an index for it in that case [17:25:15] Yeah, just left a comment to that effect [17:25:25] It'll be cheap to do with few rows in the table [17:25:39] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Wed Nov 7 17:25:32 UTC 2012 [17:26:06] <^demon> We should look at doing this slightly differently. I'd rather query more things at once and then sort at the application level, rather than iterating and doing queries. [17:26:35] There's a few more obfuscated queries based on where the source comes from [17:27:34] Change merged: CSteipp; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32096 [17:29:00] <^demon> I did explains on the image queries: http://p.defau.lt/?iyE2Kf2tqwL2AKcFATLAUw [17:29:44] i uploaded a WebM video to commons http://commons.wikimedia.org/wiki/File:2012-07-18_Market_Street_-_San_Francisco.webm [17:31:21] adding an indexed file_extension field to image table would make that query easier. [17:31:39] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [17:32:52] https://gerrit.wikimedia.org/r/32236 adds an index on trancode_key [17:34:16] i get logged out of gerrit all the time today [17:36:36] !log Creating transcode_key_idx on all transcode tables on all wikis [17:36:43] Logged the message, Master [17:37:11] <^demon> Reedy: Done on enwiki yet? [17:37:24] yup, it's onto ga via foreachwiki [17:37:38] j^ Market Street file not playing in IE7, is that expected? IE7 shows the first frame only. [17:37:39] <^demon> Oooh, I think I found a way to eliminate 4x queries in one of the iterations :) [17:37:46] <^demon> The new index works for this. [17:38:21] chrismcmahon: yes will not play until Ogg transcode is ready [17:38:36] display looks nice, though [17:38:39] Hmm, I guess the TMH table needs adding to addWiki, /me makes a TODO [17:39:01] Reedy: what does that mean? [17:39:13] so it creates the transcode table on new wikis [17:41:32] can we add me to a group that can see: http://en.wikipedia.org/wiki/Special:TimedMediaHandler ... user "mdale" [17:41:59] and for http://commons.wikimedia.org/wiki/Special:TimedMediaHandler ;) [17:42:02] Why is it empty for anons? ;) [17:42:26] <^demon> j^: How does http://p.defau.lt/?0sDneLyhr_iOCCa6BdSFqg look? [17:42:33] New review: Andrew Bogott; "This is just the design that I was aiming for -- good work!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [17:43:19] !log racking ms-be6 (720xd) [17:43:24] Logged the message, Master [17:43:26] 179 videos [17:43:26] 179 Ogg videos [17:43:33] Wow, what an awesome pag ;) [17:44:26] mdale: Only admins have that on enwiki... [17:44:45] <^demon> We already discussed this :p [17:44:45] ;) [17:45:56] Should we create a group then? [17:46:00] Potentially annoying enwiki? [17:46:00] New patchset: Dzahn; "add wildcard SSL cert for wikivoyage.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32240 [17:46:08] ^demon: with those changes you think its safe for anonymous users? [17:46:26] <^demon> All we've done is slap an index on it. [17:46:37] <^demon> We still need to clean up some of these queries-in-foreaches. [17:46:55] New review: Dzahn; "RT-3696 (includes .m.)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32240 [17:46:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32240 [17:47:20] And maybe add some wfProfile calls [17:47:34] <^demon> mdale: If you can test out http://p.defau.lt/?0sDneLyhr_iOCCa6BdSFqg on your local install, that would reduce the number of queries by count($wgEnabledTranscodeSet) times. [17:48:01] <^demon> (So on enwiki, 4x less queries in getStats() :)) [17:49:52] ^demon: Hunk #1 FAILED at 132. [17:50:01] thats against master? [17:50:33] <^demon> It wasn't but easily rebases. [17:50:33] <^demon> I'll put a new patch [17:50:54] <^demon> Against master: http://p.defau.lt/?TFvE4xxm0Sg1Wx5_TBfBEw [17:56:56] <^demon> j^: I've got to prep for a meeting. If that works on your install, we'll get it merged & deployed soon. [17:58:31] ^demon: ah that pastebin adds html to that page impossible to just patch can you just git review it to gerrit? [17:59:49] <^demon> j^: https://gerrit.wikimedia.org/r/#/c/32243/ [18:01:00] New patchset: Dzahn; "add SSL proxy config for wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32244 [18:03:40] New patchset: Dzahn; "add SSL proxy config for wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32244 [18:05:46] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [18:05:53] ^demon: thanks for your help today! [18:06:04] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK HTTP/1.1 200 OK - 633 bytes in 0.056 seconds [18:06:06] <^demon> No problem :) [18:06:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32244 [18:07:09] robla: i ended up starting to write a script to do extension updates to master in wmf branches... [18:08:26] just need to make it submit them too... [18:14:57] ^demon: your patch lost keys will rework it a bit [18:15:26] <^demon> Yeah, I wasn't 100% sure if I had it. Got the general idea of what I was getting at though? [18:17:57] the first part does not work like that, the second one with group by is k [18:18:39] * robla disappears into a meeting for a bit. [18:19:52] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [18:19:52] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [18:19:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:22:16] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [18:22:21] dang, there is an issue with the nginx config on ssl1, looking [18:30:36] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 08/22/2015 22:23. [18:42:20] !log racking new ms-be6 to row B sdtpa [18:42:28] Logged the message, Master [18:43:11] New review: Dzahn; "role class ++, installing openssl - don't see a problem. Hashar, if this makes sense to you on integ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/32192 [18:45:12] New review: Ottomata; "Naw, I talked with Stefan this morning." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/32192 [18:46:42] New review: Dzahn; "ah,ok, i also talked to him a bit last night and actually recommended to do a role class. It sounded..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/32192 [18:49:15] New review: Dzahn; "[[RT:3879]]" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/32192 [18:50:47] heisenbugs. test/test2 recently produced js errors on nearly every page for IE7. now that I'm looking for them, they're not there. found one, but it looks exceptional. http://test.wikipedia.org/wiki/Foobar [18:54:07] New review: Dzahn; "we already merged the redirect, but looks like it is missing a ServerAlias" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31302 [18:57:20] New patchset: Dzahn; "add missing ServerAlias for wlm.wikimedia.org redirect to toolserver" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32250 [18:58:45] New review: Dzahn; "needed for change 31303 to work. and that is what 31302 depends on" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/32250 [18:58:45] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/32250 [19:00:06] dzahn is doing a graceful restart of all apaches [19:00:26] !log dzahn gracefulled all apaches [19:00:34] Logged the message, Master [19:02:18] New review: Dzahn; "needed a ServerAlias: change 32250" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/31303 [19:03:49] New review: Dzahn; "should work now soon" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31302 [19:06:45] mutante: how much more are you doing? it's time for us to finish off the 1.20wmf3 deployment [19:08:09] Reedy: assuming we get the all clear here, are you ready to switch eveyrthing else to wmf3? [19:14:13] robla: go ahead, stopping for now [19:19:32] 07 18:51:30 < rfaulkner> Amit is asking if anybody here might know a wikipedia zero feature enabled today? [19:20:36] Yes. Wondering who merged the WP zero changes earlier? Who I need to contact when we're ready to shut off the test? [19:20:52] did you check gerrit and the SAL? ;) [19:21:08] do you know yet what time they should be turned off? [19:23:22] akapoor: i see nothing relevant in the SAL [19:23:35] how did you get it deployed to begin with? [19:23:44] !sal | akapoor [19:23:44] akapoor: https://labsconsole.wikimedia.org/wiki/Server_Admin_Log see it and you will know all you need [19:25:48] !log re-adding arsenic to bits cache pool [19:25:55] Logged the message, notpeter [19:37:08] anyone seen Reedy? [19:38:08] I'm here [19:38:24] I'd done 50% of the work already ;) [19:38:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32232 [19:39:06] Woah [19:39:40] j^: I've got 2 TMH related php warnings and 1 fatal (though 243 occurances in the last 1000 log lines) [19:39:48] Make that 2 fatals [19:41:48] akapoor: ? [19:42:06] paravoid: what's with copper/zinc swift? [19:42:09] seems to be down [19:42:43] hm? [19:43:37] I've never touched that cluster, does it even work? [19:44:18] not anymore, I think there was some hardware problem or something [19:45:56] zinc seems down [19:46:11] do you actively use that? does it make sense to put effort into reviving that cluster? [19:46:48] !log point wlm.wikimedia.org to the cluster rather than stand-alone host yttrium [19:46:51] Logged the message, Master [19:48:25] Reedy: can you provide more info? [19:48:42] j^: https://bugzilla.wikimedia.org/show_bug.cgi?id=41860 [19:48:53] The 2 fatals have stack traces [19:49:06] The 2 warnings should be simple enough, as they're missing constructor parameters [19:50:02] paravoid: I run tests against it [19:50:20] okay [19:50:22] I don't know how much work it is to keep that running [19:50:29] I'll try fixing it [19:50:44] and then I should probably upgrade it at some point [19:50:45] it's also good for latency tests ;) [19:50:51] to match what we have in production [19:51:05] New patchset: Asher; "disable zero for orange congo and botswana" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32252 [19:51:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf3 [19:51:12] Logged the message, Master [19:51:30] New patchset: Reedy; "Fix abwiki typoo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32253 [19:52:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32252 [19:52:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32253 [19:53:18] !log powercycling zinc, unresponsive to network and serial console [19:53:23] Logged the message, Master [19:53:27] New patchset: Asher; "Revert "Revert "mysql 5.5 compat"" - going to deploy now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32254 [19:53:59] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32254 [19:56:18] RECOVERY - Host zinc is UP: PING OK - Packet loss = 0%, RTA = 26.70 ms [19:56:23] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [19:56:28] Logged the message, Master [19:56:50] New patchset: Reedy; "wm3 -> wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32255 [19:57:05] PROBLEM - NTP on zinc is CRITICAL: NTP CRITICAL: Offset unknown [19:57:35] AaronSchulz: better? [19:57:36] Reedy: argh....I thought we had some checks to keep that sort of thing from happening [19:57:53] we do, but it's a silly warning.. which seems to carry on [19:57:58] https://bugzilla.wikimedia.org/show_bug.cgi?id=41861 [19:58:04] logged a bug to get jenkins to say no [19:58:11] PROBLEM - Swift HTTP on zinc is CRITICAL: Connection refused [19:58:11] j^: tmh2 has lots of avconv running for yesterday and they day before; no ffmpeg2theora [19:58:18] paravoid: not really :/ [19:58:47] paravoid: must be before the memory issue was fixed, you can killall -9 avconv in that case [19:59:38] RECOVERY - NTP on zinc is OK: NTP OK: Offset 0.02785778046 secs [20:00:06] j^: did that, now there are newly run avconv [20:02:49] Reedy: https://gerrit.wikimedia.org/r/#/c/32256 [20:03:13] paravoid: and top still low? [20:03:32] sorry? [20:03:44] paravoid: never mind, ganglia was just slow, now load goes up [20:03:44] there are short-lived avconvs now, I think it's just processing jobs normally [20:05:21] paravoid: hmm, using zinc seems to work [20:05:47] yeah, I restarted swift there [20:05:49] do you use both as proxies? [20:06:16] just one, zinc [20:06:16] j^: all ok from your side too? [20:06:16] I was using copper [20:06:16] AaronSchulz: the cluster also has magnesium btw. [20:06:20] New patchset: Asher; "temporarily disabling session multiwrite to redis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32257 [20:06:22] not sure if a proxy runs there, I can check [20:06:51] yeah I know about mg [20:07:04] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32257 [20:07:05] or perhaps I should call it hg ;) [20:07:24] paravoid: tmh2 looks ok now if that was your question [20:07:51] yeah [20:07:52] so if https://gerrit.wikimedia.org/r/#/c/32256 gets merged it might be good to sync tmh master to wmf3 so those errors are gone [20:08:36] j^: just in case this was addressed to me, ops (incl. myself) don't do MW deploys [20:08:46] j^: MediaTransformError needs 3 parameters [20:08:49] so the one you changed needs a 3rd [20:08:54] and the one above also needs a third adding [20:08:59] New patchset: SPQRobin; "Add wikivoyage to missing.php for future redirects to incubator" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32258 [20:08:59] !log asher synchronized wmf-config/CommonSettings.php 'disabling session multiwrite to redis on mc1' [20:09:05] Logged the message, Master [20:10:15] robla: Static analysis of our code would be great ;) [20:13:43] Reedy: ah like that, reworking patch [20:14:03] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [20:14:10] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [20:15:11] New patchset: jan; "Refactor the webserver classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30147 [20:16:31] New review: jan; "Per IRC-discussion: I removed all classes that are not for refactoring at moment. The other classes ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29975 [20:18:16] New review: Faidon; "Better :) So, how about naming php php::common, since that's what it is? Note that there's no requir..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29975 [20:18:52] New patchset: Asher; "using redis on mc1-16 for sessions" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32262 [20:18:57] Reedy: pushed [20:20:33] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32263 [20:20:56] New patchset: jan; "Add puppet config for PHP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32263 [20:29:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [20:30:17] AaronSchulz: so, does it work for you? [20:30:44] paravoid: copper? [20:30:59] using zinc as the proxy does work [20:31:02] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler [20:31:11] Logged the message, Master [20:31:25] j^: Done... Hopefully that clears stuff up... [20:33:20] was that related to the fatal https://bugzilla.wikimedia.org/show_bug.cgi?id=41860 ? [20:34:19] To which fatal? :p [20:34:30] That's just what I grabbed from our error logs [20:34:48] !log wikivoyage imports: "nl" - done [20:34:53] Logged the message, Master [20:34:56] but its happening rarely? [20:35:06] .mid ? [20:35:19] hmm [20:35:51] one of the fatals was accounting for about 25% of the error lines in the last 1000 [20:36:06] just waiting for them to filter out [20:37:38] AFK for 10-15 [20:42:07] !log added new redis-2.6.3 pkg to precise-wikimedia and upgraded mc1 [20:42:14] Logged the message, Master [20:42:27] binasher: it bothers me that those "ITEM TOO BIG" are for mostly the same set of files [20:42:36] I wonder which files those are [20:42:55] yeah [20:43:44] it seems wrong that parser output from articles never exceeds the 1M limit, metadata for some media files does [20:44:32] it looks like job queue deadlocks are gone with the switch to wmf3 [20:45:23] New patchset: Asher; "change default redis pkg version to 2:2.6.3-wmf1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32305 [20:45:27] yay [20:45:44] binasher: still lots of those old swift errors, I was hoping removing -n would magically fix them [20:45:46] !log wikivoyage imports: "sv" - done [20:45:54] Logged the message, Master [20:45:58] AaronSchulz: can you live hack in something to dump the contents of one of those file keys? [20:46:16] AaronSchulz: are they still mostly just on zhwiki? [20:46:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32305 [20:47:23] binasher: we should at least split the main file cache from the metadata one [20:48:28] its all or nothing now, so the existence/size/hash/... checks for some commons file used on some wikis keep resulting in db queries [20:50:38] splitting sounds logical [20:50:38] can we also impose size limits on metadata? [20:50:39] !log aaron synchronized php-1.21wmf3/includes/filerepo/file/LocalFile.php 'debug logging.' [20:50:39] Logged the message, Master [20:51:07] binasher: see mw-log/temp-debug.log [20:51:16] * AaronSchulz gets cold and should eat [20:51:30] ahh, yes, djvu files, makes sense [20:51:38] New patchset: Cmjohnson; "Correcting MAC for ms-be6 to reflect H/W change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32307 [20:52:34] is the "metadata" really the document? [20:52:35] cmjohnson1: oooh [20:52:42] arrived finally? [20:52:44] \o/ [20:53:11] exciting...should have for you soon [20:53:17] New review: Cmjohnson; "Looks good to me but needs +2" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/32307 [20:53:31] can you +2 paravoid? [20:53:39] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32307 [20:54:19] !log upgraded redis on mc1-16 [20:54:20] done, running puppet on brewster now [20:54:25] Logged the message, Master [20:54:53] done [20:55:13] AaronSchulz: any issue with me deploying https://gerrit.wikimedia.org/r/#/c/32262/1/wmf-config/CommonSettings.php now? then i want to swap the order later today [20:56:13] thx [20:57:33] New patchset: Anomie; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32167 [20:57:43] New patchset: Anomie; "Allow per-realm and per-datacenter configuration" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/32168 [20:57:52] Change abandoned: Anomie; "After working on bug 41133 and finding out how many other things need the same sort of infrastructur..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30792 [20:59:21] !log wikivoyage imports: "ru" - done [20:59:25] Logged the message, Master [21:00:33] New review: Anomie; "This changeset includes the changes from Ie0bc55fb5e4a70739030809e7b061ce54ee79ccf, which was abando..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/32167 [21:02:37] New review: Anomie; "The main change here is that the code for detecting the realm is moved from mediawiki-config's Commo..." [operations/mediawiki-multiversion] (master) C: 0; - https://gerrit.wikimedia.org/r/32168 [21:05:41] paravoid: i am just waiting on the idrac license and will than will be able to load basic OS...it's all yours from there. [21:05:49] alredy cfg'd the raid to jbod [21:05:57] great [21:06:07] I think apergos would prefer to handle that though [21:06:17] so it might have to wait until tomorrow morning [21:06:29] I'm happy to take it (but it would indeed be tomorrow) [21:06:29] it's late for me too and I need to finish what I'm doing now :) [21:06:36] ohhi [21:06:43] as soon as drac is set up [21:06:44] sorry for the notify, didn't expect you here :) [21:06:48] no worries [21:06:51] we got poured on [21:07:00] rained tear gas on us, then storm [21:07:12] lovely ^ [21:07:28] I guess they used the water cannon at some point (fools, they only needed to wait for the storm to hit) but I didn't see it in action, just parked [21:08:20] so anyways, after getting everything below waist drenched, (and getting dispersed by tear gas on top of that), I came home around 10 pm [21:08:28] yay for warm pjs [21:09:15] someone really needs to teach these guys *other* crowd dispersal tactics, they only seem to know one and it involves chemicals [21:09:51] how are things here anyways? [21:10:01] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32262 [21:10:45] !log asher synchronized wmf-config/CommonSettings.php 'enabling session multiwrite to redis on mc1-16' [21:10:52] Logged the message, Master [21:12:47] apergos: so the 720xd came in...just waiting on iDRAC license from Dell and I will have everything I need to get it ready for you for the morning [21:13:48] that is awesome [21:14:08] thanks for getting that ready [21:14:18] awesome indeed [21:16:22] binasher: no objections [21:20:47] New patchset: Asher; "testing persistent connectiosn with pecl-memcached" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32309 [21:22:06] binasher: "connectiosn"? [21:22:11] * AaronSchulz ducks [21:22:40] if that breaks things.. [21:22:43] haha [21:24:16] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32309 [21:24:21] :p [21:26:06] Now you've done it. [21:26:54] !log asher synchronized wmf-config/mc.php 'persistent connections for memcached-pecl' [21:27:02] Logged the message, Master [21:27:06] someone is on thin ice now [21:28:32] that seems to have increased the failure rate, oh well [21:28:57] New patchset: Asher; "Revert "testing persistent connectiosn with pecl-memcached"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32310 [21:29:38] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32310 [21:29:51] *snap* *crackle* *pop* [21:30:18] !log asher synchronized wmf-config/mc.php 'disable persistent connections for memcached-pecl' [21:30:18] * AaronSchulz should really get some profile calls in for mc [21:30:22] rice krispies? [21:30:24] Logged the message, Master [21:30:30] AaronSchulz: yes plz [21:31:03] i wonder why all the server failures with persist enabled [21:31:18] so before we just had GET profiling [21:31:44] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [21:42:53] AaronSchulz: splitting it up might be good, but having a count + perf metric for all memcached calls regardless of type would be good [21:46:17] New patchset: Jgreen; "add user awight, add admins::fr-tech, add admins::fr-tech to most fr hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32312 [21:47:11] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/32312 [21:50:53] PROBLEM - MySQL Recent Restart on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:53] PROBLEM - MySQL disk space on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:20] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:47] PROBLEM - SSH on db64 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:47] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:47] PROBLEM - Full LVS Snapshot on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:05] PROBLEM - MySQL Slave Running on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:05] PROBLEM - MySQL Idle Transactions on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:05] PROBLEM - mysqld processes on db64 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:48] !log removing strontium from eqiad bits cache pool for upgrade to precise [21:57:53] Logged the message, notpeter [21:59:22] New patchset: Aaron Schulz; "Set pruning cutoff for file journal table." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32316 [22:01:23] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:01:23] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:01:23] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:01:29] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32316 [22:02:07] !log aaron synchronized wmf-config/filebackend.php [22:02:12] Logged the message, Master [22:08:12] PROBLEM - SSH on strontium is CRITICAL: Connection refused [22:08:12] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: Connection refused [22:08:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32067 [22:08:35] !log powercycling db64 (frozen) [22:08:38] also mgmt [22:08:39] Logged the message, Master [22:10:59] PROBLEM - NTP on db64 is CRITICAL: NTP CRITICAL: No response from NTP server [22:13:14] RECOVERY - SSH on db64 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:13:35] RECOVERY - MySQL Idle Transactions on db64 is OK: OK longest blocking idle transaction sleeps for seconds [22:13:41] RECOVERY - MySQL Slave Running on db64 is OK: OK replication [22:13:50] RECOVERY - MySQL Recent Restart on db64 is OK: OK seconds since restart [22:13:50] RECOVERY - MySQL disk space on db64 is OK: DISK OK [22:14:17] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds [22:14:44] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay seconds [22:14:44] RECOVERY - Full LVS Snapshot on db64 is OK: OK no full LVM snapshot volumes [22:16:42] RECOVERY - mysqld processes on db64 is OK: PROCS OK: 1 process with command name mysqld [22:17:35] RECOVERY - NTP on db64 is OK: NTP OK: Offset -0.0006093978882 secs [22:20:17] PROBLEM - MySQL Slave Running on db64 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [22:20:44] RECOVERY - SSH on strontium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:20:53] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 2054 seconds [22:23:02] * AaronSchulz gets more gerrit 503s [22:23:19] I wonder if there is some sort of error log for that [22:25:38] AaronSchulz: chad said it was some kind of intermittent error which he did realy investigate yet [22:25:44] AaronSchulz: might be gerrit restarting [22:26:25] or the Apache Proxy automatically sending 503 for a certain period of time whenever one request has failed [22:26:33] hashar: so is jenkins on tempfs now? [22:27:07] hey asher, db64 crashed, but not during an import, just like that [22:27:14] had to powercycle [22:27:29] PROBLEM - NTP on strontium is CRITICAL: NTP CRITICAL: Offset unknown [22:27:56] mutante: thanks, taking a look [22:28:37] !log asher synchronized wmf-config/db.php 'pulling db64' [22:28:43] Logged the message, Master [22:29:40] AaronSchulz: Tim has setup some tempFS mount [22:30:02] AaronSchulz: and some symbolic link. But since the jobs are setup to unlink the sqlite database, it is probably no more using tmpfs now :/ [22:30:15] AaronSchulz: I am not entirely sure what tim did [22:30:54] AaronSchulz: also have to reproduce an issue where the Ext-Wikibase job ends up eating all the box memory when running update.php ;-/ [22:31:14] PROBLEM - Host strontium is DOWN: PING CRITICAL - Packet loss = 100% [22:31:53] AaronSchulz: the filesystem is still really slow though :-( [22:31:55] why would unlinking make it stop using it? I assume he'd change the path that the dbs are created to the tmpfs mount [22:32:07] yeah he changed the path (checked that) [22:32:30] so the $WORKSPACE/data now points to something in the tempfs [22:32:53] should adapt the ant script to enforce that [22:34:37] logged as https://bugzilla.wikimedia.org/show_bug.cgi?id=41868 [22:35:47] RECOVERY - Host strontium is UP: PING OK - Packet loss = 0%, RTA = 27.75 ms [22:35:57] mutante: i'm going thru the binlong on db34 to see where to reslave db64, it crashed mid-transaction [22:36:06] thanks [22:39:01] AaronSchulz: there's a few of Invalid username or access key. in /usr/local/apache/common-local/php-1.21wmf3/includes/filebackend/SwiftFileBackend.php on line 1440 in the logs [22:39:58] mutante: hi :-]  If you ever end up having nothing else do under "rt duty" feel free to poke my pending changes in puppet :-] [22:40:04] ok, db64 is catching up on replication now [22:40:09] RECOVERY - MySQL Slave Running on db64 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [22:40:18] mutante: once its caught up, you can go ahead with the import [22:40:29] mutante: list at https://gerrit.wikimedia.org/r/#/q/is:open+project:operations/puppet+owner:hashar,n,z , really only look at the most recently update changes. Two of them are basic apache configuration change for the CI server + some html update. ;-] [22:40:46] mutante: but I guess you are busy with wikivoyage today ;-] [22:41:32] !log re-pooling strontium [22:41:37] Logged the message, notpeter [22:43:05] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 3083 seconds [22:47:45] New patchset: Asher; "move redis to the front of the session multiwrite array" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32323 [22:49:24] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32323 [22:49:33] * AaronSchulz hears the ice crack ;) [22:51:02] !log asher synchronized wmf-config/CommonSettings.php 'move redis to the front of the session multiwrite array' [22:51:02] woo [22:51:05] Logged the message, Master [22:52:57] AaronSchulz: did you get a chance to add any profiling to the pecl bagostuff [22:53:08] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.053 seconds [22:53:13] i'm wondering when we should turn multiwrite [22:53:14] off [22:54:47] I had a patch that adds get/gets profiling [22:57:56] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 14 seconds [22:58:59] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [22:59:40] !log asher synchronized wmf-config/db.php 'returning db64 to s3' [22:59:46] Logged the message, Master [23:02:03] !log removing niobium from eqiad bits cache pool for upgrade to precise [23:02:10] Logged the message, notpeter [23:06:31] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [23:08:35] AaronSchulz: merged it [23:12:21] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 26.81 ms [23:14:09] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [23:16:26] PROBLEM - SSH on niobium is CRITICAL: Connection refused [23:16:42] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: Connection refused [23:24:59] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:35:41] PROBLEM - Host niobium is DOWN: PING CRITICAL - Packet loss = 100% [23:38:27] RECOVERY - Host niobium is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [23:43:06] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK HTTP/1.1 200 OK - 635 bytes in 0.053 seconds [23:43:12] !log repooling niobium [23:43:18] Logged the message, notpeter [23:47:20] do we have differential rate limiting to the web API for people that request higher than default limits? [23:48:58] !log removing palladium from eqiad bits cache pool for upgrade to precise [23:49:01] Logged the message, notpeter [23:54:15] PROBLEM - Host palladium is DOWN: PING CRITICAL - Packet loss = 100% [23:57:22] anyone, anyone, Bueller? [23:58:02] robla: I don't quite understand your question [23:59:14] RoanKattouw: so...got a request from a researcher saying they're hitting a limit to how many requests a (minute? hour? etc?...I'll find out), and would like to see that increased