[00:07:17] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 15h 4m 24s [00:29:47] springle: welcome back! fyi, the masses are getting antsy about dewiki. maybe you have some ideas of the current status (see #wikimedia-labs) [00:30:39] recentchanges is getting new entries (so it's replicating) but apparently some tables are missing lots of rows [00:44:43] (03CR) 10Addshore: [C: 031] Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 (owner: 10Aude) [00:57:04] jeremyb: pt-table-sync process keeps losing connection to that labs db box. hope to figure out why today [00:57:26] springle: k, thanks [00:58:25] ha. oom killer [00:59:19] you mean table's too big? :) [01:01:47] the sync is batched, so no... possibly sync + some large/slow txn backing connections up and spiking mysqld mem usage [01:02:58] ohhhh, i was thinking pt-table-sync itself was being killed [01:03:12] anyway, enjoy digging :) [01:03:15] ah :) nope, mysqld [01:10:21] (03PS1) 10Springle: Reduce mysqld footprints temporarily for investigation. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98462 [01:12:28] (03CR) 10Springle: [C: 032] Reduce mysqld footprints temporarily for investigation. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98462 (owner: 10Springle) [01:16:48] !log restarting labsdb1002 mysqld processes with 25% smaller buffer pools. kernel OOM killer striking. needs investigation [01:17:07] Logged the message, Master [01:36:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:08:15] !log LocalisationUpdate completed (1.23wmf4) at Mon Dec 2 02:08:15 UTC 2013 [02:08:32] Logged the message, Master [02:14:57] !log LocalisationUpdate completed (1.23wmf5) at Mon Dec 2 02:14:57 UTC 2013 [02:15:12] Logged the message, Master [02:21:20] (03PS1) 10Springle: depool pc1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98465 [02:22:12] (03CR) 10Springle: [C: 032] depool pc1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98465 (owner: 10Springle) [02:23:27] !log springle synchronized wmf-config/db-eqiad.php 'depool pc1002 for upgrade' [02:23:43] Logged the message, Master [02:34:08] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:36:56] (03PS1) 10Springle: switch pc1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98466 [02:37:48] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 2 02:37:48 UTC 2013 [02:38:00] (03CR) 10Springle: [C: 032] switch pc1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98466 (owner: 10Springle) [02:38:04] Logged the message, Master [02:49:12] (03PS3) 10Tim Starling: Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:21] (03CR) 10Tim Starling: [C: 032] Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:35] (03CR) 10Tim Starling: [V: 032] Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:53] (03PS1) 10Springle: repool pc1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98467 [02:50:18] (03CR) 10Springle: [C: 032] repool pc1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98467 (owner: 10Springle) [02:51:36] !log springle synchronized wmf-config/db-eqiad.php 'repool pc1002 after upgrade, max_connections lowered during warm up' [02:51:49] Logged the message, Master [02:58:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 9d 17h 39m 59s [02:59:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [02:59:36] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Mon Dec 2 02:59:35 UTC 2013 [03:00:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 0m 37s [03:01:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [03:02:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [03:03:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 3m 0s [03:04:27] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 4m 0s [03:05:17] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Mon Dec 2 03:05:08 UTC 2013 [03:07:27] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 18h 4m 34s [03:28:54]

PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [03:29:14] something is wrong with this picture [03:29:57] well, two things, most likely [03:30:45] puppetd -tv works [03:30:47] so maybe only one thing [04:17:12] * jeremyb digs up some logs for TimStarling [04:17:12] 22 08:42:53 < apergos> Nov 22 08:40:44 neon icinga: Warning: The results of service 'Puppet freshness' on host 'snapshot1' are stale by 0d 0h 0m 54s (threshold=0d 3h 0m 0s). I'm forcing an immediate check of the service. how is 54 seconds past the threshhold?? [04:17:30] 24 06:12:53 < ori-l> wtf? [04:19:12] i'm glad my contribution to debugging this issue was not lost in the abyss of time, jeremyb [04:20:09] ori-l: took a few secs for me to realize the full impact of paravoid's ignore rule [04:21:41] Reedy has also commented on this issue [04:23:10] was it similarly helpful? [04:25:53] define command{ [04:25:53] command_name puppet-FAIL [04:25:53] command_line echo "No successful Puppet run for $SERVICEDURATION$" && exit 2 [04:25:53] } [04:26:21] I'm not sure if this is how freshness is meant to be used [04:30:14] "If the check results is found to be stale, Icinga will force an active check of the host or service by executing the command specified by in the host or service definition." [04:30:57] right, it's the time since it hit critical [04:32:48] $LASTSERVICEOK$ is closer [04:33:10] http://nagios.sourceforge.net/docs/3_0/macrolist.html#lastserviceok [04:35:22] you mean http://docs.icinga.org/latest/en/macrolist.html#macrolist-lastserviceok [04:35:33] i better call my lawyer [04:35:48] rotfl [04:38:30] maybe the active check is scheduled, but not run for a while [04:38:41] and before the active check actually gets run, the passive result comes in [04:38:53] so it goes OK -> OK -> CRITICAL [05:10:13] ganglia broken: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [05:17:00] SAL has Faidon's "restarted gmond on ms-fe1001/2, both were stuck 6h ago and we lost all swift eqiad's metrics for that period" for 14:49 on Nov 29, which more or less lines up [05:17:46] 14:49 - 6h is 8:49, last legitimate update for that metric was 7:30 [05:45:12] were you going to fix the Puppet alert? [05:45:34] if not, I'll just replace the macro with $LASTSERVICEOK$ and rephrase it so it makes sense [05:50:51] we should never do active puppet checks, so that's a problem [05:51:03] morning -ish [06:01:06] (03PS1) 10Springle: depool pc1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98470 [06:02:21] PROBLEM - Puppet freshness on pc1003 is CRITICAL: No successful Puppet run for 9d 20h 44m 1s [06:03:17] (03CR) 10Springle: [C: 032] depool pc1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98470 (owner: 10Springle) [06:04:34] !log springle synchronized wmf-config/db-eqiad.php 'depool pc1003 for upgrade' [06:04:47] Logged the message, Master [06:07:40] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 21h 4m 47s [06:10:14] TimStarling: that would be you & me [06:10:31] buffer = 4194304 breaks ganglia on lucid hosts [06:10:38] Starting Ganglia Monitor Daemon: /etc/ganglia/gmond.conf:54: no such option 'buffer' [06:10:41] Parse error for '/etc/ganglia/gmond.conf' [06:11:01] right... [06:18:54] (03PS1) 10Faidon Liambotis: ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 [06:21:19] (03CR) 10Faidon Liambotis: [V: 032] ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 (owner: 10Faidon Liambotis) [06:21:26] (03CR) 10Faidon Liambotis: [C: 032] ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 (owner: 10Faidon Liambotis) [06:22:18] TimStarling: did you read about the page cache issue? [06:22:43] yes, so that is fixed now? [06:22:52] not really [06:22:56] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1052.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:23:00] so that's the control [06:23:04] still running with 3.2 [06:23:13] the kernel drops more and more cache as the days pass [06:23:27] but it still hasn't gone to the point where it drops everything all the time [06:23:34] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1065.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:23:37] and that's 3.11 [06:24:05] that's a 3.2 that hasn't been rebooted: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1055.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:24:09] it's fascinating [06:24:10] so we just need to upgrade the kernel on all the varnish servers? [06:24:32] well, yes [06:24:40] I'm just curious on why this happens [06:24:54] what makes it worsen as days pass [06:25:03] why does it happen on just these boxes, etc. [06:25:36] well, you could do a git blame on the relevant kernel source files [06:25:39] http://article.gmane.org/gmane.linux.kernel.mm/99926 <- this is what I suspect [06:26:03] and in general, the whole patchset: http://thread.gmane.org/gmane.linux.kernel.mm/99921 [06:26:57] (03PS1) 10Springle: switch pc1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98474 [06:26:59] I think domas would love this [06:27:56] (03CR) 10Springle: [C: 032] switch pc1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98474 (owner: 10Springle) [06:28:36] !log fixed ganglia for misc eqiad (possibly others); see {{Gerrit|Icc5376505}} [06:28:49] Logged the message, Master [06:30:43] ah thank you for the gmond conf fix, otherwise I would be looking at that right now [06:41:31] (03PS1) 10Ori.livneh: Saner copy for Puppet freshness alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98476 [06:48:04] argh [06:48:06] fuck you ganglia [06:49:53] (03PS1) 10Springle: repool pc1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98477 [06:50:19] springle: oh? switching more dbs to mariadb? [06:50:21] nice! [06:52:05] :) [06:52:20] (03CR) 10Springle: [C: 032] repool pc1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98477 (owner: 10Springle) [06:53:25] !log springle synchronized wmf-config/db-eqiad.php 'repool pc1003 after upgrade, max_connections lowered during warm up' [06:53:39] Logged the message, Master [06:58:27] no ganglia at all? [06:59:08] sigh [07:00:43] I restarted ganglia-monitor everywhere jsut now [07:01:36] There was an error collecting ganglia data (127.0.0.1:8654): XML error: Invalid document end at 1 [07:01:39] nice [07:05:17] ok, we're back [07:05:52] !log upgrade/reboot db1046 m2 slave [07:06:08] Logged the message, Master [07:07:15] springle: that db47 raid error, I couldn't find a ticket [07:07:53] because it's meant to be decommed any moment. still waiting for amaranth to switch masters [07:09:04] ah [07:15:29] (03PS1) 10Springle: upgrade db1046 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98479 [07:16:36] (03CR) 10Springle: [C: 032] upgrade db1046 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98479 (owner: 10Springle) [07:39:24] PROBLEM - DPKG on db1023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:40:24] RECOVERY - DPKG on db1023 is OK: All packages OK [08:24:40] paravoid: the three patchsets needs another round of applause after I did a rebase.. [08:36:20] paravoid, seems like we are on for deploying some of the stuff at 11 SF [08:38:48] i plan to push out updated the new version of zero that gets proxy configuration, deploy some config changes for META to have patrolled revisions (including db update for that) [08:39:28] okay [08:39:39] yurik: do you need anything from me? [08:40:05] paravoid, yep - once i get zero config stuff out, we can get the new landing page code in [08:40:17] cool [08:40:28] config stuff won't take long (i hope) [08:40:39] so can we work together around 11:30SF time? [08:41:32] (03Abandoned) 10Yurik: Added relative redirect workaround until its fixed ext [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97540 (owner: 10Yurik) [08:41:54] uhm, that's in 11 hours from now [08:42:12] yes, even before midday workday in SF :) [08:42:19] paravoid, if you want, i can deploy the stuff now :) [08:42:24] I'd rather to not work a 14h day unless absolutely necessary :) [08:42:55] deploy what? [08:43:50] paravoid, https://gerrit.wikimedia.org/r/#/c/97107/ for starters would be great [08:44:35] paravoid, followed by https://gerrit.wikimedia.org/r/#/c/97115/ [08:45:16] it won't enable things until varnish's patch https://gerrit.wikimedia.org/r/#/c/97122/ [08:45:55] the first one is V-2 [08:46:31] right, because it requires a file in /usr/local/apache/common/w/mobileredirect.php [08:47:49] I can do the vhost one now [08:47:57] the mediawiki-config... I'll defer to you and/or Reedy [08:47:59] vhost one shouldn't go out until this one [08:48:10] all is needed is a link to ../mobileredirect.php [08:48:18] i just don't have root on tin [08:48:26] why do you need root for this? [08:48:45] because i can't creat a link in commons i think [08:49:39] ln -s mobilelanding.php ../mobilelanding.php [08:49:53] I'm sorry, I don't understand [08:49:58] link from where to where? [08:50:20] a link from /usr/local/apache/common/w/mobileredirect.php to ../mobileredirect.php [08:50:38] sec [08:50:51] sorry, from w [08:51:08] oh yes, i said everything correctly [08:51:15] got confused for a sec [08:51:42] why don't you put mobileredirect inside /w/ like the others are? [08:52:04] because take a look - they are all links [08:52:07] look at extract2 [08:52:09] oh, hm, extract2 isn't [08:52:13] its designed exactly the same way [08:52:28] anyway, you can create symlinks if you want, the whole dir is g+rwx [08:52:29] which is what i have modeled it on [08:52:35] but I think we could benefit from another review [08:52:38] let's wait for Reedy [08:52:52] he's the expert in those things [08:53:11] paravoid, the code was approved by him, and we won't change varnish until he looks at it [08:53:16] again [08:53:37] I don't see a CR+1/2 in r97107 [08:53:56] paravoid, https://gerrit.wikimedia.org/r/#/c/96654/ [08:54:01] let's wait a few hours, it's going to be deployed before you wake up [08:54:17] csteipp +1 it, and reedy +2 it even though he had doubts [08:54:28] we won't deploy it via varnish [08:54:37] but we will be able to start testing that wget gets it [08:54:42] I'm aware of that [08:54:47] properly from tin [08:55:37] I'm grateful you want to see this deployed [08:55:38] as for the dir - incorrect, i can't (for some unknown reason) perform ln -s ../mobilelanding.php mobilelanding.php [08:55:53] ln: failed to create symbolic link `mobilelanding.php': Permission denied [08:56:10] current dir: yurik@tin:/usr/local/apache/common/w$ [08:56:19] wait an hour or three for sam to wake up [08:56:30] you do realize its 4am here :) [08:56:31] it's been like that for many months now [08:56:39] yes, go to sleep [08:56:47] by the time you wake up, it will be deployed [08:56:51] lol [08:56:56] ok, sounds good [08:57:29] again, let's do this, I'd love that, but let's not be _that_ impatient ;-) [08:57:36] just don't do varnish just yet [08:57:39] yep yep [08:57:53] I'll wait for you to do varnish [08:57:58] ok [08:58:02] I'll wait for you before I do varnish that is [08:58:18] ok damn, i thought you were giving me +2 on puppets [08:58:24] :P [08:59:38] paravoid, and soon (maybe) zero landing will look like this: http://api.beta.wmflabs.org/w/index.php?title=Special:ZeroRatedMobileAccess&X-CS=250-99&useformat=mobile [09:07:42] i wonder if ori-l is around... [09:08:12] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 0h 5m 19s [09:08:30] even he is and maybe even working, it's Sunday after midnight [09:13:15] Wouldn't that be monday then >.> [10:28:10] (03CR) 10Faidon Liambotis: [C: 032] tcpircbot: tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/98455 (owner: 10Matanya) [10:31:29] (03CR) 10Faidon Liambotis: [C: 032] varnish: whitespace & lint cleanups [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 (owner: 10Matanya) [10:35:55] (03PS1) 10Faidon Liambotis: varnish: lint fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/98493 [10:36:20] (03CR) 10Faidon Liambotis: [C: 032] varnish: lint fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/98493 (owner: 10Faidon Liambotis) [11:50:09] (03PS1) 10Springle: unbreak puppet run on pc[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/98498 [11:51:56] (03CR) 10Springle: [C: 032] unbreak puppet run on pc[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/98498 (owner: 10Springle) [11:53:53] (03PS1) 10QChris: Backup geowiki's data-private bare repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/98499 [11:54:31] (03CR) 10QChris: "This is less invasive variant of" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98499 (owner: 10QChris) [11:55:36] (03CR) 10QChris: "Trying again to get geowiki backup working in:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97021 (owner: 10Faidon Liambotis) [12:08:17] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 3h 5m 24s [12:35:07] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [12:35:49] !log stopped mysql on db1008 to clone a database [12:35:55] hey Jeff_Green [12:36:01] hey paravoid [12:36:02] Logged the message, Master [12:36:04] early isn't it [12:36:24] ~7:30AM my time yeah [12:36:42] do you know what's the status with rhodium? [12:36:49] there's a nagios alert [12:36:53] Offline Content Generation - Collection [12:36:55] CRITICAL [12:37:09] hmm. no. looking [12:38:45] garg. local network issue. back in a sec. [12:40:07] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [12:40:25] (03CR) 10Matthias Mullie: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [12:44:49] ACKNOWLEDGEMENT - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green cloning a db [12:47:07] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 6d 7h 8m 8s [12:52:51] paravoid: I disabled that rhodium icinga check for now [12:54:56] it's checking for http on tcp 17080, but the relevant daemon isn't fully configured yet. [13:05:23] RECOVERY - Puppet freshness on rhodium is OK: puppet ran at Mon Dec 2 13:05:22 UTC 2013 [13:17:18] (03PS1) 10Odder: Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 [13:30:04] RECOVERY - check_mysql on db1008 is OK: Uptime: 2373297 Threads: 2 Questions: 11765132 Slow queries: 13730 Opens: 34041 Flush tables: 2 Open tables: 64 Queries per second avg: 4.957 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:43:07] (03PS2) 10Aude: Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 [13:43:08] (03PS7) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:09:18] Reedy: around? [14:22:11] (03PS1) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:23:47] (03CR) 10Faidon Liambotis: [C: 04-1] "You need to ensure => absent the old key (like your MB 2011 key is). Keys that don't exist in the puppet manifests are, unfortunately, not" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 (owner: 10Krinkle) [14:24:56] (03PS2) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:26:28] (03PS3) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:28:34] (03CR) 10Faidon Liambotis: [C: 032] "Verified via video call" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 (owner: 10Krinkle) [14:30:46] (03PS1) 10Akosiaris: Centralize puppet reports and file buckets [operations/puppet] - 10https://gerrit.wikimedia.org/r/98519 [14:33:43] (03PS8) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:36:02] (03CR) 10Akosiaris: [C: 032] Centralize puppet reports and file buckets [operations/puppet] - 10https://gerrit.wikimedia.org/r/98519 (owner: 10Akosiaris) [14:41:09] (03CR) 10Aude: "to the extent that I am able to test this, it produces a proper ExtensionMessages file for both production and labs realm." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [14:44:57] (03PS9) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:56:00] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:24] Coren: ^^^ [14:56:57] paravoid: Hm. Should place it in maintenance. [14:57:15] (awaiting cmjohnson1 moving stuff around in meatspace) [14:57:34] Coren: can you review https://gerrit.wikimedia.org/r/#/c/98307/ ? [14:57:50] Coren: also, can you comment on https://gerrit.wikimedia.org/r/#/c/84288/ ? [14:58:16] * Coren will look at both. [14:58:48] thanks :) [15:00:48] (03CR) 10Raimond Spekking: [C: 031] Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 (owner: 10Odder) [15:01:19] (03CR) 10coren: [C: 04-1] "I suppose that, strictly speaking, it's a -1. It seems a little silly to me to iterate over a one-liner change that needs to be done some" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [15:04:19] coren: I am free to move the disk shelves anytime ? [15:05:42] cmjohnson1: Should be; I was about to check all four to make sure they were powered off. [15:06:09] And they do seem to be. [15:06:56] coren: labstore1001 and 1002...can they be taken down at all? [15:07:15] I will have to relocate 1 of them [15:08:11] I will have to relocate labsstore1001 and it's arrays to C3 [15:08:42] cmjohnson1: They're powered down now; you can play with them to your heart's content. 1001 and 1002 need to be together with the shelves though (for obvious reason); but 1003 and 1004 can be split apart if you want to. [15:08:59] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 6h 6m 6s [15:09:10] In fact, since one will be a slave of the other, it might even be best if they were in separate rows (but not necessary). [15:09:25] coren we ditched labstore1003 and 4 for labsdb1004 and 5.... [15:11:45] coren: to confirm...labstore1001 will move racks from C2 to C3 so I have space to add the disk shelf for both 1001 and 1002 [15:12:12] cmjohnson1: Yes, yes, sorry I was refering to their old names. :-) [15:12:27] cmjohnson1: Yes, that sounds good to me. [15:14:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [15:15:35] (03CR) 10coren: [C: 031] "Looks okay to me; or at least it looks like it's trying to do the same thing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [15:19:04] Coren: that doesn't sound very confident to me :) [15:19:35] should we wait for Ryan? [15:20:18] this is in the labs domain, it's likely I won't be able to do much more about it (and that work I did, I did on a weekend) [15:20:44] paravoid: No, I'm cool about the patch proper -- it's doing the right thing. Like I said, my concern is that you're doing the same thing puppet is but it's not guaranteed that puppet has all the live config. [15:55:58] (03CR) 10Hashar: "Andrew, I am wondering whether that has broken the cron job that generate the public keys on labstore1 https://bugzilla.wikimedia.org/show" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98030 (owner: 10Andrew Bogott) [16:21:50] <^d> Hmm, wonder why arsenic doesn't have an /a/common/ [16:21:52] <^d> Curious [16:28:27] (03CR) 10Akosiaris: "Minor nitpicks and one serious comment. $::site is global scope AFAIK. There are however some rules I can not find matching ferm rules for" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:29:17] akosiaris: the ldap::server iptables rules are not applied anywhere [16:29:22] PROBLEM - Varnish HTTP text-backend on cp1065 is CRITICAL: Connection timed out [16:29:25] I just killed them while I was at it [16:29:43] [245204.819597] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:46] [245206.762253] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:48] (03CR) 10Ottomata: [C: 032 V: 032] Initial Debian packaging [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 (owner: 10Ottomata) [16:29:49] [245208.768831] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:52] PROBLEM - Varnish traffic logger on cp1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:52] PROBLEM - Varnish HTCP daemon on cp1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:52] grumble grumble [16:29:55] and that's with 3.11 [16:30:07] fuck you, XFS [16:30:15] paravoid: ah ok then :-) [16:30:29] bblack: that's another bug [16:31:04] :) [16:32:52] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:31] !log rebooting cp1065, usual XFS deadlock [16:33:32] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:33:45] Logged the message, Master [16:38:25] Coren: hey, what's up with labs stuff? Specifically: the de db replication thingy and andre__ mentioned accesing his bugzilla test instance is intermittent (right andre?) [16:38:34] I guess I should ask over in labs... [16:38:43] greg-g: Here works too. :-) [16:55:58] (03PS3) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [16:56:22] Coren: this Patchset should silence jenkins [16:56:59] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [16:59:38] ^d and ottomata: switching conversation about Cirrus deploy here because ottomata is here [16:59:46] <^d> manybubbles: Ok, yeah so I'll take care of sync'ing the Cirrus files. [16:59:48] coool [16:59:48] (03PS4) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [17:00:06] ^d and ottomata: sweet. [17:00:13] I'll watch the warnings and run the rebuild [17:00:24] ottomata: we're on terbium because arsenic is busted [17:00:37] and we're not going to be as mean to terbium as we were in the past [17:00:42] ha, ok [17:00:45] so maybe we don't really need arsenic any more any way [17:00:50] so, step one is syncing some mediawiki stuff? [17:00:52] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [17:00:57] what's that for? just config changes? [17:01:04] stupid, stupid jenkins! [17:01:21] <^d> ottomata: Bunch of changes in Cirrus' master we want live. [17:01:30] ottomata: yeah, that [17:01:50] we need to sync a week and a half of work to make Cirrus less whiny to users [17:02:11] now it'll just whine to the logs if something is wrong [17:02:23] (03CR) 10Chad: [C: 032] Show "using new search engine" when using Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97939 (owner: 10Manybubbles) [17:02:43] ^d: +1 for "new" inflation [17:02:43] ah ok [17:02:48] cool [17:03:00] also, we can build the index on the job queue. [17:03:14] and we count links from Elasticsearch rather than the db [17:03:27] which should stop cirrus from doing its only long running query [17:03:31] (03PS5) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [17:04:20] (03Merged) 10jenkins-bot: Show "using new search engine" when using Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97939 (owner: 10Manybubbles) [17:06:11] !log demon synchronized wmf-config/CirrusSearch-common.php [17:06:27] Logged the message, Master [17:07:37] !log demon synchronized php-1.23wmf4/extensions/CirrusSearch 'Cirrus to master' [17:07:53] Logged the message, Master [17:08:09] (03PS6) 10Jforrester: Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [17:08:54] (03PS1) 10Chad: Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 [17:09:17] !log demon synchronized php-1.23wmf5/extensions/CirrusSearch 'Cirrus to master' [17:09:33] Logged the message, Master [17:10:11] ^d and ottomata: looks like I'm good to go for rebuilding test2wiki? [17:10:38] <^d> Yeah, you should be set now for test2wiki. [17:10:55] <^d> And when it's set, we'll merge 98543 ^ [17:10:56] akosiaris: I put yer name there, be warned [17:11:02] (in topic for rt duty) [17:11:06] =] [17:11:24] RobH: ok :-) [17:14:15] (03CR) 10Aude: "not quite sure WikipediaMobileFirefoxOS is supposed to be changed, or at least it needs more explanation." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:14:33] ^d: [17:15:09] is cirrus being used in mobile firefox app? [17:16:11] ^d: rebuilt - testing [17:16:33] <^d> aude: I'm assuming that uses the API, right? [17:16:40] no idea [17:16:46] i just saw it in your patch [17:16:53] might be a rebase gone bad [17:16:56] <^d> mobile firefox? [17:17:00] yeah [17:17:25] ^d: everything looks good [17:17:29] <^d> yay [17:17:33] <^d> I'll merge my other thing now [17:17:49] (03CR) 10Chad: [C: 032] Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:17:58] (03Merged) 10jenkins-bot: Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:18:05] Fyay [17:18:06] -F [17:18:33] greg-g: -F means force? [17:18:59] <^d> yay -F [17:19:58] yeah, what is the deal with mobilefirefox os? [17:20:03] ^d: ^^ [17:20:16] well, jobs are running now [17:20:28] <^d> I don't know anything about mobile firefox :p [17:20:40] it was part of your commit somehow! [17:20:49] you changed a firefox submodule [17:20:53] <^d> Oh dammit! [17:20:59] * aude wonder if its part of a rebase [17:21:03] <^d> I blame the mobile team! :p [17:21:06] !log demon synchronized wmf-config/InitialiseSettings.php 'Cirrus on all the wikis (that had it before)' [17:21:21] Logged the message, Master [17:21:27] <^d> I didn't sync it, will fix. [17:21:35] k [17:21:58] anyway, yay to have cirrus enabled again! [17:22:40] (03PS1) 10Chad: Fix submodule reference change that snuck into If5b3a27a [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98545 [17:22:59] cool -F [17:23:35] manybubbles: we are about to do step 6? [17:23:39] 5. Merge restore all wikis that had Cirrus before to secondaries. [17:23:39] 6. Sync that. [17:23:40] <^d> git diff HEAD..HEAD~2 looks good now. [17:23:59] (03CR) 10Chad: [C: 032 V: 032] Fix submodule reference change that snuck into If5b3a27a [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98545 (owner: 10Chad) [17:24:10] ^d: starting step 7, actually [17:24:17] looks like some of the update jobs are hanging [17:25:14] ^d: can you look at the pool counter and see if we're hanging on it? [17:26:41] <^d> Hahahaha! [17:26:47] <^d> Poolcounter log uses localized messages. [17:26:57] nice! [17:27:02] <^d> 2013-12-02 17:26:51 mw1080 cswiki: Při čekání na zámek vypršel časový limit [17:27:04] ha [17:27:17] <^d> 2013-12-02 17:22:36 mw1216 eowiki: Tempolimo atingita dum atendo de ŝlosado [17:27:37] yeah, I wonder why we're taking so long.... [17:27:46] ^d: can you disable search updates for now and see if it clears up? [17:30:34] ottomata: I can't find elastic10XX in ganglia any more.... [17:30:46]