[00:07:17] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 15h 4m 24s [00:29:47] springle: welcome back! fyi, the masses are getting antsy about dewiki. maybe you have some ideas of the current status (see #wikimedia-labs) [00:30:39] recentchanges is getting new entries (so it's replicating) but apparently some tables are missing lots of rows [00:44:43] (03CR) 10Addshore: [C: 031] Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 (owner: 10Aude) [00:57:04] jeremyb: pt-table-sync process keeps losing connection to that labs db box. hope to figure out why today [00:57:26] springle: k, thanks [00:58:25] ha. oom killer [00:59:19] you mean table's too big? :) [01:01:47] the sync is batched, so no... possibly sync + some large/slow txn backing connections up and spiking mysqld mem usage [01:02:58] ohhhh, i was thinking pt-table-sync itself was being killed [01:03:12] anyway, enjoy digging :) [01:03:15] ah :) nope, mysqld [01:10:21] (03PS1) 10Springle: Reduce mysqld footprints temporarily for investigation. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98462 [01:12:28] (03CR) 10Springle: [C: 032] Reduce mysqld footprints temporarily for investigation. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98462 (owner: 10Springle) [01:16:48] !log restarting labsdb1002 mysqld processes with 25% smaller buffer pools. kernel OOM killer striking. needs investigation [01:17:07] Logged the message, Master [01:36:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:08:15] !log LocalisationUpdate completed (1.23wmf4) at Mon Dec 2 02:08:15 UTC 2013 [02:08:32] Logged the message, Master [02:14:57] !log LocalisationUpdate completed (1.23wmf5) at Mon Dec 2 02:14:57 UTC 2013 [02:15:12] Logged the message, Master [02:21:20] (03PS1) 10Springle: depool pc1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98465 [02:22:12] (03CR) 10Springle: [C: 032] depool pc1002 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98465 (owner: 10Springle) [02:23:27] !log springle synchronized wmf-config/db-eqiad.php 'depool pc1002 for upgrade' [02:23:43] Logged the message, Master [02:34:08] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:36:56] (03PS1) 10Springle: switch pc1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98466 [02:37:48] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 2 02:37:48 UTC 2013 [02:38:00] (03CR) 10Springle: [C: 032] switch pc1002 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98466 (owner: 10Springle) [02:38:04] Logged the message, Master [02:49:12] (03PS3) 10Tim Starling: Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:21] (03CR) 10Tim Starling: [C: 032] Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:35] (03CR) 10Tim Starling: [V: 032] Remove "Your cache administrator is nobody" joke. [operations/puppet] - 10https://gerrit.wikimedia.org/r/95147 (owner: 10Mattflaschen) [02:49:53] (03PS1) 10Springle: repool pc1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98467 [02:50:18] (03CR) 10Springle: [C: 032] repool pc1002 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98467 (owner: 10Springle) [02:51:36] !log springle synchronized wmf-config/db-eqiad.php 'repool pc1002 after upgrade, max_connections lowered during warm up' [02:51:49] Logged the message, Master [02:58:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 9d 17h 39m 59s [02:59:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [02:59:36] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Mon Dec 2 02:59:35 UTC 2013 [03:00:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 0m 37s [03:01:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [03:02:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 2m 0s [03:03:16] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 3m 0s [03:04:27] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 4m 0s [03:05:17] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Mon Dec 2 03:05:08 UTC 2013 [03:07:27] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 18h 4m 34s [03:28:54] PROBLEM - Puppet freshness on pc1002 is CRITICAL: No successful Puppet run for 0d 0h 1m 0s [03:29:14] something is wrong with this picture [03:29:57] well, two things, most likely [03:30:45] puppetd -tv works [03:30:47] so maybe only one thing [04:17:12] * jeremyb digs up some logs for TimStarling [04:17:12] 22 08:42:53 < apergos> Nov 22 08:40:44 neon icinga: Warning: The results of service 'Puppet freshness' on host 'snapshot1' are stale by 0d 0h 0m 54s (threshold=0d 3h 0m 0s). I'm forcing an immediate check of the service. how is 54 seconds past the threshhold?? [04:17:30] 24 06:12:53 < ori-l> wtf? [04:19:12] i'm glad my contribution to debugging this issue was not lost in the abyss of time, jeremyb [04:20:09] ori-l: took a few secs for me to realize the full impact of paravoid's ignore rule [04:21:41] Reedy has also commented on this issue [04:23:10] was it similarly helpful? [04:25:53] define command{ [04:25:53] command_name puppet-FAIL [04:25:53] command_line echo "No successful Puppet run for $SERVICEDURATION$" && exit 2 [04:25:53] } [04:26:21] I'm not sure if this is how freshness is meant to be used [04:30:14] "If the check results is found to be stale, Icinga will force an active check of the host or service by executing the command specified by in the host or service definition." [04:30:57] right, it's the time since it hit critical [04:32:48] $LASTSERVICEOK$ is closer [04:33:10] http://nagios.sourceforge.net/docs/3_0/macrolist.html#lastserviceok [04:35:22] you mean http://docs.icinga.org/latest/en/macrolist.html#macrolist-lastserviceok [04:35:33] i better call my lawyer [04:35:48] rotfl [04:38:30] maybe the active check is scheduled, but not run for a while [04:38:41] and before the active check actually gets run, the passive result comes in [04:38:53] so it goes OK -> OK -> CRITICAL [05:10:13] ganglia broken: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [05:17:00] SAL has Faidon's "restarted gmond on ms-fe1001/2, both were stuck 6h ago and we lost all swift eqiad's metrics for that period" for 14:49 on Nov 29, which more or less lines up [05:17:46] 14:49 - 6h is 8:49, last legitimate update for that metric was 7:30 [05:45:12] were you going to fix the Puppet alert? [05:45:34] if not, I'll just replace the macro with $LASTSERVICEOK$ and rephrase it so it makes sense [05:50:51] we should never do active puppet checks, so that's a problem [05:51:03] morning -ish [06:01:06] (03PS1) 10Springle: depool pc1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98470 [06:02:21] PROBLEM - Puppet freshness on pc1003 is CRITICAL: No successful Puppet run for 9d 20h 44m 1s [06:03:17] (03CR) 10Springle: [C: 032] depool pc1003 for upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98470 (owner: 10Springle) [06:04:34] !log springle synchronized wmf-config/db-eqiad.php 'depool pc1003 for upgrade' [06:04:47] Logged the message, Master [06:07:40] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 0d 21h 4m 47s [06:10:14] TimStarling: that would be you & me [06:10:31] buffer = 4194304 breaks ganglia on lucid hosts [06:10:38] Starting Ganglia Monitor Daemon: /etc/ganglia/gmond.conf:54: no such option 'buffer' [06:10:41] Parse error for '/etc/ganglia/gmond.conf' [06:11:01] right... [06:18:54] (03PS1) 10Faidon Liambotis: ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 [06:21:19] (03CR) 10Faidon Liambotis: [V: 032] ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 (owner: 10Faidon Liambotis) [06:21:26] (03CR) 10Faidon Liambotis: [C: 032] ganglia: fix config for lucid hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98473 (owner: 10Faidon Liambotis) [06:22:18] TimStarling: did you read about the page cache issue? [06:22:43] yes, so that is fixed now? [06:22:52] not really [06:22:56] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1052.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:23:00] so that's the control [06:23:04] still running with 3.2 [06:23:13] the kernel drops more and more cache as the days pass [06:23:27] but it still hasn't gone to the point where it drops everything all the time [06:23:34] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1065.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:23:37] and that's 3.11 [06:24:05] that's a 3.2 that hasn't been rebooted: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp1055.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=mem_report&c=Text+caches+eqiad [06:24:09] it's fascinating [06:24:10] so we just need to upgrade the kernel on all the varnish servers? [06:24:32] well, yes [06:24:40] I'm just curious on why this happens [06:24:54] what makes it worsen as days pass [06:25:03] why does it happen on just these boxes, etc. [06:25:36] well, you could do a git blame on the relevant kernel source files [06:25:39] http://article.gmane.org/gmane.linux.kernel.mm/99926 <- this is what I suspect [06:26:03] and in general, the whole patchset: http://thread.gmane.org/gmane.linux.kernel.mm/99921 [06:26:57] (03PS1) 10Springle: switch pc1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98474 [06:26:59] I think domas would love this [06:27:56] (03CR) 10Springle: [C: 032] switch pc1003 to mariadb [operations/puppet] - 10https://gerrit.wikimedia.org/r/98474 (owner: 10Springle) [06:28:36] !log fixed ganglia for misc eqiad (possibly others); see {{Gerrit|Icc5376505}} [06:28:49] Logged the message, Master [06:30:43] ah thank you for the gmond conf fix, otherwise I would be looking at that right now [06:41:31] (03PS1) 10Ori.livneh: Saner copy for Puppet freshness alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98476 [06:48:04] argh [06:48:06] fuck you ganglia [06:49:53] (03PS1) 10Springle: repool pc1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98477 [06:50:19] springle: oh? switching more dbs to mariadb? [06:50:21] nice! [06:52:05] :) [06:52:20] (03CR) 10Springle: [C: 032] repool pc1003 after upgrade [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98477 (owner: 10Springle) [06:53:25] !log springle synchronized wmf-config/db-eqiad.php 'repool pc1003 after upgrade, max_connections lowered during warm up' [06:53:39] Logged the message, Master [06:58:27] no ganglia at all? [06:59:08] sigh [07:00:43] I restarted ganglia-monitor everywhere jsut now [07:01:36] There was an error collecting ganglia data (127.0.0.1:8654): XML error: Invalid document end at 1 [07:01:39] nice [07:05:17] ok, we're back [07:05:52] !log upgrade/reboot db1046 m2 slave [07:06:08] Logged the message, Master [07:07:15] springle: that db47 raid error, I couldn't find a ticket [07:07:53] because it's meant to be decommed any moment. still waiting for amaranth to switch masters [07:09:04] ah [07:15:29] (03PS1) 10Springle: upgrade db1046 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98479 [07:16:36] (03CR) 10Springle: [C: 032] upgrade db1046 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98479 (owner: 10Springle) [07:39:24] PROBLEM - DPKG on db1023 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:40:24] RECOVERY - DPKG on db1023 is OK: All packages OK [08:24:40] paravoid: the three patchsets needs another round of applause after I did a rebase.. [08:36:20] paravoid, seems like we are on for deploying some of the stuff at 11 SF [08:38:48] i plan to push out updated the new version of zero that gets proxy configuration, deploy some config changes for META to have patrolled revisions (including db update for that) [08:39:28] okay [08:39:39] yurik: do you need anything from me? [08:40:05] paravoid, yep - once i get zero config stuff out, we can get the new landing page code in [08:40:17] cool [08:40:28] config stuff won't take long (i hope) [08:40:39] so can we work together around 11:30SF time? [08:41:32] (03Abandoned) 10Yurik: Added relative redirect workaround until its fixed ext [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97540 (owner: 10Yurik) [08:41:54] uhm, that's in 11 hours from now [08:42:12] yes, even before midday workday in SF :) [08:42:19] paravoid, if you want, i can deploy the stuff now :) [08:42:24] I'd rather to not work a 14h day unless absolutely necessary :) [08:42:55] deploy what? [08:43:50] paravoid, https://gerrit.wikimedia.org/r/#/c/97107/ for starters would be great [08:44:35] paravoid, followed by https://gerrit.wikimedia.org/r/#/c/97115/ [08:45:16] it won't enable things until varnish's patch https://gerrit.wikimedia.org/r/#/c/97122/ [08:45:55] the first one is V-2 [08:46:31] right, because it requires a file in /usr/local/apache/common/w/mobileredirect.php [08:47:49] I can do the vhost one now [08:47:57] the mediawiki-config... I'll defer to you and/or Reedy [08:47:59] vhost one shouldn't go out until this one [08:48:10] all is needed is a link to ../mobileredirect.php [08:48:18] i just don't have root on tin [08:48:26] why do you need root for this? [08:48:45] because i can't creat a link in commons i think [08:49:39] ln -s mobilelanding.php ../mobilelanding.php [08:49:53] I'm sorry, I don't understand [08:49:58] link from where to where? [08:50:20] a link from /usr/local/apache/common/w/mobileredirect.php to ../mobileredirect.php [08:50:38] sec [08:50:51] sorry, from w [08:51:08] oh yes, i said everything correctly [08:51:15] got confused for a sec [08:51:42] why don't you put mobileredirect inside /w/ like the others are? [08:52:04] because take a look - they are all links [08:52:07] look at extract2 [08:52:09] oh, hm, extract2 isn't [08:52:13] its designed exactly the same way [08:52:28] anyway, you can create symlinks if you want, the whole dir is g+rwx [08:52:29] which is what i have modeled it on [08:52:35] but I think we could benefit from another review [08:52:38] let's wait for Reedy [08:52:52] he's the expert in those things [08:53:11] paravoid, the code was approved by him, and we won't change varnish until he looks at it [08:53:16] again [08:53:37] I don't see a CR+1/2 in r97107 [08:53:56] paravoid, https://gerrit.wikimedia.org/r/#/c/96654/ [08:54:01] let's wait a few hours, it's going to be deployed before you wake up [08:54:17] csteipp +1 it, and reedy +2 it even though he had doubts [08:54:28] we won't deploy it via varnish [08:54:37] but we will be able to start testing that wget gets it [08:54:42] I'm aware of that [08:54:47] properly from tin [08:55:37] I'm grateful you want to see this deployed [08:55:38] as for the dir - incorrect, i can't (for some unknown reason) perform ln -s ../mobilelanding.php mobilelanding.php [08:55:53] ln: failed to create symbolic link `mobilelanding.php': Permission denied [08:56:10] current dir: yurik@tin:/usr/local/apache/common/w$ [08:56:19] wait an hour or three for sam to wake up [08:56:30] you do realize its 4am here :) [08:56:31] it's been like that for many months now [08:56:39] yes, go to sleep [08:56:47] by the time you wake up, it will be deployed [08:56:51] lol [08:56:56] ok, sounds good [08:57:29] again, let's do this, I'd love that, but let's not be _that_ impatient ;-) [08:57:36] just don't do varnish just yet [08:57:39] yep yep [08:57:53] I'll wait for you to do varnish [08:57:58] ok [08:58:02] I'll wait for you before I do varnish that is [08:58:18] ok damn, i thought you were giving me +2 on puppets [08:58:24] :P [08:59:38] paravoid, and soon (maybe) zero landing will look like this: http://api.beta.wmflabs.org/w/index.php?title=Special:ZeroRatedMobileAccess&X-CS=250-99&useformat=mobile [09:07:42] i wonder if ori-l is around... [09:08:12] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 0h 5m 19s [09:08:30] even he is and maybe even working, it's Sunday after midnight [09:13:15] Wouldn't that be monday then >.> [10:28:10] (03CR) 10Faidon Liambotis: [C: 032] tcpircbot: tabs to spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/98455 (owner: 10Matanya) [10:31:29] (03CR) 10Faidon Liambotis: [C: 032] varnish: whitespace & lint cleanups [operations/puppet] - 10https://gerrit.wikimedia.org/r/97910 (owner: 10Matanya) [10:35:55] (03PS1) 10Faidon Liambotis: varnish: lint fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/98493 [10:36:20] (03CR) 10Faidon Liambotis: [C: 032] varnish: lint fixups [operations/puppet] - 10https://gerrit.wikimedia.org/r/98493 (owner: 10Faidon Liambotis) [11:50:09] (03PS1) 10Springle: unbreak puppet run on pc[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/98498 [11:51:56] (03CR) 10Springle: [C: 032] unbreak puppet run on pc[123] [operations/puppet] - 10https://gerrit.wikimedia.org/r/98498 (owner: 10Springle) [11:53:53] (03PS1) 10QChris: Backup geowiki's data-private bare repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/98499 [11:54:31] (03CR) 10QChris: "This is less invasive variant of" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98499 (owner: 10QChris) [11:55:36] (03CR) 10QChris: "Trying again to get geowiki backup working in:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/97021 (owner: 10Faidon Liambotis) [12:08:17] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 3h 5m 24s [12:35:07] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [12:35:49] !log stopped mysql on db1008 to clone a database [12:35:55] hey Jeff_Green [12:36:01] hey paravoid [12:36:02] Logged the message, Master [12:36:04] early isn't it [12:36:24] ~7:30AM my time yeah [12:36:42] do you know what's the status with rhodium? [12:36:49] there's a nagios alert [12:36:53] Offline Content Generation - Collection [12:36:55] CRITICAL [12:37:09] hmm. no. looking [12:38:45] garg. local network issue. back in a sec. [12:40:07] PROBLEM - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) [12:40:25] (03CR) 10Matthias Mullie: Enable Flow discussions on a few test wiki pages (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94106 (owner: 10Spage) [12:44:49] ACKNOWLEDGEMENT - check_mysql on db1008 is CRITICAL: Slave IO: No Slave SQL: No Seconds Behind Master: (null) Jeff_Green cloning a db [12:47:07] PROBLEM - Puppet freshness on rhodium is CRITICAL: No successful Puppet run for 6d 7h 8m 8s [12:52:51] paravoid: I disabled that rhodium icinga check for now [12:54:56] it's checking for http on tcp 17080, but the relevant daemon isn't fully configured yet. [13:05:23] RECOVERY - Puppet freshness on rhodium is OK: puppet ran at Mon Dec 2 13:05:22 UTC 2013 [13:17:18] (03PS1) 10Odder: Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 [13:30:04] RECOVERY - check_mysql on db1008 is OK: Uptime: 2373297 Threads: 2 Questions: 11765132 Slow queries: 13730 Opens: 34041 Flush tables: 2 Open tables: 64 Queries per second avg: 4.957 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:43:07] (03PS2) 10Aude: Fix Wikibase noc symlink [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97903 [13:43:08] (03PS7) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:09:18] Reedy: around? [14:22:11] (03PS1) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:23:47] (03CR) 10Faidon Liambotis: [C: 04-1] "You need to ensure => absent the old key (like your MB 2011 key is). Keys that don't exist in the puppet manifests are, unfortunately, not" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 (owner: 10Krinkle) [14:24:56] (03PS2) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:26:28] (03PS3) 10Krinkle: admins.pp: Update SSH pub key for user 'krinkle' [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 [14:28:34] (03CR) 10Faidon Liambotis: [C: 032] "Verified via video call" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98518 (owner: 10Krinkle) [14:30:46] (03PS1) 10Akosiaris: Centralize puppet reports and file buckets [operations/puppet] - 10https://gerrit.wikimedia.org/r/98519 [14:33:43] (03PS8) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:36:02] (03CR) 10Akosiaris: [C: 032] Centralize puppet reports and file buckets [operations/puppet] - 10https://gerrit.wikimedia.org/r/98519 (owner: 10Akosiaris) [14:41:09] (03CR) 10Aude: "to the extent that I am able to test this, it produces a proper ExtensionMessages file for both production and labs realm." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [14:44:57] (03PS9) 10Aude: Enable Wikidata build on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [14:56:00] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:24] Coren: ^^^ [14:56:57] paravoid: Hm. Should place it in maintenance. [14:57:15] (awaiting cmjohnson1 moving stuff around in meatspace) [14:57:34] Coren: can you review https://gerrit.wikimedia.org/r/#/c/98307/ ? [14:57:50] Coren: also, can you comment on https://gerrit.wikimedia.org/r/#/c/84288/ ? [14:58:16] * Coren will look at both. [14:58:48] thanks :) [15:00:48] (03CR) 10Raimond Spekking: [C: 031] Raise $wgRateLimit for rollback for editors on dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98510 (owner: 10Odder) [15:01:19] (03CR) 10coren: [C: 04-1] "I suppose that, strictly speaking, it's a -1. It seems a little silly to me to iterate over a one-liner change that needs to be done some" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84288 (owner: 10DrTrigon) [15:04:19] coren: I am free to move the disk shelves anytime ? [15:05:42] cmjohnson1: Should be; I was about to check all four to make sure they were powered off. [15:06:09] And they do seem to be. [15:06:56] coren: labstore1001 and 1002...can they be taken down at all? [15:07:15] I will have to relocate 1 of them [15:08:11] I will have to relocate labsstore1001 and it's arrays to C3 [15:08:42] cmjohnson1: They're powered down now; you can play with them to your heart's content. 1001 and 1002 need to be together with the shelves though (for obvious reason); but 1003 and 1004 can be split apart if you want to. [15:08:59] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 6h 6m 6s [15:09:10] In fact, since one will be a slave of the other, it might even be best if they were in separate rows (but not necessary). [15:09:25] coren we ditched labstore1003 and 4 for labsdb1004 and 5.... [15:11:45] coren: to confirm...labstore1001 will move racks from C2 to C3 so I have space to add the disk shelf for both 1001 and 1002 [15:12:12] cmjohnson1: Yes, yes, sorry I was refering to their old names. :-) [15:12:27] cmjohnson1: Yes, that sounds good to me. [15:14:09] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [15:15:35] (03CR) 10coren: [C: 031] "Looks okay to me; or at least it looks like it's trying to do the same thing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [15:19:04] Coren: that doesn't sound very confident to me :) [15:19:35] should we wait for Ryan? [15:20:18] this is in the labs domain, it's likely I won't be able to do much more about it (and that work I did, I did on a weekend) [15:20:44] paravoid: No, I'm cool about the patch proper -- it's doing the right thing. Like I said, my concern is that you're doing the same thing puppet is but it's not guaranteed that puppet has all the live config. [15:55:58] (03CR) 10Hashar: "Andrew, I am wondering whether that has broken the cron job that generate the public keys on labstore1 https://bugzilla.wikimedia.org/show" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98030 (owner: 10Andrew Bogott) [16:21:50] <^d> Hmm, wonder why arsenic doesn't have an /a/common/ [16:21:52] <^d> Curious [16:28:27] (03CR) 10Akosiaris: "Minor nitpicks and one serious comment. $::site is global scope AFAIK. There are however some rules I can not find matching ferm rules for" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98307 (owner: 10Faidon Liambotis) [16:29:17] akosiaris: the ldap::server iptables rules are not applied anywhere [16:29:22] PROBLEM - Varnish HTTP text-backend on cp1065 is CRITICAL: Connection timed out [16:29:25] I just killed them while I was at it [16:29:43] [245204.819597] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:46] [245206.762253] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:48] (03CR) 10Ottomata: [C: 032 V: 032] Initial Debian packaging [operations/debs/python-kafka] (debian) - 10https://gerrit.wikimedia.org/r/97848 (owner: 10Ottomata) [16:29:49] [245208.768831] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [16:29:52] PROBLEM - Varnish traffic logger on cp1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:52] PROBLEM - Varnish HTCP daemon on cp1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:29:52] grumble grumble [16:29:55] and that's with 3.11 [16:30:07] fuck you, XFS [16:30:15] paravoid: ah ok then :-) [16:30:29] bblack: that's another bug [16:31:04] :) [16:32:52] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [16:33:31] !log rebooting cp1065, usual XFS deadlock [16:33:32] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:33:45] Logged the message, Master [16:38:25] Coren: hey, what's up with labs stuff? Specifically: the de db replication thingy and andre__ mentioned accesing his bugzilla test instance is intermittent (right andre?) [16:38:34] I guess I should ask over in labs... [16:38:43] greg-g: Here works too. :-) [16:55:58] (03PS3) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [16:56:22] Coren: this Patchset should silence jenkins [16:56:59] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [16:59:38] ^d and ottomata: switching conversation about Cirrus deploy here because ottomata is here [16:59:46] <^d> manybubbles: Ok, yeah so I'll take care of sync'ing the Cirrus files. [16:59:48] coool [16:59:48] (03PS4) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [17:00:06] ^d and ottomata: sweet. [17:00:13] I'll watch the warnings and run the rebuild [17:00:24] ottomata: we're on terbium because arsenic is busted [17:00:37] and we're not going to be as mean to terbium as we were in the past [17:00:42] ha, ok [17:00:45] so maybe we don't really need arsenic any more any way [17:00:50] so, step one is syncing some mediawiki stuff? [17:00:52] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [17:00:57] what's that for? just config changes? [17:01:04] stupid, stupid jenkins! [17:01:21] <^d> ottomata: Bunch of changes in Cirrus' master we want live. [17:01:30] ottomata: yeah, that [17:01:50] we need to sync a week and a half of work to make Cirrus less whiny to users [17:02:11] now it'll just whine to the logs if something is wrong [17:02:23] (03CR) 10Chad: [C: 032] Show "using new search engine" when using Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97939 (owner: 10Manybubbles) [17:02:43] ^d: +1 for "new" inflation [17:02:43] ah ok [17:02:48] cool [17:03:00] also, we can build the index on the job queue. [17:03:14] and we count links from Elasticsearch rather than the db [17:03:27] which should stop cirrus from doing its only long running query [17:03:31] (03PS5) 10Yuvipanda: toollabs: Add proxylistener that runs on the dynamicproxy machine [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 [17:04:20] (03Merged) 10jenkins-bot: Show "using new search engine" when using Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97939 (owner: 10Manybubbles) [17:06:11] !log demon synchronized wmf-config/CirrusSearch-common.php [17:06:27] Logged the message, Master [17:07:37] !log demon synchronized php-1.23wmf4/extensions/CirrusSearch 'Cirrus to master' [17:07:53] Logged the message, Master [17:08:09] (03PS6) 10Jforrester: Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 [17:08:54] (03PS1) 10Chad: Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 [17:09:17] !log demon synchronized php-1.23wmf5/extensions/CirrusSearch 'Cirrus to master' [17:09:33] Logged the message, Master [17:10:11] ^d and ottomata: looks like I'm good to go for rebuilding test2wiki? [17:10:38] <^d> Yeah, you should be set now for test2wiki. [17:10:55] <^d> And when it's set, we'll merge 98543 ^ [17:10:56] akosiaris: I put yer name there, be warned [17:11:02] (in topic for rt duty) [17:11:06] =] [17:11:24] RobH: ok :-) [17:14:15] (03CR) 10Aude: "not quite sure WikipediaMobileFirefoxOS is supposed to be changed, or at least it needs more explanation." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:14:33] ^d: [17:15:09] is cirrus being used in mobile firefox app? [17:16:11] ^d: rebuilt - testing [17:16:33] <^d> aude: I'm assuming that uses the API, right? [17:16:40] no idea [17:16:46] i just saw it in your patch [17:16:53] might be a rebase gone bad [17:16:56] <^d> mobile firefox? [17:17:00] yeah [17:17:25] ^d: everything looks good [17:17:29] <^d> yay [17:17:33] <^d> I'll merge my other thing now [17:17:49] (03CR) 10Chad: [C: 032] Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:17:58] (03Merged) 10jenkins-bot: Turn Cirrus back on secondary for all wikis that had it before [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98543 (owner: 10Chad) [17:18:05] Fyay [17:18:06] -F [17:18:33] greg-g: -F means force? [17:18:59] <^d> yay -F [17:19:58] yeah, what is the deal with mobilefirefox os? [17:20:03] ^d: ^^ [17:20:16] well, jobs are running now [17:20:28] <^d> I don't know anything about mobile firefox :p [17:20:40] it was part of your commit somehow! [17:20:49] you changed a firefox submodule [17:20:53] <^d> Oh dammit! [17:20:59] * aude wonder if its part of a rebase [17:21:03] <^d> I blame the mobile team! :p [17:21:06] !log demon synchronized wmf-config/InitialiseSettings.php 'Cirrus on all the wikis (that had it before)' [17:21:21] Logged the message, Master [17:21:27] <^d> I didn't sync it, will fix. [17:21:35] k [17:21:58] anyway, yay to have cirrus enabled again! [17:22:40] (03PS1) 10Chad: Fix submodule reference change that snuck into If5b3a27a [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98545 [17:22:59] cool -F [17:23:35] manybubbles: we are about to do step 6? [17:23:39] 5.  Merge restore all wikis that had Cirrus before to secondaries. [17:23:39] 6.  Sync that. [17:23:40] <^d> git diff HEAD..HEAD~2 looks good now. [17:23:59] (03CR) 10Chad: [C: 032 V: 032] Fix submodule reference change that snuck into If5b3a27a [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98545 (owner: 10Chad) [17:24:10] ^d: starting step 7, actually [17:24:17] looks like some of the update jobs are hanging [17:25:14] ^d: can you look at the pool counter and see if we're hanging on it? [17:26:41] <^d> Hahahaha! [17:26:47] <^d> Poolcounter log uses localized messages. [17:26:57] nice! [17:27:02] <^d> 2013-12-02 17:26:51 mw1080 cswiki: Při čekání na zámek vypršel časový limit [17:27:04] ha [17:27:17] <^d> 2013-12-02 17:22:36 mw1216 eowiki: Tempolimo atingita dum atendo de ŝlosado [17:27:37] yeah, I wonder why we're taking so long.... [17:27:46] ^d: can you disable search updates for now and see if it clears up? [17:30:34] ottomata: I can't find elastic10XX in ganglia any more.... [17:30:46] !log demon updated /a/common to {{Gerrit|I507a72cca}}: Fix submodule reference change that snuck into If5b3a27a [17:31:02] Logged the message, Master [17:31:11] !log LocalisationUpdate completed (1.23wmf4) at Mon Dec 2 17:31:11 UTC 2013 [17:31:14] hmmm [17:31:26] Logged the message, Master [17:31:27] !log demon synchronized wmf-config/CommonSettings.php 'Search update off for cirrus wikis' [17:31:31] <^d> manybubbles: ^ [17:31:40] yeah ung [17:31:42] Logged the message, Master [17:31:46] manybubbles: sometimes that happens to me in ganglia [17:31:48] !log cp301[12].esams - puppet temporarily disabled, custom crash handler vmod in place to try to catch an error in the next couple of hours [17:31:51] the hosts don't show up in search [17:31:51] hm [17:32:09] Logged the message, Master [17:32:18] ^d: thanks. I'm wondering if this is a side effect of trying to count links without having the schema built for it yet. [17:32:24] let me see about flushing the job queue for those wikis [17:32:29] <^d> Possibly. [17:32:50] hmm, ah I don't even see the elasticsearch cluster option anymore [17:32:51] sigh [17:33:13] aghhh analytics cluster ganglia is messed up too [17:33:14] sighhhh [17:36:04] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1308: active_shards: 3636: relocating_shards: 2: initializing_shards: 4: unassigned_shards: 8 [17:36:04] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1308: active_shards: 3636: relocating_shards: 2: initializing_shards: 4: unassigned_shards: 8 [17:36:13] uh [17:36:35] grrrr! [17:36:38] why you critical [17:37:03] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1312: active_shards: 3664: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:37:04] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 12: number_of_data_nodes: 12: active_primary_shards: 1312: active_shards: 3664: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:37:04] almost too verbose for me [17:37:08] too verbose [17:37:11] surely [17:37:14] status: red to green is all I got [17:37:14] on the list to fix [17:37:19] verbosity without clarity [17:37:29] * greg-g nods [17:37:31] critical because we're adding lots of new shards faster than it can allocate them [17:38:46] manybubbles: that's cool, at least [17:38:51] oh, manybubbles... [17:38:56] did we remove the elastic* nodes from site.pp? [17:39:03] i don't see them in my production HEAD... [17:39:29] I don't remember doing that [17:39:36] ah there they are [17:39:38] just testsearch [17:39:40] ok dunno what was up with that [17:39:45] weird, editor cache, dunno [17:39:46] nm [17:41:34] manybubbles: weird [17:41:35] https://gist.github.com/ottomata/7753319 [17:41:37] looking into it... [17:47:21] huh, hm [17:47:25] sysctl values aren't set right on this node [17:47:26] hmmm [17:47:29] !log LocalisationUpdate completed (1.23wmf5) at Mon Dec 2 17:47:29 UTC 2013 [17:47:43] Logged the message, Master [17:47:57] weird [17:48:15] ^d: so now that search update is off all the cirrus jobs are noops [17:48:18] which is cool [17:48:26] but the ones that started are just stuck [17:48:30] can we kill them? [17:48:48] or get a stack trace and then kill them? [17:49:40] <^d> Are they the only cirrus jobs? [17:49:58] <^d> If so, we could do it with a one-liner in eval.php I'm sure. [17:50:21] better to get a stack trace so I have something to go on, but if we can't then yeah, kill kill [17:51:54] <^d> hmm [17:53:01] at this point we I'm trying to figure out what got these queries stuck [17:53:16] and I'm grasping at straws [17:53:34] so we should probably roll back, but I want something to work with or I'll have little hope to fix this [17:56:04] I suppose we're safe as is but it isn't a good place to be with jobs just doing nothing [17:56:39] <^d> How rebuilt are we? [17:57:39] manybubbles: I ran puppet on a couple of elastic nodes [17:57:43] aaand now, ganglia is back [17:57:55] had somethign do do with procps needing a kick to pick up sysctl values [17:57:57] dunno... [17:58:16] ottomata: k. [17:58:30] ^d: I can't rebuild without jobs and we've made them all noops [17:58:44] and the queue is stuck on those jobs that are, well, stuck [17:58:50] I though we timed out our jobs [17:59:19] <^d> They do. [17:59:37] <^d> I'm going to turn updates back on. [17:59:46] <^d> Actually, going to flush all jobs first. [17:59:52] <^d> Then turn them back on. [18:00:19] k [18:00:23] give it a shot [18:00:35] let me know when you've flushed [18:04:04] ottomata: you installed logster on cp3003, right? [18:04:28] ottomata: logster pulled logcheck (as it's a dependency) and that produces cronspam via a cronjob [18:04:35] ottomata: so... disable? :-) [18:05:26] (03CR) 10coren: [C: 032] "Looks like that'll do the trick." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98352 (owner: 10Yuvipanda) [18:05:48] paravoid, yeah i'm making sure I understand where to put the varnishkafkalogster parser via puppet [18:05:58] looking... [18:06:01] <^d> manybubbles: On which wikis had you done things yet? [18:06:03] <^d> All of them? [18:06:35] ^d: I've --startOver'ed all of their indecies [18:06:48] so _if_ my theory is correct then this should be ok [18:07:05] if it isn't we should pull the plug again [18:07:33] :( [18:07:47] I should stop bragging about you guys, I'm jinxing you. [18:07:54] wha, so paravoid, logcheck gets installed and automatically starts sending emails? [18:08:02] I'll be happy if we can figure out what is going on, though [18:08:10] ottomata: yes [18:08:15] psh, ok [18:09:05] iiiii think that is the wrong dependency.... [18:09:06] no? [18:09:28] i think I hsould have put logtail [18:09:33] Replaces: logcheck (<= 1.1.1-9), logcheck (<= 1.1.1-9) [18:09:46] hmm [18:09:47] PROBLEM - Puppet freshness on sq37 is CRITICAL: No successful Puppet run for 1d 9h 6m 54s [18:09:58] hm no [18:10:06] i guess that's right according to etsy/logster [18:10:07] readme [18:11:13] hm. [18:11:22] will look into that paravoid, I don't think it needs 'logcheck' i think it needs 'logtail' [18:11:45] okay [18:12:16] anyway, in the meantime i've uninstalled [18:12:25] thanks [18:13:13] ^d: hmmm what happens if a job is killed and it has a pool counter lock? [18:13:23] <^d> Not a clue. [18:13:29] I mean, it can't free it. [18:13:31] crap [18:13:44] let me go look. [18:13:50] it might just eat the lock [18:13:52] like forever [18:14:37] <^d> I just finished my 1-liner to drop all cirrus jobs. [18:14:52] <^d> foreach( $myWikis as $w ) { JobQueueGroup::singleton( $w )->get( 'cirrusSearchDeletePages' )->delete(); JobQueueGroup::singleton( $w )->get( 'cirrusSearchLinksUpdate' )->delete(); JobQueueGroup::singleton( $w )->get( 'cirrusSearchUpdatePages' )->delete(); } [18:15:22] <^d> With $myWikis being all our wikis. [18:15:44] k [18:16:01] all the stuck "claimed" jobs are still stuck [18:19:35] <^d> Oh, those aren't claimed :\ [18:21:13] ^d: looks like when a connection is terminated pool counter will drop all locks held by the client [18:21:15] which is good [18:21:21] so it is safe to use in the context [18:21:23] thank god [18:21:39] ^d: the jobs are gone! [18:21:43] <^d> :) [18:21:50] what did you do, sorcerer? [18:23:01] <^d> That one-liner :) [18:23:10] it just took a while to do it? [18:23:16] <^d> No. [18:23:35] <^d> I just was reading more docs before I did. [18:23:38] huh. so it went from "cirrusSearchLinksUpdate: 144 queued; 124 claimed (124 active, 0 abandoned)" to none/just the waiting ones [18:23:40] ah [18:23:52] so, if you are comfortable, lets try turning it back on [18:24:01] and send me the docs:) [18:24:21] <^d> JobQueue / JobQueueGroup classes in mediawiki :) [18:24:32] <^d> Source is the best docs ;-) [18:25:21] yeah [18:25:35] why I was reading the c in pool counter.... [18:26:07] so, want to try turning it back on? [18:26:47] we have 35 minutes left. I vote if we can get it back on in ten minutes and everything is still good in 10 minutes we leave it on, watch, and I investigate potential hanging [18:27:01] 35 mins of deploy window left? [18:27:08] ottomata: yeah [18:27:09] yeah [18:27:15] agree with your analysis, man [18:27:17] manybubbles: [18:27:19] we can build the index outside the window, but we need to get out of the way [18:27:25] before the ned of the window [18:27:25] aye [18:27:32] no more syncing files unless we have an emergency [18:27:42] and it is secondary everywhere, so its ok if the indexes aren't built yet? [18:27:46] yeah [18:27:48] aye k [18:28:00] they won't be built for a while. not sure how long yet, but a while [18:28:09] once I get to run the job queue on it we'll know. [18:28:26] <^d> I'll remove my livehack and turn updates back on. [18:28:33] so there's no actual syncing left do, right? [18:28:54] you're just making sure things are good as is, just in case something isn't, so we can roll back in our window if we have to? [18:29:15] ottomata: ^d has to remove his uncommitted hack that he used to turn us off [18:29:22] yeah [18:29:31] I want to roll back within our window so I don't bother whoever is next [18:29:35] <^d> Well, I committed to tin :p [18:29:42] <^d> Just didn't bother pushing to gerrit. [18:30:03] !log demon synchronized wmf-config/CommonSettings.php 'Search updates back on for Cirrus' [18:30:18] Logged the message, Master [18:31:48] <^d> brb in 5-10mins. If you have to emergency turn it off again before I'm back just comment the "$wgDisableSearchUpdate = false" on line 874 in CommonSettings again [18:32:00] yup [18:32:33] (03PS1) 10Ottomata: Depending on logtail package, not logcheck [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/98565 [18:32:57] (03CR) 10Ottomata: [C: 032 V: 032] Depending on logtail package, not logcheck [operations/debs/logster] (debian) - 10https://gerrit.wikimedia.org/r/98565 (owner: 10Ottomata) [18:34:45] !log added logster deb, installed on cp3003 for testing, will puppetize shortly [18:35:00] Logged the message, Master [18:35:42] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Dec 2 18:35:42 UTC 2013 [18:35:57] Logged the message, Master [18:41:57] <^d> back [18:45:51] ^d: stuff is getting stuck again, I think [18:46:10] so rollback is in order - or just leave the search updates off [18:46:13] <^d> Grr, why, I wonder. [18:46:18] I haven't a clue. [18:46:27] happening on itwiki but not mw.org [18:46:39] <^d> Stuck, how? What sort of symptoms are you seeing? [18:46:40] guh [18:46:56] ^d: cirrusSearchLinksUpdate: 510 queued; 281 claimed (281 active, 0 abandoned) [18:47:01] just sitting there forever [18:47:29] <^d> Just itwiki? [18:48:04] coren: do you want me to set the raid cfg in labstore1001 and 1002? if you do add it to the ticket plz [18:49:16] <^d> manybubbles: Maybe AaronSchulz can help us :) [18:49:21] <^d> Since its jobqueue [18:49:46] <^d> (Or we could index not using the jobqueue, for wikis that aren't huge) [18:50:08] ^d: we could index them with search update off [18:50:28] ^d: so I can't tell exactly, but it looks like it really is only working on a few wikis [18:50:32] <^d> Let's do that. [18:50:49] :( [18:50:56] cmjohnson1: I can do it myself if you're busy; it's no harder for me. [18:51:30] ^d: can you turn the updates back off? we're safe if we stay like that [18:51:33] nothing gets stuck [18:51:49] <^d> Doing it now. [18:52:07] paravoid, still awake? :) [18:52:13] yes [18:52:14] coren, I am in there anyway, i can do it [18:52:21] (03PS1) 10Chad: Disable search updates for Cirrus wikis for the time being [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98571 [18:52:33] (03CR) 10Chad: [C: 032 V: 032] Disable search updates for Cirrus wikis for the time being [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98571 (owner: 10Chad) [18:52:48] Reedy: here now? :) [18:52:56] ^d: [02-Dec-2013 18:20:00] Fatal error: Call to undefined method CirrusSearchConnection::setTimeout() at /usr/local/apache/common-local/php-1.23wmf4/extensions/CirrusSearch/includes/CirrusSearchUpdater.php on line 136 [18:53:02] wasn't appearing in my fatal monitor [18:53:07] but that is it, I'm sure [18:53:17] !log demon synchronized wmf-config/CommonSettings.php 'Search updates back off for Cirrus wikis :(' [18:53:20] cmjohnson1: I need JBOD on that; in practice, it means raid 0 of just one disk 48 times. You *sure* you don't want me to do it from the command line instead? :-) [18:53:33] Logged the message, Master [18:53:34] <^d> manybubbles: We needed to update Elastica, yes? [18:53:39] looks like it [18:53:42] <^d> fml. [18:53:44] anyone wants to hold my hand while i update database? [18:53:47] but we're over budget [18:53:58] coren: ah yeah ...go ahead that would prolly be better :-P [18:53:58] so maybe rollback and start again in another clear slot [18:53:59] paravoid: Yeaah [18:54:20] greg-g: we think we've found something but it is time to roll back [18:54:21] finally :) [18:54:22] cmjohnson1: I mean, just sayin'. :-) [18:54:31] manybubbles: :( [18:54:35] I mean :) but :( [18:54:53] Reedy: can you review (or even deploy?) https://gerrit.wikimedia.org/r/#/c/97107/ ? [18:54:53] cmjohnson1: As long as you see all four shelves from the BIOS I'm golden. [18:55:07] yurik needs it and I wanted your opinion [18:55:19] paravoid needs it too! [18:56:08] RobH: Have time today to coach me re: renaming analytics servers? [18:56:34] greg-g, as part of today's depl, i need to change meta's db schema to allow for flagged revs extension. Do i need to get db ops involved? [18:56:38] <^d> manybubbles: Ah, it's just 1.23wmf4 wikis. [18:56:43] <^d> wmf5 is on master already. [18:56:48] ^d: magic [18:56:51] <^d> Easily fixed. [18:56:56] <^d> That explains which ones work. [18:56:57] coren: labstore1001 sees the 2 shelf attached & 1002 sees it's 2 shelves. I moved 1001 to rack C3. Also fixed network cfg [18:57:25] yurik: yeah [18:57:34] yurik: that's be springle, who isn't online yet [18:57:38] cmjohnson1: Wait, no, all four should be daisy chained, with 1001 on one controller and 1002 on the other. [18:58:04] (03CR) 10Reedy: [C: 04-1] "Target of mobileredirect.php symlink doesn't exist" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [18:58:07] greg-g, what tz is he in? [18:58:08] paravoid: Not when it's broken [18:58:14] yurik: australia [18:58:57] Reedy, this is modeled on extract2 [18:59:00] yurik: Sure [18:59:02] But the file doesn't exist [18:59:09] but i can't create a link! [18:59:12] cmjohnson1: Damn, did you just move 1001 *away* from 1002? I thought you just moved them /together/ [18:59:16] reedy@tin:/a/common/docroot/wwwportal/w$ ls -al /usr/local/apache/common/w/mobileredirect.php [18:59:16] ls: cannot access /usr/local/apache/common/w/mobileredirect.php: No such file or directory [18:59:31] Reedy, of course ! [18:59:35] take a look at that dir [18:59:37] No [18:59:39] I don't need to [18:59:39] it has extract2 [18:59:43] I know full well [18:59:50] extract2 is a link to ..\extract2.php [18:59:52] Yes [18:59:53] I know full well [18:59:56] i can't create that link [18:59:57] You need to add it to the mediawik-config repo [18:59:58] yurik: https://wikitech.wikimedia.org/wiki/Schema_changes [18:59:59] review [19:00:00] merge [19:00:02] git pull onto tin [19:00:04] sync-docroot [19:00:13] coren: yeah..that is what I said this morning. I didn't realize you need all together. that's 12U together. lemme see if I have the space [19:00:25] paravoid: may I re-enable cross wiki banner hiding in CentralNotice? [19:00:27] I am not sure if I have cables long enough to daisy chain of that together either [19:00:39] (03Abandoned) 10MaxSem: Serialize special page updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 (owner: 10MaxSem) [19:00:46] mwalker: is the caching fix deployed? [19:00:54] mwalker: and did you run it by ori-l? :) [19:01:01] !log demon synchronized php-1.23wmf4/extensions/Elastica 'Fix missing code in Elastica on 1.23wmf4 wikis' [19:01:02] (03PS1) 10Manybubbles: Turn cirrus back off [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98575 [19:01:02] caching fix is deployed [19:01:05] reedy@tin:/usr/local/apache/common/w$ sudo -u mwdeploy touch reedytest.php [19:01:05] reedy@tin:/usr/local/apache/common/w$ [19:01:05] <^d> manybubbles: Should be fixed now ^ [19:01:08] cmjohnson1: If you lack the cables, we can add shelves at some undefined point in the future so long as I have at least two now. [19:01:09] That also works [19:01:13] But is the wrong way to do it [19:01:16] Logged the message, Master [19:01:18] ah [19:01:56] <^d> So, I think our fatal's fixed, and we're still with search updates off. [19:01:59] <^d> Which is safe :) [19:02:06] cmjohnson1: Yeah, I obviously misunderstood what you meant when you told me you had to move 1001 to have room, I expected you meant room to put them and the shelves together. [19:02:36] (03Abandoned) 10Manybubbles: Turn cirrus back off [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98575 (owner: 10Manybubbles) [19:03:22] drawback to not actually "talking" it's nbd, I can fit in C3, they're going to have to go high on the rack which sux but it's the only way [19:04:46] greg-g and ^d and ottomata: so, yeah, that should be it. looks like fatalmonitor was filtering out my fatals..... [19:04:55] huh [19:05:14] ori-l: about 10 days ago paravoid disabled cross wiki banner hiding because I wasn't caching the hide requests for some reason; and because I was slamming the mobile cache -- I am now caching, and will no longer make calls to the mobile cache from the desktop -- may I re-enable the feature? [19:05:51] mwalker: sorry I wasn't clear; I referred you to ori-l because of the client-side implications of that feature [19:06:09] I know it wouldn't be an issue for the cache infrastructure, I'm not worrying about that [19:06:09] greg-g, ^d, and ottomata: nope, actually I was just looking at the wrong fatals [19:06:25] heh [19:06:49] turns out fluorine:/a/mw-log/fatal.log and fenari:/home/wikipedia/syslog/apache.log are unique [19:06:53] <^d> manybubbles: fatalmonitor is just some fancy one-linering of normal logs :) [19:07:05] ^d: but no fatals! [19:07:10] no wonder I couldn't find it [19:07:19] well, some fatals I guess [19:07:28] but not the fatal fatals that fataled us [19:07:28] <^d> (Also, the apache log is now on fluorine too, thanks ori :) [19:07:35] paravoid: soo, when do you think we can stop using swift in tampa all together? [19:08:14] greg-g: soooo, now that we know what we broke, can we have another crack at it some time? [19:08:28] manybubbles: third times a charm? [19:08:35] AaronSchulz: our strategy is still unclear at this point [19:08:47] greg-g: try try again [19:08:50] manybubbles: wed 11-1, if that's not too soon [19:09:06] AaronSchulz: there are some ideas of keeping one floor in tampa with one copy of one db per shard/swift/other mission critical data [19:09:10] greg-g: nothing would be too soon [19:09:11] AaronSchulz: until the new DC arrives, that is [19:09:13] manybubbles: 'tis the only way that works [19:09:23] that'a 11-1 east coast or west coast? [19:09:27] manybubbles: ok, then technically, 2-3 is open today, but that's late for you [19:09:27] (03PS1) 10Yurik: Added link to ../mobilelanding.php in w/ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98576 [19:09:30] 2-3 pacific [19:09:33] k [19:09:42] ottomata: 11-1 pacific, sorry [19:09:43] I can do 2-3 today. wait, isn't that during our meeting? [19:09:51] er, gah, sorry [19:09:53] Reedy, ^ https://gerrit.wikimedia.org/r/#/c/98576/ [19:09:53] 3-4 [19:10:01] VE is 2-3 anyways [19:10:08] I can do it. For science! [19:10:26] k [19:10:30] <^d> Let's do it during our meeting :D [19:10:36] pretty much all we need to do is revert ^d's turning off search updates and see if everything is unbroken [19:10:39] <^d> Then everyone will be there if things break. [19:11:01] chock full day [19:11:08] ^d: I'm going to run your one liner to clear the list of stuck jobs. something is weird when they fail like this. they get forever stuck in the queue [19:11:13] greg-g, should i go with zero? [19:11:30] <^d> manybubbles: Okie dokie. Just make sure to define $myWikis :) [19:11:35] (03CR) 10Ori.livneh: [C: 032] Saner copy for Puppet freshness alerts [operations/puppet] - 10https://gerrit.wikimedia.org/r/98476 (owner: 10Ori.livneh) [19:11:41] yurik: do you have your schema change done? [19:11:50] greg-g, separate issue [19:12:12] i want to get master out (solves new m. issues for ops) [19:12:27] and once that's out, want to get some config changed [19:12:31] and deploy rev flags [19:12:37] but rev flags is on low burner [19:13:18] yurik, i m here [19:13:52] ugh, disconnected, what I tried to say before: [19:14:02] yurik: ok, you said 'need' before, and I interpreted it as such, sorry [19:14:23] greg-g, that is the requirement to get flagged revisions on meta [19:14:50] ok, so there's a lot of things floating around here, can youh give an explicit list of what's going out now and later? [19:15:00] sure [19:15:01] ^d: done [19:16:43] <^d> ugh, memcached-serious is complaining about mwNN boxes (*not* mwNNNN) [19:16:52] <^d> "SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY" [19:17:08] tim has prepared a libmemcached patch about this [19:17:11] it's blocked on me [19:17:39] <^d> Ah, didn't know that. Will ignore then :) [19:17:57] and I told him I'd prioritize it [19:17:59] two weeks ago... [19:18:10] sorry, I'll have a look soon [19:19:29] ok, I would like to 1) get the latest master of zero synced 2) get https://gerrit.wikimedia.org/r/#/c/98576/ pushed (unless Reedy objects), 3) get https://gerrit.wikimedia.org/r/#/c/97107/ [19:19:37] <^d> manybubbles: Also, `tail -f runJobs.log | grep -i cirrus` is useful. [19:19:38] ^d: hey one great thing. we didn't fail in anyone's face! [19:19:54] ^d: was doing that [19:19:57] <^d> :) [19:20:00] it was showing starting and not finishing [19:20:07] but I thought it was stuck, not crashed! [19:20:16] <^d> They'll all returning good now :) [19:20:39] noops.... [19:20:42] but really really fast! [19:20:43] greg-g, also, time permitting, i would like to deploy https://gerrit.wikimedia.org/r/#/c/95662/ which needs db schema change on meta [19:21:12] <^d> So, when we gonna attempt updates again? Now? 2? [19:21:36] ^d: right after the platform team meeting [19:21:42] <^d> Sounds good. [19:21:57] (03PS1) 10Chad: Revert "Disable search updates for Cirrus wikis for the time being" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98581 [19:22:04] ^d: you synced out the elastic plugin for wmf4, right? [19:22:14] (03CR) 10Chad: "Not merging yet, just wanted to prep for later today." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98581 (owner: 10Chad) [19:22:17] <^d> Yeah. [19:22:22] (03PS1) 10coren: Tool Labs: install pep8 in dev environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/98582 [19:23:15] (03CR) 10coren: [C: 032] "Simple package add." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98582 (owner: 10coren) [19:23:23] greg-g, to do the schema changes, i need to follow this: http://git.wikimedia.org/blob/operations%2Fmediawiki-config.git/a8723d447344c57a4f40b52eae076c683201a11a/wmf-config%2FInitialiseSettings.php#L10138 [19:23:37] greg-g, should i start the zero depl? [19:23:58] yurik: man, chill :) [19:24:11] you're bombing greg with questions :) [19:24:31] paravoid, i'm listing my intended steps, per his requset :) [19:24:43] already half an hour behind on the depl window [19:24:55] not that i'm worried :) [19:32:13] (03PS1) 10Ori.livneh: Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 [19:34:19] (03PS2) 10Ori.livneh: Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 [19:34:34] (03CR) 10jenkins-bot: [V: 04-1] Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 (owner: 10Ori.livneh) [19:36:23] (03PS3) 10Ori.livneh: Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 [19:37:18] (03CR) 10jenkins-bot: [V: 04-1] Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 (owner: 10Ori.livneh) [19:39:44] (03CR) 10jenkins-bot: [V: 04-1] Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 (owner: 10Ori.livneh) [19:40:13] (03PS4) 10Ori.livneh: Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 [19:41:42] greg-g, should i start or is there something else being deployed? [19:43:31] paravoid: thanks for looking into the potential Varnish/Parsoid issue! [19:43:42] gwicke: no worries [19:44:27] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:08] yurik: sorry, you can start, nothing else is on the calendar [19:45:17] ok, here i go [19:45:22] dr0ptp4kt, deploying zero ext [19:45:27] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [19:49:21] yurik, like i said, i am taking off for lunch. call me if you get into an emergency situation. [19:51:22] (03PS1) 10Hashar: Tool Labs: install pyflakes in dev environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/98594 [19:52:24] gwicke: thanks for the ping :) [19:53:53] !log yurik synchronized php-1.23wmf4/extensions/ZeroRatedMobileAccess/ [19:54:08] Logged the message, Master [19:54:10] paravoid: sure ;) [19:59:03] (03CR) 10MaxSem: [C: 032] Added link to ../mobilelanding.php in w/ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98576 (owner: 10Yurik) [19:59:35] (03PS4) 10Yurik: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [20:00:14] (03Merged) 10jenkins-bot: Added link to ../mobilelanding.php in w/ [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98576 (owner: 10Yurik) [20:00:38] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:02:04] !log yurik synchronized php-1.23wmf5/extensions/ZeroRatedMobileAccess/ [20:02:20] Logged the message, Master [20:06:21] (03CR) 10coren: [C: 032] "Simple enough. Though if you have flakes in your py, I'd consult a professional." [operations/puppet] - 10https://gerrit.wikimedia.org/r/98594 (owner: 10Hashar) [20:09:15] (03PS1) 10Ottomata: Adding varnishkafka::monitoring class to send stats to Ganglia. [operations/puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/98601 [20:10:38] !log yurik synchronized w/mobilelanding.php [20:10:53] Logged the message, Master [20:11:02] ottomata: did you ever work out how to use git-deploy to deploy jar files? [20:11:17] I want to start using elasticsearch's icu plugin but it is jar files. [20:11:24] we deploy plugins with git deploy already [20:11:25] (03PS1) 10Ori.livneh: (WIP) rewrite mwprof in Go [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/98602 [20:11:30] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Mon Dec 2 17:10:45 2013 [20:11:57] (03CR) 10Ori.livneh: [C: 032] Add mwprof module [operations/puppet] - 10https://gerrit.wikimedia.org/r/98585 (owner: 10Ori.livneh) [20:12:07] (03PS5) 10Yurik: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [20:12:16] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:15:16] (03PS6) 10MaxSem: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:15:22] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:15:29] bleh [20:15:30] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Mon Dec 2 17:15:19 2013 [20:15:54] Reedy, do you know why https://gerrit.wikimedia.org/r/#/c/97107/ could still be failing? [20:16:09] i just synced the w/mobileportal.php file [20:18:41] MaxSem, btw, go ahead with your depl then [20:18:49] thanks [20:19:16] hashar, any idea why https://gerrit.wikimedia.org/r/#/c/97107/ is failing? [20:20:03] yurik: looking [20:20:08] thx [20:20:35] because it's a symlink to a file that exists outside of the repository [20:20:42] but it has a .php extension, so jenkins attempts to check it [20:20:50] except it can't resolve the link [20:21:20] jenkins should probably exempt symlinks [20:21:32] yurik: well the file does not exist [20:21:49] hashar, i just synced it to all the prod servers [20:24:30] yurik: i don't get it [20:24:38] yurik: where is the code of mobileredirect.php ? [20:25:44] hashar, the code is in the same place as extract2.php - in the root of mediawiki-config repo [20:25:48] yurik: anyway the jenkins check can be ignored in that corner case [20:25:56] how is it synced if it's not merged [20:26:10] yurik: and the repo is missing the /w/mobileredirect.php anyway [20:27:20] hashar, pull the repo [20:27:21] its there [20:28:09] hashar, https://gerrit.wikimedia.org/r/#/c/98576/ [20:28:12] na no mobileredirect.php for me [20:28:39] hey AaronSchulz [20:28:52] hashar, if you think all is good here, could you +2 :) [20:28:53] is there a bug for the mismatch of timeout settings between varnish and php? [20:29:04] yurik: as I said, there is no mobileredirect.php [20:29:04] not that I know of [20:29:22] hashar, i'm confused - where are you looking? [20:29:30] paravoid: want to file one or should I? [20:29:35] I'll just fix it [20:29:38] though the varnish timeout is wall-clock and the php one CPU based [20:29:50] so they are not totally easy to "match" perfectly [20:30:05] but yeah, they probably should not be hugely different [20:30:06] right now it's 30 vs. 180 [20:30:26] (03CR) 10Hashar: [C: 04-1] "Jenkins fails the php lint because the symbolic link of mobileredirect.php points to a full path which is not on the box." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:30:30] so clearly wallclock vs. CPU is not the issue :) [20:30:31] right...heh, I recall CheckUsers getting use out of high timeout ;) [20:31:03] paravoid: I'm just speaking generally [20:31:19] there is also apache timeout mixed in there somewhere [20:31:25] I got a Special:Contributions page during the weekend that consistently takes 45s to load [20:31:32] hashar, which box are you talking about? I modeled mobilelanding approach on extract2 - and its done exactly the same way unless i messed it up somehow [20:32:50] yurik: do you know fatal.log has lots of [02-Dec-2013 15:29:56] Fatal error: Maximum execution time of 180 seconds exceeded at /usr/local/apache/common-local/php-1.23wmf4/extensions/ZeroRatedMobileAccess/includes/PageRenderingHooks.php on line 225 ? [20:33:06] clearly we need to increase the threshold! [20:33:12] ori-l, that's not good [20:33:14] works with jobqueue [20:33:19] ori-l: do you know if there is a bug for ganglia-based alerts? [20:33:20] ori-l, is that a recent thing? [20:33:36] gwicke: we have graphite-based alerts now, I'd suggest that instead :) [20:33:45] gwicke: not sure, but i second paravoid [20:33:56] yurik: in the git repo, you are introducing a symbolic link that points to /w/mobileredirect.php which is not added by that change. [20:34:00] I see, how can I use those on the typical ganglia host stats? [20:34:18] gwicke: if it's for core system metrics, then yes, use ganglia [20:34:26] this is re https://bugzilla.wikimedia.org/show_bug.cgi?id=57265 [20:34:52] yurik: so there are two issues in the change. The first can be ignored which is that the symlink points to a file using a full path, that is not going to exist on jenkins. So the lint check can be ignored [20:34:56] (03PS1) 10Faidon Liambotis: varnish: adjust first_byte_timeout to 180s (text) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98612 [20:35:08] (03PS1) 10Jforrester: Add "betar" label to VisualEditor links on eswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98613 [20:35:25] yurik: but still /usr/local/apache/common/w/mobileredirect.php does not exist since it is not in the operations/mediawiki-config.git repo, neither in master nor in the patchset you wrote. I guess you forgot to git add it [20:35:43] yurik: timeouts started Oct. 25th [20:36:05] yurik: https://dpaste.de/M78C/raw [20:36:37] hashar, but what about https://gerrit.wikimedia.org/r/#/c/98576/ ??? that's the patch where I added w/mobilelanding.php [20:36:51] and I sync-common-file it already [20:37:46] (03CR) 10Faidon Liambotis: [C: 032 V: 032] varnish: adjust first_byte_timeout to 180s (text) [operations/puppet] - 10https://gerrit.wikimedia.org/r/98612 (owner: 10Faidon Liambotis) [20:38:27] ori-l: is the plan to migrate to graphite, or does it make sense to open a bug for ganglia alerts? [20:39:09] yurik: does it add mobileredirect.php ? [20:40:54] hashar, i'm an idiot, thank you! it should have been mobileredirect.php :( [20:41:17] gwicke: the plan is to migrate or duplicate all application metrics to graphite; dunno re: system metrics, though there's a case to be made there [20:41:22] but it does make sense to open a bug, yes [20:41:51] what ori-l said [20:42:01] graphite is also nicer for alerts because you can apply functions [20:42:38] yeah, I'm just concerned about the things graphite does not currently cover [20:42:44] (03PS1) 10MarkTraceur: Enable GWToolset on betacommons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98684 [20:42:51] gwicke: yeah, it's a fair point; no reason to be blocked [20:42:52] bd808: Want to review/merge ^^ ? [20:43:10] paravoid: re: systemmetrics, been hearing good things about https://github.com/BrightcoveOS/Diamond [20:43:15] me too [20:43:16] dreamhost uses it [20:43:23] (03CR) 10Jforrester: [C: 04-1] "Do not merge until community discussion is complete." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98613 (owner: 10Jforrester) [20:43:25] https://bugzilla.wikimedia.org/show_bug.cgi?id=57882 [20:43:28] I've mentioned it some time ago [20:44:06] garg. icinga is my fault. fixing [20:44:12] there's so much work to do with graphite, argh [20:44:17] (03PS1) 10Hashar: beta: let jenkins-deploy restart Parsoid [operations/puppet] - 10https://gerrit.wikimedia.org/r/98685 [20:44:28] ori-l: one step at a time :) [20:44:31] gwicke: thanks! CC'd self [20:44:43] ori-l: tiny steps are best :-] [20:44:49] hmm yeah, what paravoid said. [20:45:08] !log maxsem synchronized php-1.23wmf4/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/95636/' [20:45:21] ori-l: I thought maybe we could use two separates graphite / statsd instance. One for mw profiling and another one for misc job. That might help. [20:45:23] Logged the message, Master [20:45:30] though I have no idea how resource intensive graphite is [20:45:31] ori-l: you know ganglia can write to carbon too, right? [20:45:59] (03PS7) 10Yurik: Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 [20:46:34] !log maxsem synchronized php-1.23wmf5/extensions/MobileFrontend/ 'https://gerrit.wikimedia.org/r/#/c/95636/' [20:46:50] Logged the message, Master [20:47:19] paravoid: ganglia can write to carbon, graphite can write to rrds, ganglia-web can use graphite to render graphs, statsd can write to ganglia and graphite, collectd can write to statsd and graphite and ganglia, etc. [20:47:22] (03CR) 10BryanDavis: [C: 04-1] "Added Dan to review so he can tell us what settings need to be added." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98684 (owner: 10MarkTraceur) [20:47:41] the flexibility is nice on the one hand but points to the lack of a good, standard full stack [20:47:50] you forgot bucky [20:47:58] and logster maybe [20:48:04] MaxSem, are you done with depl? [20:48:04] and skyline [20:48:11] yurik, yup [20:48:30] i will finish up by getting https://gerrit.wikimedia.org/r/#/c/97107/ deployed... finally :) [20:49:17] MaxSem, can you +2 it pls? [20:49:50] (03CR) 10MaxSem: [C: 032] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:50:06] MaxSem: you will have to force merge that patch [20:50:31] MaxSem: jenkins attempt to php -l the symbolic link which points to /usr/local/apache/common/w/mobilelanding.php . That does not exist on Jenkins servers :/ [20:51:02] jenkins still haven't said anything this time [20:51:37] yurik: yeah it is stuck in trafic jam. l10n changes are kicking in that overload jenkins every day around this time. [20:51:59] yurik: you can tell on the first graph at https://integration.wikimedia.org/zuul/ [20:52:11] yurik: there is a green line that show the # of patchset created [20:52:43] paravoid: do you have few minutes to review a sudo privilege for beta please? https://gerrit.wikimedia.org/r/#/c/98685/ :) [20:52:52] hashar, is it possible to make it ignore them? [20:53:08] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:53:27] MaxSem: yeah potentially we could eventually filter out symbolic links pointing to some non-relative paths. [20:53:30] (03CR) 10jenkins-bot: [V: 04-1] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:53:38] hashar, I mean i18n [20:54:03] MaxSem: haven't found a way to ignore it [20:54:22] MaxSem: if it was just me I would put all the i18n files out of the code repos :-D [20:54:33] (03CR) 10Yurik: [C: 032 V: 032] Created mobile portal m.wikipedia.org and zero.wikipedia.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97107 (owner: 10Yurik) [20:54:47] yurik: are you going to look at the timeout issue? [20:55:04] ori-l, yes, sorry, want to finish getting links deployed [20:55:20] and btw, i briefly looked - nothing apparent, will need to investigate [20:55:23] ok, i'm going to look at something else and wanted to make sure it didn't disappear [21:02:52] !log yurik synchronized docroot and w [21:03:08] Logged the message, Master [21:05:14] (03PS6) 10Yurik: for m.wikipedia.org and zero.wikipedia.org [operations/apache-config] - 10https://gerrit.wikimedia.org/r/97115 [21:06:36] greg-g, ok, i think i'm done for now [21:07:23] paravoid, https://gerrit.wikimedia.org/r/#/c/97115 is ready for deployment -- virtual host updates [21:07:42] yurik: thanks [21:07:50] all other prereqs have been sorted out i think [21:08:09] greg-g, so who should i bug about db update on meta? [21:08:45] (this is now a new db schema - this is adding existing extension to metawiki [21:09:10] yurik: did you read that schema changes page I linked? [21:09:48] sean is the man for those [21:10:05] PROBLEM - Puppet freshness on sq37 is CRITICAL: Last successful Puppet run was Sun 01 Dec 2013 06:02:06 AM UTC [21:10:46] (03CR) 10Dan-nl: "not sure how these would be added or what values would be used. below are the values i have in LocalSettings.php. INSTALL has further info" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98684 (owner: 10MarkTraceur) [21:14:53] (03CR) 10Ottomata: [C: 032 V: 032] Correctly calculate escape buffer size [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98134 (owner: 10Edenhill) [21:14:54] greg-g, yes, i read it, but all the bugs in bugzilla talk about changing the schema, not deploying existing extensions that are in production to a new server. Should I add it to bugzilla also? [21:15:32] s/server/wiki [21:16:37] yurik: then that might be a reedy type thing [21:17:00] Reedy, is it your type thing? :) [21:17:24] What schema change? [21:18:28] (03CR) 10Ottomata: [C: 032 V: 032] Added %{VCL_Log:key}x support [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98135 (owner: 10Edenhill) [21:20:55] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [21:21:59] (03CR) 10Ottomata: [C: 032 V: 032] Tag column reader was used incorrectly [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98136 (owner: 10Edenhill) [21:23:35] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:03] (03PS1) 10Ottomata: log.statistics.file now defaults to /tmp/varnishkafka.stats.json [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98694 [21:25:12] nooo [21:25:23] (03CR) 10Ottomata: [C: 032 V: 032] log.statistics.file now defaults to /tmp/varnishkafka.stats.json [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/98694 (owner: 10Ottomata) [21:25:27] haha [21:25:27] paravoid [21:25:30] that's just in the software [21:25:31] not in debian [21:25:38] debian will put it in /var/cache/vanrishkafka [21:25:50] the software can't write to /var/cache on its own [21:25:59] ees bad? [21:26:01] doesn't matter, don't use /tmp [21:26:11] cwd? [21:26:24] what's wrong with /var/cache/varnishkafka/ ? [21:26:40] because the software doesn't create that directory [21:26:47] you can't run it with that default unless you do some stuff first [21:26:57] I don't understand [21:27:01] (03PS1) 10RobH: RT:6428 testsearch1XXX to logstash1XXX RT:6428 deploying new logstash servers [operations/dns] - 10https://gerrit.wikimedia.org/r/98695 [21:27:03] have the package make that dir? [21:28:03] (03CR) 10RobH: [C: 032] RT:6428 testsearch1XXX to logstash1XXX RT:6428 deploying new logstash servers [operations/dns] - 10https://gerrit.wikimedia.org/r/98695 (owner: 10RobH) [21:28:15] bleh i put in wrong syntax to link rt ticket, oh well [21:28:25] yes [21:28:27] paravoid [21:28:30] the package will make that dir [21:28:31] RobH: woot! thanks [21:28:33] and set that as teh default it installs [21:28:45] but, i need to set a hardcoded default in the code [21:28:53] that will work if someone just compiles and runs varnishkafka [21:28:57] oh [21:29:03] but we are not going to deploy it like that? [21:29:05] no [21:29:13] ah, then I don't care [21:29:14] :P [21:29:18] the .conf file that deb ships will set it to /var/cache/varnishkafka [21:29:18] ori-l: yea, im going to get the systems reinstalled with an os for you, then its up to you guys [21:29:31] ie: they will be installed and talking to puppet/salt/etc [21:29:32] ok phew :p [21:30:08] RobH: yep! can i just ask that the disk setup be the same? [21:30:35] yea, i assumed it was going to be [21:30:43] cool, thank you [21:30:46] im doing a replace of testsearch with logstash ;] [21:30:51] easiest reallocation ever. [21:31:02] (well, and deleting some, buts till) [21:33:19] !log ignore any alerts for testsearch1XXX, I just decommissioned them, but icinga hasn't updated quite yet [21:33:35] Logged the message, RobH [21:35:05] PROBLEM - Host testsearch1001 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:25] PROBLEM - Host testsearch1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:35:42] PROBLEM - Host testsearch1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:18] now that would not be cool of me at all [21:36:23] if those hosts generated pages [21:36:27] but as they were test, they do not ;] [21:36:43] (so no one point out to apergos I skipped the neon update step ;) [21:39:08] aww man [21:39:15] who left uncommitted network changes on the row c stack =p [21:40:13] cmjohnson1: Are you actively making changes on row C network settings now? (I have no idea who it was, but there are few folks who would) [21:40:26] I am making changes yes [21:40:30] haha [21:40:32] me too [21:40:36] so yea, i wont commit mine [21:40:38] as im done [21:40:46] you do yer thing, im just changing port descriptions [21:40:53] just know they are there when you commit [21:40:54] i am done now...i can commit if okay with you? [21:40:56] yep [21:40:59] do it =] [21:41:01] cool [21:41:02] thx [21:41:29] i thought someone left uncommitted stuff, then i realized the list of folks who touch that is myself, you, mark and leslie, and none of those folks do that [21:41:58] heya paravoid, are there packet loss esams issues happening right now? i'm seeing lots of varnishkafka produce errors [21:42:00] haha..yeah I figured it was you [21:42:49] coren: labstore1001/1002 are finished. I was able to see all 4 disk shelves from each controller [21:43:08] latency looks like its ~140 ms [21:46:28] cmjohnson1: Yeay you! It looks like we may end up doing some DC visits together; I'll buy you a beer for all the trouble. [21:46:33] :-) [21:46:50] (03PS1) 10Andrew Bogott: Added 'adminadd' tool to auto-generate new user entries. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 [21:46:54] sounds like a plan [21:48:28] Moar storage. :-) [21:52:26] (03PS1) 10RobH: RT: 6428 deploy logstash1001-1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98704 [21:53:56] (03CR) 10RobH: [C: 04-1] "corrections for dhcpd file pending" [operations/puppet] - 10https://gerrit.wikimedia.org/r/98704 (owner: 10RobH) [21:54:50] (03PS2) 10RobH: RT: 6428 deploy logstash1001-1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98704 [21:56:48] (03CR) 10RobH: [C: 032] RT: 6428 deploy logstash1001-1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98704 (owner: 10RobH) [21:57:52] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [21:58:16] (03CR) 10Catrope: [C: 032] Fix server name for labs parsoid (deployment-parsoid3) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94062 (owner: 10Catrope) [21:58:24] (03CR) 10Catrope: [C: 032] Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [21:58:29] (03Merged) 10jenkins-bot: Fix server name for labs parsoid (deployment-parsoid3) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94062 (owner: 10Catrope) [21:58:30] (03CR) 10Catrope: [C: 032] Remove cruft from wmgVisualEditorDisableForAnons no longer needed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94472 (owner: 10Jforrester) [21:58:37] ottomata: 140 from tampa to esams you mean [21:58:37] (03CR) 10Catrope: [C: 032] Enable VisualEditor as opt-in on svwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 (owner: 10Jforrester) [21:58:38] (03Merged) 10jenkins-bot: Enable VisualEditor as default to "phase 3" Wikipedias [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/96473 (owner: 10Jforrester) [21:58:43] (03CR) 10Catrope: [C: 032] Enable VisualEditor as opt-in on Swedish Wikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97714 (owner: 10Jforrester) [21:58:59] (03Merged) 10jenkins-bot: Remove cruft from wmgVisualEditorDisableForAnons no longer needed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94472 (owner: 10Jforrester) [21:59:03] (03Merged) 10jenkins-bot: Enable VisualEditor as opt-in on svwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97711 (owner: 10Jforrester) [21:59:22] (03Merged) 10jenkins-bot: Enable VisualEditor as opt-in on Swedish Wikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/97714 (owner: 10Jforrester) [21:59:51] ottomata: miniscule packet loss but a lot of jitter [22:01:54] asking again: Anyone have a puppet manifest/template that they feel best exemplifies awesomeness? [22:02:37] Snaps: ^^ [22:04:31] !log catrope synchronized visualeditor-default.dblist 'Enable VisualEditor by default on 102 wikis' [22:04:37] (03PS2) 10Andrew Bogott: Added 'adminadd' tool to auto-generate new user entries. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 [22:04:46] Logged the message, Master [22:05:28] !log catrope synchronized wmf-config/InitialiseSettings.php 'Only activate VisualEditor in the User namespace on svwiktionary' [22:05:44] Logged the message, Master [22:06:14] !log catrope synchronized visualeditor.dblist 'Enable VisualEditor on svwiktionary and sewikimedia' [22:06:27] Logged the message, Master [22:07:39] Huhm [22:13:23] (03PS3) 10Andrew Bogott: Added 'adminadd' tool to auto-generate new user entries. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 [22:19:51] !log reedy synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [22:20:06] Logged the message, Master [22:20:14] !log reedy updated /a/common to {{Gerrit|Idc406aa68}}: Created mobile portal m.wikipedia.org and zero.wikipedia.org [22:20:18] (03PS1) 10Reedy: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98711 [22:20:29] Logged the message, Master [22:20:35] (03CR) 10Reedy: [C: 032] Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98711 (owner: 10Reedy) [22:20:43] Reedy: Uhm.... mind not doing config syncs during a scheduled deployment window? [22:20:58] (03Merged) 10jenkins-bot: Update interwiki cache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98711 (owner: 10Reedy) [22:33:18] !log catrope synchronized php-1.23wmf4/extensions/VisualEditor 'Update VE for cherry-pick' [22:33:34] Logged the message, Master [22:33:35] !log catrope synchronized php-1.23wmf5/extensions/VisualEditor 'Update VE for cherry-pick' [22:33:49] Logged the message, Master [22:37:39] PROBLEM - puppet disabled on testsearch1001 is CRITICAL: Connection refused by host [22:38:00] <^d> That seems...wrong? [22:38:09] <^d> testsearch1001 shouldn't exist anymore. [22:38:10] PROBLEM - DPKG on testsearch1001 is CRITICAL: Connection refused by host [22:38:10] PROBLEM - Disk space on testsearch1001 is CRITICAL: Connection refused by host [22:38:29] PROBLEM - RAID on testsearch1001 is CRITICAL: Connection refused by host [22:38:49] (03PS1) 10Manybubbles: Reenable CirrusSearch's updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98714 [22:39:06] it still exists in icinga? [22:39:46] RobH: ^^ re testsearch1001 [22:40:00] is this: 13:35 < RobH> im doing a replace of testsearch with logstash ;] [22:40:03] ? [22:40:28] just be careful and don't let it join the production search cluster:) [22:41:09] I'm sure you have it under control though [22:41:10] do you guys not read log [22:41:14] i admin logged it would alert [22:41:17] =p [22:41:24] (03CR) 10Chad: "Dupe of https://gerrit.wikimedia.org/r/#/c/98581/?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98714 (owner: 10Manybubbles) [22:41:57] (03Abandoned) 10Manybubbles: Reenable CirrusSearch's updates [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98714 (owner: 10Manybubbles) [22:46:04] neon is mid puppet update [22:46:12] has been awhile [22:46:34] ori-l: So I have all three of your new hosts installed. I'm having them call into puppetmaster now for their initial runs [22:46:45] RobH: <3! thank you! [22:47:32] quite welcome [22:47:43] so once the initial run is done, i'll assign the ticket to you [22:47:50] and you can resolve or use as you see fit. [22:48:06] yep [22:48:52] and while i babysit these puppet runs, its lunchtime \o/ (because the last time i looked at a clock it was 11:30) [22:49:09] RECOVERY - Disk space on testsearch1001 is OK: DISK OK [22:49:17] damn you icinga [22:49:29] RECOVERY - RAID on testsearch1001 is OK: OK: optimal, 1 logical, 2 physical [22:49:39] RECOVERY - puppet disabled on testsearch1001 is OK: OK [22:50:09] RECOVERY - DPKG on testsearch1001 is OK: All packages OK [22:50:19] PROBLEM - NTP on testsearch1001 is CRITICAL: NTP CRITICAL: Offset unknown [22:50:22] puppet disabled is OK , means it's enabled [22:50:27] heh [22:50:58] yea cept testsearch is gone, i removed from the db but neon takes a very long time to run puppet [22:51:29] yep, got that, i just meant that check in general [22:51:47] it's fairly new we check for puppet being disabled.. [22:51:52] as opposed to freshness checks [22:53:29] yep [23:00:52] (03CR) 10Chad: [C: 032] Revert "Disable search updates for Cirrus wikis for the time being" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98581 (owner: 10Chad) [23:01:06] (03Merged) 10jenkins-bot: Revert "Disable search updates for Cirrus wikis for the time being" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98581 (owner: 10Chad) [23:04:01] !log demon synchronized wmf-config/CommonSettings.php 'Cirrus wikis get searchupdate (take 2)' [23:04:17] Logged the message, Master [23:05:09] hey all; just an fyi -- fundraising just went up 100% [23:11:12] \O/ [23:12:28] PROBLEM - Puppet freshness on cp3012 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 05:10:45 PM UTC [23:13:24] !log demon synchronized php-1.23wmf4/extensions/Elastica [23:13:39] Logged the message, Master [23:14:28] PROBLEM - Puppet freshness on elastic1008 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:01 PM UTC [23:14:28] PROBLEM - Puppet freshness on mw1069 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:06 PM UTC [23:14:28] PROBLEM - Puppet freshness on mw1027 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:11 PM UTC [23:14:28] PROBLEM - Puppet freshness on mw1143 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:01 PM UTC [23:14:28] PROBLEM - Puppet freshness on mw28 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:11 PM UTC [23:15:28] PROBLEM - Puppet freshness on cp4002 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:08 PM UTC [23:15:28] PROBLEM - Puppet freshness on db1050 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:32 PM UTC [23:15:28] PROBLEM - Puppet freshness on db1040 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:08 PM UTC [23:15:28] PROBLEM - Puppet freshness on ms-fe1002 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:57 PM UTC [23:15:28] PROBLEM - Puppet freshness on db49 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:42 PM UTC [23:15:28] PROBLEM - Puppet freshness on mw1003 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:52 PM UTC [23:15:28] PROBLEM - Puppet freshness on mw1066 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:52 PM UTC [23:15:29] PROBLEM - Puppet freshness on mw1021 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:13 PM UTC [23:15:29] PROBLEM - Puppet freshness on mw104 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:13 PM UTC [23:15:30] PROBLEM - Puppet freshness on bast1001 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:13 PM UTC [23:15:30] PROBLEM - Puppet freshness on mw1150 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:47 PM UTC [23:15:31] PROBLEM - Puppet freshness on mw1153 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:57 PM UTC [23:15:31] PROBLEM - Puppet freshness on mw1154 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:58 PM UTC [23:15:32] PROBLEM - Puppet freshness on mw1118 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:08 PM UTC [23:15:32] PROBLEM - Puppet freshness on mw1155 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:13 PM UTC [23:15:33] PROBLEM - Puppet freshness on mw1173 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:26 PM UTC [23:15:33] PROBLEM - Puppet freshness on mw1205 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:32 PM UTC [23:15:34] PROBLEM - Puppet freshness on mw1204 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:42 PM UTC [23:15:34] PROBLEM - Puppet freshness on mw79 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:26 PM UTC [23:15:35] PROBLEM - Puppet freshness on search1012 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:57 PM UTC [23:15:35] PROBLEM - Puppet freshness on sq52 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:08 PM UTC [23:15:36] PROBLEM - Puppet freshness on sq63 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:13 PM UTC [23:15:36] PROBLEM - Puppet freshness on srv193 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:14:32 PM UTC [23:15:37] PROBLEM - Puppet freshness on osm-cp1001 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:03 PM UTC [23:16:28] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:16:11 PM UTC [23:16:28] PROBLEM - Puppet freshness on amssq47 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:45 PM UTC [23:16:28] PROBLEM - Puppet freshness on cp1017 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:15:55 PM UTC [23:16:28] PROBLEM - Puppet freshness on cp3011 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 05:15:19 PM UTC [23:16:28] PROBLEM - Puppet freshness on db1021 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:16:01 PM UTC [23:16:54] so that was prolly due to its daemon restarting [23:17:00] since i decommissioned and added new hosts [23:17:05] dunno though... [23:18:25] (03PS6) 10Yurik: Apply FlaggedRevs to metawiki for W0. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95662 (owner: 10Dr0ptp4kt) [23:18:28] PROBLEM - Puppet freshness on amssq37 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:17:33 PM UTC [23:18:28] PROBLEM - Puppet freshness on cp4015 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:17:48 PM UTC [23:18:28] PROBLEM - Puppet freshness on cp1018 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:18:14 PM UTC [23:18:28] PROBLEM - Puppet freshness on cp1016 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:17:23 PM UTC [23:18:28] PROBLEM - Puppet freshness on cp4014 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:17:23 PM UTC [23:19:28] PROBLEM - Puppet freshness on amssq51 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:18:45 PM UTC [23:19:28] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:19:11 PM UTC [23:19:28] PROBLEM - Puppet freshness on analytics1022 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:18:40 PM UTC [23:19:28] PROBLEM - Puppet freshness on amssq56 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:18:19 PM UTC [23:19:28] PROBLEM - Puppet freshness on analytics1002 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:18:50 PM UTC [23:20:28] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:20:07 PM UTC [23:20:28] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:19:42 PM UTC [23:20:28] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:20:07 PM UTC [23:20:28] PROBLEM - Puppet freshness on cp1062 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:19:37 PM UTC [23:20:28] PROBLEM - Puppet freshness on analytics1019 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:19:52 PM UTC [23:22:28] PROBLEM - Puppet freshness on aluminium is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:23 PM UTC [23:22:28] PROBLEM - Puppet freshness on amssq59 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:28 PM UTC [23:22:28] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:23 PM UTC [23:22:28] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:23 PM UTC [23:22:28] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:43 PM UTC [23:22:28] PROBLEM - Puppet freshness on db1053 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:22:14 PM UTC [23:22:28] PROBLEM - Puppet freshness on manutius is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:44 PM UTC [23:22:29] PROBLEM - Puppet freshness on db36 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:22:04 PM UTC [23:22:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:54 PM UTC [23:22:30] PROBLEM - Puppet freshness on mchenry is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:54 PM UTC [23:22:30] PROBLEM - Puppet freshness on mc1011 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:54 PM UTC [23:22:31] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:22:14 PM UTC [23:22:31] PROBLEM - Puppet freshness on mw1022 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:59 PM UTC [23:22:32] PROBLEM - Puppet freshness on mw1031 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:23 PM UTC [23:22:32] PROBLEM - Puppet freshness on mw114 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:59 PM UTC [23:22:33] PROBLEM - Puppet freshness on mw1145 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:59 PM UTC [23:22:33] PROBLEM - Puppet freshness on mw1080 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:38 PM UTC [23:22:34] PROBLEM - Puppet freshness on mw1178 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:38 PM UTC [23:22:34] PROBLEM - Puppet freshness on mw1200 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:59 PM UTC [23:22:35] PROBLEM - Puppet freshness on mw53 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:49 PM UTC [23:22:35] PROBLEM - Puppet freshness on mw64 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:28 PM UTC [23:22:36] PROBLEM - Puppet freshness on sq78 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:23 PM UTC [23:22:36] PROBLEM - Puppet freshness on srv240 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:59 PM UTC [23:22:37] PROBLEM - Puppet freshness on srv270 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:21:54 PM UTC [23:22:37] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:22:04 PM UTC [23:23:06] eek [23:23:11] ori-l, do you know why zero extension is not showing up in graphite? [23:23:21] did i do that with my gerrit checkin??? [23:23:27] wow [23:23:28] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:22:24 PM UTC [23:23:28] PROBLEM - Puppet freshness on db31 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:23:01 PM UTC [23:23:28] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:23:11 PM UTC [23:23:28] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:23:06 PM UTC [23:23:28] PROBLEM - Puppet freshness on elastic1001 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:23:11 PM UTC [23:24:29] how about a single alert for XX hosts with freshness < GOOD and a dependancy to the inidivual alerts? [23:25:19] looks false positive [23:25:23] e.g. puppet runs [23:25:28] PROBLEM - Puppet freshness on cp1009 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:25:10 PM UTC [23:25:28] PROBLEM - Puppet freshness on cp1053 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:24:49 PM UTC [23:25:28] PROBLEM - Puppet freshness on db1015 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:25:10 PM UTC [23:25:28] PROBLEM - Puppet freshness on cp4009 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:24:43 PM UTC [23:25:28] PROBLEM - Puppet freshness on chromium is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:25:10 PM UTC [23:25:50] do you run puppet in a cron or keep the long running daemonized puppet? [23:26:27] greg-g: hi, who is in today lightning deploy window? [23:26:28] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:25:56 PM UTC [23:26:28] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:26:06 PM UTC [23:26:28] PROBLEM - Puppet freshness on amssq50 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:25:56 PM UTC [23:26:28] PROBLEM - Puppet freshness on amssq60 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:26:01 PM UTC [23:26:28] PROBLEM - Puppet freshness on analytics1021 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:26:06 PM UTC [23:26:56] se4598: myself for one [23:27:28] PROBLEM - Puppet freshness on arsenic is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:27:07 PM UTC [23:27:28] PROBLEM - Puppet freshness on cp1008 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:27:12 PM UTC [23:27:28] PROBLEM - Puppet freshness on db1049 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:27:02 PM UTC [23:27:28] PROBLEM - Puppet freshness on cp1019 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:26:52 PM UTC [23:27:28] PROBLEM - Puppet freshness on db64 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 08:26:57 PM UTC [23:29:21] se4598: mwalker and that's it [23:29:29] mlitn: have you currently time to do https://gerrit.wikimedia.org/r/98073 at the next lightning depl. window? [23:30:28] it's already recovering, i just killed the bot for less spam here [23:30:45] cajoel: cron [23:31:00] (03PS1) 10Mwalker: Changing banner expiration to 10 months [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98723 [23:31:17] base/puppet.cron.erb .. [23:31:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [23:32:14] restarts icinga-wm [23:34:31] puppet makes neon really busy, hey, who else is on neon [23:34:38] i was [23:35:00] ok, then that made sense [23:35:38] ^d: I'm here for a bit! [23:35:52] <^d> I'm already reindexing :D [23:35:56] <^d> It's working wonderfully now. [23:36:19] (03PS1) 10Jgreen: add new install location for drush to sudoers config [operations/puppet] - 10https://gerrit.wikimedia.org/r/98724 [23:36:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:29:05 PM UTC [23:36:41] sweet! you making jobs flow? [23:36:56] <^d> Jobs a'running [23:37:06] <^d> It's sooooo much faster this way :D [23:37:41] (03CR) 10Jgreen: [C: 032 V: 031] add new install location for drush to sudoers config [operations/puppet] - 10https://gerrit.wikimedia.org/r/98724 (owner: 10Jgreen) [23:37:58] ^d: you doing the two passes? [23:38:14] <^d> I'm doing the first pass on everything now. [23:38:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:29:05 PM UTC [23:38:36] may want to add --forceUpdate --skipLinks --indexOnSkip [23:38:45] its in the readme:) [23:38:56] it should skip a bunch of stuff and make the jobs run faster [23:39:53] might not matter though, if the jobs are running fast enough [23:40:21] <^d> Yeah I just fixed those. [23:40:27] <^d> I noticed I was missing something :) [23:40:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:29:05 PM UTC [23:40:50] PROBLEM - MySQL Processlist on db1019 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 302 copy to table, 0 statistics [23:40:59] PROBLEM - MySQL Processlist on db1003 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 234 copy to table, 2 statistics [23:40:59] PROBLEM - MySQL Processlist on db1010 is CRITICAL: CRIT 0 unauthenticated, 1 locked, 227 copy to table, 4 statistics [23:42:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:29:05 PM UTC [23:42:31] ^d: neat that with one process spewing jobs it doesn't even make a dent in the job queue. you could spin up five and probably not change it [23:44:09] RECOVERY - Puppet freshness on cp4013 is OK: puppet ran at Mon Dec 2 23:44:02 UTC 2013 [23:45:20] ori-l: do you know about this check? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=tungsten&service=HTTP+5xx+req%2Fmin [23:45:22] ^d: you can really see how fast the links update part is now. also, I should stop spawning a job for updating links where there are no links to update [23:45:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:46:00] RECOVERY - MySQL Processlist on db1010 is OK: OK 0 unauthenticated, 0 locked, 17 copy to table, 2 statistics [23:46:11] mutante: faidon provisioned it; it is an actual problem [23:46:19] (the thing that the alert is reporting, I mean) [23:46:49] RECOVERY - MySQL Processlist on db1019 is OK: OK 0 unauthenticated, 0 locked, 1 copy to table, 0 statistics [23:46:59] RECOVERY - MySQL Processlist on db1003 is OK: OK 0 unauthenticated, 0 locked, 1 copy to table, 1 statistics [23:47:23] <^d> manybubbles: I wonder if that was us ^ [23:47:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:47:38] (03PS1) 10Ori.livneh: Specify managehome => false for "/nonexistent" $HOMEs [operations/puppet] - 10https://gerrit.wikimedia.org/r/98729 [23:47:39] (03PS1) 10Ori.livneh: Add logstash100[1-3] to site.pp & add bd808 & aaron as sudo per RT 6366 [operations/puppet] - 10https://gerrit.wikimedia.org/r/98730 [23:47:42] ori-l: thx, found it, i se it uses check_graphite [23:47:46] <^d> I sped up then slowed down a bit. [23:48:05] * ^d will keep an eye [23:48:11] ^d: the mysql recovery? [23:48:19] mutante: is it alright with you if i add the accounts in site.pp initially? i explained my rationale in RT 6366 [23:48:24] <^d> Yes, the problem then recovery. [23:48:34] <^d> I wonder if we made indexing toooooo efficient on our side ;-) [23:48:48] <^d> To where it's possible to overload things like the database :) [23:49:19] we did it to elasticsearch two weeks ago [23:49:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:49:30] so it is possible. what rate is it spititng out? [23:49:30] (03CR) 10Ori.livneh: [C: 032] Specify managehome => false for "/nonexistent" $HOMEs [operations/puppet] - 10https://gerrit.wikimedia.org/r/98729 (owner: 10Ori.livneh) [23:49:45] <^d> Each thread is doing about 125/s, I had 2 threads. [23:49:52] btw, the second pass runs evey faster because you skip parsing. [23:49:58] <^d> When I briefly tried a third thread, I saw the mysql panic so I backed off. [23:49:59] rather, the jobs run faster [23:50:06] the generator doesn't [23:50:07] <^d> Yeah [23:50:27] (03PS2) 10Mwalker: Changing banner expiration to 10 months [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98723 [23:50:50] the second pass is almost as fast in my development environment as the job generator [23:51:05] it does much less with the db though, so it _should_ be safer [23:51:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:52:34] mutante: poke [23:52:36] ori-l: hold on, graphite crashed my browser :P [23:52:40] k [23:53:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:54:55] ^d: about to go, but, honestly, if the job infrastructure executes so many jobs that ours don't look like a blip, what are the odds that we're a blip on mysql? I'm sure possible but I don't think likely now that we're "just" doing page views. [23:55:03] hiding now [23:55:13] <^d> I think we're fine and I was just being paranoid :) [23:55:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:57:27] ori-l: the second part convinced me more than the first. :) [23:57:29] PROBLEM - Puppet freshness on cp4013 is CRITICAL: Last successful Puppet run was Mon 02 Dec 2013 11:44:02 PM UTC [23:58:37] ori-l: yea, go ahead, don't need to reinvent things right now.. keep it simple is fine [23:58:49] RECOVERY - Puppet freshness on cp4013 is OK: puppet ran at Mon Dec 2 23:58:44 UTC 2013 [23:58:52] (03CR) 10Mwalker: [C: 032] Changing banner expiration to 10 months [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/98723 (owner: 10Mwalker) [23:59:47] ori-l: all i was saying is it seems better to me not having to touch site.pp each time a user changes on some node.. but don't worry now, we do it everywhere