[00:00:29] bloody puppet reverting all my changes [00:00:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:00:33] confirmed i don't see *.m.wikipedia.org [00:01:13] bleh. [00:01:13] did anyone mail out to the mobile team that this change was happening ? [00:01:15] blllllllllaaaaaargggg [00:01:25] tfinc: we were emailed by the mobile team to fix it ;] [00:01:26] New patchset: Pyoungmeister; "adding null search shard for log silence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53489 [00:01:40] but nope, cuz it should have been seemless ;_; [00:02:01] (it affected more than just mobile, change was for all https) [00:02:19] New review: Pyoungmeister; "rebasing this was annoying, so I just made a new patchset here: https://gerrit.wikimedia.org/r/#/c/5..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52547 [00:02:25] to be fair the thread was started by brion to mobile-tech and ops [00:02:34] subject " [Ops] Can somebody fix SSL on non-Wikipedia mobile sites?" [00:02:44] and we did [00:02:45] (assuming the [Ops] part didn't happen on the mobile-tech list though ) [00:02:51] and in the process, broke wikipedia mobile [00:02:51] New patchset: Tim Starling; "Use latest php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53490 [00:02:54] haha [00:03:14] yup. i see it. "subject: Can somebody fix SSL on non-Wikipedia mobile sites?" [00:03:25] we followed directions of that email to the letter [00:03:26] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53490 [00:03:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53489 [00:03:52] what's special about mw1043, does not have /etc/apache2/wmf dir [00:03:55] brion didnt say 'by the way, please dont break wikipedia' [00:04:01] I deployed a few other puppet changes just now [00:04:05] TimStarling: did you just merg [00:04:06] thanks! [00:04:25] HTTPS monitoring and search configuration [00:05:40] TimStarling: the https monitoring was me, thx dude [00:05:41] "Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request." when doing various things...several wikis [00:05:43] otrs wiki, enwiki [00:05:55] IDk if things are known - happening randomly [00:06:01] fixes on reload or two usually [00:06:16] dzahn is doing a graceful restart of all apaches [00:06:17] "The requested URL /wiki/Special:Contributions/Rjd0060 was not found on this server." [00:06:30] wikipedia says no. [00:06:34] did someone change the apache configuration? [00:06:42] Aaron|laptop: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [00:06:42] Content-Language: en [00:06:42] Vary: Accept-Encoding,Cookie [00:06:42] Expires: Thu, 01 Jan 1970 00:00:00 GMT [00:06:52] TimStarling: I added to redirects about 8 hours ago [00:06:52] i just got a 404 from http://meta.wikimedia.org/wiki/Main_Page as a logged in user [00:06:55] so was prolly someone else. [00:06:58] !log dzahn gracefulled all apaches [00:07:07] Logged the message, Master [00:07:08] robh: so, it would be ideal to have checks for every top level domain [00:07:15] mind adding a ticket for that? [00:07:22] bonus points for actually adding the checks [00:07:26] icinga checks ya mean? [00:07:29] yes [00:07:33] TimStarling: i added this redirect https://gerrit.wikimedia.org/r/#/c/53478/3/redirects.conf [00:08:32] mw1136: Action 'configtest' failed. [00:08:33] mw1136: The Apache error log may have more information. [00:08:33] mw1136: Your apache2 configuration is broken, so we're not restarting it for you. [00:09:14] it does not have /etc/apache2/wmf [00:09:23] 17:12 < mutante> what's special about mw1043, does not have /etc/apache2/wmf dir [00:09:36] robla: remove from all of analytics or just what you mentioned? [00:09:46] (nagios) [00:10:37] nothing special about it [00:10:41] $ dsh -g apaches -cM 'test -e /etc/apache2/wmf || echo help' | wc -l [00:10:41] 195 [00:10:41] bugzilla down? I'm not getting any response from the server, all connections time out before getting any response at all (chrome's ERR_CONNECTION_TIMED_OUT page is served) [00:10:42] jeremyb_: go ahead and remove me from the analytics group in nagios (are you doing this?) [00:10:45] 195 servers are broken now [00:11:08] robla: i can make the commit at least :) [00:11:40] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [00:11:40] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:11:57] bugzilla WFM [00:11:58] TimStarling: wikimedia-task-appserver ... ouch [00:12:11] wow, that's not good. meta.wikimedia.org ERR_NAME_NOT_RESOLVED [00:12:13] New patchset: Ryan Lane; "Unified is missing *.m.wikipedia.org, use original" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53494 [00:12:14] same for commons and enwiki [00:12:40] !log manually creating missing /etc/apache2/wmf symlinks on servers that lack them [00:12:47] Logged the message, Master [00:13:47] TimStarling: php is partially uninstalled on a lot of those servers as well [00:13:59] so, if i installed wikimedia-task-appserver, it would install new php packages [00:14:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53494 [00:14:24] binasher: example? [00:15:02] mw1123 was one, but i just reinstalled mediawiki-task-appserver, which reinstalled the missing php packages [00:15:27] mutante: example of a server missing mediawiki-task-appserver? [00:15:36] you mean wikimedia-task-appserver [00:15:38] TimStarling: see mw1125 [00:15:42] binasher: mw1136 [00:15:46] yeah, that [00:16:24] k, my issue is unrelated for some reason I'm unable to make any http or ssh request in any browser or terminal. yet irc is working fine. [00:16:36] (even ping google.com failed) [00:16:40] ignore me [00:16:47] !log depooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:16:53] Logged the message, Master [00:16:54] it's thereon 1125 [00:17:46] root@mw1125:~# dpkg -l | grep wikimedia-task-appserver [00:17:47] rc wikimedia-task-appserver 2.7-1 Wikimedia application server [00:18:12] I see [00:18:23] rc php-apc 3.1.7-1 APC (Alternative PHP Cache) module for PHP 5 [00:18:24] rc php5-memcached 2.1.0-2~wmf+precise1 memcached extension module for PHP5, uses libmemcached [00:18:25] rc php5-mysql 5.3.10-1ubuntu3.5+wmf1 MySQL module for php5 [00:18:26] etc [00:18:55] what is removing it? puppet? [00:19:09] is all this symlink stuff related to the 404s being reported? [00:19:15] mutante? [00:19:31] looks like it [00:19:36] Start-Date: 2013-03-12 23:53:28 [00:19:37] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf1 [00:19:39] Downgrade: php5-common:amd64 (5.3.10-1ubuntu3.5+wmf1, 5.3.10-1ubuntu3.4+wmf1) [00:19:42] Remove: wikimedia-task-appserver:amd64 (2.7-1) [00:19:52] jeremyb_: that plus not having a functional php install [00:19:59] from var/log/apt/history.log [00:20:08] wikimedia-task-appserver should be arch-independent [00:20:23] the control file does not depend on a particular version of PHP [00:20:30] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wikimedia-task-appserver [00:20:33] Upgrade: wikimedia-task-appserver:amd64 (2.6-1, 2.7-1) [00:20:41] the upgrade of appserver still worked [00:20:48] did someone just change that? [00:20:50] but then the php5 install removed it .. [00:21:12] ridiculous [00:21:39] wikimedia-task-appserver should not be depending on particular versions of things [00:22:25] so either we need to downgrade wikimedia-task-appserver or php5 [00:22:54] can someone try to beat me to that? [00:25:38] too hard? [00:26:15] Ryan_Lane: do you need anything from me to roll that change back ? I'm going to be leaving in about 15min [00:26:39] tfinc: no [00:26:49] TimStarling: I'd help, but I'm solving an unrelated thing [00:27:08] Ryan_Lane: it's ok, I think the site is mostly up [00:27:31] not sure why [00:27:42] I guess we can fix this properly [00:27:47] maybe the broken servers are mostly depooled? [00:28:04] and the other errors are due to depool limits? [00:28:06] i think so [00:28:53] am I correct that the package is architecture-dependent? [00:29:14] Architecture: all [00:29:20] in the control file [00:30:06] apt-cache show wikimedia-task-appserver also shows Architecture: all [00:30:07] !log pooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:30:13] Logged the message, Master [00:30:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 00:30:18 UTC 2013 [00:30:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:31:21] apt-cache showpkg says: [00:31:25] Dependencies: [00:31:26] 2.7-1 - apache2-mpm-prefork (0 (null)) librsvg2-bin (0 (null)) dvipng (0 (null)) gsfonts (0 (null)) ocaml (0 (null)) ploticus (0 (null)) php5-mysql (0 (null)) php5-curl (0 (null)) php5-xmlrpc (0 (null)) php5-cli (0 (null)) php-apc (0 (null)) php-wikidiff2 (0 (null)) php5-fss (0 (null)) php5-geoip (0 (null)) libapache2-mod-php5 (0 (null)) file (0 (null)) djvulibre-bin (0 (null)) tidy (2 20070821) libtidy-0.99-0 (2 20070821) php- [00:31:26] pear (0 (null)) rsync (0 (null)) make (0 (null)) xpdf-utils (0 (null)) libtiff-tools (0 (null)) texlive (0 (null)) texlive-bibtex-extra (0 (null)) texlive-font-utils (0 (null)) texlive-fonts-extra (0 (null)) texlive-lang-croatian (0 (null)) texlive-lang-cyrillic (0 (null)) texlive-lang-czechslovak (0 (null)) texlive-lang-danish (0 (null)) texlive-lang-dutch (0 (null)) texlive-lang-finnish (0 (null)) texlive-lang-french (0 (null [00:31:34] )) texlive-lang-german (0 (null)) texlive-lang-greek (0 (null)) texlive-lang-hungarian (0 (null)) texlive-lang-italian (0 (null)) texlive-lang-latin (0 (null)) texlive-lang-mongolian (0 (null)) texlive-lang-norwegian (0 (null)) texlive-lang-other (0 (null)) texlive-lang-polish (0 (null)) texlive-lang-portuguese (0 (null)) texlive-lang-spanish (0 (null)) texlive-lang-swedish (0 (null)) texlive-lang-vietnamese (0 (null)) texlive- [00:31:34] latex-extra (0 (null)) texlive-math-extra (0 (null)) texlive-pictures (0 (null)) texlive-pstricks (0 (null)) texlive-publishers (0 (null)) php5-redis (0 (null)) php5-memcached (0 (null)) libmemcached10 (0 (null)) php5-igbinary (0 (null)) lilypond (0 (null)) timidity (0 (null)) imagemagick (0 (null)) [00:31:37] !log depooling ssl3 ssl4 ssl1003 ssl1004 ssl3003 [00:31:39] sorry for flood [00:31:44] Logged the message, Master [00:32:29] example srv291, i can just fix it by installing wikimedia-task-appserver, if these php5 versions are ok: 5.3.10-1ubuntu3.5+wmf1 [00:32:52] should i just do that on all that have the package "rc"? [00:32:54] yes, that is the right version [00:33:09] yes [00:33:11] mutante: go for it [00:34:32] so if wikimedia-task-appserver doesn't depend on 3.4+wmf1, how did it come to be uninstalled? [00:35:24] !log pooling ssl3 ssl4 ssl1003 ssl1004 ssl3003 [00:35:32] Logged the message, Master [00:37:49] -_- [00:38:11] New patchset: Ryan Lane; "Only modify mobile certificate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53496 [00:38:16] dpkg mysteries [00:38:24] I can't wait to combine all of our certs and get rid of all of these damn lb services [00:38:26] hey [00:38:30] I'm back :) [00:38:42] what's up? [00:39:06] TimStarling: binasher, it is "ii" on all now [00:39:17] all in "apaches" [00:39:22] paravoid: looks like you missed all the fun [00:39:25] !log depooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:39:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53496 [00:39:31] Logged the message, Master [00:39:49] :( [00:39:50] New patchset: Ram; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [00:40:30] from dpkg.log on mw1125, here is the story [00:41:06] 22:51: upgrade wikimedia-task-appserver. [00:41:29] from 2.6 to 2.7 due to mutante's work [00:41:39] dpkg.log isn't exactly the most readable thing in the world :) [00:41:50] /var/log/apt/history.log [00:42:03] 23:48: upgrade PHP from 3.4 to 3.5 due to me manually running apt-get [00:42:22] is everything okay now? [00:42:40] 23:53: downgrade PHP to 3.4 and remove wikimedia-task-appserver due to puppet run [00:42:52] looks ok, wikimedia-task-appserver and php packages are installed on servers in apaches dsh group [00:43:02] does apache need restarting? [00:43:18] PHP 3.5 and 3.5? ;) [00:43:49] for a new php version? [00:43:51] I'd guess so [00:43:57] I was getting tired of typing 5.3.10-1ubuntu3.5+wmf1 [00:44:01] TimStarling: let me do that, that way we also confirm all of them now have the /wmf dir [00:44:03] you know what I mean [00:44:08] dzahn is doing a graceful restart of all apaches [00:44:54] !log dzahn gracefulled all apaches [00:44:56] Logged the message, Master [00:45:11] I know why puppet downgraded PHP to 3.4, I fixed it in https://gerrit.wikimedia.org/r/#/c/53490/ [00:45:12] Ryan_Lane: paravoid: looks like gluster has async datacenter replication. and an s3 compatible reset api now! let's use it instead of ceph. cc: Aaron|laptop [00:45:15] all i see is "VIP not configured on lo" messages [00:45:21] I do not know why it removed wikimedia-task-appserver [00:45:30] so no other errors like missing the complete apache-sanity-check [00:46:02] New patchset: Ryan Lane; "Fix typo in certname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53497 [00:46:06] Ryan_Lane: it works perfectly, since we just copied everything into ceph, not we can do ceph -> gluster [00:46:13] \o/ [00:46:17] *now we [00:46:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53497 [00:46:39] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [00:46:39] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [00:46:50] as long as we use quorum support we won't have to worry about split brains [00:46:58] and it's not like we need speed [00:46:58] Aaron|laptop: "everything" [00:47:02] or a sane filesystem design [00:47:09] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [00:47:23] though apparently the gluster folks say it's faster than ceph [00:47:25] +1 for ceph to gluster migration! [00:47:28] icinga-wm: I know, I know [00:47:29] PROBLEM - HTTPS on ssl2 is CRITICAL: Connection refused [00:47:43] and it supports "cloudification of applications" [00:47:45] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf1 [00:47:53] but first, a blog post about migrating to ceph [00:47:56] Aaron|laptop: does it really say that somewhere? :) [00:47:58] those ssl alerts don't look good [00:48:01] then we switch to gluster [00:48:02] don't you love -y? [00:48:21] Ryan_Lane: http://www.slideshare.net/Gluster/introduction-to-glusterfs-webinar-september-2011 [00:48:22] binasher: then we switch back to zfs? [00:48:28] Ryan_Lane: ssl alerts? [00:48:35] slide 19 [00:48:39] I remember now that I used to make a habit of using --no-remove with -y so that apt-get is less likely to uninstall your whole system [00:48:39] paravoid: so, I installed the new unified cert [00:48:40] zfs + nfs [00:49:08] paravoid: wonderfully enough it was missing: *.m.wikipedia.org, mediawiki.org and *.m.mediawiki.org [00:49:11] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 01/20/2016 12:00. [00:49:16] ! [00:49:24] yeah [00:49:29] RECOVERY - HTTPS on ssl2 is OK: OK - Certificate will expire on 01/20/2016 12:00. [00:49:53] New patchset: Jeremyb; "stop paging robla for analytics (icinga)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53499 [00:50:23] paravoid: so, I'm mostly reverting that now [00:50:40] and by mostly I mean I'm reverting it for mobile [00:50:49] we reverted mediawiki.org earlier [00:51:00] ok, if you're on top of this :) [00:51:10] yep [00:51:16] apt trying to destroy the world: http://paste.tstarling.com/p/iMSruW.html [00:51:25] hahaha [00:51:34] I love how the first one is texlive-lang-greek [00:51:41] why doesn't the HTTPS check use ISO 8601? :( [00:52:13] "libapache2-mod-php5" do we use that? [00:52:23] that would be the php apache module [00:52:25] jeremyb_: why don't we have lots of checks? :) [00:52:32] so, err, yes, we kinda use that :) [00:52:33] I think robla knows that [00:52:38] * robla shoulda put a smiley on that [00:52:40] okay :) [00:52:50] sorry :) [00:52:54] * TimStarling gives paravoid an amazon voucher to buy a sarcasm detector [00:52:59] it's 3am [00:53:02] can I get one of those too?? [00:53:06] You can buy them at 3am [00:53:07] only $99 with free shipping! [00:53:10] Amazon is open all the time! [00:53:12] and I'm really in no place to detect sarcasms [00:53:18] but the voucher is only for $15 :( [00:53:18] no prob....I give Tim crap for not putting enough smilies on his stuff [00:53:31] does amazon prime ship free to athens? [00:53:59] Probably not [00:55:08] heh, i just talked to apergos about that .. http://askville.amazon.com/Amazon-Ship-Greece/AnswerViewer.do?requestId=88175419 [00:55:58] doesitshipto.com ..heh [00:56:40] Will it ship? [00:58:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 00:58:02 UTC 2013 [00:58:22] I'm going to start running updateCollation.php since the emergency is apparently over [00:58:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:58:45] I'm glad I changed puppet to depend on the latest PHP, that stopped it from constantly reverting our site fixing efforts [00:59:21] Going to run them all simultaneously? [01:01:50] sorry for missing the fun :( [01:03:05] Reedy: yes [01:03:14] just going to use 6 screens on hume [01:03:18] paravoid my buddy my paaalllll!!!!! [01:03:24] ottomata: no way [01:03:25] go away [01:03:26] hahah [01:03:27] it's 3am :) [01:03:32] hahah [01:03:33] you spoke up [01:03:35] your fault [01:03:36] hahah [01:03:44] well, for a site wide outage [01:03:49] oh, i just signed on [01:03:50] not for reviewing puppet manifests :) [01:03:59] didn't notice :), of course of course [01:04:06] haha [01:04:19] :-) [01:05:03] !log running updateCollation.php on all uca-* wikis in 6 screens on hume [01:05:20] TimStarling: anything i can do? [01:05:56] no, I think it's all sorted, thanks for coming in [01:05:59] good night [01:06:04] thanks [01:07:14] ottomata: review my puppet change :) [01:07:20] ok where? [01:07:45] ottomata: https://gerrit.wikimedia.org/r/53499 [01:07:50] i even made it relevant to you [01:08:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53499 [01:08:45] done! [01:09:27] !log pooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [01:09:34] danke :) [01:09:47] !log depooling ssl3 ssl3 ssl1003 ssl1004 ssl3003 [01:09:48] ottomata: (sockpuppet too?) [01:10:09] yup [01:10:25] so that regex in apache-sanity-check now matches eqiad hosts, but still not all of them, because the last part of the regex was (and has been before) (1\21\) but we also have hosts with lo:LVS ending in .22 [01:11:54] ok. en.m is fixed [01:12:22] but unfortunately it no longer works for anything other than wikipedia [01:12:29] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [01:12:29] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [01:12:45] !log pooling ssl3 ssl3 ssl1003 ssl1004 ssl3003 [01:14:29] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [01:15:11] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [01:28:41] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 01:28:30 UTC 2013 [01:29:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:30:42] New review: Dzahn; "18:19 < mutante> so that regex in apache-sanity-check now matches eqiad hosts, but still not all of ..." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [01:37:07] New review: Dzahn; "RT-4676, approved and waiting period is over" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/52671 [01:37:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52671 [01:42:55] New patchset: Ottomata; "Outputting integers in python settings.py for metrics api" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53507 [01:43:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53507 [01:44:52] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:47:05] marktraceur: you can ssh to gallium now, enjoy [01:47:36] New patchset: Ottomata; "Missing \ in regex in e3 metrics api settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53508 [01:48:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53508 [01:48:09] New review: Dzahn; "21 1H IN PTR rendering.svc.pmtpa.wmnet." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [01:49:39] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:49:39] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [01:58:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 01:58:54 UTC 2013 [01:59:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:07:50] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:08:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:08:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [02:29:31] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 02:29:22 UTC 2013 [02:29:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:30:32] !log LocalisationUpdate completed (1.21wmf11) at Wed Mar 13 02:30:32 UTC 2013 [02:31:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:31:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 198 seconds [02:42:59] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:54:13] !log LocalisationUpdate completed (1.21wmf10) at Wed Mar 13 02:54:13 UTC 2013 [03:00:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 03:00:05 UTC 2013 [03:00:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:16:29] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:16:39] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [03:30:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 03:30:35 UTC 2013 [03:31:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:32:39] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [03:32:40] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [03:52:32] mutante: Thanks! [04:01:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:01:00 UTC 2013 [04:01:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:03:31] !log tstarling synchronized php-1.21wmf10/includes/Collation.php [04:04:33] !log tstarling synchronized php-1.21wmf11/includes/Collation.php [04:06:46] New patchset: Krinkle; "Integration: Update index, fix discrepencies, move wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:07:08] New patchset: Krinkle; "Integration: Update index, fix discrepencies, move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:08:15] New review: Krinkle; "When deploying: Update paths in jobs that publish artefacts in the document root (nightly snapshots)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:13:39] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:13:39] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:20:56] New review: Hashar; "Note that I am currently refactoring the contint manifests and will drop all the web materials from ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53513 [04:27:08] !log (about 00:42) - dsh installing wikimedia-task-appserver where it was 'rc' after php puppet upgrade [04:31:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:31:37 UTC 2013 [04:32:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:32:36] mutante: you might want to warn Tim about that :] I have seen he changed php to ensure => latest [04:33:38] hashar: he was present at that time and knows, i just noticed i forget to log during site outages :p [04:33:48] :-] [04:33:49] when it would be most interesting [04:34:07] so we are fine :-] [04:34:22] yea, it was a couple hours ago [04:34:34] and thank you to have added marktraceur account on gallium! [04:34:42] wikimedia-task-appserver packages had been removed on a lot of servers :p [04:36:16] mutante: and replaced by puppet ? [04:36:28] I mean, is the content of wikimedia-task-appserver obsolete nowadays? [04:36:48] no, it's not obsolete. it has stuff like apache-graceful-all in it [04:37:17] ahhh and scap-2 iirc [04:37:24] i built a new version of it to fix one line in apache-graceful-all [04:37:57] so now you can use it again. it did not restart Apaches in eqiad due to a regex in apache-sanity-check [04:39:08] good finding [04:39:20] would it make sense to put that script and some others in puppet? [04:40:37] which others? [04:40:49] the package upgrade of -appserver worked, but puppet had ensure on a specific PHP version [04:40:56] hashar: https://gerrit.wikimedia.org/r/#/c/53490/1/modules/applicationserver/manifests/packages.pp [04:41:32] jeremyb_: well I mean the scripts in the wikimedia-task-appserver [04:41:57] hashar: then this is what happened: 17:28 < mutante> Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf117:28 < mutante> Downgrade: php5-common:amd64 (5.3.10-1ubuntu3.5+wmf1, 5.3.10-1ubuntu3.4+wmf1)17:28 < mutante> Remove: wikimedia-task-appserver:amd64 (2.7-1) [04:42:33] after 23:39 Tim: updating PHP to php5-3.5+wmf1 for new ICU [04:42:49] doh [04:44:25] hashar: now to the other question. package or puppet. there seem to be different opinions. I asked Mark and he said it was an in-between thing but i should still just package for that small fix [04:45:03] I must agree with him :-] [04:45:13] you don't want to refactor everything just for a tiny regex update [04:45:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:45:12 UTC 2013 [04:45:35] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:45:50] stop flapping! [04:46:29] yeah that keeps happening on several boxes [04:47:03] should theoretically be impossible [04:47:13] time should only go in one direction! [04:47:33] heh, fair [04:47:34] maybe there's some permissions error on neon? [04:47:57] so it records the recovery but wherever the trap info is persisted is readonly? [04:48:19] RECOVERY - MySQL Slave Delay on db68 is OK: OK replication delay 0 seconds [04:48:28] wild speculation :) [04:48:29] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [04:49:43] and that db68 notification is kind of a duplicate too [04:52:23] yeah, but at least it's that way because someone made it be that way [04:52:24] ticket please for the "time should only go in one direction" : [04:52:34] ok! :) [04:52:37] ty [04:54:06] hashar: the answer to that would be either "nagios service dependencies" and/or "icinga check_cluster" [04:54:43] oh yeah ticket [04:54:45] ops-request ? [04:54:48] http://docs.icinga.org/latest/en/clusters.html | http://nagios.sourceforge.net/docs/3_0/dependencies.html [04:54:52] yes [04:54:54] go ahead and do it jeremyb_ :-) [04:55:26] but slave delay != replication heartbeat [04:55:31] first i have to reply to Ryan_Lane [04:55:39] ? [04:55:49] in email? :) [04:55:52] or in the bug? heh [04:55:52] and i have to restart my local laptop ntpd before that! [04:55:53] yes [04:56:23] mutante: gotta think about how the dependencey between slave delay / replication heartbeat. I guess we do not need to warn about the delay if the replication is dead [04:57:17] hashar: yea, that is what would be defined in a define servicedependency{ in Nagios .. that would be the traditional Nagios way that i expect to be the same in Icinga, but then there is also the newer check_cluster stuff by Icinga [04:57:23] Ryan_Lane: sent. which bug did you mean? [04:57:44] about renaming sub-sub domains [04:58:09] well, I just made a duplicate :) [04:58:46] Is it worth doing a push this week (tomorrow?) and just get them all done? [05:01:19] hashar: ideas what we need for more jenkins speed? [05:01:40] it went up to a couple minutes earlier today [05:02:19] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [05:02:22] but robh said he could see it having open slots [05:02:26] ..somewhere [05:02:35] mutante: I need Zuul to be updated, that is blocked by a python module dependency I have packaged. [05:03:26] mutante: and there is a nasty bug where all the jobs have been completed by Zuul wait for quite a long time before reporting the change back in Jenkins. I haven't tracked it yet [05:04:35] i wonder what queue RT 3481 is [05:04:57] must be procurement i guess [05:05:27] (see mutante @ https://bugzilla.wikimedia.org/38763#c3 ) [05:05:46] hashar: ugh, i see, thanks! it sounds like a bunch of packaging work too .. feel you [05:06:16] mutante: the packaging is on Faidon todo list :-] [05:06:22] then I will update zuul \O/ [05:07:25] hashar: :) [05:07:34] jeremyb_: the queue is called domains [05:08:19] errrr [05:08:25] mutante: so why can't i see it? :) [05:08:59] do you know what your groups are? [05:09:16] oh, do you have office wiki? [05:09:25] i doesn't [05:09:42] but you and i both know that i've been on domains tickets before [05:09:57] yes [05:10:30] i agree, the permissions must have changed [05:10:45] but i am not aware removing that [05:11:12] i can guess what the reason might be though, it has legal on it [05:11:27] and sometimes stuff like auth codes to transfer domains [05:12:13] but i can still (now) see domains [05:12:23] i just modified 4416 [05:12:26] in the last minute [05:12:54] eh? really, but you cant see 3481?? [05:13:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [05:13:05] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [05:13:05] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:13:05] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [05:13:05] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [05:13:07] really! [05:13:55] ooooh, duuh, my fault [05:14:08] there are 2 RT tickets linked in that BZ bug [05:14:18] 3481 is procurement :p [05:15:51] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 05:15:39 UTC 2013 [05:16:06] reason for permissions there: ssl certificate requests / ask robh [05:16:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [05:16:40] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay 0 seconds [05:16:40] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [05:17:03] it is stuff we pay for, like hardware, so it was moved there [05:17:36] right. i guess maybe it's the sort of thing that should have a pair of tickets and one not in procurement [05:18:19] why not, i am mostly pro duplicate tickets and linking them [05:18:37] and real duplicates => "merge into" instead of deleting [05:19:05] but it's closed? [05:19:12] * jeremyb_ waits for the page to load [05:19:30] the one for the planet cert? yea, it's closed and resolved [05:19:53] planet has a *.planet.wm.org wildcard just for itself [05:20:41] yeah, i see now. but it's mixed content so only a little useful... [05:20:52] (and entirely public content of course) [05:20:57] cant fix when including images from blogs [05:21:02] on an aggregator [05:21:18] right, i figured it was impossible to fix [05:21:25] or at least not worth it [05:21:38] yea, but it even has a HTTPSEverywhere rule and it's own wildcard .. [05:21:50] to prevent those HTTPS tickets, heh [05:22:03] Change abandoned: Hashar; "Abandoning for now since that is not how it should be done and I probably dont have the time to real..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51313 [05:22:24] randomly adds that it has IPv6 support :p [05:23:11] oh, btw, did en.planet get stuck again ? hrm, looks like it.. sigh [05:23:31] yeah, the bug was reopened [05:24:41] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [05:25:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [05:31:43] !log clean cache and run planet update on en.planet [05:31:46] root@zirconium:/# rm -rf /var/cache/planet/en/* && sudo -u planet /usr/bin/planet -v /usr/share/planet-venus/wikimedia/en/config.ini [05:32:34] sigh, sigh, it used to work so well, no idea where the utf8/ascii issue comes from yet [05:32:45] it's late.. leaving [05:42:11] mutante: hashar: RT 4727, 4728 [05:42:21] cant click [05:42:24] !rt 4727 [05:42:25] http://rt.wikimedia.org/Ticket/Display.html?id=4727 [05:46:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 05:46:05 UTC 2013 [05:46:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [05:48:30] jeremyb_: I don't even know how to CC myself on a RT :-] [05:48:47] on both? [05:48:56] you have to go to people. but i can do it for you :) [05:50:24] hashar [05:50:24] ahh [05:50:27] maybe it works [05:50:41] !rt 4728 [05:50:41] http://rt.wikimedia.org/Ticket/Display.html?id=4728 [05:51:01] hah, you don't have to comment! [05:51:45] * hashar whistles ( http://integration.mediawiki.org/coverage/ ) [05:52:00] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [05:53:32] I should wake up at 5 am everyday [05:53:59] hashar: look at the tickets again :) [05:54:23] admincc ? [05:54:28] New patchset: Hashar; "contint: xdebug + code coverage directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53531 [05:54:35] wtf, how is it 2am [05:54:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [05:55:10] /tmp/hudson8310835417058319495.sh: line 2: 11537 Segmentation fault php tests/phpunit/phpunit.php --coverage-html /srv/org/mediawiki/integration/coverage/ [05:55:12] hashar: means you get comments too not just correspondence. AIUI [05:55:12] seriously [05:55:27] hashar: last one, 4676, latest comment.. ehm.. i told mholmquist to proxy via fenari, but i know gallium has public IP anyways, heh.. oh well.. ttyl :) [05:55:40] hashar: but i cc'd a different hashar than you did [05:55:42] jeremyb_: thanks, i will look again tomorrow [05:55:56] mutante: did planet finish? [05:56:06] mutante: you're SF now, right? [05:56:51] unfortunately, no. UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128) [05:56:58] mutante: yeah will figure it out with marktraceur later on :-] [05:56:59] yes, i am, and thats why its 11PM [05:57:16] actually it is PDT now vs. PST [05:57:34] hashar: I can do either one :) [05:57:37] * hashar looks for a few more million dollars to emigrate to SF :-] [05:57:39] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 24 seconds [05:57:47] mutante: gute nacht [05:58:07] marktraceur: you might want to proxy via eqiad bastion :) We will eventually remove gallium public IP address [05:58:10] mutante: bonne nuit! [05:58:17] Hm, sure. [05:58:29] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [05:58:43] hashar: ooh, are we? nice, that's what i just commented but wasnt sure [05:58:57] mutante: I should fill a ticket about it :-] [05:59:11] marktraceur: i pasted a .ssh/config snippet for you to proxy nicely [05:59:16] mutante: and get some frontend proxy service [05:59:23] but hey, it is too late, you want to head bed daniel [05:59:33] mutante: I have a general solution in my config already, I think [05:59:41] ok, yea, good night. out [06:01:59] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [06:06:24] breakfast time [06:11:59] New review: Krinkle; "We're syncing in operations. I'm happy to split it up but I don't want to wait for 12 reviews. I've ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [06:13:00] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [06:15:59] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [06:15:59] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [06:16:49] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 06:16:44 UTC 2013 [06:17:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:20:00] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [06:47:30] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 06:47:19 UTC 2013 [06:47:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [06:52:29] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:56:53] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [06:56:53] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [07:04:52] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 13 seconds [07:04:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 5 seconds [07:15:29] RECOVERY - Squid on brewster is OK: TCP OK - 0.027 second response time on port 8080 [07:15:46] !log removed the rotated squid logs on brewster again, 2.2gb worht that filled / (again), restarted squid [07:17:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 07:17:52 UTC 2013 [07:18:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [07:26:00] apergos: doesn't brewster has a /a or some other mount point to write logs too ? [07:26:03] that will save the slash [07:26:13] no [07:26:17] oh and hello :-] [07:26:22] not things in /var/log [07:26:34] hello :-D [07:26:58] but root filesystem over there is teeny tiny which is an issue [07:27:26] 5..6gb [07:46:17] apergos: isn't there some additional disk space to create a partition for /var/log ? [07:46:18] or even just /var/log/squid :-] [07:46:31] lvm for the win! [07:46:52] I think someone looked at that already [07:46:58] there is /srv but eww putting logs there [07:48:29] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 07:48:20 UTC 2013 [07:48:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:19:02] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:19:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 08:19:36 UTC 2013 [08:20:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [08:25:59] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [08:25:59] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [08:55:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 08:55:18 UTC 2013 [08:55:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [09:26:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 09:26:00 UTC 2013 [09:26:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [09:42:50] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [09:43:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 201 seconds [09:56:30] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 09:56:26 UTC 2013 [09:57:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:09:42] New review: Faidon; "I know it's a lot to ask, but I'd like the geoip manifest to be moved into a module first. Let's not..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53422 [10:11:03] hashar is a machine [10:12:28] New review: Faidon; "Approved but I'd like to merge when you're around, ping me on IRC." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53424 [10:13:05] New review: Faidon; "Yay! Ping me on IRC to merge." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53423 [10:16:17] New review: Faidon; "Trivial enough, so approved." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53531 [10:27:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 10:27:03 UTC 2013 [10:27:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [10:28:40] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [10:28:53] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [10:54:19] greg-g: ping [10:57:29] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 10:57:27 UTC 2013 [10:58:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:20:43] !log reenabled mail delivery to wikibugs [11:21:29] hmm, how come i no longer can log when i used to? [11:26:01] morebots is gone. [11:26:22] ah, i thought it was logmsgbot [11:26:50] no, morebots is the one that logs it on wiki [11:26:56] is it gone on purpose or accidentally? [11:27:02] accidentally i guess [11:27:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 11:27:50 UTC 2013 [11:28:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:42:06] !log restarted morebots [11:42:13] Logged the message, Master [11:52:53] log we <3 apergos ;-) [11:55:07] aww [11:58:29] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 11:58:27 UTC 2013 [11:59:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [12:07:22] !log reenabled mail delivery to wikibugs 30 mins ago (disabled automatically for excessive bounces) [12:07:28] hehe [12:07:29] Logged the message, Master [12:07:33] ah [12:07:41] it's just slow [12:12:17] New review: Faidon; "I'm not terribly excited with the idea of random API boxes doing image transformations. I'm okay wit..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52707 [12:16:43] Anyone mind if I update Scribunto in wmf11 quick? Quick bugfix (bug 46031) that should be made before the big rollout later today, two sync-files. [12:18:10] better ask reedy? [12:28:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 12:28:52 UTC 2013 [12:29:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [12:30:13] !log anomie synchronized php-1.21wmf11/extensions/Scribunto/common/Common.php 'Fix for bug 46031' [12:30:20] Logged the message, Master [12:30:34] !log anomie synchronized php-1.21wmf11/extensions/Scribunto/common/Hooks.php 'Fix for bug 46031' [12:30:40] Logged the message, Master [12:44:00] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [12:47:39] PROBLEM - Host mw27 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:49] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [12:52:19] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [12:54:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 12:54:31 UTC 2013 [12:55:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:25:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 13:25:03 UTC 2013 [13:25:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:32:59] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [13:33:49] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds [13:45:50] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [13:45:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [13:47:46] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [13:48:35] i hate puppet modules [13:53:36] <^demon> mark: me too. [13:55:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 13:55:30 UTC 2013 [13:56:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [13:59:04] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:02:56] i hate puppet's parser, too [14:03:12] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:04:54] greg-g: ping [14:06:05] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:06:49] wtf is the problem [14:10:25] should I have a look? [14:10:59] only if you're willing to fix puppet bugs ;p [14:11:02] what's the issue? [14:11:28] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:11:49] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 13 seconds [14:12:39] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:12:40] I can't do... summation of integers in an assignment? [14:13:22] $portnr = $ganglia::configuration::base_port + $id [14:13:26] that isn't allowed? [14:14:52] uhm [14:14:54] it should be [14:15:05] what's the error? [14:15:19] 14:11:45 err: Could not parse for environment production: Syntax error at '('; expected ')' at /var/lib/jenkins/jobs/operations-puppet-validate/workspace/modules/ganglia/manifests/monitor/aggregator/instance.pp:11 [14:15:37] huh [14:15:51] that ( is gone [14:16:27] no it's not [14:16:45] indeed it is not [14:16:50] from PS65 [14:17:00] seems I didn't save the file [14:17:18] still stupid it doesn't work in the selector [14:17:29] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:18:08] yay [14:26:00] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 14:25:53 UTC 2013 [14:26:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [14:37:00] New patchset: Demon; "Redoing this as a maven project" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53572 [14:40:08] New review: Demon; "This was done after much thought...I believe maven is a superior build system for a couple of reasons:" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53572 [14:42:00] New review: Demon; "Still on the todo list:" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53572 [14:45:29] I'd like to deploy a fix for https://bugzilla.wikimedia.org/show_bug.cgi?id=45861 at around 16.00 utc unless someone objects [14:50:25] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:56:06] New patchset: Mark Bergsma; "Initial attempt at a new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [14:56:29] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 14:56:23 UTC 2013 [14:56:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:05:08] New patchset: Mark Bergsma; "Attempt to avoid conflicts with the new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53575 [15:06:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53569 [15:07:45] New patchset: Mark Bergsma; "Attempt to avoid conflicts with the new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53575 [15:08:43] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53575 [15:09:40] New patchset: Mark Bergsma; "Revert "Attempt to avoid conflicts with the new Ganglia module"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53576 [15:09:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53576 [15:14:00] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [15:14:01] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [15:14:01] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [15:14:01] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [15:14:01] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [15:16:53] New patchset: Mark Bergsma; "Attempt to avoid conflicts with the new Ganglia module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53578 [15:18:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53578 [15:22:09] New patchset: Mark Bergsma; "Move the upstart job for a normal gmond to ganglia::monitor::service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53579 [15:22:09] schmir: greg-g is the guy you want. [15:23:20] New patchset: Mark Bergsma; "Move the upstart job for a normal gmond to ganglia::monitor::service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53579 [15:23:41] yes, looks like he's away however. I only need someone who can watch the load on the cluster answering api.php requests...since the the deployment may increase the load on those machine [15:24:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53579 [15:25:21] schmir: He should be around in the next hour or so. [15:25:28] San Francisco is just waking up. [15:25:56] hmm. ok. will wait for him then. [15:26:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 15:26:50 UTC 2013 [15:27:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:29:17] New patchset: Mark Bergsma; "Fix dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53580 [15:30:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53580 [15:31:01] New patchset: Mark Bergsma; "Make manutius a new-style Ganglia aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53581 [15:31:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53581 [15:35:21] New patchset: Mark Bergsma; "Actually make manutius a new, non-labs style Ganglia aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53582 [15:36:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53582 [15:46:40] modules/wikidata_singlenode [15:46:41] sigh [15:48:20] hashar: we have an incredibly simple "testswarm" module [15:48:31] I think I merged that expecting more in it [15:48:38] is this still the case? [15:48:45] I don't remember what that is [15:48:48] in conf call right now [15:48:58] k [15:52:02] New patchset: coren; "New tools:: class to configure Tool Labs servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [15:52:59] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [15:53:48] New patchset: Mark Bergsma; "Rename new module temporarily to resolve conflicts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53588 [15:55:50] New patchset: Mark Bergsma; "Rename instantiation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53590 [15:55:59] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [15:56:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53588 [15:56:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53590 [15:57:12] New review: Faidon; "(5 comments)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53587 [15:57:29] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 15:57:23 UTC 2013 [15:57:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [15:57:38] schmir: 16:00 is 9am here, there aren't many ops and/or platform people around, actually, none that do deployments [15:58:52] New patchset: Mark Bergsma; "Change paths for file sources as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53591 [15:58:59] schmir: in the future, please suggest a time in a place that I'll see it (ie: on the relevant bug or via email) instead of via IRC at an hour when I am asleep :) [15:59:38] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53591 [16:01:03] hi greg-g [16:01:10] g'morning [16:01:51] paravoid: testswarm has been phased out for now, so its manifest should be really small nowadays [16:01:55] so, there isn't anyone around who does deployments at this hour, it is a bit early for San Franciscans :) [16:02:10] Reedy might be? [16:02:21] he's in SF right now [16:02:43] unless he flew back last night, in which case, he's probably still on SF time ;) [16:02:59] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [16:03:12] hashar: we just have the systemuser class [16:03:12] I'm fine with doing this later... [16:03:20] hashar: can we just move it under something else or drop it? [16:03:45] New patchset: Mark Bergsma; "Add top level init.pp, empty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53592 [16:04:08] paravoid: testswarm is going to be the next contint sprint :-] [16:04:27] schmir: the deployments calendar is here: https://wikitech.wikimedia.org/wiki/Deployments [16:04:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53592 [16:04:50] schmir: looks like between 20:00-21:00 is available, does that work for you? [16:05:27] schmir: OR, probably better, when Reedy does get in, talk with him, since he's doing the Scribunto to all wikis from 17:00-18:00 [16:05:38] he might be able to lump it in with that [16:05:43] I can deploy that stuff on my own. I need someone to look at the cluster [16:05:57] gotcha [16:06:24] so, just so I'm clear (I don't know the whole story of how it all works): the code changes you're going to do just live on some machines not in the WMF cluster? [16:06:34] paravoid: the clean out patch was https://gerrit.wikimedia.org/r/#/c/47665/ . I have extracted the system user out of the huge contint manifests [16:06:43] on pdf1, pdf2 and pdf3 [16:06:47] paravoid: I can move it back to the contint module if you want :-] [16:07:33] probably a better idea [16:07:56] New patchset: Mark Bergsma; "Fix module structure" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53593 [16:08:16] schmir: are pdf1 etc on the WMF cluster? [16:08:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53593 [16:08:58] paravoid: filled a bug as a reminder ( https://bugzilla.wikimedia.org/show_bug.cgi?id=46069 ) will take care of it this week :) [16:09:31] daughter time, bbl :-] [16:10:16] greg-g: ambiguous question :) [16:10:18] greg-g: yes, be we manage software updates for the pdf rendering software on these machines [16:10:30] s/be/but/ [16:10:30] paravoid: how would you phrase it? :) [16:10:43] what do you want to know? [16:11:00] if the pdf1etc machines are run by WMF ops [16:11:04] * mark stabs puppet's authors [16:11:08] New patchset: Mark Bergsma; "Fix variable reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53594 [16:12:12] then, second part: schmir has necessary privs to deploy on those machines (which he does). Then third: the machines which we would need to watch would be pdf1-3 or the other WMF cluster machines? [16:12:30] it would be the machines serving api.php [16:13:04] * greg-g nods [16:13:05] thanks [16:13:48] ops have access but they're not really maintained [16:13:59] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [16:14:01] they're kind of their own thing but in the same admin domain, sadly [16:14:02] gotcha [16:14:14] interesting, a neither here nor there type deal [16:14:22] so, if they blow up, who get's the pager message? [16:14:24] a "it sucks" kind of deal [16:14:36] I just need someone to look at the api.php machines [16:15:10] schmir: ok, looking over at the ops area, there doesn't seem to be anyone in yet (they tend to get in at 10am Pacific, heh) [16:15:38] oh, paravoid ! I just put your nick to your name! [16:15:49] heh [16:16:02] mark is also around [16:16:05] paravoid: can you watch those machines/be ready to help while schmir does the deploy to pdf1-3? [16:16:09] not sure if apergos is still around [16:16:22] not sure what to watch exactly [16:16:28] me neither :) [16:16:41] sounds like cpu spike potential [16:16:47] but I could be wrong [16:16:48] still here [16:16:59] PROBLEM - Puppet freshness on capella is CRITICAL: Puppet has not run in the last 10 hours [16:16:59] PROBLEM - Puppet freshness on search27 is CRITICAL: Puppet has not run in the last 10 hours [16:17:43] reading the scrollback [16:17:48] I don't want to overload those machines. [16:18:20] hello apergos, we haven't met, I'm Greg Grossmeier. Started Feb. 19th. [16:19:13] heya [16:19:28] must be a timeone thing, I'm here sometimes in the vening though [16:20:00] New patchset: Mark Bergsma; "Rename new Ganglia module again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53595 [16:20:06] I'm also on the platform team, not ops, but yeah [16:20:13] heh pdf1 already looks unhappy :-P [16:20:19] "great" [16:20:28] https://ganglia.wikimedia.org/latest/?c=PDF%20servers%20pmtpa&h=pdf1.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:20:30] mark: having fun? :P [16:20:47] I guess we hope they don't get worse.... [16:20:59] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [16:21:02] pdf2 is on the other hand a happy camper... [16:21:09] hmm [16:21:16] pdf1 doesn't matter. we call api.php with different parameters. that's what matters. actually the load on the pdf cluster should decrease a bit [16:21:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53594 [16:21:42] apergos: do you know which machines would be affected by an increase in api.php requests? [16:21:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53595 [16:22:02] the api eqiad cluster I guess [16:22:24] ah, I see [16:22:32] * greg-g is still learning how it is all set up on the machine side [16:22:35] ah [16:22:41] they seem to be doing just fine right now [16:22:45] New review: coren; "Thing is, "tools" is the actual name of the project. :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [16:22:45] does anyone have a ganglia url for that? [16:22:53] just a sec though cause I should look at the ms-be12 warning [16:23:00] https://ganglia.wikimedia.org/latest/?c=API%20application%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:23:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:02] Mar 13 16:07:03 ms-be12 puppet-agent[30421]: Skipping run of Puppet configuration client; administratively disabled; use 'puppet Puppet configuration client --enable' to re-enable. [16:24:04] why would that be [16:24:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.319 second response time [16:27:15] paravoid: did you do anything on ms-be12 recently? (I expect not but want to double check) [16:27:19] no [16:27:24] k thanks [16:27:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 16:27:51 UTC 2013 [16:28:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:31:31] New patchset: Mark Bergsma; "include ganglia_new::configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53597 [16:32:06] New patchset: coren; "New toollabs:: class to config Tool Labs servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [16:32:42] !log cleaned up borked puppet agent on ms-be12 and re-enabled/restarted it [16:32:48] Logged the message, Master [16:33:32] greg-g: back, sorry for the interruption [16:34:46] apergos: no worries [16:35:09] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Wed Mar 13 16:34:58 UTC 2013 [16:35:23] so, apergos and schmir, could you two work out the plan for the updates to pdf1-3 and resulting monitoring of the api cluster? [16:35:40] what is the time frame for this? [16:35:50] schmir wants to have that done sooner rather than later, preferably [16:35:59] well I mean within the next hour say? [16:36:04] basically, it is an update that makes Lua scripting NOT break collections/pdfs :) [16:36:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53597 [16:36:07] yeah, ideally [16:36:08] that would be fine for me [16:36:11] ok geat [16:36:13] *great [16:36:22] apergos: ms-be11/12 don't have python-swiftclient installed [16:36:48] that's bizarre [16:37:13] apergos: so, should I do it now? [16:37:19] sure [16:37:37] is there a [16:37:44] link to the changeset or anything? [16:37:51] schmir: [16:38:32] http://mwlib.readthedocs.org/en/latest/changelog.html [16:39:06] which are the new and old versions? [16:39:14] !log upgraded mwlib to 0.15.1. [16:39:22] Logged the message, Master [16:39:34] ok which is the old version? :-D [16:39:39] 0.14.3 is the old version [16:39:40] :) [16:39:49] ok [16:40:13] New patchset: Mark Bergsma; "include network::constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53598 [16:41:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53598 [16:41:58] New review: coren; "(4 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [16:46:08] New patchset: Mark Bergsma; "Fix notification" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53600 [16:46:39] schmir: pdf1 is suddenly very idle indeed...is that expected? [16:46:56] https://ganglia.wikimedia.org/latest/?c=PDF%20servers%20pmtpa&h=pdf1.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:47:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53600 [16:49:49] no [16:50:38] The POST request to http://pdf1.wikimedia.org:8080/mw-serve/ failed ($2). [16:50:49] just tested on enwiki [16:51:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 16:51:49 UTC 2013 [16:52:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [16:53:47] now this error message: [16:53:48] at least the api.php cluster won't overload that way [16:53:51] "An error occurred on the render server: system overloaded. please try again later. " [16:53:59] heh [16:54:51] right [16:55:47] hrmm [16:55:51] New patchset: Mark Bergsma; "Fix gmond.conf expansion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53602 [16:57:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53602 [16:58:23] schmir: do we need to get some help or are you OK? [16:58:54] I don't really know these pdf boxes, a cursory look through the usual logs (syslog etc) shows nothing of use [16:58:56] I'm ok. thanks. [16:59:00] ok cool [16:59:02] k [16:59:23] I don't want to panic before it is customary to do so. ;) [17:00:21] well you can always get in some extra training for panicking [17:00:24] could come in handy... [17:00:57] apergos: :P [17:21:02] New patchset: Mark Bergsma; "Resolve upstart job issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53607 [17:22:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53607 [17:22:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 17:22:13 UTC 2013 [17:22:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:23:38] !log downgrade to 0.14.3 [17:23:45] Logged the message, Master [17:26:29] apergos, greg-g: ok, that didn't work. thanks for your help. I will have to look into this. [17:26:47] ok. [17:26:49] schmir: did you revert? [17:26:54] oh, I see the log [17:26:59] ah you've backed out [17:27:00] yes [17:27:08] I see the box is back to busy [17:27:19] thanks schmir, let me know when you want to try again, let the bug know as appropriate [17:28:57] greg-g: yes, sure. thanks. [17:36:43] mchnery seems to be lazying? https://bugzilla.wikimedia.org/show_bug.cgi?id=43936#c3 [17:43:39] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [17:43:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [17:44:48] New patchset: Andrew Bogott; "Added a basic nginx module and one (labs) use case." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [17:48:39] New review: Ottomata; "Hm, I think that we shouldn't have WMF specific stuff inside of generic modules. The role class and..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [17:50:32] New review: Andrew Bogott; "Yeah, that's reasonable. Anyway, this patch is untested, definitely not ready for merge." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/43886 [17:52:30] New review: Jeremyb; "merged in If721e402094a62a8c9b069a35 to master @ operations/debs/ircecho" [operations/software] (master) - https://gerrit.wikimedia.org/r/53197 [17:52:49] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 17:52:43 UTC 2013 [17:53:27] schmir == ralf ? [17:53:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [17:54:12] jeremyb_: yes [17:54:31] hah, training in panicking [17:55:20] * jeremyb_ was in scrollback [17:56:55] Ryan_Lane: hey! Did the HTTPS certificate issue affect bits in any way? [17:57:00] no [17:57:12] do you mean from mobile? [17:57:21] there's no m.bits, right? [17:57:26] nope. [17:57:28] so, no [17:57:29] Ryan_Lane: I've apps crashing with 'java.security.cert.CertPathValidatorException: Trust anchor for certification path not found. ' when trying to access bits [17:57:31] (for eventlogging) [17:57:38] no [17:57:39] you have one now doing that? [17:57:40] not m. [17:57:47] ragesoss's is doing that right now [17:57:53] and i've error reports from a few other people too [17:58:03] it does have a new cert [17:58:11] let's make sure the trust chain is proper [17:58:20] okay. [17:58:26] wfm [17:58:38] Verify return code: 0 (ok) [17:58:44] I bet I know the problem [17:58:49] PROBLEM - MySQL Replication Heartbeat on db71 is CRITICAL: CRIT replication delay 424831 seconds [17:58:54] subject=/C=US/ST=California/L=San Francisco/O=Wikimedia Foundation, Inc./CN=*.wikipedia.org [17:59:00] is there some way I can get ragesoss to verify it as well? [17:59:03] that in itself isn't an issue [17:59:11] he's getting the same error on test.wikipedia.org as well (on his phone) [17:59:37] maybe Java doesn't handle the SANs properly? [17:59:50] or the way that specific app is made? [18:00:03] we should have a separate certificate for bits anyway [18:00:07] just CN=bits [18:00:11] for performance reasons [18:00:11] New patchset: Jgreen; "adding X-Spam-Score header patch per RT #4713" [operations/software/otrs] (master) - https://gerrit.wikimedia.org/r/53611 [18:00:12] both are possible. is that different from our previous cert? [18:00:24] test works for me [18:00:26] the one with all the sans is huuuge [18:00:27] paravoid: because the cert itself is smaller? [18:00:31] paravoid: yep [18:00:39] Jeff_Green: \o/ [18:00:41] test.m works for me after clearing my Browser data. [18:00:45] yes [18:01:22] jeremyb_: I look forward to closing that ticket :-P [18:01:23] yeah, but the crash you are seeing is from EventLogging, not from test. [18:01:31] paravoid: we have a unified cert to handle mobile properly [18:01:40] hmm, that is possible this is an Android / Java bug. [18:01:41] bits has a separate ip though [18:01:47] I can verify that. [18:01:57] Jeff_Green: did we figure out more about how things work? was there ever a cron? [18:02:06] paravoid: yeah, true [18:02:23] we could indeed split that apart [18:02:36] jeremyb_: i haven't figured it out yet, it looked like what there is is part of OTRS [18:02:59] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [18:03:20] paravoid: when you did the verify, did you verify the entire chain? [18:03:31] yes [18:03:36] we had an issue yesterday where the incorrect chain was added [18:03:50] maybe the java app has that cached somewhere? [18:04:13] openjdk doesn't support san iirc [18:04:18] -_- [18:04:35] and sun JDK? [18:05:45] New patchset: Dzahn; "add Jeff's apache-fast-test script to repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53612 [18:05:49] ragesoss: are you using ubuntu? [18:06:21] mutante: \o/ [18:07:16] Ryan_Lane: for these errors? no, Android. [18:07:34] ah [18:08:15] seems this is an android bug [18:08:44] New review: Ottomata; "(4 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [18:08:45] ragesoss: which android version are you running? [18:08:53] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [18:08:53] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 194 seconds [18:08:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf11 [18:09:03] Logged the message, Master [18:09:10] http://code.google.com/p/android/issues/detail?id=17680&can=1&q=reporter%3Anathan%2Cjanrain.com&colspec=ID%20Type%20Status%20Owner%20Summary%20Stars [18:09:15] there's one bug for that [18:09:18] it affects 2.1 [18:09:52] Ryan_Lane: 4.2.2 (CM 10.1 nightly) [18:09:57] heh [18:10:03] well, that wouldn't be the problem, then [18:10:15] New patchset: Dzahn; "add Jeff's apache-fast-test script to repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53612 [18:10:36] hmm. gerrit has said "review in progress" for about 10 minutes on this stupid ticket. what does that even mean? [18:10:52] anomie: Are you about? [18:11:21] Reedy- yes [18:11:38] Jeff_Green: i think all it means is be patient.. it has been slow but still worked after a couple minutes [18:12:09] patience is not my strong suit... [18:12:37] New patchset: Reedy; "Rest of the wikiss to 1.21wmf11" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53615 [18:13:15] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53615 [18:13:16] Change merged: Jgreen; [operations/software/otrs] (master) - https://gerrit.wikimedia.org/r/53611 [18:13:34] 22:11 < hashar> mutante: I need Zuul to be updated, that is blocked by a python module dependency I have packaged. [18:13:37] 22:12 < hashar> mutante: and there is a nasty bug where all the jobs have been completed by Zuul wait for quite a long time before reporting the change back in Jenkins. I haven't tracked it yet [18:13:41] Jeff_Green: ^ [18:14:00] oic [18:14:07] New patchset: Jgreen; ".gitreview file for software/otrs repo" [operations/software/otrs] (master) - https://gerrit.wikimedia.org/r/53616 [18:14:15] heh [18:14:27] Change merged: Jgreen; [operations/software/otrs] (master) - https://gerrit.wikimedia.org/r/53616 [18:14:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53612 [18:14:37] nowai [18:14:42] otrs git repo!?!? [18:14:44] i think that was the most overhead I've ever endured for a 15 character code change [18:14:47] hahaha [18:15:12] Reedy: just don't bring it up ok! [18:15:15] * Jeff_Green dies. [18:15:47] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable scribunto EVERYWHERE' [18:15:53] anomie: ^^ [18:15:54] Logged the message, Master [18:16:23] \o/ [18:16:49] Jeff_Green: i put your script in ./files/misc/scripts/ [18:16:59] and apache-graceful-all works again [18:17:25] New patchset: Pyoungmeister; "lucene-production.php: moving all search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53617 [18:17:54] mutante: nice [18:20:00] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:21:46] New patchset: Reedy; "Enable Scribunto EVERYWHERES" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53618 [18:22:22] anomie: I seem to recall Tim saying there was something else that needed doing after.. [18:23:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 18:23:10 UTC 2013 [18:23:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:23:39] Reedy- Nothing that I know of, unless maybe it's to run whatever script checks for titles that began with "Module:" that are now hidden. [18:26:06] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53618 [18:26:24] apergos, greg-g: I'd like to try a new version at 21:00 UTC (in 2.5 hours) if that's possible...I'll ping you before [18:26:46] I"ll be gone (basically I"m gone now, it's > 8pm here) [18:27:01] so you should rope in someone in the sf timezzone [18:27:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [18:27:01] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [18:27:04] schmir: [18:27:52] apergos: ok. [18:28:41] apergos: who do you recommend I pull in? ;) [18:29:10] cu later [18:29:24] mm whoever has time and a bit of energy [18:29:29] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.59004514493 (gt 8.0) [18:29:30] ok :) [18:29:54] there's no one who owns those boxes so.... [18:30:11] this is an unscheduled (not on the official list) deployment is it? [18:30:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53617 [18:30:38] mutante: Ryan_Lane LeslieCarr just pinging you because you're people I've chatted with, for no other reason: One of you mind be around at 2pm today to watch the pdf cluster (pdf1-3) when schmir does another deploy attempt to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=45861 [18:31:06] apergos: right, I wanted to get him on the official calendar, but, he wanted to just do it right now, timezones are also an issue here, I think [18:31:29] so this means it's a sort of 'ask and hope' situation [18:31:32] that's all [18:31:33] yep [18:31:36] exactly [18:31:47] !log py synchronized wmf-config/lucene-production.php 'temp moving all earch traffic to pmtpa for upgrades in eqiad' [18:31:49] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 13.9276916923 (gt 8.0) [18:31:53] Logged the message, Master [18:32:01] greg-g: so 2pm UTC-7 is going to be a certificate push [18:32:09] can you hold off for 1 more hour to 3pm UTC-7 ? [18:32:13] LeslieCarr: good to know [18:32:31] I hope, he's left irc (not the easiest to coordinate with) [18:32:35] I'm not sure what we'd be able to help with [18:32:47] we don't really know anything about this cluster [18:32:52] * greg-g nods [18:32:54] we couldn't do much besides powercycling/rebooting [18:32:58] but we can watch as much as possible [18:33:06] And point and laugh? [18:33:21] mostly, I didn't want him to do it this morning when it was just me in the office :) [18:33:36] :) [18:34:11] checking out cp1015.eqiad.wmnet [18:34:17] The pdf servers are very much a black box [18:34:36] a black, hardy running box [18:34:39] kind of annoyed with the situation since he said "who do I coordinate with" and then just assumes I'll be ok with whatever he suggests... oh well :) [18:34:42] hardy? wow [18:34:43] unpackaged and unpuppetized [18:34:46] yeah [18:34:51] great [18:34:57] it makes $INSERT_DIETY_HERE cry [18:35:10] I hate when $INSERT_DIETY_HERE cries [18:35:42] greg-g: all that really can be done from this time, is testing some pdf building [18:36:05] yeah, which I was doing when the pdf cluster flat lined :) [18:36:13] lols [18:36:24] I presume they're trying to fix the Scribunto problems? [18:36:27] yep [18:36:36] where are you with today's deploy, btw? [18:36:42] (on that note) [18:36:45] pgehres: looks like it was one of the appservers depooling and repooling [18:36:55] thanks LeslieCarr! [18:36:57] greg-g: done, nothing happened [18:37:18] Reedy: sweet, good deal. [18:38:01] ok, I'm going to not think about this for a while and finish this email [18:38:02] looks actually like a few may be overloaded ? [18:39:00] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [18:41:51] FYI: Request: GET http://meta.wikimedia.org/wiki/Grants:WM_US-DC/Wiki_Loves_Monuments_2012_USA/Report, from 10.64.0.129 via cp1015.eqiad.wmnet (squid/2.7.STABLE9) to () [18:41:52] Error: ERR_CANNOT_FORWARD, errno [No Error] at Wed, 13 Mar 2013 18:41:15 GMT [18:41:59] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - 3500 bytes in 0.151 second response time [18:41:59] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:03] hm, I'm timing out trying to save a fairly simple article at http://www.mediawiki.org/w/index.php?title=Extension:NavigationTiming&action=submit [18:42:09] PROBLEM - Apache HTTP on mw1161 is CRITICAL: Connection timed out [18:42:09] PROBLEM - Apache HTTP on mw1104 is CRITICAL: Connection timed out [18:42:09] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:09] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:09] PROBLEM - Apache HTTP on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:10] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:10] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:11] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:11] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:12] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:12] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:13] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:13] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:14] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:14] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:15] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:16] eek [18:42:20] gah as i was mentioning the apaches are overloaded [18:42:21] Request: GET http://commons.wikimedia.org/wiki/Category:SVG_localized_Wikipedia_globe_logos, from 10.64.0.134 via cp1014.eqiad.wmnet (squid/2.7.STABLE9) to () [18:42:21] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Wed, 13 Mar 2013 18:41:24 GMT [18:42:25] mentioning out loud [18:42:33] and there's the phone paging [18:42:33] PROBLEM - LVS HTTP IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.0 504 Gateway Time-out - 3352 bytes in 0.063 second response time [18:43:07] PROBLEM - LVS HTTP IPv6 on wikidata-lb.pmtpa.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - 3476 bytes in 0.066 second response time [18:43:08] so from my point of view, lvs1003 is attempting to connect to the apaches pool and actually getting many timeouts [18:43:09] PROBLEM - Apache HTTP on mw1040 is CRITICAL: Connection timed out [18:43:10] PROBLEM - Apache HTTP on mw1099 is CRITICAL: Connection timed out [18:43:10] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:10] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:10] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:12] LeslieCarr: Page loaded again for me. [18:43:13] from its point of view apaches are overloaded [18:43:19] siebrand: it will be sporadic [18:43:28] oki. [18:43:31] it actually is unable to depool all the apaches it desires because the minimum pool limit [18:43:35] * greg-g walks away slowly [18:43:42] i blame the pope. [18:43:44] so either we have overcome some tipping point of business - or something is wrong [18:43:54] ooh whats going on [18:44:00] Oh, that's very possible. Some stampede? [18:44:24] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.047 second response time [18:44:24] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [18:44:24] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [18:44:24] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [18:44:24] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [18:44:25] pope announced ? [18:44:36] yes [18:44:44] i mean maybe that's the problem :) [18:44:54] it was last time [18:45:13] No name yet [18:46:00] bits had a huge spike in http req/s [18:46:12] mobile too [18:46:22] Ryan_Lane: that is indeed a shit ton [18:46:25] New patchset: Pyoungmeister; "Revert "lucene-production.php: moving all search traffic to pmtpa"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53619 [18:46:26] oh pretty much everything [18:46:29] Ryan_Lane: saw a ton of api traffic due to search, peter had switched search from eqiad to pmtpa (may be related or not) [18:46:36] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53619 [18:46:37] question, loking if there was a deploy [18:46:58] there were a few deploys [18:47:10] scribunto to everywhere, PDFs and the PDF rollback [18:47:14] (pediapress) [18:47:15] could be just pope [18:47:17] robla: AaronSchulz has there beena deploy ? [18:47:22] !log py synchronized wmf-config/lucene-production.php 'moving all search traffic back to eqaid' [18:47:23] ah thanks jeremyb [18:47:25] :) [18:47:28] Logged the message, Master [18:47:32] not that I know of [18:47:33] https://graphite.wikimedia.org/render/?title=HTTP%20Requests/sec%20%28excludes%20bits.wikimedia.org:%20css/js%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=1&lineMode=connected&target=color%28cactiStyle%28alias%28scale%28reqstats.requests,%220.01666%22%29,%20%22requests/sec%22%29%29,%22blue%22%29 [18:47:43] could scribunto have any impact on this ? [18:47:47] greg-g: ^ [18:47:56] seems like a record to me [18:48:18] I backed out search change [18:48:33] robla: yeah, paying attention to the verbal discussions going on [18:48:34] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 8.37103404255 (gt 8.0) [18:48:41] it appears that lvs is getting timeouts from the appserver apaches [18:49:35] saw a lot of issues with S2 db's [18:49:57] looks like enwikitionary related traffic [18:50:27] we are seeing about a 2gig spike in traffic [18:50:38] yeah, HTTP req/s is very increased [18:50:39] maybe mre 1.5 [18:50:41] for info, if it can help: Lua editor don't start when editing a module (on fr:) [18:50:42] (stupid graphs) [18:50:45] there's definitely a pope effect [18:50:51] unsure if it's related to the outage though [18:50:54] so, maybe we haven't kept up on our apache requisitioning ? [18:50:55] yeah [18:50:58] we call it popedotting [18:51:21] it wouldn't be related to the rever to the pdf cluster, right LeslieCarr ? [18:51:26] no [18:51:42] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [18:51:50] nope, not pdf cluster related [18:51:53] afaict [18:51:58] i mean it is theoretically possible [18:52:47] https://gdash.wikimedia.org/dashboards/apimethods/deploys [18:52:55] just that the pdf feature calls the api, so it was down for ~20 minutes around 9:00am today [18:53:23] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Application%2520servers%2520eqiad&tab=m&vn= [18:53:27] wow visual editor is awful with regards to time [18:53:46] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 18:53:38 UTC 2013 [18:53:56] s/so/and/ [18:53:56] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [18:54:03] please use UTC times always [18:54:10] "9 am" is very confusing [18:54:16] 16:00 UTC [18:54:27] sorry, was talking with rob and typing at the same time [19:00:30] petabytes :p but this must be a ganglia error http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&me=Wikimedia&m=cpu_report&s=by+name&mc=2&g=network_report [19:00:40] and an hour earlier [19:05:24] Error connecting to 10.64.0.13: Too many connections [19:05:45] that's at Wed Mar 13 18:42:25 UTC 2013 [19:05:48] a shit-ton of them [19:06:08] also: Error connecting to 10.64.0.6: Too many connections [19:07:03] stampede? [19:08:04] If you report this error to the Wikimedia System Administrators, please include the details below. [19:08:04] Request: GET http://en.wikipedia.org/wiki/Alberto_Vargas, from 10.64.0.127 via cp1014.eqiad.wmnet (squid/2.7.STABLE9) to () [19:08:04] Error: ERR_CANNOT_FORWARD, errno (11) Resource temporarily unavailable at Wed, 13 Mar 2013 18:41:31 GMT [19:08:06] !log site was popedotted ! [19:08:12] Logged the message, Mistress of the network gear. [19:08:16] thanks AaronBale - just got it ? [19:08:30] few mins ago [19:08:44] took awhile to find the right place [19:09:35] cool thanks for reporting :) [19:10:09] eep mark, looks like we were basically at the edge before too [19:10:10] :-/ [19:10:23] so, that "redo amsterdam network" that we had been putting off ... [19:13:53] he's being announced right now [19:15:14] the pope is apparently a jesuit [19:15:30] so, our thought right now is that this is caused by mobile [19:15:46] jorge brogolio, buenos aires [19:15:51] there's a substantial increase in mobile requests [19:16:07] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: Connection timed out [19:16:08] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:08] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:08] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:08] PROBLEM - LVS HTTP IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:08] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:08] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:09] New patchset: coren; "New toollabs:: class to config Tool Labs servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53587 [19:16:13] oh a latin american pope! that's a first! [19:16:21] errr, jorge bergoglio* [19:16:25] LeslieCarr: well the first jesuit too [19:16:35] oo [19:16:47] i would look up this inofrmation but i refuse to increase the load [19:16:48] They named the new pope? [19:16:53] yep [19:17:04] hah [19:17:06] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: Connection timed out [19:17:06] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: Connection timed out [19:17:06] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:06] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:06] PROBLEM - Frontend Squid HTTP on knsq22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:07] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:07] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:08] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:08] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:08] Coren: yap, and provably caused a high servers load [19:17:09] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:10] Huh. Ouelette gets passed over again. That guy has /got/ to be bitter by now. [19:17:14] LeslieCarr: all esams [19:17:18] you can tell cuz of all the pages [19:17:27] thanks, lesse if another link is overloaded [19:17:29] coren: i would agree [19:17:33] > (cur | prev) 2013-03-13T19:13:10‎ MTVarro (talk | contribs)‎ m . . (12,224 bytes) (0)‎ . . (MTVarro moved page Jorge Bergoglio to Pope Francis) (undo) [19:18:04] ? [19:18:14] grrr, looks like the x-link was spiked up to max again for a second [19:18:26] * Coren idly wonders if the new one is going to be just as primitive and outdated as the previous one. [19:18:33] so do all the cardinals have their popename decided in advance, in case they're chosen ? [19:18:35] the new pope is Jorge Mario Bergoglio [19:18:39] (argentina) [19:18:47] as far as I know [19:18:48] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:18:48] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:18:48] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:18:51] also can someone get me food cuz i can't leave the desk i think for a little bit :( [19:19:32] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 46586 bytes in 3.781 second response time [19:19:32] RECOVERY - Frontend Squid HTTP on knsq22 is OK: HTTP OK: HTTP/1.0 200 OK - 654 bytes in 5.652 second response time [19:19:38] !give LeslieCarr food [19:19:43] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:43] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:19:53] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: Connection timed out [19:20:32] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 44618 bytes in 5.527 second response time [19:20:42] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:47] jeremyb_: shouldn't it be Francis I? [19:21:00] odder: errr, don't ask me! :) [19:21:01] odder: not until there is the II :P [19:21:09] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1029 bytes in 4.214 second response time [19:21:09] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:16] Alchimista: BBC says Francis I [19:21:18] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [19:21:21] yeah, Francis I [19:21:28] a new name [19:21:33] and he's a jesuit [19:21:36] jesuit? [19:21:49] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 52930 bytes in 2.043 second response time [19:21:49] RECOVERY - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 56284 bytes in 5.066 second response time [19:21:58] PROBLEM - LVS HTTP IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:04] so is the pope dossing wikipedia now? [19:22:07] odder: i'm not sure, at least here, on portugal, he his beeng called Francis. And was announced has Francis, not Francis I :P [19:22:10] can we stop addbot? [19:22:18] adding wikidata links like crazy [19:22:37] yes, something like +100 edits/min [19:22:41] asking in labs channel [19:22:48] I think it's run there [19:22:48] kill it [19:22:49] Alchimista: John Paul II was also announced as "John Paul" [19:22:50] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:22:51] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3906 bytes in 1.510 second response time [19:22:54] (was working on fr: today) [19:22:59] heh. you're assuming I know how to kill it ;) [19:23:03] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: Connection timed out [19:23:03] PROBLEM - Varnish HTCP daemon on cp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:04] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:04] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:23:04] PROBLEM - SSH on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:04] PROBLEM - Varnish HTTP upload-frontend on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:04] PROBLEM - Varnish HTTP upload-backend on cp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:12] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:23:13] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 846 bytes in 1.182 second response time [19:23:20] if it can help we can block it on fr: [19:23:21] Ryan_Lane: globally block IP address? [19:23:30] or firewall it away :) [19:23:48] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 788 bytes in 0.742 second response time [19:23:52] that'll also turn off vandalism stuff [19:24:00] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [19:24:00] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:00] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [19:24:00] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:00] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:24:00] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [19:24:01] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [19:24:02] and addbot isn't causing this issue [19:24:03] arf, css is gone [19:24:08] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:09] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:24:18] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 19:24:09 UTC 2013 [19:24:27] have to go to ear [19:24:32] ear/eat [19:24:47] (sorry LeslieCarr) [19:24:58] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:24:58] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 44250 bytes in 3.062 second response time [19:25:08] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 3914 bytes in 8.043 second response time [19:25:08] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:08] PROBLEM - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:08] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [19:25:59] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:26:07] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 63825 bytes in 2.032 second response time [19:26:08] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 64283 bytes in 2.874 second response time [19:26:08] RECOVERY - LVS HTTP IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 48780 bytes in 3.823 second response time [19:26:08] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 67055 bytes in 9.431 second response time [19:26:08] RECOVERY - LVS HTTP IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 56282 bytes in 9.865 second response time [19:26:39] !log reedy synchronized wmf-config/InitialiseSettings.php 'Disable Collection extension' [19:26:45] Logged the message, Master [19:26:51] AaronSchulz: around? [19:27:01] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [19:27:01] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [19:27:14] RECOVERY - LVS HTTP IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 73593 bytes in 3.286 second response time [19:27:14] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 56754 bytes in 7.131 second response time [19:27:14] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 47059 bytes in 7.234 second response time [19:27:14] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 53406 bytes in 7.375 second response time [19:27:14] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 49255 bytes in 8.588 second response time [19:27:14] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 52930 bytes in 9.250 second response time [19:27:58] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:28:07] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 46586 bytes in 2.546 second response time [19:28:08] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67530 bytes in 8.330 second response time [19:28:08] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 44621 bytes in 8.742 second response time [19:28:08] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [19:29:01] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74064 bytes in 2.130 second response time [19:29:10] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:10] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 95759 bytes in 5.297 second response time [19:29:10] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:29:17] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:08] RECOVERY - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 95296 bytes in 7.766 second response time [19:30:15] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:58] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [19:32:11] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [19:32:11] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [19:33:08] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:08] PROBLEM - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:06] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [19:35:18] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 47059 bytes in 5.007 second response time [19:35:18] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67533 bytes in 5.369 second response time [19:35:18] PROBLEM - LVS HTTPS IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:13] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [19:36:16] RECOVERY - LVS HTTPS IPv4 on upload.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 788 bytes in 4.045 second response time [19:36:17] PROBLEM - NTP on cp1022 is CRITICAL: NTP CRITICAL: No response from NTP server [19:36:17] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 63190 bytes in 7.858 second response time [19:36:17] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:17] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:17] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:06] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 1 process with command name varnishncsa [19:37:16] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 95759 bytes in 8.586 second response time [19:37:16] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:16] PROBLEM - LVS HTTP IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:21] New patchset: Diederik; "Adding David and Dan to Analytics contact group, removing Fabrice." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53624 [19:37:31] LeslieCarr: still esams... [19:38:13] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [19:38:13] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [19:38:16] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63644 bytes in 5.808 second response time [19:38:16] RECOVERY - LVS HTTP IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 95295 bytes in 8.611 second response time [19:38:20] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:05] hrm, getting packet loss between eqiad and esams, going to switch preferred transit [19:39:07] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:39:12] thanks paravoid for poking me [19:39:16] RECOVERY - LVS HTTP IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 73592 bytes in 8.687 second response time [19:39:17] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:17] PROBLEM - LVS HTTPS IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:55] mark: can i deactivate your changes and commit (with them deactivateD) ? [19:40:06] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 67057 bytes in 2.123 second response time [19:40:07] RECOVERY - LVS HTTPS IPv4 on upload.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 788 bytes in 5.679 second response time [19:40:13] erm [19:40:16] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74061 bytes in 7.292 second response time [19:40:16] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:16] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:17] let's just deploy them, one sec [19:40:46] LeslieCarr: deploy if you're ready [19:41:11] !log deploying a possibly mobile front end traffic effecting ipv6 apache request firewall filter [19:41:17] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 49255 bytes in 3.579 second response time [19:41:18] RECOVERY - LVS HTTPS IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1029 bytes in 1.536 second response time [19:41:18] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:19] Logged the message, Mistress of the network gear. [19:42:16] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 44620 bytes in 5.779 second response time [19:42:17] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 67530 bytes in 6.976 second response time [19:42:17] PROBLEM - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:08] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [19:43:09] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:43:09] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 56757 bytes in 4.268 second response time [19:43:16] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:07] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [19:44:18] !log switched preferred path between esams and eqiad [19:44:20] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:20] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:23] Logged the message, Mistress of the network gear. [19:45:07] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:45:07] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:45:07] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 63649 bytes in 3.488 second response time [19:45:17] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 52930 bytes in 8.282 second response time [19:45:18] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:17] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 44617 bytes in 3.331 second response time [19:46:17] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74064 bytes in 5.312 second response time [19:46:19] PROBLEM - HTTP on fenari is CRITICAL: Connection timed out [19:47:17] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:47:17] RECOVERY - LVS HTTP IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 63189 bytes in 4.732 second response time [19:47:26] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:48:17] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 95757 bytes in 6.968 second response time [19:49:27] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.056 second response time [19:49:28] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:50:16] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:51:08] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:51:18] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [19:52:07] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [19:52:07] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [19:54:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 19:54:35 UTC 2013 [19:54:57] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [19:55:09] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:55:09] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [19:55:29] James_F|Away: http://www.yousaytomato.biz/ [19:56:26] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 186 seconds [19:57:26] mutante: Ha. Interesting. :-) [19:58:07] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [19:58:26] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [19:58:28] Ryan_Lane: so i checked, and it doesn't seem to be a HTTP library specific problem (both libraries I used cause problems on ragesoss' phone). Should we just wait another hour for the new cert to see if that fixes it? [19:58:38] Ryan_Lane: is there debug info I can get from ragesoss' phone that can help you? [19:58:47] YuviPanda: the new cert will not fix it [19:59:05] hmm, because this has been happening only since yesterday :| [19:59:09] YuviPanda: is this hitting any domain other than wikimedia.org or wikipedia.org? [19:59:17] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [19:59:20] the new certificate is basically the same as the old [19:59:25] err [19:59:26] RECOVERY - Host europium is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [19:59:28] the commons app crash is from hitting bits [19:59:32] the one we're getting today is the same as this one [20:00:24] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [20:00:25] tfinc: can you test the commons app on your phone? [20:00:38] YuviPanda: the wikipedia app works perfectly fine [20:00:49] Ryan_Lane: so it works fine for *most* people. On some devices it does not. [20:00:57] Ryan_Lane: sure [20:01:32] YuviPanda: then it's a library issue somewhere [20:02:00] Ryan_Lane: don't think so - the commons app and the Wikipedia app use *completely* different libraries (one uses Apache HTTPClient & URLConnection, other uses Webkit's) [20:02:03] Ryan_Lane: is this a new CA certificate? [20:02:09] YuviPanda: it's likely not the http library, but the TLS/SSL libraries [20:02:11] Ryan_Lane: perhaps certificate trust chain broken somewhere? [20:02:11] paravoid: yes [20:02:13] hmm [20:02:27] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:02:29] paravoid mentioned he tested the chain [20:02:34] hmm [20:02:35] I did [20:02:42] Ryan_Lane: YuviPanda http://commons.wikimedia.org/wiki/File:Wikimedia_hiring_flyer.jpeg <-- worked great [20:02:43] no issues [20:02:48] but maybe the CA cert is relatively new and not included in old OS/certificate stores? [20:02:49] and if it's working on some devices and not others, then it's not related to the cert [20:02:53] considering that the libraries didn't change, I suppose it could be something with the cert? [20:03:02] paravoid: same root as the old [20:03:07] RECOVERY - SSH on europium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:03:17] YuviPanda: the cert is different in that the wildcard is in the SAN [20:04:15] i'm checking out cp1022 - looks like it may have died in the midst of all this excitement [20:04:32] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [20:05:12] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [20:05:12] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [20:05:22] YuviPanda: some older versions of android are known to have this problem [20:05:36] but those are really old versions [20:05:38] but ragesoss is on 4.1 (or 4.0)? [20:05:43] he's on a nightly [20:05:52] paravoid@serenity:~/wikimedia/puppet/files/ssl$ openssl x509 -in star.wikimedia.org.pem -noout -issuer [20:05:54] ragesoss: are you on CM nightly? [20:05:55] issuer= /C=US/O=Equifax/OU=Equifax Secure Certificate Authority [20:05:57] paravoid@serenity:~/wikimedia/puppet/files/ssl$ openssl x509 -in unified.wikimedia.org.pem -noout -issuer [20:06:01] issuer= /C=US/O=DigiCert Inc/OU=www.digicert.com/CN=DigiCert High Assurance CA-3 [20:06:03] not the same root [20:06:24] Yuvi: yes, cm nightly [20:06:30] hm [20:06:33] let me check something [20:07:12] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [20:07:13] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:07:46] paravoid: ah. sorry. I was talking about star.wikipedia.org, as I was thinking mobile [20:07:51] Ryan_Lane: but I have another crash report (with exception info showing same issue), and he is running a no-name chinese brand Android with 2.3.4 (not that old either) [20:07:51] but for bits, yeah, different cert [20:08:03] !log cp1022 unresponsive, power cycling [20:08:07] YuviPanda: what does the exception show? [20:08:09] Logged the message, Mistress of the network gear. [20:08:12] let me pastebin [20:08:26] OS/version? [20:08:29] can they also give a screenshot of the certificate shown by the site? [20:08:35] paravoid: android 2.3.4 ;) [20:08:48] Ryan_Lane: http://pastebin.com/MUpkgDyq [20:08:54] that's from the other user [20:08:59] ragesoss: can you take a screenshot? [20:09:11] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [20:09:31] Ryan_Lane: that is for bits. [20:09:53] the source of the error is at at org.wikimedia.commons.EventLog$LogTask.doInBackground(EventLog.java:40), which calls out to bits. (for EventLogging) [20:10:09] I wonder if the phone is missing the trust store [20:10:12] PROBLEM - Host cp1022 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:18] or if the app is looking in the wrong place for it [20:10:33] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [20:11:11] RECOVERY - SSH on cp1022 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:11:12] RECOVERY - Varnish HTCP daemon on cp1022 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [20:11:12] RECOVERY - Varnish HTTP upload-backend on cp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.439 second response time [20:11:21] RECOVERY - Host cp1022 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [20:11:37] YuviPanda: are some of these folks hitting esams and others hitting eqiad, or are all in a specific region? [20:12:23] one moment, getting jcmish here, she's been talking to them [20:13:13] am here [20:13:26] hey jcmish [20:13:34] Ryan_Lane was asking about geographical location of people facing problems [20:13:38] approximately, at least [20:13:39] do you know? [20:14:04] don't but I can get that YuviPanda and Ryan_Lane [20:14:10] ok [20:14:12] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [20:14:17] sage is in the us, right? [20:14:37] here in France all seems back and fine [20:14:39] Ryan_Lane: yup [20:14:47] near DC [20:14:55] (but not in DC) [20:15:01] ah. I see a problem [20:15:18] github uses the same root CA [20:15:18] http://pastebin.com/CgD7xwNh [20:15:37] it's not a device issue [20:15:38] hah [20:15:45] it's sets of users hitting the same server [20:15:50] source hash ;) [20:16:00] yeah :) [20:16:00] \o/ [20:16:01] good catch [20:16:05] that whooshed completely over my head :) [20:16:07] !log depooling ssl1001 [20:16:12] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:16:15] Logged the message, Master [20:16:17] I *really* need to update install_certificate [20:16:19] but yay :) [20:16:24] and nagios alerts [20:16:25] so that it regenerates the chained cert [20:16:33] that too [20:16:34] well, that won't help here [20:16:37] due to sh [20:16:42] unless we use nrpe for it [20:16:44] nagios to all SSL boxes [20:16:47] yeah [20:16:51] https://www.dropbox.com/s/15gx5bo2y8icqn5/Screenshot_2013-03-13-16-12-43.png [20:17:05] paravoid: it requires nrpe [20:17:10] Ryan_Lane: ^ [20:17:12] screenshot [20:17:13] why? [20:17:13] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [20:17:13] PROBLEM - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 432931 seconds [20:17:20] https://www.dropbox.com/s/5h97oaj2hn618r4/Screenshot_2013-03-13-16-12-52.png [20:17:20] but this is for integration.mediawiki.org [20:17:23] ah, multiple ips [20:17:23] because the ips are bound to lo on them? [20:17:34] yeah yeah [20:17:41] ragesoss: : is the first one integration.mw.org? [20:17:52] can we also do SNI with smaller certificates? [20:18:00] ragesoss: or second one? [20:18:02] both are same [20:18:03] New patchset: Andrew Bogott; "Added a basic nginx module and two (labs) use cases." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [20:18:05] paravoid: that would be nice, yes [20:18:06] ragesoss: ah, ok [20:18:08] please pretty please [20:18:12] just scrolled [20:18:14] we can do that now :) [20:18:16] and drop all those lb IPs! [20:18:19] since we upgraded to precise [20:18:30] paravoid: yes, I was saying that just yesterday :) [20:18:31] btw http://www.isg.rhul.ac.uk/tls/ [20:18:33] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 18 seconds [20:18:34] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.98059028777 [20:18:36] no need for all those ips [20:18:37] fresh [20:18:52] Ryan_Lane: so, this should resolve itself soon? :) [20:19:04] paravoid: hahaha. so basically SSL is completely fucked now [20:19:09] yep [20:19:11] RECOVERY - Varnish HTTP upload-frontend on cp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 673 bytes in 0.030 second response time [20:19:11] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [20:19:12] \o/ [20:19:16] there's no way to avoid attacks [20:19:34] well, that sucks [20:19:52] YuviPanda: i'm stepping out for lunch. let me know if you need anything testing and i'll do it when i get back [20:20:01] Change abandoned: Andrew Bogott; "Replaced by a newer patch, https://gerrit.wikimedia.org/r/#/c/43886/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44712 [20:20:06] !log reedy synchronized wmf-config/InitialiseSettings.php 'Re-enable collection' [20:20:09] YuviPanda: ok, can people check if this is still occuring? [20:20:12] tfinc: think Ryan_Lane spotted the problem :) will do. [20:20:13] Logged the message, Master [20:20:14] Ryan_Lane: okay [20:20:18] ragesoss: try again? [20:20:21] I think that should have fixed it [20:20:24] jcmish: can you poke people? [20:20:34] let me check that in esams too [20:20:35] YuviPanda: yup doing it now [20:20:42] the app? [20:20:43] jcmish: thanks jcmish :) [20:20:48] ragesoss: yup [20:20:50] shouldn't crash anymore [20:21:12] esams is good [20:21:36] !log repooling ssl1001 [20:21:42] Logged the message, Master [20:22:02] works! [20:22:02] \o/ [20:22:02] wheee! [20:22:11] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [20:22:23] I'm glad we found this now [20:22:33] otherwise when we changed out the cert the problem would have dissapeared [20:22:40] because that file would have been regenerated [20:22:57] ah [20:22:57] and then we would have *really* been confused [20:23:10] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [20:23:14] so, any reason to not revert bits to the old certificate? [20:23:14] so this was just a configuration issue on one of the ssl terminators? [20:23:19] YuviPanda: yes [20:23:21] paravoid: no [20:23:32] paravoid: we're installing the new one in like an hour [20:23:39] but why on bits was my point [20:23:42] ah [20:23:46] I see what you mean [20:24:07] we can, yes [20:25:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 20:25:02 UTC 2013 [20:26:00] Thanks Ryan_Lane, paravoid, jcmish, ragesoss :) [20:26:00] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [20:26:10] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [20:26:30] yw [20:30:32] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [20:33:05] New patchset: RobH; "pushing the updated unified certificate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53629 [20:34:52] LeslieCarr: https://meta.wikimedia.org/wiki/Www.wikimedia.org_template/temp # You can click "Preview HTML" here to see what it might look like. [20:35:16] I moved some icons around for symmetry. The bottom row is a whole lot of RGB, but I think overall this version looks good. [20:37:11] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [20:38:30] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 20:38:20 UTC 2013 [20:39:00] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [20:49:16] apergos, greg-g: Thank you both for helping out with the PDF/Collections deployment and roll back. [20:51:05] Susan: no problem, I just kinda watched ;) [20:51:54] And honed your panicking skills, I hope! [20:52:20] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 207 seconds [20:52:21] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 203 seconds [20:53:20] Susan: my deep breathing ability helps :) [20:54:11] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay 0 seconds [20:54:11] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay 0 seconds [20:55:52] New patchset: Ryan Lane; "Remove the root certificate from the unified chain" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53632 [20:56:10] paravoid: ^^ [20:56:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53629 [20:56:19] got to go [20:56:44] quick review? [20:56:48] it's really small :) [20:56:53] and we're switching out the root [20:56:55] err [20:56:56] the cert [20:57:00] it's the best time to do this [20:58:24] btw, you can do openssl x509 -in ... -hash and -issuer_hash and let it figure out the chain itself [20:58:28] instead of doing it manually [20:58:42] ca-certificates creates symlinks with the hash in /etc/ssl/certs even [20:58:54] change looks good [20:59:06] leaving [20:59:08] ttyl [21:00:28] New patchset: Pyoungmeister; "Redeploying this now that site is no longer breaking" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53633 [21:00:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53632 [21:03:23] New patchset: Ryan Lane; "Changing mobile to use the unified cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53635 [21:04:57] !log setting first 2 ssl servers per dc site to false in pybal [21:05:03] Logged the message, RobH [21:05:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53635 [21:05:32] greg-g: hi, can I try to upgrade mwlib again? [21:05:53] schmir: not right now, no [21:06:46] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53633 [21:07:36] schmir: the schedule is up to date here: https://wikitech.wikimedia.org/wiki/Deployments please pick a window that is available and preferrably with decent Pacific timezone overlap [21:08:08] !log py synchronized wmf-config/lucene-production.php 'temp moving all earch traffic to pmtpa for upgrades in eqiad' [21:08:16] Logged the message, Master [21:08:31] schmir: (reload, I just fixed a type/failed update for tomorrow) [21:09:01] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 21:08:55 UTC 2013 [21:09:04] s/type/typo/ # fitting [21:09:58] greg-g: sorry, I don't understand that table. [21:10:00] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:10:16] greg-g: let us discuss in the bugtracker [21:10:32] schmir: this is probably easier. It is a table of dates and times that are taken [21:11:20] schmir: I would like you to find a time (1hr window) that is not already claimed, but yet is still overlapping somewhat with the Pacific timezone, to do your deployment. [21:12:44] New patchset: Ram; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [21:13:54] !log repooling the first two ssl servers per site [21:14:01] Logged the message, RobH [21:15:27] !log depooling ssl servers 3 and 4 in each dc site [21:15:35] Logged the message, RobH [21:15:48] New review: Ram; "Testing is complete. Ready for review." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [21:17:11] RECOVERY - Puppet freshness on search27 is OK: puppet ran at Wed Mar 13 21:17:06 UTC 2013 [21:21:48] !log repooling all ssl servers (except broken ssl3004) in all sites. new unified certificate is now in use. [21:21:54] Logged the message, RobH [21:24:30] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [21:26:24] TimStarling: you up and about yet? I'm thinking we should coordinate the Collections extension update to some time that works for both you and schmir [21:26:43] ...which would probably not be the best time here, but that might be ok [21:28:14] I'm here now [21:29:00] I just need someone to tell me if the new mwlib overloads the api.php cluster [21:29:30] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [21:29:31] it's very unlikely [21:29:49] the API cluster handles a significant proportion of all parsing [21:30:49] should I try it now? [21:31:06] how many requests per second do you normally get? [21:31:34] schmir: what did you change between the deployment this morning and now? [21:33:02] it probably doesn't matter, just enable it and we'll see what happens [21:34:02] robla: https://github.com/pediapress/mwlib/commit/a3e55933c5f9e11bc7d51a1c7608ff252a1d90a2 [21:34:21] is the profiling collector down? [21:34:28] well, we're actually open this afternoon (mobile moved their stuff to tomorrow due to the #popedotting) [21:34:29] TimStarling: ok. I'll do that now. thanks. [21:34:55] umm...ok [21:35:02] is there an incident report/postmortem/retrospective on the popedotting? [21:35:05] I guess [21:35:12] sumanah: not that I've seen [21:35:21] RECOVERY - NTP on europium is OK: NTP OK: Offset -0.09331905842 secs [21:36:37] greg-g: https://wikitech.wikimedia.org/wiki/Incident_documentation and https://wikitech.wikimedia.org/wiki/Incident_response#Post_mortem [21:37:29] drdee: the locke replacement (gadolinium) is online now in eqiad [21:37:31] updated ticket. [21:38:09] !log on professor: restarted collector [21:38:16] Logged the message, Master [21:38:31] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.0676309322 (gt 8.0) [21:39:30] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 21:39:27 UTC 2013 [21:40:01] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [21:40:11] PROBLEM - Varnish HTTP mobile-backend on cp1041 is CRITICAL: Connection refused [21:42:30] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.93404378571 [21:42:40] TimStarling: Was there something else to do after enabling Scribunto everywhere? [21:43:32] !log upgraded mwlib to 0.15.3 [21:43:40] Logged the message, Master [21:45:11] RECOVERY - Varnish HTTP mobile-backend on cp1041 is OK: HTTP OK: HTTP/1.1 200 OK - 696 bytes in 0.016 second response time [21:45:20] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [21:45:30] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 197 seconds [21:45:45] schmir: You're trying the upgrade again? [21:46:43] Susan: according to the pp-pdf1 message, he just did [21:46:57] Right. That's why I was asking. [21:47:19] Susan: yes, I've upgraded the whole pdf cluster [21:47:38] Reedy: champagne? [21:49:00] things look good here: https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=API+application+servers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:49:25] but pdf3 doesn't look good: https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&s=by+name&c=PDF+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:49:34] Eep. [21:49:43] schmir: ^^ [21:50:03] TimStarling: Any movement on global Scribunto modules? [21:50:21] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 30 seconds [21:50:28] I'm also not sure the licensing question re: modules in wiki pages was ever looked at or resolved. [21:51:04] nothing I've seen on the gloval scributo stuff yet [21:51:47] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 8.59301945736 (gt 8.0) [21:52:07] schmir: those pdf servers seem a bit overworked generally, is that normal? [21:52:07] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [21:53:25] greg-g: yes, they are pretty busy at times [21:53:33] greg-g: https://ganglia.wikimedia.org/latest/?r=year&cs=&ce=&m=cpu_report&s=by+name&c=PDF+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:53:37] Try a longer time span ;) [21:53:37] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [21:53:57] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:54:28] greg-g: this looks all good to me [21:55:20] Reedy: :P [21:55:26] schmir: ok, good [21:55:35] that test page for the bug worked !! yay! [22:11:40] New patchset: RobH; "redirection for softwarewikipedia.net per rt4672" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53686 [22:12:27] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [22:13:26] jenkins... so slow [22:16:29] New patchset: awjrichards; "Update X-Analytics handling to new k/v pair spec" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [22:17:07] Reedy: ping? [22:17:12] <^demon> RobH: It's not scaling nicely :\ [22:17:36] <^demon> We prolly need to look at slaves eventually. [22:19:04] * Damianz makes chad his slave [22:20:35] ^demon: so i can see in the project that it ran a build against my changeset [22:20:39] I'm reminded of . [22:20:41] but it never actually updated my changeset [22:20:45] https://integration.mediawiki.org/ci/view/All/job/operations-apache-config-lint/changes [22:20:51] https://gerrit.wikimedia.org/r/#/c/53686/1 [22:20:57] ^demon: any ideas? [22:21:06] <^demon> Sooo mannny pingggs. [22:21:15] yes, but im the most important~! [22:21:16] ;] [22:21:48] <^demon> No clue. [22:22:09] where is hashar when I need him! [22:23:32] hrmm, so i think it passed, but since i have no failed oens in the history to compare output, i have no idea. [22:23:52] oh, yes i do, #4 failed. [22:23:59] manual merge time. [22:25:11] New review: RobH; "can confirm it passed jenkins testing @ https://integration.mediawiki.org/ci/view/All/job/operations..." [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/53686 [22:25:11] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53686 [22:27:51] thanks everyone. I'm going to leave in a few minutes...unless someone objects (greg-g, TimStarling, anyone else?) [22:28:19] <^demon> RobH: zuul is hella backed up :\ http://integration.mediawiki.org/zuul/status [22:28:57] that handles the write back to gerrit pages? [22:29:03] cuz i can see it did the test build [22:30:24] robh is doing a graceful restart of all apaches [22:30:33] every time i do this it makes me nervous. [22:30:44] !log robh gracefulled all apaches [22:30:50] Logged the message, Master [22:31:06] mutante: So, why doesnt the script restart api servers? [22:31:19] i would think it needs to, so now i need to use dsh to manually restart eqiad based api apaches? [22:31:21] schmir: ok, bye [22:31:37] re popedotting: I don't really see a big spike in mysql connection count [22:31:38] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.135 second response time [22:31:48] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 2.22684938462 [22:32:00] I see some possible evidence of internal network saturation in the form of missing ganglia data [22:32:16] and that could cause mysql client connection timeouts [22:34:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 183 seconds [22:34:46] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 187 seconds [22:37:29] New patchset: Reedy; "Remove commented simplewikibooks entries" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53690 [22:37:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53690 [22:39:37] New patchset: Reedy; "Bug 46083 - viwiktionary has $wgLogo defined twice" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53692 [22:40:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53692 [22:41:19] am I the only one who gets annoyed by having all of one sort of server in the same rack? [22:43:53] I've got an excuse because it's not my job to think about it, but that's the first time I thought about the risk of homogeneous racks [22:44:25] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [22:44:43] !log demon synchronized php-1.21wmf11/extensions/Scribunto/ 'Updating Scribunto to master' [22:44:49] Logged the message, Master [22:47:23] New patchset: Reedy; "Bug 45866 - Babel configuration for min.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53693 [22:47:43] TimStarling: no you are not [22:47:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53693 [22:48:15] TimStarling: but i'll let you and mark fight that one out ;) he does have a point that with our number of servers if a row that contains 1/2 or 1/3 of the servers goes out we're not really able to handle it [22:51:35] New patchset: Reedy; "Bug 45841 - Localizing sitename for dv.wikt" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53694 [22:52:10] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53694 [22:53:04] !log reedy synchronized wmf-config/ [22:53:12] Logged the message, Master [22:56:16] New patchset: Pyoungmeister; "Re-de-deploying this, as upgrades are complete." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53695 [22:58:37] for the record, re-de-deploying is a thing [22:58:59] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53695 [23:00:01] !log py synchronized wmf-config/lucene-production.php 'moving all search traffic back to eqiad. upgrades over.' [23:00:07] Logged the message, Master [23:00:36] notpeter: http://en.wiktionary.org/wiki/redeployment [23:00:45] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 8 seconds [23:00:50] notpeter: that's my second new word for the day, after "popedotting" [23:01:17] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [23:01:30] hahaha, awesome [23:01:34] being dotted because the pope got redployed [23:01:36] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [23:02:02] I find it funny [23:02:13] because we wouldn't even notice any kind of slashdotting [23:02:19] ha! [23:02:26] only popes can drive enough traffic for us to notice [23:02:30] and michael jackson [23:02:32] we get realworlddotted [23:02:38] "the Pope and Michael Jackson" [23:02:38] hahaha, esactly [23:02:42] my new magical realism novel [23:03:15] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53696 [23:03:47] didn't we get Bieberdotted a while back? :) [23:05:09] !log py synchronized wmf-config/lucene-production.php 'moving all search traffic back to eqiad. upgrades over.' [23:05:16] Logged the message, Master [23:11:13] xyzram: hey [23:11:23] howdy [23:11:26] I have deployed your patch to lucene-search-2 and https://gerrit.wikimedia.org/r/#/c/52547/ [23:11:29] well [23:11:45] I wrote a new identical patchset because I really relaly hate re-basing things [23:12:00] so, it's all live in prod now [23:12:13] and the error logs are *much* less spammy [23:12:36] Great! Glad to know it didn't crash and burn :-) [23:12:47] nope, all worked well [23:13:33] That slimness of log files will be a big help tracking down the remaining issues with search. [23:13:56] Thanks for the update! [23:15:07] xyzram: yep! may I abandon your version of that patchset? [23:15:25] Surely. [23:15:30] * Reedy rebases notpeter [23:16:01] Reedy: for 4 line changes.... man, not worth fighting the rebase fight ;) [23:16:22] Cherrypick is sometimes useful. [23:16:38] Change abandoned: Pyoungmeister; "this was done in a different patchset because I was not willing to rebase. but this change is live!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52547 [23:16:58] git cherry ... [23:17:09] git fetch ssh://reedy@gerrit.wikimedia.org:29418/operations/puppet refs/changes/47/52547/2 && git cherry-pick FETCH_HEAD [23:17:11] copy paste! [23:17:36] !log demon Started syncing Wikimedia installation... : [23:17:42] Logged the message, Master [23:18:14] I tried going to cherrypicking route [23:18:16] and then got annoyed [23:18:32] and then used my laptop's buffers to fix the problem :) [23:19:07] New patchset: RobH; "wikimedia.us redirect per rt4416" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53697 [23:20:47] New patchset: Reedy; "Update config/dblist locations" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53698 [23:21:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53698 [23:23:09] !log no one sync apaches, we wre doing crazy magic on them [23:23:16] Logged the message, RobH [23:23:24] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53697 [23:27:07] New patchset: RobH; "Revert "wikimedia.us redirect per rt4416"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53699 [23:28:15] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53699 [23:28:35] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [23:31:57] New review: RobH; "So I attempted to steal this bit of regex and use it in my wikimedia.us patchset. It failed, with t..." [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/53403 [23:38:45] !log deploying change 53464 to OpenStackManager on wikitech [23:38:51] Logged the message, Master [23:41:20] !log demon Finished syncing Wikimedia installation... : [23:41:27] Logged the message, Master [23:43:48] !log installing package upgrades on zirconium [23:43:54] Logged the message, Master [23:44:54] New patchset: RobH; "redirection for wikimedia.us to wikimedia.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53703 [23:45:31] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53703 [23:48:56] New patchset: RobH; "Revert "redirection for wikimedia.us to wikimedia.org"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53704 [23:49:13] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53704 [23:49:41] well, that was a failure. [23:49:49] im too tired to deal with regex now it seems. [23:55:53] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa