[00:00:29] bloody puppet reverting all my changes [00:00:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:00:33] confirmed i don't see *.m.wikipedia.org [00:01:13] bleh. [00:01:13] did anyone mail out to the mobile team that this change was happening ? [00:01:15] blllllllllaaaaaargggg [00:01:25] tfinc: we were emailed by the mobile team to fix it ;] [00:01:26] New patchset: Pyoungmeister; "adding null search shard for log silence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53489 [00:01:40] but nope, cuz it should have been seemless ;_; [00:02:01] (it affected more than just mobile, change was for all https) [00:02:19] New review: Pyoungmeister; "rebasing this was annoying, so I just made a new patchset here: https://gerrit.wikimedia.org/r/#/c/5..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52547 [00:02:25] to be fair the thread was started by brion to mobile-tech and ops [00:02:34] subject " [Ops] Can somebody fix SSL on non-Wikipedia mobile sites?" [00:02:44] and we did [00:02:45] (assuming the [Ops] part didn't happen on the mobile-tech list though ) [00:02:51] and in the process, broke wikipedia mobile [00:02:51] New patchset: Tim Starling; "Use latest php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53490 [00:02:54] haha [00:03:14] yup. i see it. "subject: Can somebody fix SSL on non-Wikipedia mobile sites?" [00:03:25] we followed directions of that email to the letter [00:03:26] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53490 [00:03:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53489 [00:03:52] what's special about mw1043, does not have /etc/apache2/wmf dir [00:03:55] brion didnt say 'by the way, please dont break wikipedia' [00:04:01] I deployed a few other puppet changes just now [00:04:05] TimStarling: did you just merg [00:04:06] thanks! [00:04:25] HTTPS monitoring and search configuration [00:05:40] TimStarling: the https monitoring was me, thx dude [00:05:41] "Additionally, a 404 Not Found error was encountered while trying to use an ErrorDocument to handle the request." when doing various things...several wikis [00:05:43] otrs wiki, enwiki [00:05:55] IDk if things are known - happening randomly [00:06:01] fixes on reload or two usually [00:06:16] dzahn is doing a graceful restart of all apaches [00:06:17] "The requested URL /wiki/Special:Contributions/Rjd0060 was not found on this server." [00:06:30] wikipedia says no. [00:06:34] did someone change the apache configuration? [00:06:42] Aaron|laptop: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate [00:06:42] Content-Language: en [00:06:42] Vary: Accept-Encoding,Cookie [00:06:42] Expires: Thu, 01 Jan 1970 00:00:00 GMT [00:06:52] TimStarling: I added to redirects about 8 hours ago [00:06:52] i just got a 404 from http://meta.wikimedia.org/wiki/Main_Page as a logged in user [00:06:55] so was prolly someone else. [00:06:58] !log dzahn gracefulled all apaches [00:07:07] Logged the message, Master [00:07:08] robh: so, it would be ideal to have checks for every top level domain [00:07:15] mind adding a ticket for that? [00:07:22] bonus points for actually adding the checks [00:07:26] icinga checks ya mean? [00:07:29] yes [00:07:33] TimStarling: i added this redirect https://gerrit.wikimedia.org/r/#/c/53478/3/redirects.conf [00:08:32] mw1136: Action 'configtest' failed. [00:08:33] mw1136: The Apache error log may have more information. [00:08:33] mw1136: Your apache2 configuration is broken, so we're not restarting it for you. [00:09:14] it does not have /etc/apache2/wmf [00:09:23] 17:12 < mutante> what's special about mw1043, does not have /etc/apache2/wmf dir [00:09:36] robla: remove from all of analytics or just what you mentioned? [00:09:46] (nagios) [00:10:37] nothing special about it [00:10:41] $ dsh -g apaches -cM 'test -e /etc/apache2/wmf || echo help' | wc -l [00:10:41] 195 [00:10:41] bugzilla down? I'm not getting any response from the server, all connections time out before getting any response at all (chrome's ERR_CONNECTION_TIMED_OUT page is served) [00:10:42] jeremyb_: go ahead and remove me from the analytics group in nagios (are you doing this?) [00:10:45] 195 servers are broken now [00:11:08] robla: i can make the commit at least :) [00:11:40] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [00:11:40] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:11:57] bugzilla WFM [00:11:58] TimStarling: wikimedia-task-appserver ... ouch [00:12:11] wow, that's not good. meta.wikimedia.org ERR_NAME_NOT_RESOLVED [00:12:13] New patchset: Ryan Lane; "Unified is missing *.m.wikipedia.org, use original" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53494 [00:12:14] same for commons and enwiki [00:12:40] !log manually creating missing /etc/apache2/wmf symlinks on servers that lack them [00:12:47] Logged the message, Master [00:13:47] TimStarling: php is partially uninstalled on a lot of those servers as well [00:13:59] so, if i installed wikimedia-task-appserver, it would install new php packages [00:14:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53494 [00:14:24] binasher: example? [00:15:02] mw1123 was one, but i just reinstalled mediawiki-task-appserver, which reinstalled the missing php packages [00:15:27] mutante: example of a server missing mediawiki-task-appserver? [00:15:36] you mean wikimedia-task-appserver [00:15:38] TimStarling: see mw1125 [00:15:42] binasher: mw1136 [00:15:46] yeah, that [00:16:24] k, my issue is unrelated for some reason I'm unable to make any http or ssh request in any browser or terminal. yet irc is working fine. [00:16:36] (even ping google.com failed) [00:16:40] ignore me [00:16:47] !log depooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:16:53] Logged the message, Master [00:16:54] it's thereon 1125 [00:17:46] root@mw1125:~# dpkg -l | grep wikimedia-task-appserver [00:17:47] rc wikimedia-task-appserver 2.7-1 Wikimedia application server [00:18:12] I see [00:18:23] rc php-apc 3.1.7-1 APC (Alternative PHP Cache) module for PHP 5 [00:18:24] rc php5-memcached 2.1.0-2~wmf+precise1 memcached extension module for PHP5, uses libmemcached [00:18:25] rc php5-mysql 5.3.10-1ubuntu3.5+wmf1 MySQL module for php5 [00:18:26] etc [00:18:55] what is removing it? puppet? [00:19:09] is all this symlink stuff related to the 404s being reported? [00:19:15] mutante? [00:19:31] looks like it [00:19:36] Start-Date: 2013-03-12 23:53:28 [00:19:37] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf1 [00:19:39] Downgrade: php5-common:amd64 (5.3.10-1ubuntu3.5+wmf1, 5.3.10-1ubuntu3.4+wmf1) [00:19:42] Remove: wikimedia-task-appserver:amd64 (2.7-1) [00:19:52] jeremyb_: that plus not having a functional php install [00:19:59] from var/log/apt/history.log [00:20:08] wikimedia-task-appserver should be arch-independent [00:20:23] the control file does not depend on a particular version of PHP [00:20:30] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install wikimedia-task-appserver [00:20:33] Upgrade: wikimedia-task-appserver:amd64 (2.6-1, 2.7-1) [00:20:41] the upgrade of appserver still worked [00:20:48] did someone just change that? [00:20:50] but then the php5 install removed it .. [00:21:12] ridiculous [00:21:39] wikimedia-task-appserver should not be depending on particular versions of things [00:22:25] so either we need to downgrade wikimedia-task-appserver or php5 [00:22:54] can someone try to beat me to that? [00:25:38] too hard? [00:26:15] Ryan_Lane: do you need anything from me to roll that change back ? I'm going to be leaving in about 15min [00:26:39] tfinc: no [00:26:49] TimStarling: I'd help, but I'm solving an unrelated thing [00:27:08] Ryan_Lane: it's ok, I think the site is mostly up [00:27:31] not sure why [00:27:42] I guess we can fix this properly [00:27:47] maybe the broken servers are mostly depooled? [00:28:04] and the other errors are due to depool limits? [00:28:06] i think so [00:28:53] am I correct that the package is architecture-dependent? [00:29:14] Architecture: all [00:29:20] in the control file [00:30:06] apt-cache show wikimedia-task-appserver also shows Architecture: all [00:30:07] !log pooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:30:13] Logged the message, Master [00:30:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 00:30:18 UTC 2013 [00:30:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:31:21] apt-cache showpkg says: [00:31:25] Dependencies: [00:31:26] 2.7-1 - apache2-mpm-prefork (0 (null)) librsvg2-bin (0 (null)) dvipng (0 (null)) gsfonts (0 (null)) ocaml (0 (null)) ploticus (0 (null)) php5-mysql (0 (null)) php5-curl (0 (null)) php5-xmlrpc (0 (null)) php5-cli (0 (null)) php-apc (0 (null)) php-wikidiff2 (0 (null)) php5-fss (0 (null)) php5-geoip (0 (null)) libapache2-mod-php5 (0 (null)) file (0 (null)) djvulibre-bin (0 (null)) tidy (2 20070821) libtidy-0.99-0 (2 20070821) php- [00:31:26] pear (0 (null)) rsync (0 (null)) make (0 (null)) xpdf-utils (0 (null)) libtiff-tools (0 (null)) texlive (0 (null)) texlive-bibtex-extra (0 (null)) texlive-font-utils (0 (null)) texlive-fonts-extra (0 (null)) texlive-lang-croatian (0 (null)) texlive-lang-cyrillic (0 (null)) texlive-lang-czechslovak (0 (null)) texlive-lang-danish (0 (null)) texlive-lang-dutch (0 (null)) texlive-lang-finnish (0 (null)) texlive-lang-french (0 (null [00:31:34] )) texlive-lang-german (0 (null)) texlive-lang-greek (0 (null)) texlive-lang-hungarian (0 (null)) texlive-lang-italian (0 (null)) texlive-lang-latin (0 (null)) texlive-lang-mongolian (0 (null)) texlive-lang-norwegian (0 (null)) texlive-lang-other (0 (null)) texlive-lang-polish (0 (null)) texlive-lang-portuguese (0 (null)) texlive-lang-spanish (0 (null)) texlive-lang-swedish (0 (null)) texlive-lang-vietnamese (0 (null)) texlive- [00:31:34] latex-extra (0 (null)) texlive-math-extra (0 (null)) texlive-pictures (0 (null)) texlive-pstricks (0 (null)) texlive-publishers (0 (null)) php5-redis (0 (null)) php5-memcached (0 (null)) libmemcached10 (0 (null)) php5-igbinary (0 (null)) lilypond (0 (null)) timidity (0 (null)) imagemagick (0 (null)) [00:31:37] !log depooling ssl3 ssl4 ssl1003 ssl1004 ssl3003 [00:31:39] sorry for flood [00:31:44] Logged the message, Master [00:32:29] example srv291, i can just fix it by installing wikimedia-task-appserver, if these php5 versions are ok: 5.3.10-1ubuntu3.5+wmf1 [00:32:52] should i just do that on all that have the package "rc"? [00:32:54] yes, that is the right version [00:33:09] yes [00:33:11] mutante: go for it [00:34:32] so if wikimedia-task-appserver doesn't depend on 3.4+wmf1, how did it come to be uninstalled? [00:35:24] !log pooling ssl3 ssl4 ssl1003 ssl1004 ssl3003 [00:35:32] Logged the message, Master [00:37:49] -_- [00:38:11] New patchset: Ryan Lane; "Only modify mobile certificate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53496 [00:38:16] dpkg mysteries [00:38:24] I can't wait to combine all of our certs and get rid of all of these damn lb services [00:38:26] hey [00:38:30] I'm back :) [00:38:42] what's up? [00:39:06] TimStarling: binasher, it is "ii" on all now [00:39:17] all in "apaches" [00:39:22] paravoid: looks like you missed all the fun [00:39:25] !log depooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [00:39:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53496 [00:39:31] Logged the message, Master [00:39:49] :( [00:39:50] New patchset: Ram; "Bug: 45266 Use sequence numbers instead of timestamps" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53299 [00:40:30] from dpkg.log on mw1125, here is the story [00:41:06] 22:51: upgrade wikimedia-task-appserver. [00:41:29] from 2.6 to 2.7 due to mutante's work [00:41:39] dpkg.log isn't exactly the most readable thing in the world :) [00:41:50] /var/log/apt/history.log [00:42:03] 23:48: upgrade PHP from 3.4 to 3.5 due to me manually running apt-get [00:42:22] is everything okay now? [00:42:40] 23:53: downgrade PHP to 3.4 and remove wikimedia-task-appserver due to puppet run [00:42:52] looks ok, wikimedia-task-appserver and php packages are installed on servers in apaches dsh group [00:43:02] does apache need restarting? [00:43:18] PHP 3.5 and 3.5? ;) [00:43:49] for a new php version? [00:43:51] I'd guess so [00:43:57] I was getting tired of typing 5.3.10-1ubuntu3.5+wmf1 [00:44:01] TimStarling: let me do that, that way we also confirm all of them now have the /wmf dir [00:44:03] you know what I mean [00:44:08] dzahn is doing a graceful restart of all apaches [00:44:54] !log dzahn gracefulled all apaches [00:44:56] Logged the message, Master [00:45:11] I know why puppet downgraded PHP to 3.4, I fixed it in https://gerrit.wikimedia.org/r/#/c/53490/ [00:45:12] Ryan_Lane: paravoid: looks like gluster has async datacenter replication. and an s3 compatible reset api now! let's use it instead of ceph. cc: Aaron|laptop [00:45:15] all i see is "VIP not configured on lo" messages [00:45:21] I do not know why it removed wikimedia-task-appserver [00:45:30] so no other errors like missing the complete apache-sanity-check [00:46:02] New patchset: Ryan Lane; "Fix typo in certname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53497 [00:46:06] Ryan_Lane: it works perfectly, since we just copied everything into ceph, not we can do ceph -> gluster [00:46:13] \o/ [00:46:17] *now we [00:46:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53497 [00:46:39] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [00:46:39] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [00:46:50] as long as we use quorum support we won't have to worry about split brains [00:46:58] and it's not like we need speed [00:46:58] Aaron|laptop: "everything" [00:47:02] or a sane filesystem design [00:47:09] PROBLEM - HTTPS on ssl1 is CRITICAL: Connection refused [00:47:23] though apparently the gluster folks say it's faster than ceph [00:47:25] +1 for ceph to gluster migration! [00:47:28] icinga-wm: I know, I know [00:47:29] PROBLEM - HTTPS on ssl2 is CRITICAL: Connection refused [00:47:43] and it supports "cloudification of applications" [00:47:45] Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf1 [00:47:53] but first, a blog post about migrating to ceph [00:47:56] Aaron|laptop: does it really say that somewhere? :) [00:47:58] those ssl alerts don't look good [00:48:01] then we switch to gluster [00:48:02] don't you love -y? [00:48:21] Ryan_Lane: http://www.slideshare.net/Gluster/introduction-to-glusterfs-webinar-september-2011 [00:48:22] binasher: then we switch back to zfs? [00:48:28] Ryan_Lane: ssl alerts? [00:48:35] slide 19 [00:48:39] I remember now that I used to make a habit of using --no-remove with -y so that apt-get is less likely to uninstall your whole system [00:48:39] paravoid: so, I installed the new unified cert [00:48:40] zfs + nfs [00:49:08] paravoid: wonderfully enough it was missing: *.m.wikipedia.org, mediawiki.org and *.m.mediawiki.org [00:49:11] RECOVERY - HTTPS on ssl1 is OK: OK - Certificate will expire on 01/20/2016 12:00. [00:49:16] ! [00:49:24] yeah [00:49:29] RECOVERY - HTTPS on ssl2 is OK: OK - Certificate will expire on 01/20/2016 12:00. [00:49:53] New patchset: Jeremyb; "stop paging robla for analytics (icinga)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53499 [00:50:23] paravoid: so, I'm mostly reverting that now [00:50:40] and by mostly I mean I'm reverting it for mobile [00:50:49] we reverted mediawiki.org earlier [00:51:00] ok, if you're on top of this :) [00:51:10] yep [00:51:16] apt trying to destroy the world: http://paste.tstarling.com/p/iMSruW.html [00:51:25] hahaha [00:51:34] I love how the first one is texlive-lang-greek [00:51:41] why doesn't the HTTPS check use ISO 8601? :( [00:52:13] "libapache2-mod-php5" do we use that? [00:52:23] that would be the php apache module [00:52:25] jeremyb_: why don't we have lots of checks? :) [00:52:32] so, err, yes, we kinda use that :) [00:52:33] I think robla knows that [00:52:38] * robla shoulda put a smiley on that [00:52:40] okay :) [00:52:50] sorry :) [00:52:54] * TimStarling gives paravoid an amazon voucher to buy a sarcasm detector [00:52:59] it's 3am [00:53:02] can I get one of those too?? [00:53:06] You can buy them at 3am [00:53:07] only $99 with free shipping! [00:53:10] Amazon is open all the time! [00:53:12] and I'm really in no place to detect sarcasms [00:53:18] but the voucher is only for $15 :( [00:53:18] no prob....I give Tim crap for not putting enough smilies on his stuff [00:53:31] does amazon prime ship free to athens? [00:53:59] Probably not [00:55:08] heh, i just talked to apergos about that .. http://askville.amazon.com/Amazon-Ship-Greece/AnswerViewer.do?requestId=88175419 [00:55:58] doesitshipto.com ..heh [00:56:40] Will it ship? [00:58:09] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 00:58:02 UTC 2013 [00:58:22] I'm going to start running updateCollation.php since the emergency is apparently over [00:58:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [00:58:45] I'm glad I changed puppet to depend on the latest PHP, that stopped it from constantly reverting our site fixing efforts [00:59:21] Going to run them all simultaneously? [01:01:50] sorry for missing the fun :( [01:03:05] Reedy: yes [01:03:14] just going to use 6 screens on hume [01:03:18] paravoid my buddy my paaalllll!!!!! [01:03:24] ottomata: no way [01:03:25] go away [01:03:26] hahah [01:03:27] it's 3am :) [01:03:32] hahah [01:03:33] you spoke up [01:03:35] your fault [01:03:36] hahah [01:03:44] well, for a site wide outage [01:03:49] oh, i just signed on [01:03:50] not for reviewing puppet manifests :) [01:03:59] didn't notice :), of course of course [01:04:06] haha [01:04:19] :-) [01:05:03] !log running updateCollation.php on all uca-* wikis in 6 screens on hume [01:05:20] TimStarling: anything i can do? [01:05:56] no, I think it's all sorted, thanks for coming in [01:05:59] good night [01:06:04] thanks [01:07:14] ottomata: review my puppet change :) [01:07:20] ok where? [01:07:45] ottomata: https://gerrit.wikimedia.org/r/53499 [01:07:50] i even made it relevant to you [01:08:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53499 [01:08:45] done! [01:09:27] !log pooling ssl1 ssl2 ssl1001 ssl1002 ssl3001 ssl3002 [01:09:34] danke :) [01:09:47] !log depooling ssl3 ssl3 ssl1003 ssl1004 ssl3003 [01:09:48] ottomata: (sockpuppet too?) [01:10:09] yup [01:10:25] so that regex in apache-sanity-check now matches eqiad hosts, but still not all of them, because the last part of the regex was (and has been before) (1\21\) but we also have hosts with lo:LVS ending in .22 [01:11:54] ok. en.m is fixed [01:12:22] but unfortunately it no longer works for anything other than wikipedia [01:12:29] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 0 seconds [01:12:29] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [01:12:45] !log pooling ssl3 ssl3 ssl1003 ssl1004 ssl3003 [01:14:29] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [01:15:11] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [01:28:41] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 01:28:30 UTC 2013 [01:29:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:30:42] New review: Dzahn; "18:19 < mutante> so that regex in apache-sanity-check now matches eqiad hosts, but still not all of ..." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [01:37:07] New review: Dzahn; "RT-4676, approved and waiting period is over" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/52671 [01:37:17] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52671 [01:42:55] New patchset: Ottomata; "Outputting integers in python settings.py for metrics api" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53507 [01:43:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53507 [01:44:52] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:47:05] marktraceur: you can ssh to gallium now, enjoy [01:47:36] New patchset: Ottomata; "Missing \ in regex in e3 metrics api settings.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53508 [01:48:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53508 [01:48:09] New review: Dzahn; "21 1H IN PTR rendering.svc.pmtpa.wmnet." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/49231 [01:49:39] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:49:39] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [01:58:59] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 01:58:54 UTC 2013 [01:59:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:07:50] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:08:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:08:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [02:29:31] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 02:29:22 UTC 2013 [02:29:31] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [02:30:32] !log LocalisationUpdate completed (1.21wmf11) at Wed Mar 13 02:30:32 UTC 2013 [02:31:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 194 seconds [02:31:39] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 198 seconds [02:42:59] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [02:54:13] !log LocalisationUpdate completed (1.21wmf10) at Wed Mar 13 02:54:13 UTC 2013 [03:00:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 03:00:05 UTC 2013 [03:00:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:16:29] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:16:39] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [03:30:40] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 03:30:35 UTC 2013 [03:31:29] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [03:32:39] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [03:32:40] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 187 seconds [03:52:32] mutante: Thanks! [04:01:10] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:01:00 UTC 2013 [04:01:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:03:31] !log tstarling synchronized php-1.21wmf10/includes/Collation.php [04:04:33] !log tstarling synchronized php-1.21wmf11/includes/Collation.php [04:06:46] New patchset: Krinkle; "Integration: Update index, fix discrepencies, move wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:07:08] New patchset: Krinkle; "Integration: Update index, fix discrepencies, move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:08:15] New review: Krinkle; "When deploying: Update paths in jobs that publish artefacts in the document root (nightly snapshots)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [04:13:39] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:13:39] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:20:56] New review: Hashar; "Note that I am currently refactoring the contint manifests and will drop all the web materials from ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53513 [04:27:08] !log (about 00:42) - dsh installing wikimedia-task-appserver where it was 'rc' after php puppet upgrade [04:31:39] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:31:37 UTC 2013 [04:32:30] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:32:36] mutante: you might want to warn Tim about that :] I have seen he changed php to ensure => latest [04:33:38] hashar: he was present at that time and knows, i just noticed i forget to log during site outages :p [04:33:48] :-] [04:33:49] when it would be most interesting [04:34:07] so we are fine :-] [04:34:22] yea, it was a couple hours ago [04:34:34] and thank you to have added marktraceur account on gallium! [04:34:42] wikimedia-task-appserver packages had been removed on a lot of servers :p [04:36:16] mutante: and replaced by puppet ? [04:36:28] I mean, is the content of wikimedia-task-appserver obsolete nowadays? [04:36:48] no, it's not obsolete. it has stuff like apache-graceful-all in it [04:37:17] ahhh and scap-2 iirc [04:37:24] i built a new version of it to fix one line in apache-graceful-all [04:37:57] so now you can use it again. it did not restart Apaches in eqiad due to a regex in apache-sanity-check [04:39:08] good finding [04:39:20] would it make sense to put that script and some others in puppet? [04:40:37] which others? [04:40:49] the package upgrade of -appserver worked, but puppet had ensure on a specific PHP version [04:40:56] hashar: https://gerrit.wikimedia.org/r/#/c/53490/1/modules/applicationserver/manifests/packages.pp [04:41:32] jeremyb_: well I mean the scripts in the wikimedia-task-appserver [04:41:57] hashar: then this is what happened: 17:28 < mutante> Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install php5-common=5.3.10-1ubuntu3.4+wmf117:28 < mutante> Downgrade: php5-common:amd64 (5.3.10-1ubuntu3.5+wmf1, 5.3.10-1ubuntu3.4+wmf1)17:28 < mutante> Remove: wikimedia-task-appserver:amd64 (2.7-1) [04:42:33] after 23:39 Tim: updating PHP to php5-3.5+wmf1 for new ICU [04:42:49] doh [04:44:25] hashar: now to the other question. package or puppet. there seem to be different opinions. I asked Mark and he said it was an in-between thing but i should still just package for that small fix [04:45:03] I must agree with him :-] [04:45:13] you don't want to refactor everything just for a tiny regex update [04:45:19] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Wed Mar 13 04:45:12 UTC 2013 [04:45:35] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [04:45:50] stop flapping! [04:46:29] yeah that keeps happening on several boxes [04:47:03] should theoretically be impossible [04:47:13] time should only go in one direction! [04:47:33] heh, fair [04:47:34] maybe there's some permissions error on neon? [04:47:57] so it records the recovery but wherever the trap info is persisted is readonly? [04:48:19] RECOVERY - MySQL Slave Delay on db68 is OK: OK replication delay 0 seconds [04:48:28] wild speculation :) [04:48:29] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [04:49:43] and that db68 notification is kind of a duplicate too [04:52:23] yeah, but at least it's that way because someone made it be that way [04:52:24] ticket please for the "time should only go in one direction" : [04:52:34] ok! :) [04:52:37] ty [04:54:06] hashar: the answer to that would be either "nagios service dependencies" and/or "icinga check_cluster" [04:54:43] oh yeah ticket [04:54:45] ops-request ? [04:54:48] http://docs.icinga.org/latest/en/clusters.html | http://nagios.sourceforge.net/docs/3_0/dependencies.html [04:54:52]