[00:16:17] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1043 and shifting watchlist q's to db1050' [00:16:17] !log shutting down mysql on db1043, converting to mariadb [00:16:17] LeslieCarr: [00:16:17] er [00:16:17] LeslieCarr: spence is the ishmael server [00:16:17] please don't fully decom [00:16:17] at least for now [00:16:17] just de-nagios [00:16:17] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [00:16:18] oh [00:16:18] oops [00:16:18] i had turned it off [00:16:18] ishmael ? [00:16:18] is that unpuppetized ? [00:16:18] hrm [00:16:18] wait it's not [00:16:18] binasher: ishmael redirects to labs [00:16:18] or no fenari [00:16:18] https://ishmael.wikimedia.org/ [00:16:18] ishmael.wikimedia.org is an alias for noc.wikimedia.org. [00:16:18] yes [00:16:18] so, that's not spence [00:16:18] ishmael runs on spence [00:16:18] PROBLEM - mysqld processes on db1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:16:18] ok, is this something i can puppetize ? :) [00:16:18] that would be oh so very awesome! [00:16:18] noc is just being used as an https ldap auth protected proxy [00:16:19] so it's just the php or are there other secret processes ? [00:16:19] !log pgehres Started syncing Wikimedia installation... : Updating i18n for DonationInterface-langonly [00:16:19] no, i started the scap about 20 minutes ago [00:16:20] RECOVERY - mysqld processes on db1043 is OK: PROCS OK: 1 process with command name mysqld [00:16:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51312 [00:16:20] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 525 seconds [00:16:21] New patchset: Jeremyb; "RT 4615 - mywot.com site verification for the wikivoyages" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51595 [00:16:21] LeslieCarr: its just the php, there's stuff running on every db for it but that part actually is puppetized [00:16:21] hehe [00:16:21] oh i love unpuppetized unpackaged stuff [00:16:21] and the php i sometimes make lots of little changes to [00:16:21] ok, so would you rather package it or just put the files straightup into puppet ? [00:16:21] moving it to a deb would make that a huge pain in the ass [00:16:21] putting the files straight into puppet would be easy for me (lazy me) [00:16:21] mark has requested not to put webapps files in puppet tho [00:16:21] and do you want from github or just copied from there? [00:16:21] shhh, now that you said his name you'll ping him ;) [00:16:21] maybe what should be puppetized is just the config file and some kind of git-deploy based thing [00:16:21] that keeps recurring, doesn't i t [00:16:22] lots of web apps that need to get deployed and neither puppet nor packages being really suitable [00:16:22] what's in github should be generally in sync with what's on spence, i pushed recently [00:16:22] that is why we need to write a debian package don't we ? [00:16:22] just liked or bugzilla? [00:16:22] s/liked/like [00:16:22] just like for bugzilla? [00:16:22] is git deploy right ? [00:16:22] for this occasion ? [00:16:22] I have no idea how useful git-deploy is right now [00:16:22] dpkg isn't right [00:16:22] not even if it is properly working [00:16:22] regardless of what is right [00:16:22] I agree [00:16:22] debs for small web apps sounds wrong even to me :) [00:16:22] puppet supports rsync:// instead of just file:// so maybe we could setup something based on distributing things via rsync that originate from different wmf git repos [00:16:22] my feeling right now while we are in limbo is to put it in puppet because it's better than nothing ? [00:16:22] paravoid: oh really? :-] [00:16:22] what about small python modules ? *grin* [00:16:22] hrm, damn, i was hoping that github had rsync [00:16:22] LeslieCarr: what I do for some of my project is that I manually git pull on the prod box [00:16:22] i'd probably rather put it in our operations-software repo [00:16:22] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 0 seconds [00:16:22] !log pgehres Finished syncing Wikimedia installation... : Updating i18n for DonationInterface-langonly [00:16:22] like the Zuul gateway (which handle the Gerrit/Jenkins interaction). [00:16:22] if it's rsync based, live hack changes need to go into a repo [00:16:22] that seems like a good idea [00:16:22] and i rarely push up to github [00:16:22] the code is installed by puppet using git::clone() that trigger a restart the Zuul process [00:16:22] the conf for it I deploy it from a second repository zuul-config [00:16:22] which I just have to git pull [00:16:22] works well since that is just one software on ONE server [00:16:23] how is parsoid deployed? [00:16:23] via tin [00:16:23] using git-deploy [00:16:23] i was talking with paravoid about this too, I heard that they were using git-deploy [00:16:23] yeah [00:16:23] ok, so git-deploy is in use [00:16:23] Analytics is thinking about this too [00:16:23] we want to productionize limn soon [00:16:23] been made by Roan / Ryan Lane who originally thought about the slot system that git-deploy uses [00:16:23] so I guess parsoid has been used as a proof of concept [00:16:23] its a NodeJS webapp, (parsoid uses NodeJS too?) [00:16:23] so one would just go to tin then /srv/deployment/Parsoid [00:16:23] git pull && git deploy start && git deploy rsync [00:16:23] though the npm modules there need to be updated manually [00:16:23] is it in a git repo ? [00:16:23] LeslieCarr: Parsoid is [00:16:23] the npm dependencies are not [00:16:23] ok [00:16:23] (as far as I know) [00:16:23] we investigated that yesterday with marktraceur [00:16:23] they are in a separate config git repo [00:16:23] since another project needed some node.js modules to be deployed, we looked at how parsoid has been deployed [00:16:23] on tin [00:16:24] ahh [00:16:24] hashar: Yeah, RoanKattouw let me know today they're in a config repo [00:16:24] see https://wikitech.wikimedia.org/view/Parsoid [00:16:24] good [00:16:24] marktraceur: I guess that mean the git repo we created yesterday should be discarded isn't it ? [00:16:24] gwicke: that link is broken :( [00:16:24] Heh, wikitech was migrated, wasn't it? [00:16:24] the link works for me [00:16:24] Ryan_Lane: https://wikitech.wikimedia.org/view/Parsoid ends up being broken :( https://labsconsole.wikimedia.org/wiki/Parsoid [00:16:24] got a redirect to labsconsole [00:16:24] hashar: the modules in production are different from the ones in testing [00:16:24] so we would need a different branch or simply a different repository [00:16:25] oh /srv/deployment/parsoid/config/ contains the node modules and is a local to tin git repository [00:16:25] maybe two branches will work :] [00:16:25] Gluster down for anybody else? [00:16:25] or two repo, up to you [00:16:25] hashar: a different repo is simpler for now I think [00:16:25] up to you :-] [00:16:25] the tin one is really tin-local [00:16:25] we just needed a way to fetch the npm dependencies without having to use npm itself :-D [00:16:25] came up with the idea of sticking them in a Gerrit repo so Jenkins can happily fetch all the npm dependencies [00:16:25] yeah, and to fix a revision of the dependencies so that we can tell apart regressions from libs from those in our code [00:16:25] did not thought that much about how to handle the dev/prod differences [00:16:25] dschoon^ [00:16:26] yeah! i've been reading through parsoid, hashar gwicke [00:16:26] hmm [00:16:26] but i've been in meetings -- i'll have some questions tomorrow :) [00:16:26] would be nice to have all of that sorted out soon [00:16:26] PROBLEM - Packetloss_Average on emery is CRITICAL: STALE [00:16:26] no one reads their email [00:16:26] :) [00:16:26] hashar: it's not broken [00:16:26] define "it" ? [00:16:26] hashar: use wikitech-old.wikimedia.org for now [00:16:26] dschoon: great, will be happy to chat about it ;) [00:16:26] PROBLEM - Packetloss_Average on locke is CRITICAL: STALE [00:16:26] Ryan_Lane: that is because the content is being migrated? [00:16:26] yes [00:16:26] and post migration https://wikitech.wikimedia.org/view/Parsoid would properly rewrite to labsconsole page having the content of that page ? [00:16:26] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:16:26] Ryan_Lane: the mail you sent just confused me sorry :-r( [00:16:26] Ryan_Lane: jet lag is preventing me from properly parsing the english emails [00:16:26] no [00:16:26] it'll be wikitech [00:16:26] ok now I am no more understanding anything haha [00:16:26] hm [00:16:26] actually... [00:16:26] it shouldn't redirect to labsconsole [00:16:26] I will just wait for you to complete the migration and then send the cheerful comments about it :-] [00:16:27] oh. now it isn't [00:16:36] LeslieCarr: so I am not sure how you will end up preparing the ishmael files :-] [00:16:36] me neither now [00:16:36] :-/ [00:16:36] LeslieCarr: seems there are lot of different ways to do it :-] [00:16:36] but all are wrong! [00:16:36] LeslieCarr: if you ever bring that discussion, I will be more than happy to attend [00:16:37] that would interest me for a few projects I am leading [00:16:38] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [00:16:38] mark: are you around? [00:16:39] he's in a meeting and doesn't seem to have a laptop in front of him [00:16:39] not his laptop anyway [00:16:39] so that would be a no [00:16:39] PROBLEM - MySQL Slave Delay on db39 is CRITICAL: CRIT replication delay 196 seconds [00:16:39] he is starring at a can of coke actually [00:16:39] :D [00:16:39] mark: When you get a sec, just let me know if you still need your OTRS account. It's set for closure. [00:16:40] PROBLEM - Puppet freshness on search23 is CRITICAL: Puppet has not run in the last 10 hours [00:17:13] New patchset: Asher; "Revert "pulling db1043 and shifting watchlist q's to db1050"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51609 [00:17:34] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Fri Mar 1 00:17:22 UTC 2013 [00:18:31] !log restarted glusterd service on labstore1-4 [00:18:36] (and yes, I know log is broken) [00:20:08] oh.. hah, i didn't even notice when i !log'd earlier [00:21:32] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 00:21:26 UTC 2013 [00:21:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:22:06] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 00:21:57 UTC 2013 [00:22:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:22:52] RECOVERY - MySQL Slave Delay on db39 is OK: OK replication delay 0 seconds [00:26:37] New review: Dzahn; "so if it's generated for en.voy it won't work for the other languages, but i don't see any harm ther..." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/51595 [00:26:55] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51595 [00:27:58] New patchset: Pyoungmeister; "fixing tmh role class to work with new jobrunner defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51611 [00:28:02] RECOVERY - Puppet freshness on search23 is OK: puppet ran at Fri Mar 1 00:27:57 UTC 2013 [00:28:15] AaronSchulz: can you look at ^^ [00:28:18] I think that it's right.... [00:28:29] mutante: Tim is working on merging all the docroots I think [00:32:00] paravoid: ah,ok, i was going to talk to Reedy once they are out of meeting [00:34:29] mutante: we are never going to get out of that meeting [00:34:41] probably already have ordered pizza *grin* [00:36:10] New patchset: MaxSem; "Bug 45256: make upload endpoint for Commons local" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51612 [00:38:20] !log dzahn synchronized ./docroot/wikivoyage.org/mywotf83102384bfcd1e52152.html [00:38:36] jeremyb_: ^ [00:42:07] danke [00:42:53] notpeter: looks fine [00:43:05] AaronSchulz: cool [00:43:15] thanks! [00:43:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51611 [00:45:58] huh, still no morebots [00:47:53] !log COME AT ME, MOREBOTS [00:48:20] "Send them to me, come on!" [00:51:29] New patchset: Rfaulk; "mod. update config info for E3 user metrics deployment." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51615 [00:52:07] ottomata: https://gerrit.wikimedia.org/r/#/c/51615/ [00:52:36] there are some new credentials however that need to be setup [00:55:15] hokay looking! [00:55:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:57:20] a the mysql things, hm [00:57:21] New patchset: Ryan Lane; "Rewrite any non-controller url" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51616 [00:58:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51616 [01:01:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:07:04] !log csteipp synchronized php-1.21wmf10/includes [01:13:20] binasher: We seem ot to be getting a lot of errors displayed to users relating to trying to get locks in memcached [01:13:21] https://bugzilla.wikimedia.org/show_bug.cgi?id=43516 [01:13:49] well that was fail [01:18:42] Reedy: could the procs that got the lock be throwing exceptions before unlocking or something? [01:19:00] or do they just take a long time? [01:19:29] Not sure [01:19:41] Things seem to work on refresh [01:24:03] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:27:51] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:28:57] New patchset: Rfaulk; "mod. update config info for E3 user metrics deployment." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51615 [01:31:51] PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=91%): [01:35:25] What's using db9 now? [01:36:45] bugzilla for some stuff? :/ [01:43:01] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [01:49:51] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [02:00:46] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:11:16] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [02:12:16] PROBLEM - Puppet freshness on mw1048 is CRITICAL: Puppet has not run in the last 10 hours [02:12:16] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Puppet has not run in the last 10 hours [02:16:05] New patchset: Dzahn; "turn bugzilla_report.php into template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51629 [02:18:27] New patchset: Dzahn; "turn bugzilla_report.php into template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51629 [02:20:22] New patchset: Dzahn; "turn bugzilla_report.php into template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51629 [02:21:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51629 [02:31:14] !log LocalisationUpdate completed (1.21wmf10) at Fri Mar 1 02:31:13 UTC 2013 [02:31:16] PROBLEM - Puppet freshness on cp1011 is CRITICAL: Puppet has not run in the last 10 hours [02:32:53] New patchset: coren; "Support for autogenerated sudoers on tools project" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51631 [02:35:42] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [02:36:43] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [02:37:42] PROBLEM - Puppet freshness on db38 is CRITICAL: Puppet has not run in the last 10 hours [02:38:04] New review: Faidon; "sudoers in a dotfile in a random place in the filesystem which may or may not be mounted?" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/51631 [02:44:52] RECOVERY - MySQL disk space on db9 is OK: DISK OK [02:45:16] Change abandoned: coren; "Not needed, after all." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51631 [03:19:28] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:21:08] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:58] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:28:18] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.004 second response time [04:08:29] RECOVERY - Puppet freshness on tmh1 is OK: puppet ran at Fri Mar 1 04:08:19 UTC 2013 [04:08:29] RECOVERY - Puppet freshness on tmh1001 is OK: puppet ran at Fri Mar 1 04:08:25 UTC 2013 [04:09:09] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:08:59 UTC 2013 [04:09:09] RECOVERY - Puppet freshness on tmh1002 is OK: puppet ran at Fri Mar 1 04:09:06 UTC 2013 [04:09:21] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:09:49] RECOVERY - Puppet freshness on tmh2 is OK: puppet ran at Fri Mar 1 04:09:47 UTC 2013 [04:10:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:10:28 UTC 2013 [04:11:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:11:39] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:11:38 UTC 2013 [04:12:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:12:39] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:12:30 UTC 2013 [04:13:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:13:19] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:13:18 UTC 2013 [04:14:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:14:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 04:14:28 UTC 2013 [04:15:19] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [04:20:01] this is fun! [04:20:21] (spence ^^) [04:24:34] New patchset: Parent5446; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [04:48:53] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:49:03] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:43] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 1.359 second response time [04:57:53] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:25:07] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:25:57] PROBLEM - SSH on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:26:47] RECOVERY - SSH on palladium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:26:57] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.001 second response time [06:22:43] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:29:43] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 06:29:35 UTC 2013 [06:29:44] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:29:53] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 06:29:52 UTC 2013 [06:30:44] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:30:44] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 06:30:38 UTC 2013 [06:31:43] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:32:03] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 06:31:55 UTC 2013 [06:32:45] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:39:03] PROBLEM - Varnish HTTP bits on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:53] RECOVERY - Varnish HTTP bits on strontium is OK: HTTP OK: HTTP/1.1 200 OK - 637 bytes in 0.002 second response time [06:45:03] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 06:45:02 UTC 2013 [06:45:44] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [06:58:02] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [06:58:22] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:06:32] I skimmed scrollback. [07:07:32] So the server admin log has gone missing. [07:07:38] As has the bot. [07:09:28] yes and it wil likely stay that way til ryan sorts out the wikitech migration [07:10:13] Susan: wikitech-old.wm.o/view/SAL [07:10:41] apergos: I'm a little behind on wikitech-l. [07:10:48] Was there anything there about this? [07:10:54] yes, plenty [07:11:20] how can there possibly be a connection refused from neon to neon? (icinga==neon. above) [07:12:11] apergos: can you set an extended sticky downtime for puppet freshness on spence? (unless you know something about why it's flapping?) [07:12:30] jeremyb_: No, there wasn't. [07:12:33] LeslieCarr was talking today about possibly shutting spence down entirely today [07:14:36] Susan: how do you figure? [07:14:38] http://lists.wikimedia.org/pipermail/wikitech-l/2013-February/066948.html [07:14:44] among others [07:15:49] jeremyb_: Do you think it's reasonable to assume that if I've replied to a thread, I've read (or at least skimmed) the previous posts in that thread? [07:16:34] 01 07:12:29 < Susan> jeremyb_: No, there wasn't. [07:16:38] what does that mean? [07:16:49] I was discussing apergos' comment. [07:17:18] I don't know what happened with the migration. Neither does the list, apparently. [07:17:34] the migration is still going [07:17:49] the import is still running at this very minute [07:18:02] Susan: you should search the history of this channel for !log [07:18:13] jeremyb_: Why? [07:18:23] apergos: Really? It's been running for hours? :-/ [07:18:27] What's it doing? [07:18:52] importing wiki pages [07:19:19] jeremyb_: I've found most of your answers this evening to be unhelpful. [07:19:26] I'm not sure if this is intentional. [07:19:26] importDump.php is what it's doing [07:19:37] exactly that, importing wiki pages [07:19:47] There are 906 content pages on wikitech. [07:19:54] It takes 12 hours to import them? [07:19:54] and a lot of revisions [07:19:55] yeah, i thought i was pretty accurate :) [07:20:03] yes. it takes longer than 12 hours apparently. [07:20:18] All right. [07:20:26] Well, I filed a bug about the SAL having gone missing. [07:20:29] Susan: !log because it may be more enlightening than the mailing list which you claim has had no update at all. (which may be right) [07:20:51] gah, don't have my bugzilla creds here :/ [07:21:16] I only claim that because it happens to be true. [07:21:31] 57,000ish revisions. [07:21:33] "which may be right" [07:21:42] Which seems high. [07:21:48] jeremyb_: You may be helpful. [07:21:55] apparently Susan's bug is https://bugzilla.wikimedia.org/45595 [07:22:08] * jeremyb_ will CC later when at the other computer [07:23:23] it's a 6.1G dump [07:23:27] with 60k or so revisions [07:23:37] I wouldn't have assumed it would have taken so long [07:23:48] no, me neither [07:24:03] it would almost have been worth doing it the other way: [07:24:07] dump labsconsole [07:24:12] truncate all tables [07:24:20] import the wikitech dump and all tables [07:24:33] nw use importdump.php for the labconsole wiki :-P [07:24:46] not so easy to do [07:25:22] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:30] does anyone bother reading email? [07:25:39] the SAL is accessible at wikitech-old.wikimedia.org [07:25:52] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:26:01] 01 07:10:13 < jeremyb_> Susan: wikitech-old.wm.o/view/SAL [07:26:01] It wasn't clear to me that it takes half a day to import a wiki. [07:26:16] all wikis are created equal! [07:26:18] if you *really* need the stuff that's been logged since the migration started we have irc logd [07:26:18] Still not really clear why that is. [07:26:20] *logs [07:26:39] meh. it's 6.1G of xml dumps [07:26:42] it's a matter of things being logged in the meantime [07:26:46] and 60k revisions [07:26:50] we'll live without for a day [07:26:54] I'm just hoping http://wikitech.wikimedia.org/view/SAL will work again someday. So I filed a bug. I'm not in a rush. [07:26:59] apergos: as I mentioned in the email, I'm going to log them [07:27:11] yes, I saw your email [07:27:12] Susan: what do you mean work again someday? [07:27:31] Susan: https://wikitech-old.wikimedia.org/view/SAL [07:27:35] I mean "it's a matter of seeing the things that haven't been logged in the meantime' [07:27:36] is that what you want? [07:27:39] Susan: it's being announced several times, days in advance and today again [07:27:45] Susan: someone already told you here what's going on [07:27:56] Susan: and you still complain and file bugs? [07:28:00] Ryan_Lane: http://wikitech.wikimedia.org/view/SAL isn't the server admin log. [07:28:08] oh god [07:28:15] Susan: don't be so dramatic [07:28:32] I'm not burning my bras, J. [07:28:50] paravoid: Susan's special :) [07:29:11] Susan: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20130228.txt http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20130301.txt [07:29:15] search for !log [07:29:24] I skimmed scrollback. I saw. [07:29:43] why does every little thing have to be such a giant deal? [07:30:02] it's a bigger deal that we can't edit the docs right now [07:30:14] I'm not sure why you think I'm invested in this. [07:30:23] because you're bitching about it [07:30:29] if you don't care, don't say anything [07:30:37] while already being told both over mail and irc [07:30:56] As I said, I didn't realize it took so long to import the dump. [07:31:02] Which I guess is why the server admin log hasn't been pulled in? [07:31:05] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [07:31:15] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:31:44] yes. [07:31:47] yep [07:31:50] Okay. [07:32:12] I wouldn't think that it would need every revision to show it at all, but that's a possibility [07:32:13] If I file a bug about morebots having gone missing, is that just going to piss people off as well? [07:32:25] there's no need [07:32:28] Okay. [07:32:43] this was all detailed in the initial email [07:33:12] I read the initial e-mail. I thought the import would take like an hour. [07:33:25] well, based on 1600 pages, I thought it would too [07:33:32] I didn't realize the dump was going to be 6GB [07:34:12] that's 6gb of bz2 right? [07:34:22] * apergos would boggle if it were 6gb of 7z  [07:34:31] I followed up a number of times saying it was going to take longer. I honestly thought it would be done by now, based on the progress it had made since the earlier one [07:34:36] apergos: it's 3.1GB gz [07:34:42] 6.1GB uncompressed [07:34:44] Ryan_Lane: Did you only e-mail engineering-l? [07:34:46] oh, so raw 6gb [07:34:50] heck that's tiny :-D [07:34:53] hm. did I? [07:34:56] I looked at wikitech-l. I didn't see any follow-up. [07:35:13] there's 2 followups to wikitech-l [07:35:55] Thu, 28 Feb 2013 17:01:46 -0800 it went to wikitech-l [07:36:12] same subject 'Wikitech migration is underway' [07:36:25] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:36:35] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.002 second response time [07:36:54] I feel really stupid, but I don't see that e-mail anywhere. [07:36:57] http://lists.wikimedia.org/pipermail/wikitech-l/2013-February/date.html [07:37:01] http://lists.wikimedia.org/pipermail/wikitech-l/2013-March/date.html [07:37:22] * Ryan_Lane groans [07:37:24] I was also looking for "merging" from the previous thread, but I don't see any e-mails with "migration" in them. [07:37:33] jeremyb_: leslie was talking about shutting down spence yes but I had to restart icinga for unknown reason yesterday and asher expressed some concern about possibly missing monitoring checks so we'll likely keep it around for a little bit at least [07:37:36] I really should have disabled the backups in the cron [07:37:53] apergos: i didn't know about the restart [07:38:03] New review: Nikerabbit; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51542 [07:38:07] yeah I logged it by hand on the... sal :-P [07:38:10] it did an xml dump while doing the import [07:38:18] it did? :-D [07:38:28] :) [07:38:31] ah priceless [07:38:32] yep. I'm sure that didn't help at all [07:39:01] http://lists.wikimedia.org/pipermail/labs-l/2013-February/000929.html [07:39:04] http://lists.wikimedia.org/pipermail/labs-l/2013-February/000930.html [07:39:07] http://lists.wikimedia.org/pipermail/labs-l/2013-March/000931.html [07:39:09] Susan: ^ [07:39:45] Okay. [07:39:54] I wonder why it didn't show up in wikitech-l archives? [07:40:04] Thank you. I'm not subscribed to labs-l, so I completely missed those messages. [07:40:08] I checked my sent label and it's in the recipient list [07:40:20] right, i see it in the recipient list too [07:40:24] It's not in pipermail and it never hit my inbox. [07:40:32] so wait it's on lab-l and not the wikitech-l archive? [07:40:34] Ryan_Lane: Thank you very much for sending those e-mails. [07:40:37] so weird [07:40:42] yw [07:41:13] I'm sorry it seems like I'm being an asshole. I'm really not trying to be. The first thing I did was check wikitech-l for a status update. [07:41:29] Anyway, got it now. (Thanks jeremyb_.) [07:41:48] well wikitech-l should get that update somehow [07:41:50] grr [07:42:01] Susan: you should get a labsconsole account and subscribe to labs-l :) :) [07:42:23] I have something like 5,000 unread wikitech-l e-mails. :-/ [07:42:31] well other people have mailed wikitech-l and it worked... [07:43:34] I got one copy of those instead of two, my copy came via engineering [07:43:37] meh [07:43:42] heh [07:44:49] well... care to try forwarding/sending one message with all the info? [07:48:09] it's only *really* important for ops/deployers [07:48:28] I have a pretty good feeling they got an update [07:50:42] Clearly the answer is for me to be subscribed to the engineering list. ;-) [07:51:00] Sorry again about the confusion. I didn't mean to poke the bear with that bug. [07:52:45] or maybe the answer is not SMTP [07:53:22] (no delivery guarantees) [07:53:24] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/51542/1/manifests/bots.pp [07:54:00] Susan: no worries [07:54:15] Nikerabbit: what about it? [07:54:20] ah [07:54:21] shit [07:54:22] heh [07:54:29] hah [07:54:43] the commit msg was right! [07:55:08] Nikerabbit: Nice catch. :-) [07:55:15] indeed [07:56:18] New patchset: Ryan Lane; "Fix typo in labs adminbot config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51639 [07:56:50] wonder if there are others (hope not!) [08:01:01] I only updated a couple by hand [08:01:06] then used a sed for the rest [08:01:21] Susan: I sent another update to wikitech-l [08:02:03] hm [08:02:17] I haven't received a wikitech-l post on this email account in days [08:02:23] I wonder if my emails are getting stuck in moderation [08:04:00] maybe related to the phpmyadmin mess? [08:04:07] wild guess [08:04:13] unlikely [08:04:13] hm. I think unrelated [08:04:22] Ryan_Lane: check your spam folder [08:05:02] not in there [08:05:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:05:43] not being moderated [08:05:48] weird [08:05:54] did my email make it to wikitech-l? [08:05:56] I just sent one [08:05:56] MTA logs :) [08:05:59] no [08:06:02] wtf [08:07:48] I'll send it from my personal account [08:07:51] I know that works [08:08:09] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 08:07:59 UTC 2013 [08:08:19] here we go again [08:08:24] jeremyb_: ? [08:08:42] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:08:43] Ryan_Lane: spence is going to flap a few times soon [08:08:46] ah [08:09:21] I got caught in wikitech-l archives. The year 2002 was pretty interesting. [08:09:27] :D [08:09:39] Ryan_Lane: so now where will the wikitech archive go? [08:09:45] Ryan_Lane: just got it [08:09:45] are there dumps? [08:09:48] Nemo_bis: archive? [08:09:53] I'm importing the full history [08:09:55] http://lists.wikimedia.org/pipermail/wikitech-l/2013-March/067084.html worked. [08:10:06] Ryan_Lane: oh, but that's from the other account [08:10:15] Nemo_bis: this is wikitech.wikimedia.org I'm talking about [08:10:23] Susan: he resent from a different email [08:10:28] Ryan_Lane: the deleted pages [08:10:41] gone forever [08:10:50] that's not an acceptable answer [08:10:53] which was the point of deleting them ;) [08:11:05] not that we actually deleted many [08:11:26] do xml dumps also import deleted pages? [08:11:44] or export them, even? [08:11:45] there were many deleted pages, in fact [08:11:57] no it doesn't, that's why it needs to be done separately [08:12:01] none that non-admins could see [08:12:10] heh [08:12:12] no they do not export deleted pages [08:12:26] if ops decided to delete pages, they did so because it was complete crap [08:12:32] Ryan_Lane: wrong [08:12:53] the old wiki is still there [08:13:01] is or will [08:13:07] it won't be forever [08:13:11] apergos: is there a dump of the trye wikitech [08:13:17] Ryan_Lane: you see [08:13:34] bring it up on the list [08:13:40] the dump won't have the deleted pages. somene would have to musqldump the text and revision and page tables [08:13:43] my opinion is that if we deleted it, there's no reason to save it [08:14:15] others may have a different opinion and we'll save it forever [08:14:30] Ryan_Lane: you obviously didn't even look at https://wikitech-old.wikimedia.org/index.php?title=Special:Log/delete&offset=&limit=500&type=delete&user= [08:14:30] there's definitely some deleted stuff that should never come back or be saved [08:14:50] "we'll save it forever" = provide a full dup somewhere? [08:14:52] *dump [08:16:15] almost all of the deleted pages are spam [08:16:20] or double or broken redirects [08:16:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 08:16:27 UTC 2013 [08:16:39] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [08:17:01] others are deleted with reason "Author request" [08:17:21] Blatant censorship. [08:17:24] :D [08:17:58] Nemo_bis: Are you looking for a dump for Internet Archive or yourself or something else or? [08:18:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51639 [08:19:22] Susan: whatever [08:19:55] easiest and best for everyone would of course importing the archive table, but Ryan_Lane doesn't want to [08:20:24] well, I'd really rather not put stuff back out in the world that we've been asked to remove [08:20:25] or spam [08:20:28] it can just be published somewhere [08:20:37] there's no spam in there [08:20:40] nor libel [08:20:44] there's a shit-ton of spam [08:20:47] and nothing anyone "asked to delete" [08:20:49] wrong [08:20:55] it was just page-move vandalism [08:21:01] 16:36, 27 May 2012 http://wikitech-old.wikimedia.org/view/User:Krinkle (http://wikitech-old.wikimedia.org/view/User_talk:Krinkle | http://wikitech-old.wikimedia.org/view/Special:Contributions/Krinkle) deleted page http://wikitech-old.wikimedia.org/edit/Talk:Switch_master?redlink=1 (Author request) [08:21:24] that's himself [08:21:38] again, bring it up on the list [08:21:43] I already did [08:21:47] you even answered me [08:22:00] I did? [08:22:20] apergos: can you just dump that table please? [08:22:33] I don't want to waste everyone's time with useless discussions, it's a very simple thing to do [08:22:46] Ryan_Lane: http://lists.wikimedia.org/pipermail/wikitech-l/2013-February/067003.html [08:23:00] ah. right. indeed [08:23:02] soooo is that only on the linode instance or is is somewhere else now? [08:23:08] ah. bikeshedding. [08:23:10] linode instance [08:23:19] Nemo_bis: let me ask you a serious question [08:23:22] yeah well I can't apprently get on there any more [08:23:29] Nemo_bis: how do you think the enwiki community would respond to this request? [08:24:10] Is the request to the en.wiki community providing a copy of en.wiki's archive table or wikitech's? :-) [08:24:22] I would actually like a copy of the dataset1 page (now deleted) since it kept a record of kernel issues on that hardware [08:24:24] were deleted pages of wikivoyage saved? [08:24:59] apergos: undelete the page after the migration, and export that page, then [08:25:09] ok but I can't get to the linode instance [08:25:09] FWIW, I think moving the archive table isn't necessary. But I also think some of those server page deletions were a bit... :-/. [08:25:19] "A Klee painting named 'Angelus Novus' shows an angel looking as though he is about to move away from something he is fixedly contemplating. His eyes are staring, his mouth is open, his wings are spread. This is how one pictures the angel of history. His face is turned toward the past. Where we perceive a chain of events, he sees one single catastrophe which keeps piling wreckage and hurls it in front of his feet. [08:25:19] "The angel would like to stay, awaken the dead, and make whole what has been smashed. But a storm is blowing in from Paradise; it has got caught in his wings with such a violence that the angel can no longer close them. The storm irresistibly propels him into the future to which his back is turned, while the pile of debris before him grows skyward. This storm is what we call progress." (Walter Benjamin, On the Concept of Histo [08:25:19] ry) [08:25:25] Hi ori-l. [08:26:07] apergos: I think Ryan means to export from wikitech-old.wikimedia.org. [08:26:21] yeah, that would be it [08:26:27] hey [08:27:20] en wiki as a community would say 'huh we have monthly dumps with those old pages in em, if you need them get them from there' [08:27:45] The archive table isn't publicly dumped. [08:27:48] nope [08:27:58] And the en.wiki community would go ape-shit if someone suggested dumping it. [08:28:03] indeed [08:28:04] but pages that were around for awhile and recently deleted would be in an older dump [08:28:05] As would WMF's legal team. [08:28:22] dumping it --> dropping the table or making it public [08:28:27] I wonder how hard it would be to have wikitech be dumped regularly and the dumps downloadable [08:28:31] (as an aside) [08:28:37] apergos: way ahead of you [08:28:38] Either would be unimaginably dramatic. [08:28:45] are you doing it? [08:29:04] for instance: http://wikitech.wikimedia.org/dumps/labswiki-20130228.xml.gz [08:29:16] I need to rename the dump name, but whatever [08:29:34] we also need to deal with SAL [08:29:39] because now the dumps are going to be 6GB [08:29:44] does it have the other tables? [08:29:53] which other tabled? [08:29:54] I've got a mule and her name is Sal... [08:29:55] tables* [08:30:02] well that looks like it's xml [08:30:05] yeah [08:30:16] I'm not making the database dumps public [08:30:18] just xml/images [08:30:19] so the tables except for page/text/revision [08:30:32] without of course the private tables [08:30:45] xml dumps don't dump private tables, right? [08:30:58] no [08:31:07] then, no, no private tables ;) [08:31:26] you're questions were starting to worry me :D [08:31:27] I'm asking about the non-private tables that are not page/text/revision [08:31:39] what about them? [08:32:01] otherwise soeone who wants to set up eg a local copy must either rebuildall or importdump.php, which as we just discovered is horribly slow [08:32:32] I plan on purging all the SAL revisions [08:32:58] and setting up a new SAL system [08:49:33] I am interested in the details when you are ready to discuss them (well I'm interested in the general plan, probably not the details) [08:50:27] apergos: one idea is to use [08:50:30] http://www.mediawiki.org/wiki/Extension:EventLogging [08:51:39] so this means each entry is stored separately and no more giant text entries eh? [08:51:44] yep [08:51:55] and we'd have an interface to display the SAL [08:52:02] well that was my next q [08:52:26] if I want to look at last June cause I remember doing something in swift there (for example) there will be an easy way to get there? [08:52:44] ideally we'd be able to put in a date range [08:52:52] and it would show entries for that range [08:53:03] maybe with some paging, if we're lucky :) [08:56:33] that would work for me. [09:02:40] it would be searchable? (full text I guess, I use that alot) [09:02:58] um. no clue [09:03:04] how do you actually search it now? [09:03:16] i think it would be written by notryan [09:03:27] aka hashar+ori [09:03:31] mw search [09:03:41] and you get realistic results from that? :D [09:03:48] yes, believe it or not [09:04:08] i usually search with ctrl-f (browser's find) [09:04:16] I do that if I know the time frame [09:04:33] (because I can load the archive of that 3 or 6 months up and check it) [09:04:37] right. so once a year i go back to a previous archive [09:05:06] not that i'm representative in any way [09:07:03] idk what else to do for https://bugzilla.wikimedia.org/44777 [09:07:16] can i get a sanitized copy of kaulen's BZ conf somehow? [09:07:54] (not going to use it tonight anyway so no rush. i was just rereading the patch) [09:10:26] Antoine and I were discussing about getting it done next week.. [09:10:33] does importDump need to load an entire page (and all revisions) into memory before it imports them? [09:11:17] Reedy: that's a reply to me? ok... [09:11:22] No [09:11:23] when does everyone start going home? [09:11:28] hah [09:11:33] oh, you mean SAL? [09:11:36] yeah [09:11:49] well going home question still standds [09:11:51] well, it's not actually importing anything reight now [09:11:52] stands* [09:11:53] Most of the bz files aren't readable [09:11:59] to non roots [09:12:01] right* [09:12:07] revisions count isn't increasing [09:12:14] As for going home. Daniel Kinzler goes home tomorrow.. [09:12:27] Reedy: hence i was assuming there would be some sanitizing involved :) [09:12:50] For BZ stuff, I'd probably poke mutante [09:12:58] k [09:13:32] no afaik it doesn't [09:13:38] lemme look at what it does, it's been ages [09:14:38] ah [09:14:43] it's doing site stats update [09:14:48] and other things [09:15:46] * jeremyb_ wonders if Ryan_Lane is selling popcorn to all the spectators :) [09:16:37] well, this just looks really inefficient :) [09:16:59] write(5, "c\0\0\0\3UPDATE /* SiteStatsUpdate::doUpdate 127.0.0.1 */ `site_stats` SET ss_total_edits=ss_total_edits+1", 103) = 103 [09:18:42] oh tell me it doesn't have to compute 'good pages' now from scratch [09:18:49] but stil that's "only" 6k pages [09:18:56] what's good? [09:19:10] used to be 'has [[ ]] in it' [09:19:14] hm. [09:19:22] in main namespace too [09:19:35] well, ss_total_edits is total number of edits performed [09:19:45] I'm sure it's doing good pages too [09:19:50] figures [09:20:04] where is that popcorn, I could use some right about now [09:20:11] it's also updating the search index [09:20:22] sounds like the rebuildall piece [09:20:59] which means it's nearing the end? [09:22:11] yes [09:22:19] I can't find where it does that in here tbh [09:22:20] well, except there's still missing pages [09:22:29] so if it's not like that it would be... [09:22:42] I mean in theory what it should do is update site stats after every page. [09:22:44] oh, maybe that's the jobs being run [09:23:01] jobqueue? do you have that going? [09:23:02] hm [09:23:03] no [09:23:49] I guess I could actually look at the code [09:23:53] SET ss_total_edits=ss_total_edits+1 see that looks like 'new page added' or 'new edit done' [09:24:16] not any sort of rebuild afterwards [09:26:44] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51612 [09:29:55] ah [09:30:09] !log maxsem synchronized wmf-config/InitialiseSettings.php 'make upload endpoint for Commons local https://gerrit.wikimedia.org/r/#/c/51612/' [09:30:13] seems that the importer will periodically update the stats [09:30:28] when showing the report [09:32:22] $page->doEditUpdates [09:32:31] finally found it geez [09:33:49] it seems kind of absurd to do an update for every single edit [09:34:04] when it knows how many revisions it's imported [09:34:10] parses the text, Update the links tables and other secondary data for each revision [09:34:28] well to count good it has to render the revision [09:34:58] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [09:35:02] ugh [09:35:04] to see if there is indeed a lnk in there [09:35:17] and to update the links tables etc [09:35:20] right [09:35:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 208 seconds [09:36:11] It might do that on every page instead of revision (it *should* do that just for current rev) [09:36:20] I would have to look closer at the code [09:38:00] anyways yes it's equivalent to shoving in just page/rev/text and then rebuild all, just about as painful [09:38:11] * Ryan_Lane nods [09:38:36] hungry. need calories [09:38:59] but first... [09:39:09] I need to go to bed :) [09:40:54] are you subscribed to wikitech-l with your wikimedia addy? [09:41:09] should be [09:41:28] I have a wikitech-l label with email in it [09:41:37] last update was yesterday [09:42:05] no [09:42:15] I see rlane gmail [09:42:19] but not rlane wikimedia [09:42:26] yeah [09:42:46] apergos: but he sent mail 2 and 3 days ago from wikimedia and it went through [09:43:46] I seee none frm rlane @ wm [09:43:55] only from gmail that went to wikitech-l [09:44:21] looking again... [09:44:45] ryan you might wanna subscribe from that address and just nomail it [09:45:41] ok. bedtime for me [09:45:43] * Ryan_Lane waves [09:46:43] night [09:48:39] http://lists.wikimedia.org/pipermail/labs-l/2012-December/000573.html never showed up on wikitech-l [09:48:56] also one of Susan's pet projects :) [09:49:34] οη ςος [09:49:35] er [09:49:36] oh wow [09:49:40] * jeremyb_ really needs to go to bed (worse than Ryan_Lane) [09:49:45] git there [09:49:46] bye [09:49:49] night [09:55:19] Action to take for postings from non-members for which no explicit action is defined. 'discard' [09:55:38] that is why his emails didn't wind up in a moderator queue (and I guess we get so much spam that's justified) [09:56:07] now that that mystery is resolved, I can go hunt for food [09:57:13] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [10:14:53] New patchset: Silke Meyer; "Add Babel extension in Wikidata test repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51648 [10:24:27] so for now we just have no SAL at all? [10:24:42] logmsgbot: who do you talk to? [10:24:51] for right now we pretend to log in here... [10:25:16] and ryan collects the entries on his next wake shift and stuffs em into the log (because by then import will be done) [10:27:47] sounds scary [10:28:03] gotta be done [10:30:49] !log Hey Ryan! [10:31:06] we'll toss the logspam of course :-P [10:38:13] apergos, are you in Europe? [10:38:18] yes [10:56:01] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:02:01] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:06:07] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [11:18:46] grrrr went to update the swift deployment schedule.. of course can't >_< [12:03:26] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 187 seconds [12:04:06] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 197 seconds [12:11:26] PROBLEM - Puppet freshness on knsq28 is CRITICAL: Puppet has not run in the last 10 hours [12:12:26] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Puppet has not run in the last 10 hours [12:12:26] PROBLEM - Puppet freshness on mw1048 is CRITICAL: Puppet has not run in the last 10 hours [12:16:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:21:32] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 28 seconds [12:22:02] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [12:31:32] PROBLEM - Puppet freshness on cp1011 is CRITICAL: Puppet has not run in the last 10 hours [12:33:56] New review: Silke Meyer; "Working. Thanks!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/50451 [12:36:32] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [12:37:32] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [12:38:32] PROBLEM - Puppet freshness on db38 is CRITICAL: Puppet has not run in the last 10 hours [13:00:50] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [13:01:00] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [13:20:40] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:41] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:41] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:41] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:41] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:41] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:50] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:50] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:50] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:50] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:00] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:00] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:21:30] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [13:21:31] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.915 second response time [13:21:31] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.182 second response time [13:21:31] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.773 second response time [13:21:41] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.095 second response time [13:21:41] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.284 second response time [13:21:41] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.496 second response time [13:21:41] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.093 second response time [13:21:41] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.676 second response time [13:21:50] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.464 second response time [13:21:50] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.789 second response time [13:22:00] RECOVERY - Apache HTTP on mw1093 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.058 second response time [13:32:53] RECOVERY - MySQL disk space on neon is OK: DISK OK [13:33:03] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [13:36:33] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 1 seconds [13:37:03] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:42:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:43:00] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [15:16:02] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [15:16:22] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:42:00] apergos, there's some problem with performance logging, eg https://gdash.wikimedia.org/dashboards/totalphp/ or https://gdash.wikimedia.org/dashboards/mobext/ [15:42:55] great... I wonder how any of that works :-/ [15:44:15] served from fenari [15:45:11] mmm https://wikitech-old.wikimedia.org/view/Gdash [15:45:49] ah it's not on the new one yet [15:46:08] I had already made my way to graphite [15:47:35] found in conf file: [15:47:37] # this doesn't belong here, shh. [15:48:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [15:48:32] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 191 seconds [15:50:12] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:51:04] that is so not true! wtf [15:52:42] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:53:58] PROBLEM - Varnish HTTP bits on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:54:08] ooohhh [15:54:10] Mar 1 15:37:37 professor kernel: [15121880.709440] ECC/ChipKill ECC error. [15:54:20] Mar 1 15:37:37 professor kernel: [15121880.754278] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x1042087c0 [15:54:21] Mar 1 15:37:37 professor kernel: [15121880.821986] EDAC MC0: CE page 0x104208, offset 0x7c0, grain 0, syndrome 0x3884, row 0, cha [15:54:21] nnel 0, label "": amd64_edac [15:54:22] bummer [15:54:48] RECOVERY - Varnish HTTP bits on palladium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [15:55:08] that's from much later though [15:55:48] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:09] dmesg kinda ful of them though [15:56:40] that Professor Kernel guy is a jerk, apparently:P [15:56:57] pretty much [15:57:09] uhhh [15:57:18] I think I gotta report this and open a ticket [15:57:47] in the meantime there might be some process that died the deth that I could restart for now [15:58:08] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.846 second response time [16:08:44] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:08:30 UTC 2013 [16:09:09] PROBLEM - HTTP on fenari is CRITICAL: Connection timed out [16:09:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:10:19] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:10:13 UTC 2013 [16:10:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:10:28] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [16:10:39] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:11:18] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:11:15 UTC 2013 [16:11:29] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:12:18] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:12:13 UTC 2013 [16:12:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:13:09] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:12:59 UTC 2013 [16:13:29] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:13:48] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:13:40 UTC 2013 [16:14:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:16:08] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.054 second response time [16:19:49] ok well complete fail [16:19:56] tried restarting gmond on carbon, no help [16:20:19] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:20:40] since gangla moniroting over there seems to have the problem [16:21:08] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 5.625 second response time [16:21:34] hm. looks like some pages didn't import? [16:22:21] including the damn SAL [16:22:58] Ryan_Lane: yay you are here [16:23:14] ganglia for misc hosts pmtpa is apprently out to lunch [16:23:18] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 16:23:12 UTC 2013 [16:23:24] I tried restarting gmond on carbon, this didn't help it [16:23:34] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [16:23:37] I'm wading through puppet and not finding anything very quickly... [16:23:47] (yeah some pages didn't make it in) [16:25:31] why on carbon? [16:25:42] is carbon the ganglia host now? [16:25:42] isn't theat the aggregator for misc? [16:26:16] looks like it's one of them [16:26:26] uh huh [16:27:12] oh [16:27:14] and ms1004? [16:27:20] oh, those are eqiad [16:27:22] meh [16:27:33] want tampa [16:27:35] spence [16:27:47] oh no seriously? [16:28:34] trying over there [16:30:11] meh that did not help [16:30:48] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [16:32:01] re-running the import [16:32:23] oh man [16:32:57] meh. it'll go fast since it imported basically everything already [16:34:03] I cannot see anything in these dang logs that is at all relevant [16:34:14] (to ganglia misc being out) [16:34:38] Ryan_Lane: there's a bug for which Special:Import sometimes imports duplicate revisions, I hope it's not in the script to :) [16:34:44] *too [16:35:02] yeah, hopefully that's not the case ;) [16:35:07] it's nearly done right now [16:35:11] so I'd imagine it's not [16:35:41] Nemo_bis, Special:Import is not the best idea with this dump size;) [16:35:57] it wouldn't work at all, in fact [16:36:04] MaxSem: I know, but maybe the logic for timestamps etc. is the same [16:36:04] importDump.php is what I'm using [16:36:18] not that it's a great thing to use either [16:40:37] hahaha [16:40:43] no one would claim that [17:01:33] I am stuck [17:02:16] udp and/or tcp (I don't know which one I want here ) on 8649 does not show traffic from the misc nodes on spence [17:03:40] udp shows me getting stuff from: owa2/3, ms1/2 and that's about it [17:03:53] oh a morebots!! welcome back [17:04:01] morebots: hi there [17:04:02] I am a logbot running on wikitech-static. [17:04:02] Messages are logged to labsconsole.wikimedia.org/wiki/Server_Admin_Log. [17:04:02] To log a message, type !log . [17:04:07] !log test [17:04:08] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [17:04:22] -_- [17:04:29] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [17:04:50] Ryan_Lane, don't tell me you had any hope ^_^ [17:05:11] :-D [17:05:18] I did because I tested this yesterday ;) [17:05:36] !log test [17:05:50] (, HTTPRedirectError('Only HTTP connections are supported', 'https://wikitech.wikimedia.org/w/api.php'), ) [17:05:52] bullshit [17:07:10] ah [17:07:15] wrong hostname [17:07:19] hahaha [17:07:28] that's going to kill us for a while isn't it [17:07:38] I tested it before I changed it ;) [17:07:38] probably, yes [17:07:40] !test [17:07:40] $1 [17:07:48] !log test [17:07:53] Logged the message, Master [17:07:54] -_- [17:07:56] \o/ [17:08:06] yay [17:14:33] ugh, now the lvs hosts are not showing new data [17:14:44] and I still have no idea what is wrong on spence [17:15:31] hi morebots! [17:25:38] brewster sends udp to 239.192.0.8 but spence listening as 239.192.0.1 apparently never sees it [17:25:52] (I guess same for the rest, that host was chosen at random) [17:31:11] <^demon> !log upgrading gerrit to 2.5.2-1506-g278aa9a on manganese/formey [17:31:16] Logged the message, Master [17:32:30] hey [17:32:38] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 277 seconds [17:33:18] PROBLEM - MySQL Slave Delay on db1049 is CRITICAL: CRIT replication delay 238 seconds [17:35:38] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 30 seconds [17:36:04] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [17:36:14] RECOVERY - MySQL Slave Delay on db1049 is OK: OK replication delay 2 seconds [17:36:25] RECOVERY - MySQL disk space on neon is OK: DISK OK [17:39:57] <^demon> !log rescanned all trackingids for gerrit on manganese [17:40:03] Logged the message, Master [17:46:54] * jeremyb_ fixes LeslieCarr's OTRS Tickets :) [18:02:35] New patchset: Hashar; "beta: syslog-ng on deployment-bastion host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51668 [18:06:45] ok so how is it that these work now: https://gdash.wikimedia.org/dashboards/totalphp/ [18:06:51] https://gdash.wikimedia.org/dashboards/mobext/ [18:12:45] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [18:13:38] also that message is a lie, puppet ran on spence 5 minutes ago [18:16:09] New patchset: Hashar; "beta: migrate udp2log to -bastion" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51669 [18:16:24] AaronSchulz: morning :-] Got a simple +2 for ya https://gerrit.wikimedia.org/r/51669 [18:16:26] Nagios is tainting it [18:17:24] hashar: does that even need a review? :) [18:17:24] apergos: I guess the icinga host does not receive the SNMP trap? Maybe some iptables rule is not up to date [18:17:30] AaronSchulz: probably not ahah [18:17:42] seems harmless [18:17:46] New review: Hashar; "Aaron says so" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/51669 [18:18:02] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51669 [18:19:03] grgrg [18:19:21] mark, I tried restarting gmond on spence. that didn't seem to do anything. when I tcpdump udp on 8649 on brewster (random misc host) I see packets going. when I do that on spence I don't see em from brewster, just [18:19:25] ms1/2, owa2/3 [18:19:34] at that point I didn't know what to do next [18:19:50] could not find anything in any log of course that was helpful to me [18:20:06] might be iptables [18:21:01] well I looked at that and it seemed to have accept all for the right ips [18:21:05] (on spence) [18:22:44] apergos: I wonder why mw1001 has more cpu use than the others...it's always seemed that way [18:23:20] job runner? [18:23:25] PROBLEM - Host knsq22 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:35] PROBLEM - Host maerlant is DOWN: PING CRITICAL - Packet loss = 100% [18:23:43] yes [18:23:55] RECOVERY - Host knsq22 is UP: PING WARNING - Packet loss = 93%, RTA = 103.74 ms [18:24:27] RECOVERY - Host maerlant is UP: PING WARNING - Packet loss = 37%, RTA = 95.37 ms [18:25:02] New patchset: Pyoungmeister; "adding hooper and tarin as pmtpa misc ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51670 [18:27:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51670 [18:29:00] mark: how hard would it be to get geoiplookup to give a redirect to foo.wikimedia.org/${countrycode} ? [18:32:33] New patchset: Ottomata; "Adding puppet Limn module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [18:32:50] I didn't forget about you, just distracted by meeting setup [18:33:04] that is really weird about 1001 all right [18:34:39] New patchset: Pyoungmeister; "correction: hooper is external" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51671 [18:37:36] !log installing package upgrades on fenari [18:37:41] Logged the message, Master [18:37:55] so AaronSchulz when I actually look at the load (uptime) on mw1001 vs 1002 I see about the same load [18:38:15] the difference is the # of cpus [18:38:20] or cores or whatever [18:38:24] total cores on 1002: 24 [18:38:29] total cores on 1001: 12 [18:39:06] !log installing package upgrades on bast1001 [18:39:11] Logged the message, Master [18:39:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51671 [18:39:40] oh look th misc cluster is reporting now [18:39:49] still wish I knew what was borked on spence [18:42:40] # search disabled [18:42:40] search1000x: *tspart1 *tspart2 en-titles* de-titles* ja-titles* it-titles* sv-titles* pl-titles* pt-titles* es-titles* zh-titles* nl-titles* ru-titles* fr-titles* commonswiki.spell commonswiki.nspart1.hl commonswiki.nspart1 commonswiki.nspart2.hl commonswiki.nspart2 *.related jawiki.nspart1.hl jawiki.nspart2.hl zhwiki.nspart1.hl zhwiki.nspart2.hl [18:42:56] Anyone know why these are mapped to a non-existent host ? [18:43:14] notpeter? ^^ [18:43:17] (maybe) [18:43:38] Ok, I'll check with him. [18:44:11] I know there was some shuffling of stuff around yesterday to make searchpool4 be pmtpa only via a vitual.. lemme find that email actually [18:44:45] We essentially have a virtual search-pool5 shard, but with the spelling and highlight indices for pool4 and pool5 sharing the same servers. The pool4 wikis are using the new setup in pmtpa [18:44:52] New review: Brion VIBBER; "I don't have +2 in this repo, but it looks ok to me." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/51291 [18:45:09] that's all I know [18:45:58] xyzram: they were disabled when i got here :) [18:46:51] all righty then :-D [18:47:01] "legacy"... gotta love it [18:47:08] That's generating a huge number of exception stack traces in the logs: [18:47:17] 880406 today [18:47:31] on search16 [18:47:50] ohhhh, wow. took me several days to figure out xyzram was ram [18:48:08] PROBLEM - Varnish HTTP bits on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:26] and I still don't know who jeremyb_ is :-) [18:48:48] PROBLEM - SSH on niobium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:49] Not staff ;) [18:49:00] well we've interacted on bugzilla i think [18:49:17] yes about using tabs vs. spaces. [18:49:28] i don't remember [18:50:00] New review: Demon; "You do now ;-)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51291 [18:50:08] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.41787657143 [18:50:23] ^demon's funny [18:51:08] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.71190478261 [18:51:20] Ryan_Lane: imagine importing something rally large [18:51:22] xyzram: yeah, if you want to create a real way to disable wikis, I'm all for it :) [18:51:22] *really [18:51:30] we had someone trying en wp current revs only [18:51:34] (ginormous) [18:51:37] * Ryan_Lane shudders [18:51:41] heh heh [18:54:26] I swear everytime I do file migrations I keep seeing things with "Jewish Cemetery" in them [18:54:38] at least for s4 [18:54:54] weird [18:55:12] jeremyb_: oh noes, fixing tickets ? [18:55:21] jeremyb_: that sounds like then i have to do tickets! eep! [18:56:08] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [18:56:18] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [18:58:17] LeslieCarr: no, it's just when you mail then it says it's from you. so fixing it so the "customer" is not leslie. and then also dealing with some [18:58:22] oh okay [18:58:23] :) [18:58:25] yay [18:58:32] !log apergos: restarted icinga (not sure why it stopped, log file simply said 'caught TERM signal', in any case it complained that Error: Could not create external command file '/var/lib/nagios/rw/nagios.cmd' as named pipe ) [18:58:37] Logged the message, Master [18:58:48] !log reedy synchronized php-1.21wmf10/extensions/WikimediaMaintenance [18:58:54] Logged the message, Master [18:58:57] !log reedy synchronized php-1.21wmf10/cache/interwiki.cdb 'Updating 1.21wmf10 interwiki cache' [18:59:02] Logged the message, Master [18:59:03] !log reedy synchronized php-1.21wmf10/cache/interwiki.cdb 'Updating 1.21wmf10 interwiki cache' [18:59:09] Logged the message, Master [18:59:17] !log binasher: rebuilding db35 from a hotbackup of db55 (s5); testing a new build of mariadb 5.5.29 [18:59:19] !log restart nagios-nrpe,gmond,apache,salt-minion on bastion hosts [18:59:22] Logged the message, Master [18:59:27] Logged the message, Master [18:59:47] !log binasher: resharding search-pool4 in pmtpa - restarted lsearchd on all local nodes and indexers on searchidx2 [18:59:52] Logged the message, Master [18:59:53] !log disabled editing on wikitech.wikimedia.org [18:59:59] Logged the message, Master [19:00:01] !log changing A record for wikitech.wm.o and changing labsconsole.wm.o into a cname to wikitech [19:00:06] Logged the message, Master [19:00:13] ttps://gerrit.wikimedia.org/r/51542 [19:00:13] [21:12:16] [19:00:16] bleh [19:00:22] !log olivneh synchronized php-1.21wmf10/extensions/EventLogging 'Updates to JavaScript API' [19:00:27] Logged the message, Master [19:00:37] !log mutante: create missing dump directory on streber, enables "RT-shredder" plugin which comes with RT since 3.8 [19:00:43] Logged the message, Master [19:00:45] !log olivneh synchronized php-1.21wmf10/extensions/GuidedTour 'Update to split test (1/3)' [19:00:50] Logged the message, Master [19:00:51] !log olivneh synchronized php-1.21wmf10/extensions/E3Experiments 'Update to split test (2/3)' [19:00:55] Logged the message, Master [19:00:57] !log olivneh synchronized php-1.21wmf10/extensions/GettingStarted 'Update to split test (3/3)' [19:01:03] Logged the message, Master [19:01:14] !log Jeff_Green: stopped gmetad on neon for testing [19:01:19] Logged the message, Master [19:01:22] !log tstarling synchronized wmf-config/InitialiseSettings.php [19:01:27] Logged the message, Master [19:01:38] RECOVERY - SSH on niobium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:01:39] !log notpeter: upgrading mariadb boxes to 5.5.29 [19:01:44] Logged the message, Master [19:01:51] !log asher synchronized wmf-config/lucene.php 'sending search-pool4 traffic to pmtpa, where the index/host distribution has been rebalanced' [19:01:57] Logged the message, Master [19:01:58] RECOVERY - Varnish HTTP bits on niobium is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [19:02:03] !log LeslieCarr: powering off spence [19:02:09] Logged the message, Master [19:02:14] !log pgehres synchronized php-1.21wmf10/extensions/DonationInterface/ 'Updating DonatonInterface-langonly' [19:02:19] Logged the message, Master [19:02:20] sbernardin: It was db29 that died right? [19:02:26] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1043 and shifting watchlist q's to db1050' [19:02:30] !log powering back on spence [19:02:31] Logged the message, Master [19:02:36] Logged the message, Mistress of the network gear. [19:02:38] (since i forgot to "log" that when i did that) [19:02:40] hehehe [19:02:41] !log binasher: shutting down mysql on db1043, converting to mariadb [19:02:52] Logged the message, Master [19:02:53] !log pgehres Started syncing Wikimedia installation... : Updating i18n for DonationInterface-langonly [19:02:58] Logged the message, Master [19:03:03] !log pgehres Finished syncing Wikimedia installation... : Updating i18n for DonationInterface-langonly [19:03:08] Logged the message, Master [19:03:13] !log restarted glusterd service on labstore1-4 [19:03:16] ah while you are there, do you have any idea why spence would have no longer been receiving the ganglia ddata from the misc cluster? [19:03:18] Logged the message, Master [19:03:25] !log dzahn synchronized ./docroot/wikivoyage.org/mywotf83102384bfcd1e52152.html [19:03:28] I mean folks added other hosts for monitoring now so it's not broken [19:03:31] Logged the message, Master [19:03:37] but it bugs me that I don't know why it broke [19:03:38] !log db9 - zipping old sql dumps of drupal and civirm to prevent full disk [19:03:40] !log csteipp synchronized php-1.21wmf10/includes [19:03:43] Logged the message, Master [19:03:43] not sure - though should spence have been getting the data ? [19:03:45] LeslieCarr: [19:03:47] Logged the message, Master [19:03:48] was it for ganglios monitoring ? [19:03:48] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:49] !log LocalisationUpdate completed (1.21wmf10) at Fri Mar 1 02:31:13 UTC 2013 [19:03:55] Logged the message, Master [19:03:57] well yeah it was the aggregator for misc [19:04:11] oh ! [19:04:12] hehe [19:04:13] !log maxsem synchronized wmf-config/InitialiseSettings.php 'make upload endpoint for Commons local https://gerrit.wikimedia.org/r/#/c/51612/' [19:04:14] and when it stoppped working righ tthere was no monitoring [19:04:18] oh?? [19:04:19] Logged the message, Master [19:04:24] that it was the aggregator [19:04:28] PROBLEM - SSH on arsenic is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:30] i should have fixed that [19:04:37] and that's that [19:04:41] mark ^^ arsenic [19:04:44] yay ryan [19:04:48] !log create missing backup dir on streber to enable RT_Shredder plugin [19:04:53] Logged the message, Master [19:05:48] bleh, was db27 [19:06:48] !log install package upgrades on streber [19:06:53] Logged the message, Master [19:06:53] New patchset: Ottomata; "Adding puppet Limn module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [19:06:55] were there local images on wiitech? [19:07:05] apergos: local images? [19:07:17] images uploaded there [19:07:20] all images were local i suppose [19:07:20] yeah [19:07:21] lots [19:07:25] ah you got em [19:07:28] yep [19:07:32] sweet [19:08:22] RECOVERY - SSH on arsenic is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:08:38] RECOVERY - Varnish HTTP bits on arsenic is OK: HTTP OK: HTTP/1.1 200 OK - 635 bytes in 0.001 second response time [19:08:52] RobH: it was db27 [19:09:14] paravoid: I believe this can be closed, yes? https://rt.wikimedia.org/Ticket/Display.html?id=4183 [19:09:38] (1) isn't done yet [19:09:47] hhhmmm, ok [19:09:53] I have the patch in a branch at home, just needs deploying [19:10:14] I'm taking it [19:10:20] (the ticket) [19:10:48] ok, cool [19:12:08] New review: Brion VIBBER; "Woohoo! Merged." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/51291 [19:12:09] Change merged: Brion VIBBER; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51291 [19:13:30] !log install package upgrades on marmontel [19:13:35] Logged the message, Master [19:15:26] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [19:15:39] <^demon> notpeter: So, it looks like the indices are corrupted for napwiki and vecwikisource. [19:16:08] :((((( [19:16:17] <^demon> "Error sending index update records of napwiki to indexer at searchidx1001" [19:16:18] <^demon> ... [19:16:25] <^demon> org.apache.lucene.index.CorruptIndexException: failed to locate current segments_N file [19:16:48] those wikis sound small [19:17:05] <^demon> napwiki has 600k revs across 26k total pages. [19:17:09] <^demon> So not the tiniest. [19:17:18] bigger than wikitech [19:17:31] our new favorite benchmark [19:17:42] <^demon> "Oh, that's not bad...wikitech was worse" ;-) [19:18:38] ugh [19:24:16] search doesn't work on the new wikitech, is it known? [19:24:51] at least prefix search [19:26:03] ^demon: I'll trash them and re-import [19:26:11] can you shoot me a ticket so that I don't forget? [19:26:18] <^demon> RT? Sure. [19:27:59] !log maxsem synchronized wmf-config/mobile.php 'https://gerrit.wikimedia.org/r/#/c/51291/' [19:28:04] Logged the message, Master [19:29:14] <^demon> notpeter: rt 4622. [19:29:17] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [19:30:21] ^demon: thanks! [19:30:34] <^demon> yw, and thank you. [19:31:32] !log DNS update - add CNAME for status.wp/status.wm [19:31:37] Logged the message, Master [19:36:09] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51609 [19:36:37] mutante: never liked the status.wp idea [19:36:48] every project is going to ask for status.foo :) [19:37:18] !log asher synchronized wmf-config/db-eqiad.php 'returning db1043 to s1 watchlist service' [19:37:23] Logged the message, Master [19:37:33] Ryan_Lane: also I have readded my ugly hack to get a central syslog on beta. syslog-ng and rsyslog conflicts :/ I already brought it up a few months ago but though maybe we could reconsider it https://gerrit.wikimedia.org/r/#/c/51668/ [19:39:29] paravoid: hmm.. seemed like a very common typo.. but i see that point too..hmm [19:39:57] ok, convinced. let's not do it [19:40:04] I don't care that much :) [19:44:24] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [19:45:02] mutante: yeah lets stick to status.wikimedia.org :-] [19:45:08] New patchset: Tim Starling; "Set favicon for donatewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51683 [19:45:09] mutante: people will just have to remember about it :-] [19:46:17] hashar: ok, if you also agree then i'm reverting it all, and also removing the Apache redirect.. you requested it [19:46:31] ;-D [19:46:34] and it's been a whillle ago.. also just because cleaning up RT [19:46:49] are you guys having a RT sprint? [19:46:55] would be happy to comment on my pending RT [19:47:18] we did yesterday [19:47:29] hashar: 1449 [19:47:46] !rt 1449 [19:47:47] http://rt.wikimedia.org/Ticket/Display.html?id=1449 [19:48:05] (still haven't found out how to find a ticket in the RT interface haha) [19:48:06] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51683 [19:48:40] TimStarling: here is the updated favicon for donatewiki :-] [19:48:44] hashar: i'll show you :) [19:48:49] mutante: paravoid: and status.wikipedia then needs an HTTPS everywhere update :) [19:49:06] New patchset: Andrew Bogott; "Refactor mediawiki::singlenode and wikidata::singlenode into modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50451 [19:49:30] andrewbogott: I LOVE YOU! [19:49:31] hashar: thanks for reminding me of the bot feature :) [19:49:35] andrewbogott: mooooaar modules [19:49:51] hashar: I'm about to merge it; that was just a rebase :) [19:49:54] !rt 1449 | mutante [19:49:54] mutante: http://rt.wikimedia.org/Ticket/Display.html?id=1449 [19:51:45] New patchset: Ottomata; "Adding puppet Limn module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [19:51:54] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50451 [19:52:23] New patchset: Tim Starling; "Make the parser cache expire at 30 days again" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51684 [19:53:31] New patchset: Cmjohnson; "decommissioning db27" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51685 [19:53:52] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/51684 [19:54:22] notpeter: You now have free reign to murder mediawiki.pp [19:54:44] (There's still some stuff in there but I don't know what it's for) [19:55:21] !log tstarling synchronized wmf-config/InitialiseSettings.php 'parser cache max age 30d' [19:55:21] !log dzahn synchronized ./docroot/wikivoyage.org/block_ab2c3.html [19:55:26] Logged the message, Master [19:55:31] Logged the message, Master [19:56:23] andrewbogott: thanks! [19:57:17] TimStarling: change of heart? [19:59:02] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 253 seconds [19:59:12] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [19:59:32] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [19:59:50] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [20:01:37] !log dzahn synchronized ./docroot/wikivoyage.org/block_8708f.html [20:01:43] Logged the message, Master [20:01:57] Change abandoned: Cmjohnson; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51685 [20:02:32] !log blocked en.wikivoyage from bugmenot [20:02:36] Logged the message, Master [20:02:37] jeremyb_: Your site has now been blocked from the bugmenot system. We reserve the right to unblock the site if this system has been abused in any manner. [20:02:40] it works [20:03:02] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:05:13] mutante: wikidata [20:06:14] that seems like everything [20:06:21] (looked at sitematrix) [20:08:52] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:08:47 UTC 2013 [20:09:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:09:22] RECOVERY - Puppet freshness on knsq28 is OK: puppet ran at Fri Mar 1 20:09:19 UTC 2013 [20:10:02] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:09:55 UTC 2013 [20:10:03] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:11:03] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:10:53 UTC 2013 [20:11:03] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:11:22] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:11:16 UTC 2013 [20:12:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:12:32] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:12:28 UTC 2013 [20:13:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:13:14] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Mar 1 20:13:05 UTC 2013 [20:14:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [20:17:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:52] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 187 seconds [20:18:02] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [20:18:05] lol [20:19:37] geez [20:20:07] icinga-wm, you're stupid [20:20:16] conflicted I'd say [20:20:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.420 second response time [20:20:33] icinga-wm doesn't even get voice now :D [20:20:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:21:02] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [20:25:22] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 182 seconds [20:25:32] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 184 seconds [20:32:21] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [20:33:05] hashar, will Lucene on beta melt down just like the production one?:} [20:33:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [20:33:29] RECOVERY - ircecho_service_running on neon is OK: PROCS OK: 3 processes with args ircecho [20:33:39] MaxSem: yeah that is the plan :-] [20:33:57] MaxSem: xyzram is working on the java side to make it suck less [20:34:11] MaxSem: and peter is highly interested in having some proper monitoring of the search boxes [20:34:17] which needs a ganglia plugin probably [20:34:22] as well as a nagios check_url plugin [20:34:28] might sprint that next week [20:34:41] Just port it to solr and use my plugin;) [20:34:47] yeah [20:34:55] so we had a discussion about solr / elasticsearch [20:35:00] guess they will set up both [20:35:06] and find out which one should be picked [20:35:10] and choose the winner? [20:35:22] why I wasn't invited, BTW? [20:35:23] or stick to lucene *grin* [20:35:33] that was some random talks in the office [20:35:43] feel free to bring up the subject on ops list [20:35:52] or get CT to make you the product manager for it heheh [20:35:57] 1) I'm not subscribed to it [20:36:14] 1) ask to be subscribed hehe [20:36:28] 2) there are talks about offloading my stuff on ^demon|away, hehe [20:36:35] yeah [20:36:53] the Java guys sub team in platform engineering :-] [20:37:04] we are going to take over the whole engineering department [20:37:41] * MaxSem euthanizes hashar before he becomes the evil overlord [20:37:47] New patchset: Pyoungmeister; "migration to mediawiki module done!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51691 [20:38:13] MaxSem: request denied. [20:38:56] New patchset: Pyoungmeister; "migration to mediawiki module done!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51691 [20:41:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51691 [20:45:26] notpeter: on search box, isn't /a/ supposed to be a mount ? [20:46:43] nm [20:46:44] found out [20:46:55] I can connect to them and that is indeed at the root [20:46:57] grbmblb ;-] [20:47:01] mooooaar hacks [20:56:29] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [20:58:10] hashar: ok :) [21:00:29] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 19 seconds [21:00:39] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 7 seconds [21:01:42] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [21:02:02] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:04:20] hey andre__ [21:04:36] hi jeremyb_ [21:04:56] New patchset: Pyoungmeister; "rename mediawiki_new to mediawiki" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51694 [21:05:04] \o/ [21:07:40] jeremyb_: Re attachment forward issue: Would you like some info about the Bugzilla conf, and what I've tried to find out in the code so far? Wondering if email might be best for this [21:08:38] phone... [21:10:20] New patchset: Cmjohnson; "decom'ing db27" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51696 [21:10:24] Hey notpeter, would the query in this comment be sane to run on meta: https://bugzilla.wikimedia.org/show_bug.cgi?id=43913#c3 [21:11:51] (no rush... just trying to see if we should do write a script to do it in batches, or if we can do it all in one query) [21:12:23] andre__: you have something written already? [21:12:27] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51696 [21:12:46] andre__: you could mail or RT [21:13:12] lemme take a look [21:17:37] New patchset: Dzahn; "remove status.wp redirect (RT-1449)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51702 [21:18:11] heh [21:23:42] mark: did you see my geoip question? [21:23:52] 01 18:28:59 < jeremyb_> mark: how hard would it be to get geoiplookup to give a redirect to foo.wikimedia.org/${countrycode} ? [21:24:46] jeremyb_, sent [21:24:46] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/51702 [21:26:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51694 [21:30:40] andre__: irc should be #mozwebtools @ irc.mozilla.org [21:30:43] unless it's changed [21:30:51] that was 2-3 years ago [21:31:22] it's #bugzilla now [21:31:26] I know, I lurk there. [21:39:49] New patchset: Pyoungmeister; "fixing the rest of the shit I forgot..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51779 [21:42:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51779 [21:43:42] New patchset: Lcarr; "Adding in nagios monitor classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51780 [21:45:05] andre__: channel was just responding to the mail :) [21:47:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51780 [21:47:09] didn't get that, sorry [21:50:39] New patchset: Asher; "mariadb query optimizer tuning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51781 [21:51:33] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51781 [21:52:32] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [21:56:16] !log taking wtp1001 down to replace w/new box [21:56:21] Logged the message, Master [21:56:46] roankattouw ^ [21:57:09] cmjohnson1: OK cool [21:57:23] cmjohnson1: It's all depooled and ready to go down. Lemme know when the new one is up and I'll spin it up [21:57:31] ok..cool [21:59:43] PROBLEM - Host wtp1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:24] New patchset: Cmjohnson; "adding new mac address for wtp1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51783 [22:04:03] notpeter: https://bugzilla.wikimedia.org/show_bug.cgi?id=45619 [22:04:03] woo [22:04:41] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51783 [22:07:52] does anyone please know where /a is defined in puppet? [22:08:35] seems to be in site.pp :-] [22:09:16] andre__: oh, you wrote trying to find out on the upstream channel. i thought you were trying to find out what the channel was. :P [22:09:35] andre__: will finish reading sometime later [22:11:59] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [22:12:10] !log DNS update - remove status cname [22:12:15] Logged the message, Master [22:13:03] PROBLEM - Puppet freshness on mw1048 is CRITICAL: Puppet has not run in the last 10 hours [22:13:03] PROBLEM - Puppet freshness on mw1050 is CRITICAL: Puppet has not run in the last 10 hours [22:14:28] PROBLEM - Puppet freshness on db27 is CRITICAL: Puppet has not run in the last 10 hours [22:14:59] andre__: got the list of bz ticket numbers ? [22:15:17] !log auth-dns update [22:15:22] Logged the message, Master [22:15:50] mutante, of what? [22:16:01] need context [22:16:02] andre__: do you use whines? [22:16:05] jeremyb_, no [22:16:27] hrmmm, ok :) [22:17:02] andre__: the ones we identified in the RT-BZ triaging meeting [22:17:47] New patchset: Asher; "making eqiad search pool4 awesome" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51785 [22:18:46] mutante: either check your emails from 3 hours ago, or take: https://bugzilla.wikimedia.org/buglist.cgi?bug_id=42234,45343,43628 [22:19:20] andre__: got the email! tyvm [22:19:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51785 [22:20:21] notpeter: https://bugzilla.wikimedia.org/show_bug.cgi?id=42234 [22:22:08] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [22:22:09] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 183 seconds [22:22:49] New review: Andrew Bogott; "Looks OK, but will need to be moved to play nice with the new singlenode module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51648 [22:26:56] PROBLEM - Host search-pool5.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [22:28:06] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 10 seconds [22:28:06] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 11 seconds [22:31:56] PROBLEM - Puppet freshness on cp1011 is CRITICAL: Puppet has not run in the last 10 hours [22:33:55] !log authdns update [22:34:00] Logged the message, Master [22:36:56] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [22:37:56] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [22:38:56] PROBLEM - Puppet freshness on db38 is CRITICAL: Puppet has not run in the last 10 hours [22:42:17] New patchset: Jgreen; "add lcarr to icinga fundraising notification group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51790 [22:42:27] -1000! [22:44:54] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51790 [22:45:59] LeslieCarr: fundraising is so awesome that you get 8 copies of each failure [22:46:16] assuming they are not related to a specific box [22:46:22] New patchset: Pyoungmeister; "one more fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51791 [22:46:31] noooooes! [22:46:49] i need to make a time machine and warn my past self about saying yes to this! [22:47:32] LeslieCarr: or you and Jeff_Green can figure out service groups [22:47:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51791 [22:49:07] !log Depooled wtp1004 so Gabriel can use it for benchmarking [22:49:12] Logged the message, Mr. Obvious [22:49:33] New patchset: Dzahn; "move all monitoring related stuff into new monitoring directory and include that in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51792 [22:51:05] PROBLEM - Puppet freshness on db27 is CRITICAL: Puppet has not run in the last 10 hours [22:51:50] !log Temporarily installing apachebench on wtp1004 per Gabriel's request [22:51:55] Logged the message, Mr. Obvious [22:53:24] !log Make that apache2-utils, that's the actual name of the package [22:53:25] Logged the message, Mr. Obvious [22:54:32] New review: Dzahn; "hashar, check out this jenkins failure" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51792 [22:55:01] mutante: 22:49:42 err: Could not parse for environment production: No file(s) found for import of 'nagios.pp' at /var/lib/jenkins/jobs/operations-puppet-validate/workspace/manifests/base.pp:8 [22:55:01] [22:55:10] mutante: someone moved the nagios.pp file maybe? [22:55:22] yes, that is what the change is about [22:55:31] i renamed files, and then jenkins cant find them [22:55:31] mutante: or base.pp is included before nagios.pp (in site.pp) [22:55:44] hmm [22:56:38] hashar: https://gerrit.wikimedia.org/r/#/c/50936/ :) [22:56:55] New patchset: Pyoungmeister; "adding db69 and db70 as pmtpa enwiki slaves" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51795 [22:59:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51795 [23:02:00] mutante: hmm [23:02:11] mutante: so yeah in base.pp you import nagios.pp [23:02:17] should be import misc/nagios.pp [23:02:19] (I think) [23:02:52] New review: Hashar; "Base.pp should probably get updated too:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/51792 [23:03:04] New patchset: Andrew Bogott; "Added optional wiki_name setting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51796 [23:03:05] New patchset: Andrew Bogott; "Fix up the mediawiki extension class a bit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51797 [23:03:05] New patchset: Andrew Bogott; "Switch the openstack manifest to use webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51798 [23:03:05] New patchset: Andrew Bogott; "Rearrange webserver dependencies so that openstack and mediawiki::singlenode classes can play nice together." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51799 [23:26:38] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [23:28:35] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [23:29:05] PROBLEM - ircecho_service_running on neon is CRITICAL: Connection refused by host [23:29:36] !log dropping limesurvey from db9, backup in db9:/a/backup_dumps [23:29:41] Logged the message, notpeter [23:30:17] YAY [23:30:19] die shitty software [23:30:28] \o/ [23:30:54] :D [23:34:18] New patchset: Dzahn; "move all monitoring related stuff into new monitoring directory and include that in site.pp and remove ./misc/nagios.pp (merge with nagios.pp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51792 [23:34:44] hashar: duh, thanks, we had nagios.pp AND misc/nagios.pp [23:34:55] with just 2 monitoring group definitions in it [23:36:48] New patchset: Demon; "wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47564 [23:38:59] yeah you are welcome :-] [23:39:04] maybe I should apply for an ops position [23:39:11] :-D [23:39:18] yeah [23:39:25] mutante: theses tiny errors are really cumbersome, I hate them. [23:39:31] usually take me half an hour to spot them [23:39:42] and whenever I show it up to someone else, he will instantly find the answer :( [23:39:44] crazy [23:39:52] like semi columns versus columns [23:39:53] that is how it works :p [23:40:00] with some fonts you can barely tell the difference [23:41:25] yeah, and now i have fatal: Unable to read current working directory: No such file or directory [23:41:34] but yeah:) fixing [23:42:12] New patchset: Dzahn; "move all monitoring related stuff into new monitoring directory and include that in site.pp and remove ./misc/nagios.pp (merge with nagios.pp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51792 [23:46:14] heya hashar, you there? [23:52:49] jeremyb_: around? [23:58:57] !log authdns update [23:59:02] Logged the message, Master