[00:00:04] !log reedy synchronized langlist [00:00:06] Logged the message, Master [00:00:13] hrmm [00:00:17] Reedy: it doesnt seem to have it. [00:01:22] Reedy: ok, its on langlist file [00:01:28] is it added in svn and such per the wikitech page? [00:01:33] http://wikitech.wikimedia.org/view/Add_a_project [00:01:39] svn? :p [00:01:45] well, i guess its git now [00:01:47] hrmm [00:01:53] i think the auth-dns just pulls from langlist [00:01:56] i wonder why it didnt pull [00:02:12] It seems to have [00:02:17] OpenDNS is showing correct entries for it [00:02:20] and I can access it [00:02:32] hrmm, i killed negative cache, but my dig didnt work [00:02:36] lemme see what i messed up [00:02:52] Snowolf: Seriously? :p [00:03:07] Reedy: meh, i put wikimedia in my dig [00:03:10] old habits [00:03:12] London, England, UK [00:03:12] wikipedia-lb.esams.wikimedia.org [00:03:12] wikipedia-lb.wikimedia.org [00:03:12] 91.198.174.225 [00:03:13] heh [00:03:14] Reedy: you shoudl be all set yes? [00:03:17] Reedy: hmm? I get emails :D [00:03:56] Snowolf: it only became accessible a few minutes ago.. And DNS takes time to propogate unless you force [00:04:11] 00:01, 7 February 2013 User account Hoo man (Talk | contribs) was created automatically [00:04:11] 00:00, 7 February 2013 User account Snowolf (Talk | contribs) was created automatically [00:04:11] 00:00, 7 February 2013 User account Leinad (Talk | contribs) was created automatically [00:04:11] yea, i had to kill a negative cached entry for it [00:05:08] Reedy: need anything else? im about to run out and wont be back online for about an hour. [00:05:19] Nope, that's great thanks :) [00:05:21] I dont wanna leave you hanging since there arent other open about =] [00:05:23] cool [00:05:36] Reedy: not forced, I just clicked the link [00:05:47] lols [00:06:28] Wikimania is in hong kong? ruddy hell, I thought the us was far enough :P [00:06:52] Damianz: Welcome to nearly a year ago ;) [00:07:48] I've not bought a new pc recently so no reason to look up mythical creatures on wikipedia :P yay for being outdated [00:12:21] Must be really slow [00:20:05] !log reedy synchronized php-1.21wmf9/extensions/WikimediaMaintenance [00:20:06] Logged the message, Master [00:30:22] New patchset: Reedy; "Add new wikis to small.dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47819 [00:30:56] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47819 [00:32:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47799 [00:35:49] !log reedy synchronized wmf-config/InitialiseSettings.php [00:35:50] Logged the message, Master [00:40:54] New patchset: Reedy; "Add wgMetaNamespace for plwikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47820 [00:41:07] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47820 [00:41:39] !log reedy synchronized wmf-config/InitialiseSettings.php [00:41:40] Logged the message, Master [00:51:49] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [00:51:49] PROBLEM - Puppet freshness on ocg1 is CRITICAL: Puppet has not run in the last 10 hours [00:51:49] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [01:00:49] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [01:06:49] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [01:06:49] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [01:09:49] PROBLEM - Puppet freshness on titanium is CRITICAL: Puppet has not run in the last 10 hours [01:10:52] PROBLEM - Puppet freshness on ocg2 is CRITICAL: Puppet has not run in the last 10 hours [01:11:46] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [01:12:49] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [01:13:52] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [01:14:46] PROBLEM - Puppet freshness on mexia is CRITICAL: Puppet has not run in the last 10 hours [01:15:49] PROBLEM - Puppet freshness on caesium is CRITICAL: Puppet has not run in the last 10 hours [01:15:50] PROBLEM - Puppet freshness on cerium is CRITICAL: Puppet has not run in the last 10 hours [01:17:46] PROBLEM - Puppet freshness on mchenry is CRITICAL: Puppet has not run in the last 10 hours [01:18:49] PROBLEM - Puppet freshness on wtp1 is CRITICAL: Puppet has not run in the last 10 hours [01:50:27] New patchset: Krinkle; "Fix sawikiquote wgLogo." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47825 [01:51:55] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47825 [01:53:56] !log krinkle synchronized wmf-config/InitialiseSettings.php 'I9fd98309' [01:53:58] Logged the message, Master [01:57:58] !log Force-running puppet on wtp1 to figure out why it's broken [01:57:59] Logged the message, Mr. Obvious [02:02:14] New review: Catrope; "This is broken. You forgot to remove git-core from existing manifests, so some servers now doubly de..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37247 [02:03:23] ottomata: Are you around? [02:04:06] mw27: ssh: connect to host mw27 port 22: Connection timed out [02:04:07] srv266: ssh: connect to host srv266 port 22: Connection timed out [02:04:07] srv278: ssh: connect to host srv278 port 22: Connection timed out [02:04:10] mw1041: ssh: connect to host mw1041 port 22: Connection timed out [02:04:11] RoanKattouw: I assume this is snafu? [02:04:16] Yeah, ignore those [02:04:24] srv278 is almost always broken [02:06:39] New patchset: Demon; "Removing git-core manifest from other servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47826 [02:09:14] New review: Catrope; "This is not enough. You also need to fix invocations of Package[git-core] like in parsoid.pp" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/47826 [02:10:27] When wmf9 got put onto fenari, it looks like the submodule within GuidedTour was dropped. [02:10:41] Is it alright if I do a sync-dir to fix it. [02:11:18] Go ahaed [02:14:06] thx RoanKattouw. ori-l changed all the steps in "How_to_deploy_code" to use git submodule update --init --recursive , but it's easy to forget [02:14:17] Right [02:14:19] Y [02:14:25] Yeah you guys are the first to use nested submodules [02:15:23] We're just using one submodule. It's everyone else who's nesting. ;) [02:15:26] But, yeah. [02:18:13] It's prompting me for passwords when I run sync-dir (similar to last time). [02:18:26] I'm going to email RT, since I think I sshed correctly and everything should be right. [02:18:35] spagewmf, can you do this one? I already did the submodule update. [02:18:45] huh [02:18:47] Ahm [02:18:55] sure, happy to [02:19:02] Mind if I ask you some questions about that, superm401 ? [02:19:10] (the password issue) [02:19:20] RoanKattouw, sure, go ahead. [02:19:32] Want me to pastebin it? [02:19:39] Please do [02:20:08] Please also run ssh-add -l in the same shell and tell me what the output is (describe, don't paste) [02:21:30] http://pastebin.ca/2311334 [02:22:06] It's an RSA private key with fingerprint. [02:22:42] superm401, sync-dir php-1.21wmf9/extensions/GuidedTour/modules/externals/mediawiki.libs.guiders/mediawiki.libs.guiders.submodule ; OK? [02:22:57] spagewmf, yep, that should do it. [02:23:17] Were those six lines all you got? [02:23:32] !log spage synchronized php-1.21wmf9/extensions/GuidedTour/modules/externals/mediawiki.libs.guiders/mediawiki.libs.guiders.submodule 'sync GuidedTour submodule omitted from 1.21wmf9' [02:23:33] Logged the message, Master [02:24:17] I might have control-Ced. I think they all should have the same underlying cause. [02:24:56] Thanks, spagewmf. [02:25:12] np. https://test2.wikipedia.org/wiki/Main_Page?tour=test WFM [02:27:53] !log LocalisationUpdate completed (1.21wmf9) at Thu Feb 7 02:27:52 UTC 2013 [02:27:54] Logged the message, Master [02:28:28] New patchset: Demon; "Removing git-core manifest from other servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47826 [02:31:20] New review: Catrope; "Looks good to me. There is one remaining use of git-core, but that's in misc/beta.pp" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/47826 [02:34:05] New review: Demon; "That's part of a require, should be fine." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/47826 [02:35:13] New review: Catrope; "D'oh, right. Roan can't read grep output." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/47826 [02:52:26] !log LocalisationUpdate completed (1.21wmf8) at Thu Feb 7 02:52:26 UTC 2013 [02:52:28] Logged the message, Master [03:09:12] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [03:19:51] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [04:02:47] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:02:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:02:47] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [04:02:47] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:04:52] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [07:42:31] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:08:36] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [08:30:53] New review: Hashar; "Then I think it complains about the change being closed isnt it ?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47538 [08:38:56] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 181 seconds [08:38:56] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [08:50:06] New patchset: Spage; "Add a logbot class for #wikimedia-e3 channel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46672 [08:54:30] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:55:51] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [09:45:36] PROBLEM - MySQL Replication Heartbeat on db1022 is CRITICAL: CRIT replication delay 186 seconds [09:45:53] PROBLEM - MySQL Slave Delay on db1022 is CRITICAL: CRIT replication delay 190 seconds [09:46:11] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 224 seconds [09:46:20] PROBLEM - MySQL Slave Delay on db46 is CRITICAL: CRIT replication delay 227 seconds [09:51:08] RECOVERY - MySQL Slave Delay on db1022 is OK: OK replication delay 15 seconds [09:52:20] RECOVERY - MySQL Replication Heartbeat on db1022 is OK: OK replication delay 0 seconds [09:54:35] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [09:54:53] RECOVERY - MySQL Slave Delay on db46 is OK: OK replication delay 0 seconds [10:08:25] !log jenkins : regenerating jobs to make phpcs to show the sniff codes being used. {{gerrit|47847}} [10:08:27] Logged the message, Master [10:53:12] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [10:53:12] PROBLEM - Puppet freshness on lardner is CRITICAL: Puppet has not run in the last 10 hours [10:53:12] PROBLEM - Puppet freshness on ocg1 is CRITICAL: Puppet has not run in the last 10 hours [11:02:12] PROBLEM - Puppet freshness on marmontel is CRITICAL: Puppet has not run in the last 10 hours [11:08:04] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [11:08:04] PROBLEM - Puppet freshness on kuo is CRITICAL: Puppet has not run in the last 10 hours [11:11:03] PROBLEM - Puppet freshness on titanium is CRITICAL: Puppet has not run in the last 10 hours [11:12:06] PROBLEM - Puppet freshness on ocg2 is CRITICAL: Puppet has not run in the last 10 hours [11:13:09] PROBLEM - Puppet freshness on tola is CRITICAL: Puppet has not run in the last 10 hours [11:14:03] PROBLEM - Puppet freshness on celsus is CRITICAL: Puppet has not run in the last 10 hours [11:15:06] PROBLEM - Puppet freshness on wtp1001 is CRITICAL: Puppet has not run in the last 10 hours [11:16:09] PROBLEM - Puppet freshness on mexia is CRITICAL: Puppet has not run in the last 10 hours [11:17:12] PROBLEM - Puppet freshness on cerium is CRITICAL: Puppet has not run in the last 10 hours [11:17:13] PROBLEM - Puppet freshness on caesium is CRITICAL: Puppet has not run in the last 10 hours [11:19:09] PROBLEM - Puppet freshness on mchenry is CRITICAL: Puppet has not run in the last 10 hours [11:20:13] PROBLEM - Puppet freshness on wtp1 is CRITICAL: Puppet has not run in the last 10 hours [11:44:44] paravoid: I had a google hangout with ottoman yesterday about his puppet manifests for kraken haproxy [11:45:10] paravoid: basically, I have instructed him to use a role class that hold the wikimedia settings and a module with a parameterized class :-D [11:45:30] I am not sure whether he sent a patch [11:46:04] will see :) [11:46:04] hey [11:46:08] yeah [11:46:13] it's not in an ideal state imho [11:46:26] he explained me that is merely temporary [11:46:34] apparently they will remove haproxy entirely [11:46:53] so I guess if that looks like something not too bad, you might want to be liberal :-] [11:47:17] he is definitely not going to write a full haproxy module for sure. His aim is to make the current haproxy hack to be working again then work on a replacement [11:48:09] yeah, I don't agree with that [11:48:12] we'll see [11:48:16] yeah :-] [11:48:18] that is what I told him [11:48:40] anyway, could you possibly review a hack for beta ? /usr/local/apache is defined twice :( [11:48:41] paravoid: anyway, if you could please look at yet another hack for beta [11:48:43] grr [11:48:45] https://gerrit.wikimedia.org/r/#/c/45115/2/modules/applicationserver/manifests/config/apache.pp,unified [11:49:17] ugh [11:49:38] I hate /data [11:49:46] ;-] [11:50:20] that won't work [11:50:27] besides being ugly as hell, it just won't work [11:50:33] there's no ordering gurantee [11:50:34] :( [11:50:46] that one maybe be defined first and nfs::apache::labs later [11:50:47] ahh [11:51:02] and no, don't put the if !defined there too :) [11:51:12] so I am not sure how to fix that :( [11:51:31] I though about mounting /data/project/apache on /usr/local/apachce [11:51:53] but the mount command would not let us mount a subdir (/data/project/apache), only the main volume /data/project [11:53:00] ah maybe I can switch based on $::realm [11:53:06] uugh [11:53:13] prod would ensure directory, and labs would require nfs::apache::labs [11:53:20] such poor abstractions [11:53:56] nfs::apache::labs needs to go for starters [11:54:05] it's a misnomer [11:55:15] maybe I could move the /usr/local/apache definition to the role::applicationserver class ? [11:55:38] applicationserver::config::apache you mean? [11:56:30] or I can move the symlink for beta in applicationserver::config::apache yes [11:56:45] I am not sure if that realm varying stuff should be in the role class or the module [11:57:45] applicationserver::config::apache really needs to be split up [11:58:18] oh, maybe not [11:58:33] the sync apache config part isn't suitable for labs either, is it? [12:00:41] yeah that is broken too [12:00:45] I got a patch for it IIRC [12:02:41] New patchset: Hashar; "beta: /usr/local/apache dupe definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45115 [12:03:05] paravoid: updated to have /usr/local/apache definition at the same place (aka under applicationserver::config::apache ) [12:03:12] and the patch to disable mwsync in beta is https://gerrit.wikimedia.org/r/#/c/47398/ [12:04:05] getting a snack, brb [12:04:20] ugh I hate how labs has /data [12:04:27] it really diverges from production [12:05:05] hashar: the if trickery is bad as it is, let's not do it in two places [12:05:18] put everything into applicationserver::config::apache I'd say, in the same if [12:07:57] so does someone know why amaranth and more are unreachable from Europe since 3.20 UTC as reported by DaB.? [12:08:14] nagios only reported some varnish problem around that time [12:08:28] amaranth isn't managed by us [12:08:45] it's toolserver/WMDE people [12:09:12] paravoid: like https://gerrit.wikimedia.org/r/#/c/45115/3/modules/applicationserver/manifests/config/apache.pp,unified ? [12:09:27] yeah [12:09:27] got rid of the nfs::apache::labs at the sametime [12:09:36] not too excited of that either, but at least let's do it in one place [12:09:39] so it can factored out later [12:09:41] paravoid: in SAL I see RobH restarted it several times [12:10:00] whenever we get git-deploy in beta, the files will be in /srv/ that would be nicer [12:10:07] (no more /data/project ! ) [12:10:20] Nemo_bis: when? [12:11:01] paravoid: some days/weeks/months? ago? [12:11:13] anyway this looks more like a network problem [12:11:17] it's physically located in our infrastructure [12:11:22] not really, no, it looks like the box died [12:11:39] ah better then [12:12:02] I know nothing, it's what he said :) DaB|Uni> It is not possible to reach any host in tampa from the toolserver-cluster [12:12:29] lunch brb [12:17:39] New patchset: Hashar; "sync mediawiki only in production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47398 [12:19:34] New review: Hashar; "Rebased on https://gerrit.wikimedia.org/r/#/c/45115 which also introduced a if realm." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47398 [12:28:18] out for the rest of the afternoon [12:28:22] though I will connect tonight [13:19:46] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [13:19:55] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [13:26:02] New patchset: Dzahn; "generate separate Apache sites for each planet language from template instead of a single for all" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47862 [13:30:39] New patchset: Dzahn; "generate separate Apache sites for each planet language from template instead of a single for all" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47862 [13:32:54] New patchset: MaxSem; "Advanced Solr monitoring script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47111 [13:33:12] New patchset: Dzahn; "generate separate Apache sites for each planet language from template instead of a single for all" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47862 [13:34:37] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47862 [13:36:45] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [13:37:12] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 191 seconds [13:48:04] New patchset: Dzahn; "remove old planet (pre-venus) puppet class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47864 [13:49:49] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47864 [13:56:48] New patchset: Dzahn; "use a simple Apache ports.conf with SSL (NameVirtualHost *:443) enabled without this being in default the first VirtualHost will take precedence" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47865 [13:57:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47865 [13:59:50] Reedy: you created minwiki, but it is not listed at interwikimap yet. can you fix it? [14:03:36] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:03:37] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [14:03:37] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:03:37] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:05:42] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [14:34:12] <^demon> paravoid: Hi. Roan spotted some problems with the "put git on all servers" change. I submitted a followup: https://gerrit.wikimedia.org/r/#/c/47826/ [14:34:16] mutante: could you also merge https://gerrit.wikimedia.org/r/#/c/47223/ by any chance? :) [14:34:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47826 [14:35:06] ^demon: I got an Internal Server Error from gerrit when I clicked submit; second time worked [14:35:24] <^demon> Hmm, I'll check the log. [14:36:03] <^demon> Wow, never seen that before. [14:36:22] <^demon> http://p.defau.lt/?0N8WjmdGzTy1KfNY9Y4Crw [14:36:26] Nemo_bis: ok, yes. i meant to run it more often than just daily, maybe i would not have made it hourly, but...actually.. why not [14:36:37] ;) [14:36:57] <^demon> paravoid: Weridddd, it didn't leave any comments from you. [14:37:03] I was thinking of 2-3 h, but I see it takes only 2 min to complete... [14:37:07] <^demon> *weirdddd, even [14:37:09] New review: Dzahn; "yes, just once daily was a little slow to update" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47223 [14:37:10] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47223 [14:37:41] New review: Demon; "Some weird bug prevented Faidon's comments from being posted. Will file upstream." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47826 [14:38:02] I didn't put anything in the comments field fwiw [14:38:20] just hit +2 and publish & submit [14:39:23] <^demon> Yeah, but it should've still left the default comments. [14:39:27] <^demon> It didn't leave anything. [14:39:53] RECOVERY - Puppet freshness on kuo is OK: puppet ran at Thu Feb 7 14:39:34 UTC 2013 [14:40:31] nod [14:40:56] RECOVERY - Puppet freshness on titanium is OK: puppet ran at Thu Feb 7 14:40:48 UTC 2013 [14:43:56] RECOVERY - Puppet freshness on caesium is OK: puppet ran at Thu Feb 7 14:43:40 UTC 2013 [14:46:11] RECOVERY - Puppet freshness on xenon is OK: puppet ran at Thu Feb 7 14:45:57 UTC 2013 [14:47:23] RECOVERY - Puppet freshness on lardner is OK: puppet ran at Thu Feb 7 14:47:08 UTC 2013 [14:48:00] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Thu Feb 7 14:47:43 UTC 2013 [14:48:00] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Feb 7 14:47:52 UTC 2013 [14:48:57] New patchset: Ryan Lane; "Remove keystone patches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47870 [14:49:02] RECOVERY - Puppet freshness on wtp1001 is OK: puppet ran at Thu Feb 7 14:48:38 UTC 2013 [14:49:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47870 [14:51:26] RECOVERY - Puppet freshness on ocg1 is OK: puppet ran at Thu Feb 7 14:51:07 UTC 2013 [14:52:02] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Thu Feb 7 14:51:29 UTC 2013 [14:55:02] RECOVERY - Puppet freshness on ocg2 is OK: puppet ran at Thu Feb 7 14:54:55 UTC 2013 [14:55:29] RECOVERY - Puppet freshness on tola is OK: puppet ran at Thu Feb 7 14:55:18 UTC 2013 [14:56:32] RECOVERY - Puppet freshness on cerium is OK: puppet ran at Thu Feb 7 14:56:16 UTC 2013 [15:00:01] is interwikimap (missing min prefix) fixed automatically or need i to create a bug? [15:00:26] RECOVERY - Puppet freshness on wtp1 is OK: puppet ran at Thu Feb 7 15:00:06 UTC 2013 [15:01:02] RECOVERY - Puppet freshness on constable is OK: puppet ran at Thu Feb 7 15:00:41 UTC 2013 [15:03:26] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Feb 7 15:03:18 UTC 2013 [15:17:55] New review: Faidon; "I just realized that this is already packaged in Debian, although a version behind (0.8) as the late..." [operations/debs/python-jsonschema] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47662 [15:39:49] !log reedy synchronized php-1.21wmf9/cache/interwiki.cdb 'Updating 1.21wmf9 interwiki cache' [15:39:50] Logged the message, Master [15:40:11] !log reedy synchronized php-1.21wmf8/cache/interwiki.cdb 'Updating 1.21wmf8 interwiki cache' [15:40:12] Logged the message, Master [15:44:27] !log Rebooting cr1-esams [15:44:28] Logged the message, Master [15:46:05] !log reedy synchronized wmf-config/InitialiseSettings.php [15:46:06] Logged the message, Master [15:47:00] New patchset: Reedy; "Fix incubator dbname in geocrumbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47871 [15:47:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47871 [15:52:26] New patchset: Reedy; "Add wikidata.dblist to ease running of maintenance scripts under all wikidata-wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47872 [15:52:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47872 [16:07:02] RECOVERY - Host ms-be11 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [16:08:52] New patchset: Reedy; "Add wikidata dblist to noc conf" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47875 [16:09:15] paravoid, or mutante, here's another puppet question for you [16:09:35] if I'm making a role class that uses a module of the same name [16:09:46] e.g. role::kraken and modules/kraken [16:10:00] I have to be really careful about how I define and include classes, due to puppet's autoloader [16:10:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47875 [16:10:14] role::kraken::proxy cannot include a class from the kraken module called kraken::proxy [16:10:20] should I: [16:10:42] 1. name the role something different [16:10:42] or [16:10:42] 2. fully qualify the module class include: ::kraken::proxy [16:10:46] PROBLEM - swift-account-server on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:10:46] PROBLEM - swift-container-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:10:46] PROBLEM - SSH on ms-be11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:47] PROBLEM - swift-object-updater on ms-be11 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:10:55] PROBLEM - swift-container-auditor on ms-be11 is CRITICAL: Connection refused by host [16:10:56] PROBLEM - swift-object-replicator on ms-be11 is CRITICAL: Connection refused by host [16:10:56] PROBLEM - swift-container-replicator on ms-be11 is CRITICAL: Connection refused by host [16:10:56] Anyone know why root seems to own nearly everything in /home/wikipedia/common/docroot/noc/conf now? :/ [16:11:13] PROBLEM - swift-account-reaper on ms-be11 is CRITICAL: Connection refused by host [16:11:14] PROBLEM - swift-object-auditor on ms-be11 is CRITICAL: Connection refused by host [16:11:23] PROBLEM - swift-account-auditor on ms-be11 is CRITICAL: Connection refused by host [16:12:07] PROBLEM - swift-account-replicator on ms-be11 is CRITICAL: Connection refused by host [16:12:07] PROBLEM - swift-container-server on ms-be11 is CRITICAL: Connection refused by host [16:12:07] PROBLEM - swift-object-server on ms-be11 is CRITICAL: Connection refused by host [16:15:30] New patchset: ArielGlenn; "ms-be ssd layout changed to reflect h710 controllers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47876 [16:15:40] ignore those, I'm installing [16:16:08] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47876 [16:20:49] PROBLEM - Host ms-be11 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:40] RECOVERY - Host ms-be11 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms [16:34:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.954 seconds [16:37:36] New patchset: Reedy; "Bug 44741 - kowikiversity, minwiki, and tswiki using SVG instead of PNG for $wgLogo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47879 [16:38:11] !log reedy synchronized wmf-config/InitialiseSettings.php [16:38:12] Logged the message, Master [16:38:50] Oo, RobH, whatever happened with your battle 1007? [16:38:53] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 206 seconds [16:39:29] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 221 seconds [16:45:11] ottomata: spent all day, never finished, its still being bitchy [16:45:21] i'll resume later today [16:45:21] aye ok, cool, just wondering! [16:45:23] no worries, thank you! [16:46:39] New patchset: Ottomata; "Adding very simple haproxy module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47881 [16:47:44] PROBLEM - SSH on ms-be11 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:48:08] paravoid, step 1! super simple haproxy module: [16:48:08] https://gerrit.wikimedia.org/r/#/c/47881/ [16:48:20] let me know if I should make more files (install.pp, service.pp, whatever) [16:51:20] RECOVERY - SSH on ms-be11 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [16:55:29] New patchset: Ottomata; "passwords.pp - Adding empty passwords::analytics class, s/svn/git/ in comment." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47883 [16:56:06] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47883 [16:57:39] RECOVERY - swift-object-server on ms-be11 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:57:39] RECOVERY - swift-container-server on ms-be11 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:58:05] RECOVERY - swift-account-server on ms-be11 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:58:05] RECOVERY - swift-container-updater on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:58:05] RECOVERY - swift-object-updater on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:58:14] RECOVERY - swift-account-auditor on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:58:14] RECOVERY - swift-container-auditor on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:58:32] RECOVERY - swift-object-auditor on ms-be11 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:58:33] RECOVERY - swift-container-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:58:41] RECOVERY - swift-account-reaper on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:58:41] RECOVERY - swift-object-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:59:17] RECOVERY - swift-account-replicator on ms-be11 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:06:51] hurry up git review! [17:06:51] New patchset: RobH; "mw75-mw80 new image scalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47884 [17:06:57] that takes forever... [17:07:20] cmjohnson1: So you wanna review my change and +1 it before I merge? [17:07:38] It is easy change, but thought you may wanna review so you are involved in all steps =] [17:08:04] yep [17:09:44] robh: what does line 1524 do? [17:10:13] if its mw75 or 76 it makes it a ganglia aggregator [17:10:28] as each service group (api, apache, scaler) has to have two aggregators per datacenter. [17:10:43] example below it on 1533 [17:11:41] ok cool..i was just wondering what was different for mw75 and 76 and why not for the 77-80 [17:11:58] yea we only want two aggregators per cluster type [17:12:04] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47884 [17:12:05] s/want/need [17:12:21] ah..cool [17:12:23] cmjohnson1: oh cool, when did we give you =2? [17:12:25] +2 even [17:12:31] awhile ago [17:12:31] (i agree with it, just didnt know it happened) [17:12:35] \o/ [17:12:47] i'll merge on sockpuppet [17:12:53] k..cool [17:13:19] cmjohnson1: Ok, so those are in site.pp [17:13:32] so once you have them racked, you guys can update dhcpd and either do the install [17:13:34] or hand off to me for it [17:13:36] either or [17:14:01] (keep me in loop either way and I'll follow along) [17:14:09] i will update dhcp and hand off...lots of h/w and manual labor stuff going on here today [17:14:17] actually, you dont have to update dhcpd [17:14:22] that regex is missing ^ at the start [17:14:29] and $hostname should be $::hostname [17:14:59] but all those regexes near there are missing that [17:15:03] marktraceur: » if $hostname =~ /^mw7[56]$/ { to » if $::hostname =~ /^mw7[56]$/ { [17:15:04] ? [17:15:09] yep, i just copied from nearby [17:15:14] but i can correct them all if thats what you mean [17:15:20] Um [17:15:30] ack [17:15:33] yes that's what I mean [17:15:33] sorry marktraceur [17:15:44] mark: ok, i'll fix them all right now =] [17:15:47] Ah. [17:15:50] because technically, something like testmw75 matches [17:17:00] mark: should all instances of $hostname be $::hostname or just the ganglia aggregator matches? [17:17:21] all references (not assignments) [17:17:25] but don't change them all in one huge commit [17:17:50] ok... so just change say... all the mw servers for pmtpa ? [17:18:07] also note the $ at the end [17:18:14] yeah [17:18:22] $ at the end? [17:18:29] "^" means: "there shouldn't be anything before this" [17:18:36] "$" means: nothing after this [17:18:43] so ^mw70$ matches JUST mw70 [17:18:49] whereas mw70 matches testmw7000 too [17:19:49] hrmm, my mw75 stanza has the ^$ [17:19:54] but not the $::hostname [17:19:59] i'll fix them all and ping you to look [17:20:03] well, all mw tampa. [17:20:40] ok [17:21:48] heh, everyone has put $hostname except for hte two other entries i assume mark did properly [17:21:52] i am amused [17:22:16] every nonproper instance gives a log warning [17:22:31] well, now that i know, as i do other small commits I will fix them a bit at a time [17:22:37] and ensure once i fix they dont bork. [17:23:43] New patchset: RobH; "fixing hostname entries for ganglia aggregator logic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47886 [17:24:19] mark: ^ I think thats is what you are looking for, please let me know if not. [17:25:24] cmjohnson1: So yea, once you guys have them all wired, I can take over via mgmt for the install. I'll also need to know the port # ranges they are plugged into so I can set the vlan and label the ports. [17:25:38] i am making the ticket now [17:25:47] awesome, just assign to me when done and ping it to me =] [17:30:11] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [17:30:44] robh: rt4488 [17:30:56] Change abandoned: CSteipp; "Enabling this on a smaller set to start" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47540 [17:31:09] cmjohnson1: you mean image scalers in the email to CT [17:31:11] not api i think. [17:31:28] yes...i can correct that [17:32:28] i already sent another email [17:32:35] cuz even though we are bringing up new scalers first [17:32:47] we are unable to fall back to tampa during this time anyhow without overloading apaches [17:32:52] its just how it is. [17:33:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [17:33:38] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [17:35:20] bleeeeeh, asw-d1-sdtpa is a foundry [17:35:23] >_< [17:36:20] New patchset: ArielGlenn; "fix up swift partitions on ssds for ms-be* hosts with H710s" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47889 [17:36:56] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 25 seconds [17:37:04] RobH: commented [17:37:32] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 19 seconds [17:37:34] ahhh, i see [17:37:40] mark: thx! [17:41:30] Change abandoned: ArielGlenn; "should do only for ms-be11" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47889 [17:42:59] bugger. [17:43:05] i just pushed two commits, blehhhh [17:43:17] sweet, remote rejection, whew [17:43:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:43:57] heh [17:45:33] New patchset: ArielGlenn; "ms-be11 setup swift partitions sda3/sdb3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47892 [17:46:09] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47892 [17:46:12] New patchset: RobH; "fixing hostname entries for ganglia aggregator logic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47886 [17:46:47] <^demon> paravoid: Mind reviewing a packaging update I did? [17:47:02] mark: ^ Updated the patchset with your comment in mind [17:52:14] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [17:52:59] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [17:54:15] cmjohnson1: So are mw75-85 ready to go then (since i finished network ticket) [17:54:16] ? [18:02:02] robh: yes...dracs are cfg'd but you need to add mac's etc [18:02:10] awesome [18:02:12] thanks dude [18:02:20] even added asset tag [18:05:03] !log Rebooting re1.cr1-eqiad [18:05:04] Logged the message, Master [18:06:31] cmjohnson1: it was mw75 to what? [18:06:38] cuz 84 and 85 seem to not have drac response [18:08:04] 85 [18:08:06] let me check [18:08:09] k [18:09:20] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [18:10:35] robh: k...those 2 were not connected [18:13:52] cmjohnson1: So in the future, you can for loop retrival of macs [18:14:01] example: for mw in mw{75..85}; do MAC=$(ssh root@$mw.mgmt.pmtpa.wmnet racadm getsysinfo | awk '/^NIC.Embedded.1-1-1/ { print $4 }'); echo -e "host $mw {\n\thardware ethernet $MAC;\n\tfixed-address $mw.pmtpa.wmnet;\n}\n"; done > macs.txt [18:14:16] oh..awesome [18:14:19] that will pull all the stanza's in a formatted output to cut and paste into lease file. [18:14:33] easy = better [18:14:34] heh [18:16:34] New patchset: RobH; "added mw75-mw85 to lease file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47898 [18:16:58] cmjohnson1: so yea, copy that someplace, then the biggest favor you can do yourself is memorize how the command is structured [18:17:04] for when you have to write it from scratch ;] [18:17:25] anything you have to enter more than 8 times should be in a loop! ;] [18:17:35] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [18:17:53] mark: You are aware or working on that? ^ or should we be concerned? [18:17:56] that's me [18:18:21] cool [18:18:47] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [18:18:49] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [18:18:49] PROBLEM - Host wikisource-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::5 [18:18:50] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::7 [18:18:52] New review: RobH; "yay apaches!" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47898 [18:18:53] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47898 [18:18:56] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::9 [18:18:58] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [18:18:59] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::2 [18:19:05] gah [18:19:09] what is happening [18:19:12] mark: you may wanna send a page to tell folks ;] [18:19:12] a master switch [18:19:22] cuz they are all on PST paging hours, yet traveling in the EU [18:19:23] heh. [18:19:23] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::4 [18:19:24] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [18:19:24] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::6 [18:19:25] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::a [18:19:26] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [18:19:41] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:19:41] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::0 [18:19:50] ah [18:19:58] why is it so fucking slow [18:19:59] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [18:20:02] it went in seconds last time I did that [18:20:17] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::b [18:20:19] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [18:20:35] this is useless [18:20:44] PROBLEM - Host foundation-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::9 [18:21:20] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6_https is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::8 [18:21:29] PROBLEM - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:861:ed1a::12 [18:21:48] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 117.67 ms [18:21:49] RECOVERY - Host wikinews-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 116.74 ms [18:21:50] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 116.37 ms [18:21:56] RECOVERY - Host wikidata-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [18:22:05] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.42 ms [18:22:15] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.98 ms [18:22:22] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.75 ms [18:22:37] New patchset: Lcarr; "fixing nagios-nrpe-server to be same as package installed version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47899 [18:22:50] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 116.85 ms [18:22:53] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.65 ms [18:22:53] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.76 ms [18:22:54] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.41 ms [18:22:59] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 114.69 ms [18:23:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47899 [18:23:21] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 115.18 ms [18:24:52] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.21 ms [18:25:22] RECOVERY - Host wikisource-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.19 ms [18:25:40] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.82 ms [18:25:57] RECOVERY - Host bits-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.25 ms [18:27:09] RECOVERY - Host foundation-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 113.14 ms [18:27:34] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_ipv6_https is UP: PING OK - Packet loss = 0%, RTA = 114.24 ms [18:32:27] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [18:34:49] New patchset: Lcarr; "init does not like retry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47901 [18:35:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47901 [18:36:33] mutante, is this something you could review? [18:36:35] https://gerrit.wikimedia.org/r/#/c/47881/ [18:42:45] !log Rebooted re0.cr1-eqiad [18:42:46] Logged the message, Master [18:42:57] RECOVERY - MySQL disk space on neon is OK: DISK OK [18:44:24] New patchset: Hashar; "Adding very simple haproxy module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47881 [18:44:57] New review: Hashar; "I have passed the manifests through puppet-lint :-D" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/47881 [18:53:08] roh: on a5-sdtpa....there are 36 servers now...i can use up to 40 u but will need to use y cables for 2. [18:56:17] robh ^ and power will be fine for 40 servers [18:57:39] do you have the y cables to use? [18:57:42] cuz yea, lets do that. [18:57:43] yes [18:57:55] sounds good [18:57:57] cool [18:58:03] the mw75-85 are installing now [18:58:22] great news...adding rails for the others now [18:58:39] also on that note we have 9 servers left over...i added it to the ticket [18:58:46] not sure where to put them [18:59:01] we can put them into d2 when we decom some apaches there shortly. [18:59:12] d2-sdtpa [18:59:19] kk...can you add that to the ticket [18:59:26] for history [18:59:26] thats the batch i initially wanted to decom, but found older ones [18:59:31] whats the # again? [18:59:38] 9 [18:59:48] i mean rt # [18:59:54] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 195 seconds [19:00:01] 4436 [19:00:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [19:00:06] :-P [19:01:00] cmjohnson1: Ok, updated, so pick out th e9 oldest in d2-sdtpa [19:01:04] and create a ticket to decom them [19:01:17] then we'll outline in ticket steps to decom and go from there. [19:04:38] kk [19:12:27] ok, mw75-80 os installed, doing initial puppet runs now. [19:38:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [19:38:31] PROBLEM - Apache HTTP on mw76 is CRITICAL: Connection refused [19:38:58] PROBLEM - Apache HTTP on mw79 is CRITICAL: Connection refused [19:38:59] PROBLEM - Apache HTTP on mw75 is CRITICAL: Connection refused [19:39:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [19:39:25] PROBLEM - Apache HTTP on mw78 is CRITICAL: Connection refused [19:39:25] PROBLEM - Apache HTTP on mw80 is CRITICAL: Connection refused [19:39:52] PROBLEM - Apache HTTP on mw77 is CRITICAL: Connection refused [19:45:11] puppet is so slow..... [19:45:22] takes multiple runs to fully update a new apache. [19:45:59] cmjohnson1: they are 90% ready to go [19:46:11] soon we'll be able to pull the old ones [19:46:37] RECOVERY - Apache HTTP on mw77 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.486 second response time [19:46:57] once these runs complete, we'll add these into the node group lists and also into pybal/lvs [19:47:04] RECOVERY - Apache HTTP on mw76 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.489 second response time [19:47:22] RECOVERY - Apache HTTP on mw75 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.186 second response time [19:47:23] RECOVERY - Apache HTTP on mw79 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.213 second response time [19:47:49] RECOVERY - Apache HTTP on mw78 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.386 second response time [19:47:49] RECOVERY - Apache HTTP on mw80 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.510 second response time [19:49:36] cmjohnson1: you about? [19:54:23] I was going to have you go ahead and repool the new ones to test, but I am doing it now, will put in here what i do so you can see in backread [19:54:39] editing /home/wikipedia/conf/pybal/pmtpa/rendering [19:54:49] rather than yank the old ones without pooling new first to test, i put in mw75 [19:55:01] it appears to be passing all tests, so seems ok [19:56:24] !log adding mw75-mw80 into tampa rendering pool [19:56:25] Logged the message, RobH [20:00:56] bleh, forgot to update ganglia [20:01:12] New patchset: RobH; "new image scaler ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47910 [20:02:13] robh: back [20:02:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47910 [20:02:50] cmjohnson1: no worries, just pushing new servers into service [20:03:09] they are now pooled on LVS successfully, but I neglected to update ganglia's configuration on which servers are aggregators for it [20:03:14] which I just committed and merging now [20:04:15] nagios has them under monitoring as well [20:04:24] cool [20:04:27] once i can see them in ganglia, im going to remove the old servers from pybal [20:04:52] this all used to be so much more painful. [20:05:45] hahahaha, if i have foxyproxy enabled it borks my ganglia graphs [20:05:47] as they load from pmtpa.wmnet =/ [20:05:58] ok, so ganglia is updated [20:06:04] !log Rebooting cr1-sdtpa [20:06:05] Logged the message, Master [20:06:50] mark: is the proper way to remove servers from a pybal to disable them first and let pybal recognize that before removing entirely? [20:06:57] or can i just delete the lines entirely [20:07:02] the latter [20:07:15] cool, i was going to err to the former if you werent about =] [20:07:41] !log removing srv219-224 from rendering config in pmtpa [20:07:42] Logged the message, RobH [20:07:55] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [20:07:55] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [20:07:55] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [20:07:55] PROBLEM - Host ps1-a5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.5) [20:07:55] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [20:07:56] PROBLEM - Host ps1-a3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.3) [20:07:56] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [20:07:57] PROBLEM - Host ps1-c1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.20) [20:07:57] PROBLEM - Host ps1-c2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.21) [20:08:03] don't worry about that [20:08:04] PROBLEM - Host ps1-b3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:04] PROBLEM - Host ps1-c1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:04] PROBLEM - Host ps1-a1-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:04] PROBLEM - Host ps1-a2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:04] PROBLEM - Host ps1-b4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:05] PROBLEM - Host ps1-c3-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:05] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:06] PROBLEM - Host ps1-c2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:06] PROBLEM - Host ps1-b2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:07] PROBLEM - Host ps1-c3-pmtpa is DOWN: PING CRITICAL - Packet loss = 100% [20:08:07] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [20:08:08] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [20:08:08] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [20:08:19] mark: i only worried for 3 seconds then realized it couldnt have been me ;] [20:08:28] but my heartrate did spike for a moment, heh [20:08:51] that made me nervous for a second [20:09:12] cmjohnson1: Ok, So srv219-srv224 are no longer in any kind of reference. They have been put into decomissioning.pp and can be wiped [20:09:29] you have no idea how much easier it is to add shit to cluster now than a few years ago [20:09:53] so much better. [20:10:05] nope but lets not look bad ...heh [20:10:09] back [20:10:17] can you imagine how much RobH was complaining THEN [20:10:18] ;-) [20:10:25] he's happy now! [20:10:27] mark: complain, i just let you do it ;] [20:10:41] it had to be 10x worse [20:10:52] cmjohnson1: So we killed off a bunch of apaches, so mw81-85 will go back into the apache pool [20:10:59] i do hope that router is coming back [20:11:03] I'll handle getting it done, because you are still cabling [20:11:14] cmjohnson1: but you'll get to do these steps soon enough in eqiad. [20:11:15] you don't have mgmt in tampa now [20:11:20] heh [20:11:26] well, then i may take a lunch break instead. [20:11:54] mark: its times like this you really love eqiad isnt it? [20:11:55] robh: yep good time to take a break [20:12:08] yes [20:13:48] robh: i am taking 219-224 offline now [20:13:54] never to return again [20:14:06] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: kuwiktionary back to 1.21wmf8 [20:14:07] Logged the message, Master [20:14:43] cmjohnson1: good enough, they arent in anything now, so shouldnt even see alerts [20:14:48] if nagios updated, checking... [20:15:12] mark: nagios is affected by what yer doin right? [20:15:21] something is off now [20:15:25] New patchset: Demon; "Roll back ku/sr wikis to 1.21wmf8 - langconverter breakage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47914 [20:15:40] yea, i lost nagios [20:15:47] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47914 [20:16:35] <^demon> I can't hit fenari either. [20:16:42] Ok, nice, getting nagios alerts via phone, heh [20:17:09] I imagine this is entirely network related, and thus mark is looking into it as we speak. [20:17:15] yes [20:17:28] not that I can get in [20:18:55] Hi everyone: this is a very random question, but: if you wanted to have a MediaWiki site where every page loaded in less than one second, accessed from anywhere in the world, around how much would it cost to set up? [20:19:06] Let's say traffic was fairly minimal. [20:19:31] Just a ballpark figure would be fine - like, thousands of dollars? Tens of thousands? [20:20:16] hi yaron: we just had a site outage, so I suspect folks will be a bit distracted here [20:20:30] Ooh... that's pretty bad timing on my part. [20:20:39] no worries [20:21:05] just letting you know. there was a break in the action, so it looked a fine time to ask [20:21:48] i'm starting to miss foundry [20:22:15] i dunno how to reply to that [20:22:18] its so unexpected. [20:23:06] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: All ku/sr wikis back to 1.21wmf8 - lang converter breakage [20:23:07] Logged the message, Master [20:23:15] what's going on? [20:23:19] page-o-rama [20:23:30] apergos: network issue that mark was working on [20:23:33] i think. [20:23:42] it was just fixed, so no postmortem yet [20:23:49] (ie: i may be wrong) [20:24:05] <^demon> RobH: srv219-224 are still being worked on, right? [20:24:15] ^demon: uhh, they are gone [20:24:23] why? still in nagios? [20:24:24] <^demon> Oh, they're still in dsh groups I think. [20:24:33] yea, im still updating them [20:24:39] i lost connectivity mid edit [20:24:42] <^demon> Ah ok, no worries then. [20:27:13] ^demon: sorry about that [20:27:20] I should have notified you guys so you knew node lists were in flux [20:27:36] or i should have done this when you werent in deploy window ;] [20:28:32] <^demon> Well, we weren't in a normal window. [20:28:34] ^demon: So, yea i just pushed the new mw75-mw80 into the node files [20:28:43] well, i also didnt bother to check if you were, so heh [20:28:44] <^demon> We were dealing with breakage on all the ku & sr language wikis. [20:28:47] <^demon> :) [20:28:58] so if you pushed something that would affect rendering servers [20:29:03] you may wanna just do it again to be safe. [20:29:19] (otherwise they will update with puppet run anyhow) [20:29:52] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: Being paranoid, node lists were in flux during last sync [20:29:53] Logged the message, Master [20:30:09] woo paranoia! \o/ [20:34:35] New patchset: RobH; "mw81-85 added as application servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47923 [20:35:37] New review: RobH; "" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47923 [20:35:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47923 [20:50:55] robla: is this a better time to ask a random question? [20:52:21] I think things are better [20:52:25] (I'm not the person your question is for, though) :) [20:53:13] Okay, well, here goes... [20:54:13] Hi everyone (again): this is a very random question, but: if you wanted to have a MediaWiki site where every page loaded in less than one second, accessed from anywhere in the world, and it had fairly minimal traffic, around how much would it cost, very roughly, to set up? [20:54:54] I am not sure who would answer that. [20:55:01] <^demon> Wouldn't a VPS work if the traffic's not bad? [20:55:26] Since you can't control the 'last leg' of the network, under 1 second load times as a metric is a little insane [20:55:27] ^demon: sure - whatever worked is fine. [20:55:37] <^demon> A decent mid-sized VPS with some appropriate caching should be able to serve a low-traffic wiki pretty fast. [20:55:53] hey root / ops . Could we get a review of https://gerrit.wikimedia.org/r/#/c/47795/ that tweak a sudo rule for the beta cluster [20:55:57] less than 1 second worldwide seems unreasonable. [20:56:04] i dont see that happening for any site ever... [20:56:13] Damianz, RobH: okay, that's good to know. [20:56:16] https://gerrit.wikimedia.org/r/#/c/44548/ changed a sudo invocation and that breaks beta heavily :-] [20:56:17] (i just hear our india based devs complain too much ;) [20:56:42] Well, instead of "under 1 second", how about just "fast". [20:57:07] then what ^demon suggests seems the most cost effective for a low traffic site. [20:57:33] or some CDN if it goes to mid level traffic i guess [20:57:42] i have no idea the costs on that kinda thing. [20:58:03] RobH: alright; I can look that up. [20:58:05] It also depends on your read vs write load I guess, since read wise you can just cdn it - write you gotta handle [20:58:23] well, when he said page load, not page update, i assumed a heavy read [20:58:28] but its an assumption, indeed. [20:58:36] <^demon> yaron: Caching caching caching. Use memcached and apc and you should get pretty far :) [20:58:46] Yes - I assume it'll be 100:1 read vs. write, or something like that. [20:59:21] ^demon: okay, that makes sense. [20:59:46] New patchset: SPQRobin; "Enable Narayam on the new Sanskrit Wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48015 [21:00:03] Well, if no one knows any specific costs, that's fine - I can look up the CDN stuff. Thanks everyone for your help! [21:09:10] !log olivneh synchronized php-1.21wmf9/extensions/EventLogging [21:09:11] Logged the message, Master [21:10:13] !log olivneh synchronized php-1.21wmf8/extensions/EventLogging [21:10:14] Logged the message, Master [21:11:56] !log olivneh synchronized php-1.21wmf9/extensions/PostEdit [21:11:57] Logged the message, Master [21:12:35] !log olivneh synchronized php-1.21wmf8/extensions/PostEdit [21:12:37] Logged the message, Master [21:18:26] hi dudes, multichill over in #mediawiki is asking about amaranth, a toolserver machine [21:18:45] it seems to be unreachable. I can get into the mgmt console, but aren't able to start a serial connection [21:19:29] RobH, i'm going to ask you since you've been around a bunch this week :p [21:19:33] should I try to reboot it? [21:19:34] 2013-02-07 15.45 < DaBPunkt> ok, got RT-id 4487 [21:19:35] see if it comes up? [21:19:51] hrmm, lemme take a look at what it is [21:19:54] http://nagios.toolserver.org/cgi-bin/status.cgi?host=amaranth [21:20:09] ottomata: So it is based out of Tampa, so a reboot can be done [21:20:13] RobH already restarted it a few times lately [21:20:22] and if it doesnt come back, we have cmjohnson1 and sbernardin on site [21:20:26] Nemo_bis: uhh, i did? [21:20:32] yes :p [21:20:37] i dont recall touching this machine in a long time.... [21:21:15] ok, i'm going to reboot and see what happens [21:21:31] i dont see an admin log of anyone touching it since before last july [21:21:40] ottomata: admin log it =] [21:21:55] if it doesnt come back, well, i dunno what to do next cuz i dont have root to toolserver boxes. [21:22:03] hmm, RobH [21:22:04] -> reset /SYS [21:22:04] Are you sure you want to reset /SYS (y/n)? y [21:22:04] Performing hard reset on /SYS failed [21:22:04] reset: Internal error [21:22:17] ahh, sounds like the SP is borked [21:22:38] ottomata: we need to ask cmjohnson1 or sbernardin to remove all power from the chassis (pull power cables) and then plug back in [21:23:00] when the [e|i]lom gets in that state, its the only fix [21:23:08] that may fix it [21:23:28] (or it may not, and the sysetm may be having hardware issues, which is unfortunate since its long out of warranty) [21:23:50] heh, was about to tell him to make a ticket for them [21:23:53] i'll do it [21:25:13] ottomata: So who reported that its not working? [21:25:23] multichill in #mediawiki [21:25:46] ok, well, it looks like it has to have the power pulled and plugged back in, and then a reboot test to see if it works [21:25:58] you'll want to drop a ticket into the pmtpa queue for onsite [21:26:08] then you can touch base with sbernardin or cmjohnson1 about it [21:26:16] (chris is only in tampa this week) [21:26:34] AH, adium crashing [21:26:36] bleh, you just lost all my answer. [21:26:44] ok, well, it looks like it has to have the power pulled and plugged back in, and then a reboot test to see if it works [21:26:47] you'll want to drop a ticket into the pmtpa queue for onsite [21:26:51] then you can touch base with sbernardin or cmjohnson1 about it [21:26:53] (chris is only in tampa this week) [21:26:57] hokeydokey [21:26:58] can do [21:27:01] heh =] [21:27:18] can I CC multichill on the ticket? [21:27:21] So if they do the power thing, and it still is borked, then the server is prolly fubar [21:27:33] seems reasonable to me (to cc him) [21:27:40] ok cool [21:27:55] also, luckily, and you may wanna mention it [21:28:06] if the server is fubar, its got a disk array attached [21:28:19] and while i have no idea how they partitioned the TS server, one would hope they put data on the disk shelf. [21:28:37] so if its important data, we may be able to recover (If the system is toast) [21:33:15] !log olivneh synchronized php-1.21wmf9/extensions/E3Experiments [21:33:16] Logged the message, Master [21:34:33] !log olivneh synchronized php-1.21wmf8/extensions/E3Experiments [21:34:34] Logged the message, Master [21:37:04] cmjohnson1, when you get to it: [21:37:07] https://rt.wikimedia.org/Ticket/Display.html?id=4489 [21:37:19] i am looking at it now [21:37:24] multichill mentioned that this is kinda urgent because its a replication slave, and the longer it is down the harder it is to bring up [21:37:25] cool! [21:37:26] thank you [21:38:43] yep..working on it now [21:39:51] !log performing a hard reboot on amaranth [21:39:52] Logged the message, Master [21:40:01] ottomata, I don't think the CC worked for the ticket though [21:40:25] I didn't CC, i just emailed toolserver-l manually [21:40:43] oh I was referring to " can I CC multichill on the ticket?" [21:42:50] yeah, I asked him for his email [21:42:58] he told me to just notify toolserver-l [21:43:20] hrmm [21:43:28] how the hell does ganglia know when a server is decommissioned. [21:43:38] im not seeing the hook so far, and my old crap is still in ganglia. [21:45:20] soooo, 120 international pages, eh? [21:45:25] that's not so fun [21:45:44] we need some way to very quickly disable the paging [21:47:43] Ryan_Lane: uhh, isnt incoming when roaming still normal cost? [21:47:50] i thought its only outgoing international the us carriers charge for [21:48:08] (unless you just forwarded all your stuff to a local sim i suppose) [21:48:50] ah. right [21:49:08] though its still annoying as shit ;] [21:49:12] I'm used to the US where they screw you [21:49:30] in particular since i bet you guys left yourselfs on PST paging times [21:49:32] heh [21:49:36] yourselves even [21:49:39] ottomata: you should be able to get into console now on amaranth but it does not appear the OS is coming up. [21:49:41] i can haz spellingz [21:50:15] cmjohnson1, same deal [21:50:16] -> start /SP/AgentInfo/Console [21:50:16] start: Invalid target /SP/AgentInfo/Console [21:50:23] !log mw81-mw85 in pmtpa apache service [21:50:25] Logged the message, RobH [21:50:33] ottomata: eh? [21:50:40] i thought you tried to reboot it and it failed [21:50:42] not tried to console. [21:50:45] i did both [21:50:50] first console, no good [21:50:50] ok, does the reboot work now? [21:50:51] then reboot [21:50:55] shoudl I try cmjohnson1? [21:51:44] ottomata: you should try [21:51:48] it sounds like its not going to work anyhow [21:52:22] ja [21:52:22] you can also do a last ditch reset of the SP again just to be sure, reset /SP [21:52:27] it takes about 3 minutes for it to come back [21:52:32] reboot looks happier [21:52:32] -> reset /SYS [21:52:32] Are you sure you want to reset /SYS (y/n)? y [21:52:32] Performing hard reset on /SYS [21:52:37] /SP [21:52:37] ? [21:52:40] wikitech says /SYS [21:52:46] so SYS is the system [21:52:48] and if that didnt work [21:52:54] i would try rebooting the service processor [21:52:56] (elom) [21:52:57] oh, hm [21:52:59] i see [21:53:01] but it seems ok [21:53:04] ya ok [21:53:06] so i wouldnt do that, try serial now? [21:53:08] ottomata you are using wrong console command [21:53:13] start /SP/console [21:53:13] oh? [21:53:25] cmjohnson1: well, it has the command he lists for that server [21:53:27] as its an elom [21:53:34] but i would try the ilom command as well ottomata [21:53:44] -> start /SP/console [21:53:44] Are you sure you want to start /SP/console (y/n)? y [21:53:44] Serial console is in use. [21:53:51] cmjohnson1: you on it? [21:53:54] hmm, no [21:53:56] yep..i am ... [21:53:56] just another prompt [21:54:02] oh [21:54:04] ok [21:54:07] ottomata: it said it was in use, thats different ;] [21:54:16] hehe, by chris! [21:54:16] i see! [21:54:21] cmjohnson1: so you can hand it back to ottomata [21:54:27] who can babysit the post via serial now [21:54:32] and see whats up OS wise [21:54:39] i can't get off it now [21:54:41] haha [21:54:44] ottomata: cmjohnson1 has a ton of shit onsite, so we wanna free him back up [21:54:49] yeah np [21:55:00] cmjohnson1: esc + shift + 9 [21:55:05] ? [21:55:22] cool [21:55:22] thx [21:55:26] ottomata all yours [21:55:35] ottomata: normally chris is happy to follow up and bring stuff back and all [21:55:41] but since he is in tampa, and his time is limited, etc... [21:56:34] its cool! [21:56:41] glad it came back [21:56:43] ok, in console [21:56:44] cuz if it didnt, it would suck. [21:56:52] oo booting! [21:57:10] ottomata: so historically, a single member of toolserver admins (river) has had both root on toolserver and root on wmf cluster [21:57:21] so if the OS came back as fubar, river handled it [21:57:51] if the OS comes up as fubar now, i honestly have no idea whose problem it is. I dont think any wmf ops have root on those systems, and we cannot give our mgmt login info out to any of the current toolserver admins [21:58:04] river had root because river was a long time sysadmin volunteer for us when we had no paid staff [21:58:06] !log olivneh synchronized php-1.21wmf8/extensions/RelatedArticles/ 'Fixes 44761' [21:58:07] Logged the message, Master [21:58:31] in other words, im glad this happened on your week of trige and not mine ;] [21:58:49] (that was mean, but amused me anyhow) [21:58:51] hah [21:58:54] !log olivneh synchronized php-1.21wmf9/extensions/RelatedArticles/ 'Fixes 44761' [21:58:55] Logged the message, Master [21:59:06] csteipp: ^ [21:59:15] weee, new apaches in service, in wrong datacenter, but meh [21:59:17] aye, welp, i emailed toolserver-l with thte status [21:59:20] ori-l: Thanks! [21:59:21] not much more I can do, eh? [21:59:26] np. [21:59:29] i havent pushed new servers in apache pool online in awhile, i wanted to do the first ones in the nonprimary center. [21:59:50] ottomata: if it doesnt post, you could do sysadmin work on it, but im not sure if thats legit use of your time. [22:01:22] oh it booted to login prompt [22:02:14] and pings [22:02:22] New patchset: Mattflaschen; "Enable GuidedTour on more Wikipedias, and outreach." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48031 [22:02:30] ottomata: dunno what else we can do cept admin log its back and responsive to ping [22:02:58] aye right thanks for reminder [22:03:56] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48031 [22:04:21] !log pulled power and rebooted amaranth toolserver for multichill. Its back online. multichill will notify whomever needs to do more work. [22:04:23] Logged the message, Master [22:38:25] Deploying GuidedTour now [22:39:58] !log mw82 pooled into tampa apaches [22:39:59] Logged the message, RobH [22:47:28] !log mflaschen Started syncing Wikimedia installation... : Deploy GuidedTour [22:47:30] Logged the message, Master [22:49:40] how many db servers we have on maria now? still that one only? [22:49:46] *mariadb [22:50:27] Uh-oh: "srv205.pmtpa.wmnet: ssh: connect to host srv205.pmtpa.wmnet port 22: Connection timed out" [22:50:31] Will it recover from that? [22:50:48] superm401: Yes. [22:51:00] There are usually a couple of hosts that are out of rotation for which scap / sync time out. [22:51:25] Thanks, ori-l [22:51:29] ..and everybody hopes that they will resync when coming back online:P [22:52:48] New review: Faidon; "Package ensure => present, not 'installed'. Service needs hasstatus => true, hasrestart => true." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47881 [22:54:49] New patchset: Jalexander; "Add JP to Wikimedia Shop link config for sidebar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48038 [22:59:15] New review: Ottomata; "The module is small because that's all I need. It would be complicated to try and make it smart eno..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47881 [23:00:19] ottomata: reminder re: https://gerrit.wikimedia.org/r/#/c/47621/ if you're doing puppet stuff [23:00:57] New patchset: Ottomata; "Adding very simple haproxy module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47881 [23:01:23] cool, can do ori-l [23:01:28] Thanks [23:01:35] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47621 [23:02:30] !log mflaschen Finished syncing Wikimedia installation... : Deploy GuidedTour [23:02:31] Logged the message, Master [23:09:45] New patchset: Ottomata; "Adding very simple haproxy module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47881 [23:11:32] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47881 [23:16:05] New patchset: Ottomata; "Adding kraken role and module. Including role::kraken::proxy on analytics1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [23:17:51] Change abandoned: Ottomata; "Redoing this here:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47741 [23:20:26] New patchset: Ori.livneh; "Imported Upstream version 0.8.0" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/48042 [23:25:46] !log manually cleared arp and turned managmeent interface off and on of asw-c-eqiad to solve reachability issue [23:25:47] Logged the message, Mistress of the network gear. [23:27:26] New patchset: Ori.livneh; "Imported Upstream version 0.8.0" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/48043 [23:27:26] New patchset: Ori.livneh; "Imported Debian patch 0.8.0-1" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/48044 [23:27:26] New patchset: Ori.livneh; "Backported to precise-wikimedia." [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/48045 [23:28:10] New review: Ori.livneh; "Faidon, thanks! I did as you suggested. It was too confusing to try and rebase that on top of this c..." [operations/debs/python-jsonschema] (master) C: 0; - https://gerrit.wikimedia.org/r/47662 [23:28:19] Change abandoned: Ori.livneh; "(no reason)" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/47662 [23:28:31] Very sorry for gerrit spam. [23:31:14] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling GuidedTour on additional wikis' [23:31:15] Logged the message, Master [23:36:55] New patchset: Ottomata; "Adding kraken role and module. Including role::kraken::proxy on analytics1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [23:38:40] New patchset: Ottomata; "Adding kraken role and module. Including role::kraken::proxy on analytics1001." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48041 [23:46:49] New patchset: Lcarr; "commenting out row a5 smokeping because it is currently empty" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48048 [23:49:01] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48048 [23:59:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47879