[00:01:45] New review: Bhartshorne; "see inline comments." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/1797 [00:02:45] Change restored: Catrope; "Bah, apparently I can't amend abandoned changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:02:57] New patchset: Catrope; "WIP for breaking out puppet-specific hooks to puppet.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1794 [00:03:23] New review: Catrope; "(no comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/1794 [00:13:12] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:14:29] maplebed: can you check this out ? [00:14:40] * maplebed looks [00:18:08] I'd rename COPYSCRIPT to RESTORESCRIPT to match the pattern of SAVESCRIPT. [00:18:28] there's a mix of tabs and spaces (also in the gmetad start script) [00:19:11] also, just for sanity, I'd copy the files before starting the daemon instead of after. [00:19:18] I know we said it works either way, but it feels better that way. [00:19:37] the tabs and spaces were in the script to start with [00:19:40] so i just left them in [00:19:43] but i can convert them to tabs [00:19:59] patch set 3 coming soon…. [00:20:01] up to you. [00:21:48] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:22:25] gah! [00:22:38] ? [00:22:46] both the restore script and the start script echo 'restoring ... done' [00:22:48] :( [00:23:00] sorry i didn't catch that last time. [00:23:20] same with the save script. [00:27:45] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [00:30:34] New patchset: Asher; "test new varnish pkgs on mobile cp servers out of production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:31:41] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1798 [00:31:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1798 [00:42:20] PROBLEM - DPKG on cp1042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:46:40] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:48:31] New patchset: Asher; "fix for varnish pkg testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:49:06] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1799 [00:49:06] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1799 [00:52:00] RECOVERY - DPKG on cp1042 is OK: All packages OK [00:56:30] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [01:00:53] New patchset: Asher; "putting backend instances of new mobile varnish servers into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:02:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1800 [01:02:02] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1800 [01:32:40] New patchset: Lcarr; "Puppetizing ganglia and gangliaweb Puppetizing automatic saving and restoration of rrd's from tmpfs to disk Modifying gmetad startup to import rrd's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:36:01] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1797 [01:36:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1797 [01:37:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1797 [01:46:35] New patchset: Lcarr; "fixing type in ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [01:46:50] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1801 [01:46:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1801 [02:05:01] !log LocalisationUpdate completed (1.18) at Fri Jan 6 02:05:01 UTC 2012 [02:05:03] Logged the message, Master [02:20:05] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [02:56:55] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 677s [03:01:15] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 689s [03:42:59] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 03:42:38 UTC 2012 [04:19:34] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:19:34] RECOVERY - Disk space on es1004 is OK: DISK OK [04:46:48] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [05:52:41] could i get a look at /cache/interwiki.cdb ? [05:52:52] for https://bugzilla.wikimedia.org/31042 [05:53:59] its on noc already i'm pretty sure [05:54:41] hmm maybe not [05:55:07] i'm pretty sure it's not the first time i've looked for it :/ [05:55:28] is there a reason not to close the paren? [05:55:29] > // For transwiki import [05:55:30] ini_set( 'user_agent', 'Wikimedia internal server fetcher (noc@wikimedia.org' ); [07:26:10] Nemo_bis: where did brion say? [07:26:25] jeremyb, I think on bugzilla [07:26:50] anyway, it seems like the solution is to add to special.dblist [07:27:12] but i'd want to compare interwiki.cdb before/after adding to special.dblist [07:27:24] which is generated by https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikimediaMaintenance/dumpInterwiki.php?view=markup [07:28:32] * jeremyb runs away [08:00:50] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 205 MB (2% inode=60%): /var/lib/ureadahead/debugfs 205 MB (2% inode=60%): [08:10:40] RECOVERY - Disk space on srv222 is OK: DISK OK [10:00:49] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 429355 MB (3% inode=99%): [10:04:29] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 413604 MB (3% inode=99%): [10:08:19] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:34:09] PROBLEM - Puppet freshness on cp3001 is CRITICAL: Puppet has not run in the last 10 hours [10:35:09] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq68 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq67 is CRITICAL: Puppet has not run in the last 10 hours [10:46:53] PROBLEM - Puppet freshness on sq70 is CRITICAL: Puppet has not run in the last 10 hours [10:52:43] PROBLEM - Puppet freshness on niobium is CRITICAL: Puppet has not run in the last 10 hours [10:53:43] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:54:43] PROBLEM - Puppet freshness on cp3002 is CRITICAL: Puppet has not run in the last 10 hours [12:29:31] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [12:46:00] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [13:21:30] ACKNOWLEDGEMENT - Host db43 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2170 [13:24:00] ACKNOWLEDGEMENT - Host db41 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn was used for OWA, re-claimed by CT for miscellaneous tomfoolery [13:30:54] ACKNOWLEDGEMENT - Host db19 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn http://rt.wikimedia.org/Ticket/Display.html?id=2034 [13:39:24] ACKNOWLEDGEMENT - Host dataset1 is DOWN: CRITICAL - Host Unreachable (208.80.152.166) daniel_zahn RT #1345 [13:52:14] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:06:03] New review: Dzahn; "reverting to use NFS path like before (for now)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1643 [14:06:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1643 [14:18:42] New patchset: Dzahn; "require Class varnish::packages, instead of Package[varnish3] in Mount, because this broke puppet fe. on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:19:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1802 [14:28:01] New review: Dzahn; "fixed dependency problem ?!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1802 [14:28:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:30:12] RECOVERY - Puppet freshness on sq67 is OK: puppet ran at Fri Jan 6 14:29:51 UTC 2012 [14:30:23] New review: Dzahn; "yes, it did. puppet runs again on sq67" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1802 [14:32:42] RECOVERY - Puppet freshness on sq68 is OK: puppet ran at Fri Jan 6 14:32:27 UTC 2012 [14:33:12] RECOVERY - Puppet freshness on sq70 is OK: puppet ran at Fri Jan 6 14:33:10 UTC 2012 [14:39:12] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Fri Jan 6 14:38:57 UTC 2012 [14:41:12] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Fri Jan 6 14:40:46 UTC 2012 [14:44:42] RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Fri Jan 6 14:44:33 UTC 2012 [14:46:42] RECOVERY - Puppet freshness on niobium is OK: puppet ran at Fri Jan 6 14:46:24 UTC 2012 [14:56:42] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Fri Jan 6 14:56:19 UTC 2012 [15:14:03] hi [15:15:02] I need some help for a password problem [15:16:59] I'm a sysop on French wikisource, I have lost my password and I have no email address to ask a new password [15:23:12] I'm a sysop on French wikisource and I have lost my password, is it possible to have some help please ? [15:23:38] Do you have a committed identity on your user page, or any way at all for us to verify that you are who you claim to be? [15:23:43] use your email to reset your password [15:23:57] @ RoanKattouw : yes [15:25:14] I am Marc on Wikisource. You can verify here : http://fr.wikisource.org/w/index.php?title=Utilisateur:Jean-Baptiste&diff=1919043&oldid=1919039 [15:26:35] Jean-Baptiste is a « sockpuppet » that I use on Wikisource. [15:27:11] KO [15:27:30] Can you make an edit to User talk:Catrope as Jean-Baptiste on frwikisource? [15:27:38] ok [15:28:02] ah, I didn't notice that marcjb , sorry [15:29:50] Alright [15:30:02] marcjb: Does Jean-Baptiste have an e-mail address set [15:30:04] ? [15:30:54] yes [15:30:56] Meh, no it doesn't [15:31:09] Ah, does now [15:31:10] I just add an emailaddress [15:31:15] OK, I will set that e-mail address for Marc as well then [15:31:49] great [15:35:56] OK, done [15:36:01] Marc now has an e-mail address [15:36:08] You should be able to request a new password now [15:36:59] it works [15:37:16] thank you very much [15:37:38] and sorry for the inconvenience [15:38:59] I will be careful now. Bonne journée [15:44:34] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33556 - ArticleFeedback settings on Chinese wikipedia' [15:44:35] Logged the message, Master [15:45:10] !log srv220 / is at 100% usage [15:45:12] Logged the message, Master [15:45:44] ^ There's 2.6G in /tmp if someone wants to clear it [15:46:28] !log Removed /tmp/mw-cache-1.17 and /tmp/mw-cache-1.17-test on srv220 [15:46:29] Logged the message, Mr. Obvious [15:47:20] There's a bunch of 100MB files in /tmp , seems to be related to gs (ghostscript) [15:47:25] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33556 - ArticleFeedback settings on Chinese wikipedia' [15:47:26] Logged the message, Master [15:50:12] !log Removing gs_* files in /tmp on srv220 that are >30 min old [15:50:14] Logged the message, Mr. Obvious [15:50:29] There, now it's got 1.3GB of free space [16:02:29] you are aware of dumps.wm.o being down? [16:03:30] Nope, but confirmed [16:03:31] apergos, ^ [16:03:49] thx [16:04:41] Hmm, out traffic dropped off significantly 15 minutes or so ago [16:05:09] !log HTTP server (lighttpd?) seems to be down on dataset2 [16:05:10] Logged the message, Master [16:05:15] !log restarted lilghty on dataset2 [16:05:16] Logged the message, Master [16:05:24] Reedy: then just restart the movie-torrent-server ;) [16:05:30] you logged a second too late, I already had restarted it [16:05:34] :D [16:05:44] WTF [16:05:48] Apache is NOT INSTALLED on dataset2 [16:05:53] The SAL says I was a minute before you [16:05:53] Oh [16:05:55] Lighty [16:05:56] no and it shouldn't be [16:05:56] nm [16:05:58] Lololol [16:06:06] it's already dealt with. go away :-P [16:06:09] xD [16:06:13] I'm sorry :) [16:06:16] hehe [16:06:27] Giftpflanze, fixed [16:06:34] every couple weeks or so it keels over [16:06:47] Yeah, not really a big issue [16:06:51] RoanKattouw: If you are bored, could you check for the ipv6-entry of upload and if it is still announced to some parts of the world? [16:06:58] thx :) [16:07:06] I have no functional knowledge of DNS, sorry [16:07:23] who has if I may ask? [16:07:26] afk (that is, officially not paying attention) now. I've decided to try to reclaim my evenings this year [16:07:55] DaBPunkt: notpeter [16:08:02] apergos: That's one of these new-years-promisses? ;) [16:08:11] Any of the actual ops people when they're about :p [16:08:28] jeremyb: thanks [16:08:46] yeah, "resolutions" [16:08:55] apergos: 6pm? [16:08:59] yup [16:09:06] a question about the dumps: are they all from the same point of time and is the underlying data changing while dumped or frozen? [16:09:14] underlying data changes [16:09:18] we can't afford to lock tables [16:09:20] DaBPunkt: are you saying the ipv6 entry disappeared recently? [16:09:23] mh [16:09:29] each step is done serially for any given project [16:09:56] jeremyb: some evenings might be later and some earlier (like tonight), all depends. [16:10:00] mutante: some parts of the world get an ipv6-entry, but the server itself is NOT listing on ipv6, which causes troubles [16:10:00] but none really late. [16:10:23] not ideal, but thanks for the info [16:10:30] bye [16:12:00] DaBPunkt: ok, i can confirm we got reports via mail about the server not listening on v6. and the current task is that LVS does not support it. it has been forwarded to the right people already. (the part about some "some parts of the world" i'm not sure about though) [16:12:31] mutante: we got an email to, that's the reason I asked [16:16:20] DaBPunkt: yesterday it was considered to temp. disable the ipv6 dns announcements as a workaround until the issue with LVS is fixed. but all i can say right now is that its being worked on [16:17:18] ok. The problem for me is that I can not test it myself, because the entry was never announced to me [16:17:21] DaBPunkt: so that may explain the "some parts of the world".. caching etc.. [16:18:00] mutante: mm, I was told that the entry was or is only announced to some parts of the world [16:18:24] whitelisted servers [16:21:54] DaBPunkt: fyi, you can check using hurricane electric's recursor, dig @74.82.42.42 aaaa upload.esams.wikimedia.org [16:22:32] thanks [16:27:02] New patchset: Hashar; "WikipediaMobile: add direct link to latest nightly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [16:27:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1803 [16:48:37] New patchset: Demon; "Fix my public key, was off-by-one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [16:48:51] New patchset: Demon; "For the last time, fixing my public key. I swear this is it. (With a new comment too)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [16:49:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1805 [16:50:10] New review: Demon; "Cherry picked from test: https://gerrit.wikimedia.org/r/#change,1219" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1805 [16:50:31] New review: Demon; "Stupid change, but was needed for the dependency to cleanly merge." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1804 [17:03:13] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Fri Jan 6 17:02:44 UTC 2012 [18:00:21] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1804 [18:00:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1804 [18:01:55] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1805 [18:01:56] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1805 [18:05:31] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1803 [18:05:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1803 [18:09:17] so, what's wrong? [18:09:58] malafaya, wrong with what? [18:10:10] in my case, with pt.wiktionary [18:10:21] it says it's experiencing technical difficulties alreayd for 5 minute [18:10:33] Unknown error (10.0.6.32) [18:10:41] When doing what? [18:11:02] submitting a page [18:11:09] That's the cliuster master [18:11:12] db22 [18:11:12] i have been refreshing cause I don't want to lose the data [18:11:52] mutante, about? [18:12:00] or domas [18:12:32] eh, yeah..trying to open pt.wikt [18:12:36] Ganglia graphs for db22 just stopped and it see [18:12:40] i get an error, while saving on de.wiki [18:12:42] works for me [18:13:00] Anybody else having errors from Commons? [18:13:14] Other sites are [18:13:17] mutante, reporduced [18:13:23] cannot contact database server [18:13:31] db22 won't respond to ping [18:13:33] checking db22 on nagios [18:13:50] ganglia graphs just stopped [18:14:04] me too getting erorrs on Commons, StevenW [18:14:05] Isn't CA on s4? [18:14:06] appears down [18:14:22] https://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?host=db22 [18:14:32] arr, drop the s [18:14:34] Can you get serial to it? [18:14:42] trying [18:18:55] Reedy, usually when I come here, the issues are already being taken care of. It's the first time I'm the first reporter ;) [18:19:05] PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet loss = 100% [18:19:49] malafaya: quicker than nagios .. looks like failed disk.. hold on [18:20:14] (Cannot contact the database server: Unknown error (10.0.6.32)) [18:20:26] @info db22 [18:20:26] Krinkle: [db22: s4] 10.0.6.32 [18:20:29] right [18:20:33] Krinkle, disk/raid failure [18:20:41] mw.org save gave that [18:20:50] Yeah [18:20:56] Isn't S4 CA? [18:21:13] @info s4 [18:21:13] Krinkle: [s4] db31: 10.0.6.41, db22: 10.0.6.32, db33: 10.0.6.43 [18:21:18] @info centralauth [18:21:18] Krinkle: [centralauth: s7] db37: 10.0.6.47, db18: 10.0.6.28, db16: 10.0.6.26 [18:21:25] nope [18:21:29] who pulled the plug out? [18:21:46] dbbot's copy of noc-wm may be outdated [18:22:22] Reedy: nope, still on S7 [18:22:23] 'centralauth' => 's7', [18:22:27] http://noc.wikimedia.org/conf/db.php.txt [18:22:28] right [18:22:32] it's cause it's commons then :p [18:23:09] Reedy: Hm.. save worked regardless [18:23:18] even tough I wasn't' redirected after save [18:24:11] save where? [18:24:14] mw.org [18:24:23] save on mw gave me that db-error [18:24:29] * sumanah came to report the same thing [18:24:45] * DarkoNeko restes the logs [18:24:51] a disk failed on db22 a little while ago. and while the failed disk should be replaced something went wrong [18:25:05] since everyone's joining, I guess everyone's havint he same issue :) [18:25:07] its being worked on [18:25:08] man I hate spell correction sometimes on Mac. It changes "eventhough" (missing space) to "even tough" and wasnt' (misplaced a apostrophe) instead of moving it from after t to between n and t, it adds a new one between n and t and leaves the one after resulting in wasn't'. [18:25:13] !log Commons having db issues, db22 (s4 master) has a disk issue [18:25:15] the page saves go through but the you get the error [18:25:15] Logged the message, Master [18:25:37] @info mediawikiwiki [18:25:37] Krinkle: [mediawikiwiki: DEFAULT (s3)] db39: 10.0.6.49, db34: 10.0.6.44, db25: 10.0.6.35 [18:25:52] Reedy: how come mw.org on s4 is suffering from s4 db failing ? [18:25:56] s3* s4* [18:26:12] How do we do requests to commons? via db access I guess? [18:26:13] @info ptwikt [18:26:13] malafaya: Unknown identifier (ptwikt) [18:26:24] @info ptwiktionary [18:26:24] Reedy: [ptwiktionary: DEFAULT (s3)] db39: 10.0.6.49, db34: 10.0.6.44, db25: 10.0.6.35 [18:26:29] thanks :) [18:26:35] heh I tried to delete a page on nlwiki and got the db error [18:26:45] nlwiki is on s2 [18:27:21] getting another error now [18:27:37] press the any key [18:27:43] !log asher synchronized wmf-config/db.php 'setting s4 to read only, preparing to make db31 master' [18:27:44] Logged the message, Master [18:28:02] That's the error I got, db in read-only [18:28:09] * Romaine suggests to use the "solve" key [18:28:11] Just hang on for a few minutes [18:28:26] look, just don't try again until they sorted it ? ^^; [18:28:34] ^ [18:29:12] was the collected amount of money too low? [18:29:20] or too late [18:29:33] hur hur hur. [18:31:19] database ops are working on promoting a slave to the new master [18:31:27] it's working again [18:31:29] approximate downtime like 10 minutes [18:31:50] at the same time technician in dc is still checking hardware of the failed db [18:31:55] PROBLEM - RAID on db23 is CRITICAL: CRITICAL: logical devices: 1 defunct [18:32:24] the current effect is that commons is read-only [18:32:45] I see all the addicted wikimedian panicking :) [18:33:18] addicted? Wiki is a way of life [18:33:37] uuung *refresh* I can't *refresh* edit *refresh* [18:33:50] * sumanah hits save on all the mw.org edits she had queued up [18:37:37] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:40:58] !log ben synchronized wmf-config/db.php [18:40:58] Logged the message, Master [18:42:27] RECOVERY - RAID on db23 is OK: OK: 1 logical device(s) checked [18:44:49] everything should be back to normal, please try again [18:44:53] thanks for your patience [18:45:06] database admins moved a slave to be the new master [18:45:44] thx mutante [18:45:45] cu [18:46:34] thanks mutante for this & for the email [18:47:17] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:24:31] New patchset: Asher; "db51 -> s4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1806 [19:24:56] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1806 [19:24:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1806 [20:27:24] RECOVERY - Host db22 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [20:44:44] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [20:46:58] PROBLEM - MySQL disk space on db22 is CRITICAL: Connection refused by host [20:52:08] PROBLEM - DPKG on db22 is CRITICAL: Connection refused by host [20:53:48] PROBLEM - Puppet freshness on cp1044 is CRITICAL: Puppet has not run in the last 10 hours [20:53:58] PROBLEM - Disk space on db22 is CRITICAL: Connection refused by host [20:54:36] !ops Prodego is a fuckshit. ĥ̸̬̬̥̩̮͓͖̩̹͕̜̠̹̔͆ͥ͑͛̌ͬ͛ͩ̈́ͪ̀̕ͅͅe̶̫͉̭̖̮̯̭͍̞̥ͪͧͧ͂̏̈́ͭͯ͒́ͨ̆̏ͩ̅̽ͤ̚͝l̗̯̯͖̟̺̻̭̤̲̭̺͈͖̲̍͂̈ͯ̌̇̿̽̀lͬͭ̌́̍̊̒͗ͥ́͞҉͍̭͉̩͈͓̻͙̤͍̰̯͍̝̜̥͔o̙͚̱̞̜̹̤̭̪̾ͦ̐ͥͥ̄͝ͅ [20:55:44] Huh? Does !__ ops even work here? [20:57:04] no [20:57:18] only if someones irc client stalks it [20:57:58] right it works for me :D [20:58:09] but I don't really care in this chan [20:59:40] Yeah, I stalk it. That's why I asked :P [21:01:10] I wonder what a fuckshit is [21:01:27] * derp hugs petan [21:02:13] IDK, but it sounds like something a high schooler would say. They're always making up words like that :P [21:02:45] or it's a word from some foreign language which just accidentally contains English swearwords. [21:05:05] I don't have anything as a stalkword in any of the channels [21:05:46] apergos, you must be so gutted you missed the DB server problems [21:06:06] right around when they started up I had gone to make food [21:06:17] and when I checked in later there were sveral people working on it [21:06:25] no need for me to be here too [21:06:58] RECOVERY - MySQL disk space on db22 is OK: DISK OK [21:06:59] as mark points out from time to time, we don't all have to be here all the time. the idea is that everyone covers half of the time so there are always a few folks around [21:07:16] I'm really trying to put that into practice this year [21:08:00] speaking of which... afk :-P [21:08:51] !log asher synchronized wmf-config/db.php 'adding db51 as an s4 slave' [21:08:52] Logged the message, Master [21:11:58] RECOVERY - DPKG on db22 is OK: All packages OK [21:13:48] RECOVERY - Disk space on db22 is OK: DISK OK [21:13:58] New patchset: Asher; "upgrading db22 to new pkgs / fully puppetized config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1807 [21:13:58] RECOVERY - RAID on db22 is OK: OK: 1 logical device(s) checked [21:14:38] guys, is it realy that hard to send me a ping when you change the master of a cluster? [21:14:51] Merry Christmas! [21:15:43] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1807 [21:15:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1807 [21:16:39] DaBPunkt: it was logged to the server admin log - sorry though.. it can get hectic when something like the commons master dies [21:17:08] the new server and binlog position can be found there [21:17:16] binasher: no problem [21:32:28] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [21:40:32] DaBPunkt: is there a prescribed way to send the ping? you obviously know about the SAL but also, db.php is easily fetched anonymously so maybe you could watch it for changes? [21:42:02] also, maybe put a comment right in db.php telling people to tell you? [21:42:17] jeremyb: I normaly read this channel, but I was afk during that time; but a simple "DaBPunkt: We changed the master of s4" would be enough for me to notice when I come back :-) [21:42:33] DaBPunkt: you're not always here [21:44:30] that's true. You could send a mail to admins@toolserver.org, but that's maybe too much trouble [21:46:28] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 30.86 ms [21:46:28] PROBLEM - Host db1029 is DOWN: PING CRITICAL - Packet loss = 100% [21:47:28] RECOVERY - Host db1029 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [22:16:40] New patchset: Lcarr; "removed statically spec'ed cluster from nickel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1808 [22:16:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1808 [22:17:03] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1808 [22:17:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1808 [22:19:12] New patchset: Ryan Lane; "Enabling ipv6 proxy on ssl3001 to re-enable upload ipv6 proxy." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1809 [22:19:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1809 [22:19:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1809 [22:19:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1809 [22:39:11] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [22:40:21] RECOVERY - SSH on es1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:53:41] RECOVERY - Puppet freshness on es1002 is OK: puppet ran at Fri Jan 6 22:53:38 UTC 2012 [22:55:11] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [22:58:53] !log asher synchronized wmf-config/db.php 'adding db22 back to s4' [22:58:54] Logged the message, Master [23:05:43] RECOVERY - NTP on es1002 is OK: NTP OK: Offset -0.02113974094 secs [23:31:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 85, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - psw1-eqiad:xe-0/1/0 (cable #TBD)BR [23:33:17] New patchset: Lcarr; "adding in new ganglia site for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1810 [23:39:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1810 [23:39:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1810 [23:41:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 88, down: 0, dormant: 0, excluded: 0, unused: 0 [23:47:03] RECOVERY - RAID on es1002 is OK: OK: State is Optimal, checked 2 logical device(s) [23:49:03] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [23:55:13] RECOVERY - DPKG on es1002 is OK: All packages OK [23:55:22] anyone hanging out here a ganglia expert ? [23:56:33] RECOVERY - Disk space on es1002 is OK: DISK OK