[00:01:50] New patchset: Bhartshorne; "only check thumb buckets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3304 [00:02:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3304 [00:02:04] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3304 [00:02:07] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3304 [00:06:10] New patchset: Bhartshorne; "add in pid checking so only one copy runs at a time. add in code to only check thumb buckets, not all buckets." [operations/software] (master) - https://gerrit.wikimedia.org/r/3305 [00:07:01] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3305 [00:07:03] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/3305 [00:50:29] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (36014) [01:05:41] New patchset: Ryan Lane; "Adding roots and demon manganese" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3306 [01:05:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3306 [01:05:57] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3306 [01:06:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3306 [01:07:21] Change abandoned: Demon; "Abandoning in favor of https://gerrit.wikimedia.org/r/#change,3306" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3024 [01:09:02] New patchset: Ryan Lane; "Revert "Adding roots and demon manganese"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3307 [01:09:15] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3307 [01:09:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3307 [01:09:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3307 [01:16:00] Ryan_Lane: Are you around? [01:16:10] yes [01:16:13] not for long, though [01:16:27] what's up? [01:16:57] RD: ...? [01:17:00] OTRS received an email from an ISP saying that some IPs might be blocked from accessing site. Not sure who woul dbe the one to look at this [01:17:13] it's possible [01:17:21] Yes [01:17:31] https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom&TicketID=6491421 if yo want to look, othewise they list two IPs [01:17:41] I don't have time [01:17:46] I won't be on until tomorrow [01:17:56] Is this your 'department'? :) [01:18:09] well, I'm ops [01:18:23] Should I bother somebody else, or just drop you a note about it for when you are free [01:18:42] if you want it looked at now, I recommend finding another ops person [01:18:58] otherwise I have it open in my browser, but I can't promise any fast response time [01:20:24] OK, well if any ops want to see if 203.176.199.5 & 203.176.199.10 are blocked from accessing sites.. email from ISP @ https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom&TicketID=6490827 ..otherwise, thanks Ryan_Lane :) [01:21:05] neither are blocked [01:21:26] if they were blocked it would be via squid [01:21:34] the config has neither of those addresses [01:21:48] also, they would be able to tell because they'd get a response from squid, saying they were blocked [01:24:25] OK [01:24:32] I replied. They must just be confused [01:24:43] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [01:29:04] guess I had time after all :) [01:29:06] heh [01:29:14] THanks :) [01:29:23] yw [01:55:23] uhm..job_queue on enwiki up to> 37 [01:55:27] k [02:19:18] Hi Tim, I know you are crazy busy but do you have an idea when you would have some time to do the final code review of the udp-filter? [02:30:29] drdee: it's going to be a little while [03:03:43] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [03:10:50] !log running puppet on aluminium [03:10:54] Logged the message, Master [03:11:04] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Wed Mar 21 03:10:59 UTC 2012 [03:16:06] !log powercycling magnesium - down and just "init: tty4 main" on mgmt, frozen [03:16:11] Logged the message, Master [03:18:35] !log magnesium - "..drive on port B of the Srial ATA controller is operating outsde of normal specifications.. Strike F1 key to continue".. [03:18:38] Logged the message, Master [03:19:19] RECOVERY - Host magnesium is UP: PING OK - Packet loss = 0%, RTA = 27.70 ms [03:25:19] oh, i see, maplebed already made RT-2669 for it [03:29:31] !log magnesium - shutting down, has existing RT-2669 to replace disk [03:29:34] Logged the message, Master [03:32:22] PROBLEM - Host magnesium is DOWN: PING CRITICAL - Packet loss = 100% [03:34:04] ACKNOWLEDGEMENT - Host magnesium is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT-2669 [03:44:49] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [03:51:48] !log added "lez" to langlist and running authdns-update, for lez.wikipedia per RT-2665 [03:51:51] Logged the message, Master [05:24:37] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [05:24:55] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [06:39:47] PROBLEM - MySQL Replication Heartbeat on db46 is CRITICAL: CRIT replication delay 321 seconds [06:40:23] PROBLEM - MySQL Replication Heartbeat on db1040 is CRITICAL: CRIT replication delay 357 seconds [06:41:53] RECOVERY - MySQL Replication Heartbeat on db46 is OK: OK replication delay 0 seconds [06:42:11] PROBLEM - Disk space on search1016 is CRITICAL: DISK CRITICAL - free space: /a 4212 MB (3% inode=99%): [06:42:20] PROBLEM - Disk space on search1015 is CRITICAL: DISK CRITICAL - free space: /a 3548 MB (3% inode=99%): [06:43:48] PROBLEM - MySQL Slave Delay on db1040 is CRITICAL: CRIT replication delay 234 seconds [06:45:36] RECOVERY - MySQL Replication Heartbeat on db1040 is OK: OK replication delay 0 seconds [06:45:54] RECOVERY - MySQL Slave Delay on db1040 is OK: OK replication delay 0 seconds [07:34:45] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [07:41:48] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [07:48:40] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [07:49:25] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [07:49:52] PROBLEM - Lucene on search3 is CRITICAL: Connection timed out [07:58:27] !log restarted lsearchd on search3 and 9 [07:58:30] Logged the message, Master [08:05:02] !log ms-be4 down but cant powercycle it yet..Unable to establish LAN session / ipmitool /ipmi_mgmt [08:05:06] Logged the message, Master [08:23:27] apergos: could you try the powercyle command just to confirm if it doesnt work for you either? last time i had these weird issues as well and others said "works for me", all of a sudden it worked for me too [08:33:13] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection refused [08:33:29] arg [08:37:25] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.006 second response time on port 8123 [08:37:42] !log stopped/started lsearchd on search9 [08:37:45] Logged the message, Master [08:38:37] RECOVERY - Lucene on search9 is OK: TCP OK - 0.002 second response time on port 8123 [08:58:34] !log rebooting ms-be4 [08:58:37] Logged the message, Master [09:01:22] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [09:05:07] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [09:49:52] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:51:58] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [09:59:55] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [09:59:55] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:29:46] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:33:40] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [10:41:10] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 27 seconds [10:42:49] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [10:48:40] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:50:12] I would have mutante but I was doing my monthly trip to the government office to see about my residence permit [10:50:22] 15 months after submission: not done [10:50:36] the woman next to me in line said she has been waiting a year and a half [10:50:40] nothing for her either [10:51:17] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.005 second response time on port 8123 [11:14:41] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [11:16:29] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.001 second response time on port 8123 [11:50:23] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:29] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:01:47] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:03:44] RECOVERY - LVS Lucene on search-pool1.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [12:23:50] PROBLEM - LVS Lucene on search-pool1.svc.pmtpa.wmnet is CRITICAL: Connection timed out [12:25:59] !log disabling notifications for search-pool1 [12:26:02] Logged the message, and now dispaching a T1000 to your position to terminate you. [12:38:41] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [12:38:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [13:07:03] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [13:12:18] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [13:37:27] New patchset: Demon; "Also add the ability to sudo as gerrit2 so I can inspect the git repos easily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3309 [13:37:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3309 [13:46:21] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [14:01:11] New review: Mark Bergsma; "Yeah. And have full access to operations/puppet and thus full root on the cluster. Of course that wo..." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/3309 [14:04:30] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 205 MB (2% inode=61%): /var/lib/ureadahead/debugfs 205 MB (2% inode=61%): [14:08:42] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 105 MB (1% inode=61%): /var/lib/ureadahead/debugfs 105 MB (1% inode=61%): [14:26:25] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 181 seconds [14:26:34] RECOVERY - Disk space on srv221 is OK: DISK OK [14:26:43] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 185 seconds [15:36:14] New patchset: Pyoungmeister; "htis is where I was supposed to set dontBreakEverything=1 ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3311 [15:36:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3311 [15:38:04] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 171 MB (2% inode=61%): /var/lib/ureadahead/debugfs 171 MB (2% inode=61%): [15:41:49] RECOVERY - Disk space on search1016 is OK: DISK OK [15:42:07] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:42:16] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:44:13] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 215 MB (3% inode=61%): /var/lib/ureadahead/debugfs 215 MB (3% inode=61%): [15:44:31] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [15:48:34] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [15:54:52] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [15:56:49] RECOVERY - Disk space on srv223 is OK: DISK OK [15:56:58] RECOVERY - Disk space on srv221 is OK: DISK OK [15:56:58] RECOVERY - Disk space on srv219 is OK: DISK OK [15:59:13] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [16:01:01] RECOVERY - Disk space on srv224 is OK: DISK OK [16:07:37] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 119 MB (1% inode=61%): /var/lib/ureadahead/debugfs 119 MB (1% inode=61%): [16:16:10] RECOVERY - Disk space on srv222 is OK: DISK OK [16:18:07] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 271 MB (3% inode=61%): /var/lib/ureadahead/debugfs 271 MB (3% inode=61%): [16:22:28] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 99 MB (1% inode=61%): /var/lib/ureadahead/debugfs 99 MB (1% inode=61%): [16:26:49] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 271 MB (3% inode=61%): /var/lib/ureadahead/debugfs 271 MB (3% inode=61%): [16:37:19] RECOVERY - Disk space on srv222 is OK: DISK OK [16:45:25] RECOVERY - Lucene on search9 is OK: TCP OK - 0.002 second response time on port 8123 [16:45:43] RECOVERY - Disk space on srv219 is OK: DISK OK [16:57:22] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [17:14:19] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3311 [17:14:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3311 [17:17:46] RECOVERY - Lucene on search3 is OK: TCP OK - 0.002 second response time on port 8123 [17:18:04] RECOVERY - Lucene on search9 is OK: TCP OK - 0.001 second response time on port 8123 [17:23:37] can someone please do sudo kill -9 11000 on search1 and then restart lsearchd? This thread seems to be stuck [17:24:19] On it [17:24:46] rainman-sr: How do I restart lsearchd? [17:24:54] nm got it [17:24:59] sudo /etc/init.d/lsearch restart [17:25:01] thanks [17:25:03] Done [17:25:28] oki, thanks, looks good [17:25:54] speaking of search... [17:25:59] any idea what was up last night? [17:26:14] it paged repeatedly (and I presume someone in europe fixed it each time) [17:26:58] nop, no idea, i guess it was the usual, processes going out of memory [17:27:41] http://nagios.wikimedia.org/nagios/cgi-bin/history.cgi?host=search-pool1.svc.pmtpa.wmnet&service=LVS+Lucene [17:28:16] RECOVERY - Disk space on search1015 is OK: DISK OK [17:41:29] maplebed, yep, i suspect this is because the majority of pool1 went offline by going out of memory [17:43:17] maplebed: ping pong :) [17:43:36] maplebed: will you be able to review a minor change to redirect swift syslogs https://gerrit.wikimedia.org/r/#change,2820 [17:43:55] maplebed: it amends a previous change you have reviewed and deployed :) [17:46:22] ryan_lane: I received the new DIMM. Let me know when you want to add to virt3 [17:46:32] is it both dimms? [17:50:13] hashar: looking [17:54:19] maplebed: that is some message I catched by looking at the main syslog file [17:54:22] hashar: that looks good to me. Shall I review, and mereg? [17:54:28] yup [17:54:40] maybe want to sig HUP the syslog process os it reads the file [17:54:40] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2820 [17:54:43] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2820 [17:54:45] do you know where it needs to be deployed? [17:54:51] oh that [17:54:54] i.e. on the swift hosts or on the syslog aggregator? [17:55:03] I think it is on the aggregator [17:55:21] as I understand it, that is the one receiving everything and then applying those filters to dispatches messages in various log files [17:55:29] though I have no idea where is that aggregator [17:55:32] I think so too. [17:55:42] me neither! but reading the swift syslog configs will tell met. [17:55:44] me. [17:55:49] (meanwhile, sig HUP might not be the proper way, probably want to use something like /etc/init.d/syslog-ng reload [17:58:07] PROBLEM - MySQL disk space on db59 is CRITICAL: DISK CRITICAL - free space: /a 16116 MB (2% inode=99%): [17:58:15] hashar: thanks for the change! [17:58:29] it's deployed now. [17:58:44] and puppet kicked syslog-ng for me [17:59:42] awesome [17:59:44] puppet is SOO smart [17:59:46] PROBLEM - Disk space on db59 is CRITICAL: DISK CRITICAL - free space: /a 16129 MB (2% inode=99%): [17:59:56] it is going to replace the whole staff one day :-) [18:00:02] maplebed: thanks! heading out now :) [18:00:05] will be back later [18:19:48] hi binasher, would you have some time today / tomorrow to create a slave db for the portuguese and hindu wikipedia? [18:24:07] drdee_: what do you mean? [18:24:31] like db42 for the english, can we have a similar setup for the portuguese and hindu wikipedia? [18:25:09] in particular, the analytics team needs a replica of the recentchanges table for both wikis [18:25:13] what are the actual wiki names? [18:25:28] 1 sec [18:26:28] this sort of request generally needs planning around hardware acquisition and budget impact [18:27:11] http://pt.wikipedia.org/ and http://hi.wikipedia.org/ [18:28:18] mmmmmmm, and just replicating the recentchanges table and use db42 or db1012 (i think, we have several slaves during the summer of research) [18:28:48] there isn't room to add two additional mysql instances on db42 [18:28:50] or one additional [18:29:30] this would generally involve creating slaves of s2 and s3 on separate servers, coving most wikis [18:29:36] perhaps a mysql slave isn't what you want [18:30:41] perhaps you should dump recentchanges from those two wikis, import into tables, and regularly run against an idle eqiad slave select * from recentchanges where id > xxx to update your snapshot tables [18:31:19] okay,that also sounds good [18:31:26] (and way more doable! :)) [18:31:49] final question: how can we get access to an eqiad slave? [18:33:07] that's a good question :) you can't from internproxy but we really need to replace that and it isn't suited for jobs that hit production dbs either. you have general cluster access, right? [18:33:34] no,i don't [18:41:02] i thought you had access to fenari, emery, lock, bayes, etc? [18:42:51] New patchset: Pyoungmeister; "actually just needed for search11 (once more disk for search1015 and 1016)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3315 [18:43:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3315 [18:43:18] oh sorry, yes for emery, lcoke and bayes,fenari not sure [18:43:26] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay seconds [18:43:27] how do you get to bayes? [18:43:33] please tell me you don't ssh directly... [18:43:35] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay seconds [18:43:41] (sorry for lurking) [18:44:47] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:51] apergos: i think i might have,are you going to kill me now:) [18:46:13] no but we will plead with you to go through a bastion host (ie fenari or bast1001) pretty please [18:46:24] sure, i'll improve my life [18:47:48] ok, gone to eat a late dinner, talk to folks later [18:48:05] PROBLEM - Host db24 is DOWN: PING CRITICAL - Packet loss = 100% [18:48:59] RECOVERY - Host db24 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [18:51:06] drdee_: ok, so you've got access to hosts in the cluster and you can connect to any of the dbs as the wikiadmin user [18:51:33] am i a wikiadmin user? [18:51:46] No, *the* wikiadmin user [18:51:48] !log brought db24 back up after hang, and reslaving, but leaving out of db.php. just replicating until a replacement s2 snapshot host is built [18:51:51] Logged the message, Master [18:51:53] The DB username we use is 'wikiadmin' [18:52:03] check [18:52:28] ok that answrs my question about why it suddenly went down, *now* I can go eat [18:52:34] RoanKattouw: do you think drdee_ should write php and use the mw lb classes for analytics related stuff? [18:52:44] apergos: ah, yeah it was discussed in -tech [18:52:51] thanks for checking into it [18:53:02] Depends, what does he need? [18:53:07] Master or slave conns [18:53:16] thanks for making it happy. [18:54:01] slave conns.. in this case slave conns to dbs in eqiad that aren't used by the site. there is a db-secondary.php that can be included to rewrite wgLBFactoryConf['sectionLoads'] with the eqiad slaves [18:54:16] Hmm well [18:54:28] You could do that as a maintenance script I guess [18:54:52] drdee_: if you feel php'ish, you can write mw maintenance style scripts that will handle the db parts for you [18:54:58] if you prefer python [18:55:05] Then you can just grab the conn details [18:55:24] Grab the username and password (we'll tell you how to get that in private), and pick an eqiad slave for each cluster [18:55:30] you can find the wikiadmin credentials somewhere.. yeah [18:55:35] awesome! [18:55:36] (for python that is, PHP handles this for you) [18:55:49] thanks guys! [18:56:02] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 2957 seconds [18:56:07] use http://noc.wikimedia.org/dbtree/ to select a very bottom level slave [18:56:11] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 2938 seconds [18:56:47] will do [18:57:02] One that has a four-digit number [18:57:05] RECOVERY - Host ms-be3 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [18:57:27] ok [18:59:40] !log rebooted ms-be3 after it crashed. [18:59:43] Logged the message, Master [19:00:23] AaronSchulz: do you have a labs account yet? [19:10:25] maplebed: everyone with a svn account should... [19:17:20] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3315 [19:17:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3315 [19:31:00] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 25 seconds [19:31:18] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [19:35:21] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 206 MB (2% inode=61%): /var/lib/ureadahead/debugfs 206 MB (2% inode=61%): [19:35:21] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 221 MB (3% inode=61%): /var/lib/ureadahead/debugfs 221 MB (3% inode=61%): [19:40:49] New review: Ryan Lane; "added some inline comments." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3309 [19:43:36] RECOVERY - Disk space on db59 is OK: DISK OK [19:43:54] RECOVERY - MySQL disk space on db59 is OK: DISK OK [19:45:51] RECOVERY - Disk space on srv221 is OK: DISK OK [19:51:42] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [19:52:09] RECOVERY - Disk space on srv222 is OK: DISK OK [19:53:39] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [20:01:45] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [20:01:45] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [20:19:36] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Wed Mar 21 20:19:23 UTC 2012 [21:00:59] !log shutting down virt3 to replace dimms [21:01:03] Logged the message, Master [21:03:24] RobH: are you in the colo today? [21:03:43] no, i was at the hospital until 3:30 [21:03:49] so my day is shot [21:03:56] lame. (the hospital part, that is) [21:04:02] will be there all day tomorrow, from 930 to 1830 [21:04:03] ok. [21:04:08] minimum [21:04:22] PROBLEM - Host virt3 is DOWN: PING CRITICAL - Packet loss = 100% [21:06:25] * jeremyb keeps an eye out for free, unmetered wikipedia access on hospital wifi ;) [21:08:31] !log swapped 2 DIMMS in virt3 (b2 and b5) [21:08:35] Logged the message, Master [21:08:37] \o/ [21:08:47] booting now [21:10:27] ryan_lane: no DIMM errors...good to go [21:10:33] at least in post [21:10:38] great [21:10:39] thanks [21:10:42] yeah, I'm watching bootup [21:10:47] looks good [21:11:34] RECOVERY - Host virt3 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [22:40:31] PROBLEM - Puppet freshness on linne is CRITICAL: Puppet has not run in the last 10 hours [22:40:31] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [23:04:03] Change abandoned: Demon; "Had Ryan just run a few things manually for me. We'll script it up later." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3309 [23:08:43] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [23:34:31] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:47:34] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours