[00:01:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [00:03:56] greg-g, chrismcmahon: OK, now I'm stumped. The files are there, but it still errors saying they don't exist. [00:04:07] #1. The files do exist [00:04:16] #2. It shouldn't be requesting them to start with [00:04:59] bizarre. [00:05:35] WTF [00:05:52] thanks mutante [00:06:52] New patchset: MaxSem; "Rm old debug group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55228 [00:07:38] maybe I need to sync-dir on it! [00:09:02] that should fix it [00:09:12] !log kaldari synchronized php-1.21wmf12/extensions/Thanks 'syncing Thanks ext' [00:09:18] Logged the message, Master [00:10:24] kaldari: No that's not it, I found it [00:10:30] That error is a literal string in a generated file [00:10:46] csteipp: as of 2012-07-20: db48 is the otrs master, db49 and db1048 slave from it. db1048 is the gerrit master, db1046 and db48 slave from it. [00:10:48] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [00:11:56] New review: Anomie; "> I have not used the multiversion shell script cause I am" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55058 [00:12:08] jeremyb_, are you sure OTRS shares with gerrit? [00:12:22] "Gerrit is installed on manganese in the prefix /var/lib/gerrit" [00:12:29] OTRS is on williams by itself I think? [00:12:32] Thehelpfulone: your point? [00:12:33] right [00:12:43] but neither one uses a DB on localhost [00:13:15] Chris' comment on Bugzilla was "It looks like there are other critical [00:13:15] services on that server/db, so the impact would be significant if an attack was [00:13:15] successful." - so I was looking for what else ran on williams [00:13:22] New review: Anomie; "Still, why not use the getRealmSpecificFilename function to resolve the dblist filename?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55059 [00:13:48] RECOVERY - Puppet freshness on professor is OK: puppet ran at Fri Mar 22 00:13:45 UTC 2013 [00:14:00] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Fri Mar 22 00:13:50 UTC 2013 [00:14:11] RECOVERY - Puppet freshness on sanger is OK: puppet ran at Fri Mar 22 00:14:05 UTC 2013 [00:14:50] RECOVERY - Puppet freshness on cp1026 is OK: puppet ran at Fri Mar 22 00:14:43 UTC 2013 [00:14:59] RECOVERY - Puppet freshness on snapshot4 is OK: puppet ran at Fri Mar 22 00:14:52 UTC 2013 [00:15:30] !log aaron synchronized php-1.21wmf12/includes/Title.php 'deployed cef327e945e2593fe0291880a7d1976cc5a2f248' [00:15:38] Logged the message, Master [00:15:38] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Fri Mar 22 00:15:32 UTC 2013 [00:16:40] !log catrope synchronized wmf-config/ExtensionMessages-1.21wmf12.php 'Remove PHP errors' [00:16:48] Logged the message, Master [00:17:08] RECOVERY - Puppet freshness on colby is OK: puppet ran at Fri Mar 22 00:17:04 UTC 2013 [00:18:21] RoanKattouw: is it likely that we did something wrong during the deployment that caused this? [00:18:34] Well [00:18:38] Kind of [00:18:51] Are you behind the deployment of the Thanks extension? [00:18:55] yep [00:19:02] Yeah then you did something wrong [00:19:20] I'm pretty sure the instructions tell you to regenerate ExtensionMessages-1.21wmfNN.php and check the results [00:19:34] And the results looked something like: [00:20:03] http://pastebin.com/NHUXmPbg [00:21:33] damn that paste link is slow [00:22:02] New patchset: Catrope; "Add Thanks to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55231 [00:22:15] kaldari: Also, that ---^^ [00:22:29] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55231 [00:22:55] Alright [00:23:08] Benny said he had done that [00:23:09] AaronSchulz: I'm being dragged out of here but I think things are reasonably stable now [00:23:15] He had [00:23:19] But it wasn't committed [00:23:22] ah [00:23:46] ok [00:24:03] OK, I'm out of here. If more things fall apart (given how today has gone I wouldn't be surprised), text me, phone number is on officewiki [00:24:09] RoanKattouw: Thanks for your help! [00:30:11] kaldari: I'm done too, but I'll look in on test2 early tomorrow. (and thanks for fixing PageTriage, that was an unexpected breakage) [00:30:33] thank you! [00:54:47] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming 'NavigationTiming: log mobile mode' [00:54:55] Logged the message, Master [01:05:40] New patchset: Ori.livneh; "Correct entry-point in E3Experiments extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55232 [01:08:44] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55232 [01:11:03] !log olivneh synchronized wmf-config/CommonSettings.php 'Temporarily accepting either E3Experiments.php or Experiments.php as extension entry point to facilitate clean-up' [01:11:10] Logged the message, Master [01:17:54] !log olivneh synchronized php-1.21wmf11/extensions/E3Experiments 'Updating E3Experiment extensions's entry point' [01:18:02] Logged the message, Master [01:18:08] !log olivneh synchronized php-1.21wmf12/extensions/E3Experiments 'Updating E3Experiment extensions's entry point' [01:18:14] Logged the message, Master [01:19:42] New patchset: Ori.livneh; "Remove temporary workaround for E3 entry-point migration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55235 [01:20:40] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55235 [01:23:29] !log olivneh synchronized wmf-config/CommonSettings.php 'Remove temporary workaround for E3 entry-point migration (1/2)' [01:23:36] Logged the message, Master [01:24:14] !log olivneh synchronized wmf-config/extension-list 'Remove temporary workaround for E3 entry-point migration (2/2)' [01:24:20] Logged the message, Master [01:28:05] !log Finished correcting E3Experiments extension's entry point from 'Experiments.php' (which does not match the submodule name) to 'E3Experiments.php'. Updated extension-list, CommonSettings.php, and the extension itself in both wmf11 and 12. [01:28:11] Logged the message, Master [01:31:46] New patchset: Yurik; "Mobile dflt redir to RU for Vimpelcom Beeline (m & zero)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55236 [01:47:30] New review: Ori.livneh; "I don't think it would make sense (for me, anyway) to have this be a role that only specific machine..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50306 [02:04:18] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [02:30:32] !log LocalisationUpdate completed (1.21wmf11) at Fri Mar 22 02:30:31 UTC 2013 [02:30:39] Logged the message, Master [02:32:38] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 185 seconds [02:33:22] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 230 seconds [02:37:40] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1043, oops' [02:37:48] Logged the message, Master [02:39:12] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:42:33] !log LocalisationUpdate completed (1.21wmf12) at Fri Mar 22 02:42:32 UTC 2013 [02:42:40] Logged the message, Master [02:45:45] PROBLEM - MySQL disk space on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:45] PROBLEM - MySQL Recent Restart on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:45:46] PROBLEM - Full LVS Snapshot on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:02] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:02] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - SSH on db1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:36] PROBLEM - MySQL Slave Running on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - MySQL Idle Transactions on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:36] PROBLEM - mysqld processes on db1050 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:46:50] !log asher synchronized wmf-config/db-eqiad.php 'returning db1043' [02:46:57] Logged the message, Master [02:48:44] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1050, crashed' [02:48:51] Logged the message, Master [02:50:16] !log power cycling db1050, unresponsive on serial console [02:50:21] Logged the message, Master [02:51:57] !log on searchidx1001: started incremental indexer, apparently it died on March 21 at 02:08 when it ran at the same time as a cron job import [02:52:04] Logged the message, Master [02:53:22] PROBLEM - Host db1050 is DOWN: PING CRITICAL - Packet loss = 100% [02:58:03] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:01:03] New patchset: Asher; "db1050 died" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55241 [03:01:30] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55241 [03:02:33] PROBLEM - Host db1050 is DOWN: CRITICAL - Plugin timed out after 15 seconds [03:04:56] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay seconds [03:05:06] RECOVERY - Host db1050 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [03:05:17] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay seconds [03:05:17] RECOVERY - SSH on db1050 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:05:17] RECOVERY - MySQL Slave Running on db1050 is OK: OK replication [03:05:17] RECOVERY - MySQL Idle Transactions on db1050 is OK: OK longest blocking idle transaction sleeps for seconds [03:05:37] RECOVERY - MySQL Recent Restart on db1050 is OK: OK seconds since restart [03:05:37] RECOVERY - Full LVS Snapshot on db1050 is OK: OK no full LVM snapshot volumes [03:05:37] RECOVERY - MySQL disk space on db1050 is OK: DISK OK [03:07:17] RECOVERY - mysqld processes on db1050 is OK: PROCS OK: 1 process with command name mysqld [03:18:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [03:18:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [03:21:26] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [03:21:27] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [03:53:17] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:53:35] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:54:40] Anyone awake that can do a security review of a small C++ program for me? [03:56:31] Tomorrow, then [04:09:13] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [04:10:44] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Fri Mar 22 04:10:39 UTC 2013 [04:37:41] New patchset: Yurik; "Rearranged and consolidated carrier detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55242 [04:39:53] New patchset: Yurik; "(bug 46430) Mobile dflt redir to RU for Vimpelcom Beeline (m & zero)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55236 [04:40:33] New patchset: Yurik; "Rearranged and consolidated carrier detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55242 [04:52:10] yuck, what a day [04:53:01] greg-g: heh [04:53:16] TimStarling: what's died mean? not in process list at all? should be pretty easy to get a nagios check for that [04:53:32] (searchidx1001) [04:53:53] (err, icinga [04:53:54] ) [04:54:01] https://xkcd.com/859/ [04:58:33] New review: Ori.livneh; "Well, the base directory names on stat1 and stat1001 ought to match, so I think we want the base nam..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54116 [05:12:06] jeremyb_: it's in a shell script restart loop with set -e [05:12:24] the shell script runs the updater, which updates all wikis and then exits [05:12:41] then the shell script sleeps for 15 minutes then starts the index update again [05:13:13] I found the relevant entries in /var/log/account/pacct [05:13:32] the indexer finished, exited, then after 15 minutes, started again [05:13:59] then after 18 seconds, java exited, and the shell script exited immediately afterwards, so java probably exited with a non-zero exit status [05:14:10] which was fatal because of the set -e [05:14:48] so you can't just monitor for the presence of the java process since it will be absent for 15 minute periods [05:17:00] what would be ideal is an occasional check (say hourly) of the last modification timestamp of /a/search/indexes/status/enwiki [05:17:38] if it's more than, say, 12 hours, it would alert [05:17:44] more than 12 hours ago [05:19:01] TimStarling: (( $(date +%s) - $(stat -c %Y /a/search/indexes/status/enwiki) < 43200 )) [05:19:32] er, > [06:07:58] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [06:10:58] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [06:10:58] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Puppet has not run in the last 10 hours [06:13:58] PROBLEM - Puppet freshness on cp1001 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db1017 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on db45 is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [06:13:59] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [06:14:00] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [06:14:00] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [06:14:58] PROBLEM - Puppet freshness on db1018 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1021 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db53 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on db67 is CRITICAL: Puppet has not run in the last 10 hours [06:14:59] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [06:15:00] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [06:15:01] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [06:15:02] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [06:15:02] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [06:28:58] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [07:01:28] New review: Hashar; "Great! :-D" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55162 [07:01:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [07:53:34] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [07:54:03] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [08:04:06] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [08:04:37] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:06:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [08:55:07] PROBLEM - MySQL Slave Delay on db57 is CRITICAL: CRIT replication delay 217 seconds [08:55:19] PROBLEM - MySQL Replication Heartbeat on db57 is CRITICAL: CRIT replication delay 222 seconds [08:58:08] RECOVERY - MySQL Slave Delay on db57 is OK: OK replication delay 0 seconds [08:58:18] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [09:07:56] !log Recreating Solr index [09:08:02] Logged the message, Master [09:11:49] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 189 seconds [09:12:00] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [09:15:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [09:15:59] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [09:28:42] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:28:42] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [10:11:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [10:15:21] PROBLEM - Puppet freshness on mw1093 is CRITICAL: Puppet has not run in the last 10 hours [10:15:21] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [10:16:22] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [10:25:23] PROBLEM - Puppet freshness on db1049 is CRITICAL: Puppet has not run in the last 10 hours [10:28:24] PROBLEM - Puppet freshness on mw1065 is CRITICAL: Puppet has not run in the last 10 hours [11:04:24] Hey there, could somebody please pass me ( aklapper@ ) the content for the last two weeks in Bugzilla's "audit_log" table? I'd appreciate it. [12:04:21] PROBLEM - Puppet freshness on constable is CRITICAL: Puppet has not run in the last 10 hours [13:36:32] could somebody please pass me ( aklapper@wm ) the content for the last two weeks in bugzilla.wikimedia.org's "audit_log" table? [13:47:27] New patchset: Demon; "Consolidate gallium and formey replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55259 [13:58:18] hey guys, did any of you get the emails david and I sent to the & analytics lists yesterday? Subject:    Re: Packet loss on oxygen and locke now [13:58:36] we both got the 'Your message to Ops awaits moderator approval' reply from *-bounces [14:10:04] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [14:12:04] PROBLEM - Puppet freshness on mw1092 is CRITICAL: Puppet has not run in the last 10 hours [14:13:06] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [14:18:04] PROBLEM - Puppet freshness on mw62 is CRITICAL: Puppet has not run in the last 10 hours [14:19:00] New patchset: Ottomata; "puppetize haproxy (for brewster), the very basics (RT-4660)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [14:20:52] New patchset: Ottomata; "puppetize haproxy (for brewster), the very basics (RT-4660)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [14:24:54] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: Timeout while attempting connection [14:27:16] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:00] mark, you round? wanted to ask about gadolinium varnishncsa [14:30:10] yes [14:30:18] is gadolinium in pmtpa or eqiad? [14:30:22] in eqiad [14:30:32] and oxygen is too, right? [14:30:35] yes [14:30:39] all the element names are [14:30:42] ah ok [14:30:43] good to know [14:31:10] ok, we haven't heard from asher yet, but how about I just remove the extra varnishncsa instance and consume from the stream then [14:31:20] i've got 4 analytics boxes consuming from that stream too, and they've never had a problem [14:31:33] sounds good [14:31:45] how much is the sum of all log traffic now? [14:31:53] i've heard it was approaching 1 Gbps [14:32:02] oo, haven't checked [14:32:25] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 82.86 ms [14:37:41] i've received and read the packet loss emails [14:37:56] but I don't have much to say about it other than "so can we disable the logging on the nginx servers then?" :) [14:40:08] yeah that's what I want to do too, i guess we gotta here from analytics folks [14:45:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 199 seconds [14:45:04] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [14:45:16] mark, re how much traffic, if I watch ifconfig, I see about 0.4 Gbps, which is about what ganglia shows too [14:45:24] 50 M bytes/sec [14:45:39] right [14:45:42] but now it's not peak [14:45:54] above 60-70 it'll start to get problematic [14:46:03] packet loss :) [14:46:21] we def see packet loss during peak times [14:46:39] some of that may also be from the cache servers [14:46:44] is that due to the nic not being able to handle it? [14:46:46] who are also becoming quite congested [14:46:59] i think that's right, according to what erik z is noticing [14:46:59] that is from temporarily full traffic queues during traffic bursts [14:47:07] only portions of hosts show packet loss [14:47:15] not the entire stream [14:47:16] that's entirely possible [14:47:19] do you have some examples? [14:47:24] I was the one that said the approaching 1gbps [14:47:27] it was 70+ when I looked [14:48:04] see Erik Z's latest email, but I will share a google doc with you with his numbers... [14:48:10] hm, maybe that was locke [14:48:16] * paravoid looks again [14:48:32] oxygen max week is 61.9MB/s [14:48:41] https://docs.google.com/a/wikimedia.org/spreadsheet/ccc?key=0AtLjsFovAGuvdGpuUnhHLWtBZDlkaXdjM2hOTWVZOXc#gid=0 [14:48:44] month 63.7MB/s [14:48:59] locke month is 65.3MB/s [14:49:04] this is the monthly average difference between seq numbers by host group [14:49:13] ssls are way off, but we know why [14:49:35] because the module is broken [14:49:37] we know that for years [14:49:39] upload pmtpas are 20% off, but that might be due to the fact that they don't have much load ? [14:49:45] I wonder why we still collect those streams [14:49:48] good q! [14:49:49] yeah [14:50:00] we're waiting to here from analytics people now to see if we can just remove them [14:50:11] it doesn't work anyway [14:50:25] even after my patches, it bumps the seq number for every *field* [14:50:32] we have what, 10 fields? [14:50:45] that's 90% seq lost [14:50:58] iirc, haven't looked it at for a year [14:51:04] aye, its ok [14:51:15] i think we shoudl just remove it, since they are duplicate logs anyway, as we talked about in that RT ticket [14:51:23] from that spreadsheet I take that recently everything is quite ok except for ssl [14:51:24] am I wrong? :) [14:51:28] if people want SSL stats then we can start logging x-f-proto or just set up a new stream [14:51:43] I think that would be valuable [14:51:49] x-f-proto [14:51:49] naw, since these are large averages, if the number is off by more than a few from 1000 it can indicate something is wrong [14:51:56] see erik z's latest email [14:52:06] - upload squids series knsq* (knams) have 2-3% loss in Feb/Mar, [14:52:16] upload squids in pmtpa have 23% loss since Nov 2012 but their load has dropped to almost zero so that doesn't weigh in on Ganglia trend [14:52:31] not sure if this is significant: [14:52:32] - text squids have < 0.1% loss in recent months [14:53:02] check this too: [14:53:02] http://ganglia.wikimedia.org/latest/graph_all_periods.php?title=Packet+Loss+Average&vl=%25&x=&n=&hreg%5B%5D=(locke%7Cemery%7Coxygen)&mreg%5B%5D=packet_loss_average>ype=line&glegend=show&aggregate=1 [14:53:17] look at the yearly one [14:53:41] i'm pretty sure the diff in packet loss between emery/locke and oxygen is due to SSLs, since oxygen is not receiving logs from nginx [14:54:00] but, i have no idea why this started happening in october [14:54:03] yeah november is around when we moved upload to eqiad [14:54:31] aye, makes sense, so that 20% loss is probably not a worry [14:54:33] probably just error [14:55:01] but also, relevant to the high traffic convo [14:55:25] for the past few weeks we've been seeing spikes of loss during high load time on oxygen and locke sometimes [14:55:34] during the popedot? [14:55:41] yes, but more regular [14:55:46] http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&title=Packet+Loss+Average&vl=%25&x=&n=&hreg[]=%28locke%7Cemery%7Coxygen%29&mreg[]=packet_loss_average>ype=line&glegend=show&aggregate=1 [14:56:02] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [14:56:03] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [14:56:50] those oxygen spikes started happening backin January and have only been getting worse [14:57:13] oxygen is running 17 different IP based filters for wikipedia zero, we've been meaning to turn them all into a single filter using the X-CS header [14:57:25] but Evan Rosen is not confident about using it yet [15:00:20] anyyyyyway [15:00:29] hopefully we'll take out ssls from the stream soon [15:00:41] I will make gadolinium use the multicast stream (although we should hear from asher on that too) [15:00:56] what does asher have to do with this anyway? [15:01:06] graphite? [15:01:06] he set up the multicast stream [15:01:23] he just told me not to use it since it would be a spof for the udp2log boxes [15:01:40] that's why there were gonna be two [15:02:13] we discussed this, remember? [15:02:19] you were gonna setup two boxes, one in pmtpa, one in eqiad [15:02:27] and everything would consume from those [15:02:45] sounds familiar…:) [15:03:36] if our udp2log boxes are in eqiad (or at least, oxygen and gadolinium), that is a spof for the 2 of them, which I think is why asher told me not to use it [15:03:49] yeah [15:03:52] but there should be one in pmtpa as well [15:04:11] but the eqiad udp2log boxes wouldn't be able to use it, woudl they? [15:04:25] (i thought mulitcast between the dcs was a little funky) [15:04:37] multicast in pmtpa is the biggest problem [15:04:40] not between DCs [15:04:52] but I think we agreed that in pmtpa we'd have udp2log just forward instead of multicast [15:04:54] or something like that [15:05:00] now, with 60 MB/s that's gonna be a bit of a problem [15:05:11] need 10G interfaces, or multiple GigE [15:05:38] ok ja, hm [15:06:21] well, unless you think otherwise, i'm not going to push on making changes right now other than using multicast stream on gadolinium, and eventually removing locke unicast altogether. [15:06:28] we should have the varnishncsa replacement be capable of sending to multiple senders [15:06:35] that would be cool [15:06:36] oxygen would be a spof for oxygen udp2log and gadolinium [15:06:38] but [15:06:41] i think its ok [15:06:46] gadolinium and oxygen run different filters anyway [15:06:50] and there's no redundancy there [15:06:57] so each one is its own spof anyway :p [15:07:08] yeah then it doesn't matter [15:07:49] plus, oxygen has been running the multicast stream flawlessly for how long now? long time. [15:07:55] if it dies, we'll scramble to get another one up [15:07:58] socat I think? [15:08:01] yeah [15:08:02] socat [15:08:08] I setup a multicast relay on some box a long time ago [15:08:12] I think asher used that [15:08:15] yeah, its socat [15:08:23] i was saying that its been running well for a long time [15:08:31] sure, there's not much to it ;) [15:08:35] sure it is due to fail sometime, but we'll just scramble when it happens [15:08:43] heh [15:08:46] traffic loggers fail all the time [15:08:50] every time I reboot a box [15:08:53] varnish comes up at boot [15:08:59] and varnishncsas don't start until the next puppet run [15:09:03] haha, yeah [15:09:05] no i mean [15:09:05] hah [15:09:06] they also memleak, get killed, restarted, etc [15:09:12] i just mean the socat instance [15:09:15] i know [15:09:16] just saying [15:09:18] oh oh [15:09:18] yeah [15:09:19] totally [15:09:24] you're not getting 100% anyway ;) [15:09:26] yeah [15:09:36] i've caught varnish running but not writing to shmlog [15:09:36] if/when we want to add another udp2log box in pmtpa, we can put effort into adding multicast there [15:09:41] so no logging [15:09:43] ha [15:09:57] or really, it was writing to a different shmlog, but it doesn't matter [15:10:06] things like that ;) [15:32:36] New patchset: Ottomata; "gadolinium now uses udp2log multicast relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55269 [15:40:16] heya, paravoid: theoretical puppet q for you [15:40:32] since i'm removing a varnishncsa logger [15:40:52] i thoguht i'd go ahead and make the varnish::logging define removable via an ensure => absent parameter [15:41:10] but, i think there's a chicken and egg issue with that [15:41:10] https://gist.github.com/ottomata/5222247 [15:41:27] the service has to require that the init.d file exists [15:41:33] in order to start it [15:41:39] but in the reverse scenario [15:41:55] the init.d file has to exist in order for puppet to stop the service (if ensure == absent) [15:42:16] but, if ensure is absent, and the service depends on the file resource [15:42:25] then puppet will remove the init.d file before it tries to stop the service [15:42:27] force the order after the definitions [15:42:31] hm [15:42:33] conditional? [15:42:36] either via Service[...] -> File[...] [15:42:41] or with the plussignment operator [15:42:58] if absent { service ==> file } else { file -> service } [15:42:58] ? [15:43:14] like Service[...] { require +> File[...] } [15:43:34] yeah, i thought of the order thing, just seems weird! [15:43:35] ok [15:43:39] do you have a prefernce? [15:43:41] no :) [15:43:43] k [15:43:51] i think I like plussignment better [15:46:47] hmmmm, actually, no, i think this doesn't work either, i mean [15:46:57] it will work in the case where the file had actually existed before [15:47:05] but if someone adds a new varnish machine [15:47:23] and ensure => absent [15:48:13] the service will be stopped without the init.d file in place [15:48:25] paravoid^ :) [15:58:25] mark, i've done frontend conf deployments about 3 times now, each time babysat by peter…I feel more confident about it, especially since this is such a small change [15:58:35] mind if I go ahead, and if I break things come screaming to you? :D [15:59:27] sorry, frontend what? [16:03:36] removing gadolinium [16:03:43] squid, varnishncsa, nginx [16:06:08] ok [16:08:38] PROBLEM - Puppet freshness on mw1099 is CRITICAL: Puppet has not run in the last 10 hours [16:10:25] New patchset: Ottomata; "gadolinium now uses udp2log multicast relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55269 [16:10:56] gonna do the squids first [16:11:35] PROBLEM - Puppet freshness on mw1001 is CRITICAL: Puppet has not run in the last 10 hours [16:11:35] PROBLEM - Puppet freshness on mw1103 is CRITICAL: Puppet has not run in the last 10 hours [16:14:13] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55269 [16:14:36] PROBLEM - Puppet freshness on cp1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on db1017 is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on db40 is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on db45 is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [16:14:37] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [16:14:38] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [16:14:39] PROBLEM - Puppet freshness on sq53 is CRITICAL: Puppet has not run in the last 10 hours [16:14:39] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on db1021 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on db1018 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on db67 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on hooft is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on db53 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [16:15:36] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [16:15:37] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [16:15:37] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [16:15:38] PROBLEM - Puppet freshness on sq49 is CRITICAL: Puppet has not run in the last 10 hours [16:15:38] PROBLEM - Puppet freshness on sq81 is CRITICAL: Puppet has not run in the last 10 hours [16:15:40] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [16:15:40] PROBLEM - Puppet freshness on sq80 is CRITICAL: Puppet has not run in the last 10 hours [16:17:59] mark, ~how many frontend varnish servers are there? (ones that would have been running gadolinium varnishncsa?) [16:18:40] 23 I think [16:20:32] mk, not so bad, i didn't do the ensure => absent thing I was talking with paravoid about because it would make puppet angry in some cases [16:20:49] so I might have to manually log into them and stop ncsa and remove teh init.d file :/ [16:21:06] is there a dsh group for them? is that dangerous to do with dsh? [16:23:26] puppet angry? :) [16:23:31] no there's no up to date dsh group for them [16:24:51] yeah, due to resource deps [16:25:01] if I remove the init.d file, the serivce won't stop [16:25:11] if i stop the service before removing the init.d file [16:25:20] puppet will error on new varnish instance (that don't have the init.d file) [16:25:27] because the service won't really exist [16:26:09] ok, i've not used dsh on the prod cluster at all, is there any documentation of how/where? can I updated it with a frontend varnish group? [16:26:16] or shoudl I just find a list of them and do it manually? [16:29:35] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [16:31:32] hey folks, we're doing some reorg on the Wikipedia Zero varnish configurations [16:31:41] i want to double-check something real quick with someone who knows varnish better :) [16:31:59] https://gerrit.wikimedia.org/r/#/c/55242/2/templates/varnish/mobile-frontend.inc.vcl.erb <- this changes some of our detection to first set some headers, then compare based on them [16:32:09] wanna make sure that actually works as expected before i go approving it :) [16:32:37] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 20.4659275 (gt 8.0) [16:32:48] New patchset: Demon; "Updating gerrit to 2.6-rc0-7-g6e5cc39" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/55271 [16:33:11] New review: Demon; "War can be obtained from: https://integration.wikimedia.org/nightly/gerrit/wmf/gerrit-2.6-rc0-7-g6e5..." [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/55271 [16:35:20] This is war. [16:35:22] New review: Brion VIBBER; "Looks ok to me, but I'd like someone with more Varnish experience to confirm that setting a header, ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55242 [16:35:52] war? what is it good for? [16:43:39] New review: Brion VIBBER; "Should work. :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55236 [16:48:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55236 [16:52:12] whee [16:52:51] New review: Mark Bergsma; "Yes, that works." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/55242 [16:52:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55242 [16:53:36] thanks mark :D [16:54:22] If anybody from the ops team has two minutes, could you please pass me ( aklapper@wm ) the content for the last two weeks in bugzilla.wikimedia.org's "audit_log" table? thanks in advance... [17:02:24] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [17:02:34] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 0.143514 [17:05:21] andre__: are you on the ops mailing list? [17:05:46] andre__: I could send it there if you want (no one is in the office yet, not sure about remote workers right now) [17:06:42] greg-g, I am, but this something quick and dirty that does not feel like worth an email, really :) [17:06:51] * Damianz wonders how greg-g knows no one is in the office, if there is no one in the office.... inception... is the cat dead or alive in the box... [17:06:54] but thanks :) [17:07:10] Damianz: no one in ops, the lazy... err [17:07:18] ;) [17:07:21] I have two things that require ops that I'd love to sort out today still. Let's see if timezones will collide or not :P [17:09:28] greg-g: [17:09:29] um [17:09:37] there's some of us in the office [17:09:51] though now i am not sure if i should have admitted that ;) [17:10:08] plus, all the european folks are up early :) [17:10:22] New patchset: Mark Bergsma; "Remove hit_for_pass code in generic Wikimedia VCL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55277 [17:12:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55277 [17:13:10] LeslieCarr: oh, I didn't see you! :) [17:18:37] lesliecarr: i received a new uplink module today...not sure what to do w/it rt4456 [17:19:07] hi. deployment question - i just did a big varnish reconfig (merged), how will i know when it gets deployed? [17:19:13] cmjohnson1: an ex4500 uplink module ? [17:19:21] or an ex4200 ? [17:19:31] 4500 [17:19:40] * brion waves yurik  [17:19:56] * yurik has been waved [17:20:01] like for stacking ? [17:20:02] hrmm...nope lesliecarr [17:20:03] 4200 [17:20:09] oh [17:20:09] according to the ticket [17:20:10] :) [17:20:17] 4200 is for spare purposes [17:20:26] since we had all that drama after asw-c-eqiad's one went bad [17:20:50] * cmjohnson1 put's it back in the box for storage [17:20:55] hmm.. brion, whom should i ping re q above [17:21:15] mark did the merge, let's ask him [17:21:29] wasn't sure if mark is mark :) [17:23:54] yurik: it was merged and has been deployed by puppet around now [17:24:01] \o/ [17:24:04] yei! [17:24:13] i tested it on one box, it seemed fine [17:24:20] puppet should normally do the others within 30 mins, so about now [17:24:22] yurik: go ahead and do some spoofing tests, make sure it all looks right [17:24:24] mark, two q: is it possible for me to do that? [17:24:28] no [17:24:33] :) [17:24:34] i meant not to pro [17:24:36] production [17:24:40] but as a test [17:24:40] in labs, yes [17:24:54] is there an easy setup to do the whole deployment? [17:24:56] hashar is working on setting up mobile varnish in the beta cluster [17:25:00] it's not fully finished yet [17:25:03] super! [17:25:04] so I think you'll be able to soon [17:25:10] will i be able to fake my ip? [17:25:21] no... don't think so [17:25:24] i think we still don't have a labs setup for zero, or it's partially done…. yurik ask jcmish about that, she's our mobile qa person [17:25:37] this is very important for us IMO [17:25:55] otherwise i have no idea if i broke partners ... they won't be happy :) [17:26:09] ideally we should have some way to throw fake traffic at it and test yeah [17:26:39] unit tests too.... mmm [17:27:57] :) to dream the impossible dream [17:28:06] ok, so the state of the art - change puppet file and bug mark. Future goal - jcmish & hashar will set up a test env [17:28:29] sounds about right [17:28:52] s/mark/ops/ [17:28:54] RobH, mutante_away, and LeslieCarr are also folks who can push stuff. [17:29:03] although I'm the one who does most varnish stuff at present, indeed [17:29:04] there should probably be a list somewhere, i just ping random people until someone responds ;) [17:29:23] brion: why'd you out me? ;) [17:29:53] everyone in ops as a full time position has root. [17:30:02] that means they can push stuff, but not they should ;] [17:30:05] not true [17:30:19] hehe [17:30:36] i have root and should *definitely* not push things because i don't know what i'm doing there [17:30:51] paravoid: not true? someone is full time in ops without root? [17:31:00] i keep to the download server mostly :) [17:31:05] brion: I thought we took your shell away years ago? [17:31:14] RoanKattouw: sssssh :) [17:31:15] i was shocked too [17:31:18] haha [17:31:20] brion left, a few weeks before he was rehired I took away his shell [17:31:27] i figured if we gave it back to him he would quit [17:31:32] lol [17:31:33] then at some point we reenabled it and now he has root again too [17:31:35] more shell, mo problems [17:31:56] RobH: coren [17:31:58] yeah i try not to get mixed up in ops stuff cause it's a never-ending rabbit hole and i've got enough on my plate programming :) [17:32:02] danese was like, "please disable brion's shell, he doesn't want it anymore, it's a liability!" [17:32:02] RobH: and CT if you want to be precise :-) [17:32:03] I see [17:32:09] and I was like "uh ok if you insist" [17:32:11] paravoid: Ahh, coren [17:32:19] ok, well, 80% of ops has root. [17:32:30] i really shouldn't have root, though shell is useful for debugging sometimes [17:32:37] There is also exactly one person in most other team that has root [17:32:45] brion: if you really dont need it, we can remove your key out of roots authorized keys [17:32:47] (Tim in platform, Brion in mobile, and myself in features) [17:32:53] cuz honestly, if you dont need it or want, safer to yank. [17:32:56] We're moving towards a pattern here :) [17:33:00] (can leave you will full deployment like most devs) [17:33:03] RobH: otto is kind of platform too [17:33:04] well, not most, some. [17:33:10] coren what? [17:33:12] !log aaron synchronized php-1.21wmf12/includes/WikiPage.php 'deployed 1862a57aef08432e8b1194a54dc179649f5cae63' [17:33:17] tomasz has root too, surprisingly. [17:33:18] Coren: i mistakenly said all ops have root [17:33:20] Logged the message, Master [17:33:22] Ah. [17:33:24] but you and ct do not [17:33:36] tomasz is not surprising if you know where he came from ;) [17:33:47] but nowadays probably shouldn't have it anymore indeed [17:33:48] plus he was about at stillman [17:33:49] i only use root to set file permissions on download, can i get it limited to that server? or should we just fix the file permissions once and for all ;) [17:34:01] brion: we can give you sudo on that server alone [17:34:03] brion: yeah we should provide better methods for that ;) [17:34:06] which is slightly better [17:34:08] \o/ that'd be awesome [17:34:14] can you file an RT ticket with the things you'd need? [17:34:20] :) [17:34:21] yay [17:34:28] just leave me in wikidev and a sudo on dataset2 (or whatever download is now) [17:34:44] let's see if i still have my rt credentials [17:34:46] yeah that works [17:34:59] atleast then if your key is stolen its just bad [17:35:02] instead of very bad. [17:35:04] ;P [17:35:10] :) [17:35:24] yep [17:35:29] * LeslieCarr gives brion the crown of security! [17:35:35] i try to keep my laptop secure but you never know :) [17:36:35] tfinc: hi [17:37:58] https://rt.wikimedia.org/Ticket/Display.html?id=4798 [17:38:01] enjoy :D [17:38:50] greg-g: anyway, the sync finished minutes ago [17:39:16] brion: is there a reason mobile related files arent owned by wikidev group? [17:39:26] hmmmm [17:39:30] actually that might be simplest fix [17:39:33] cuz if they were, then you wouldnt need sudo at all. [17:39:41] cuz sudo to do file crap is pretty wide open sudo [17:39:44] lemme check quick [17:39:44] yeah [17:40:15] cuz yea, it seems your access needs are pretty much dev standard if we have the proper file permissions [17:40:23] dev-deploy standard. [17:41:15] ok they're mostly not group-writable. hah! [17:41:15] i can fix that [17:41:38] coolness [17:42:49] once thats done, i think its as simple as moving your include user from the admins::roots to admins::mortals [17:42:53] i hate that group name [17:43:15] and yanking your key out of private repo copy of roots authorized [17:43:20] see they're things like this: :P [17:43:21] drwxr-xr-x 3 tfinc root 151 2012-12-18 23:13 iOS [17:43:49] so gotta beat up tomasz to upload proper permissions ;] [17:43:59] how are those files placed? [17:44:21] no not beat up tomasz, we should provide a proper method for that [17:44:31] yes, i was joking. [17:44:37] brion: there has been talk about a new download server for mediawiki at the platform meetup [17:44:42] probably we should coordinate that effort with you guys [17:44:45] hence the ;] [17:44:52] spiff [17:45:01] RobH: we scp them in [17:45:10] probably just have bad default umasks [17:45:27] ok i'm not touching the mediawiki releases so i don't have to touch that directory [17:45:47] ok i think i go them all -- android, iOS, and win8 in /data/xmldatadumps/public/ [17:45:55] *got [17:46:05] is the best solution here to set puppet to enforce the umask on directories [17:46:06] ? [17:46:17] * brion fixed the permissions on the relevant dirs [17:46:23] * RobH isnt sure what the common practice is, since it seems most sync scripts are what fix permissions  [17:46:33] brion: yep, but they will mess up again no? [17:46:37] ah [17:46:46] well if we don't upload as root it should go better :) [17:47:03] ok, then is that all ya needed and yer good to lose root now? [17:47:04] that whole uploads structure for the data dumps and other is pretty wacky [17:47:11] RobH: yep good to go [17:47:21] cool, I'll yank it and start on it. [17:47:22] oh lemme confirm quick i can log into mw* as myself, i think i should [17:47:36] if you cannot now (you can since yer included in roots) [17:47:45] you will be later (when included in mortals, who all have mwdeploy access) [17:48:01] brion@fenari:~$ ssh mw1044 [17:48:02] Permission denied (publickey). [17:48:10] why you gotta prove me wrong? [17:48:14] haha [17:48:22] and that's with -A forwarding [17:48:25] thats odd. [17:48:32] cuz i am identical in setup to you and can do it! [17:48:44] * RobH investigates [17:48:57] let's make sure i have the right mortal key in there [17:49:23] should be ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPe5ARdfajt7cDlcK6Fn3uFf5d5hvFdefqdr3L4Q2qeojQYioEvgcbZfVXRzpoSuPPx1cl/tDZCdfYityJiZWaE3T+gDZqYh/zO4M/JkiRp0vfnHKQeRbW7ledlitPKi9ZoEGE0e8FX17V9DNxnSolI3wBrEOOHxmBnnqS2Q04bM1/MRuMH/jxkcOWEp/SG5TOJtlSqKMAOrui7vU0gycQ9Kn6bwB0csuRA2IUwAnn07oVlCoBLR4nDTzj+iXF9j3aB2nyuZE0huXJM4ys3oL5CSDVTDow42vLyH4jwMlugxsgC2QBwUuCPLGz0uTVOvdFG5PstXBEWJnr6lL/0D13 brion@Hawkeye.local [17:49:33] key => "AAAAB3NzaC1yc2EAAAADAQABAAABAQDPe5ARdfajt7cDlcK6Fn3uFf5d5hvFdefqdr3L4Q2qeojQYioEvgcbZfVXRzpoSuPPx1cl/tDZCdfYityJiZWaE3T+gDZqYh/zO4M/JkiRp0vfnHKQeRbW7ledlitPKi9ZoEGE0e8FX17V9DNxnSolI3wBrEOOHxmBnnqS2Q04bM1/MRuMH/jxkcOWEp/SG5TOJtlSqKMAOrui7vU0gycQ9Kn6bwB0csuRA2IUwAnn07oVlCoBLR4nDTzj+iXF9j3aB2nyuZE0huXJM4ys3oL5CSDVTDow42vLyH4jwMlugxsgC2QBwUuCPLGz0uTVOvdFG5PstXBEWJnr6lL/0D13"; [17:49:39] yeah looks right [17:49:40] funky [17:49:53] hrmm [17:50:07] can't get into bast1001 either [17:50:14] but fenari lets me in fine [17:50:28] that is super odd. [17:50:35] computers… man [17:50:41] no srsly [17:50:49] damn unix don't make no sense [17:50:54] right now your cluster account has identical permissions to mine [17:51:53] weeeeeeeeeird [17:52:08] i can root into bast1001 and su to brion :P:P [17:52:40] interesting. [17:52:46] So on the apaches, your home dir has no .ssh [17:52:52] thus no authorized key [17:53:08] aaaaah fun [17:53:20] am i in the wrong subgroup so it's not copying me around? [17:53:29] or are those on an nfs *shudder* [17:53:58] well, you have [17:53:59] if $enabled == "true" and $manage_home { [17:54:02] versus my [17:54:08] if $manage_home { [17:54:14] hmmmm [17:54:21] i compare to say reedy [17:54:25] who matches my just if $manage_home { [17:56:01] am i not enabled? :P [17:56:02] i could just change it [17:56:05] but you are enabled [17:56:08] so im not sure wtf [17:56:18] class brion inherits baseaccount { [17:56:18] $username = "brion" [17:56:18] $realname = "Brion Vibber" [17:56:19] does that vary based on host? or is it global? [17:56:19] $uid = 500 [17:56:21] $enabled = true [17:56:25] huh [17:56:27] is listed as a global in the admins.pp [17:56:31] haha uid 500 [17:56:35] nice eh? [17:56:46] so i could just change it to match mine [17:56:53] that's what i win for configuring the first tampa servers in '04 [17:56:56] but then i think im just glossing over a larger issue in the admins.pp setup. [17:57:13] if jeff wasnt on vacation i would totally ping him [17:57:23] he had to redo a lot of this for fundraising, so has a firm grasp [17:57:34] paravoid: You have any knowlege of how our admins.pp is structured? [17:57:55] holy shit this year will be the 10th anniversary of my conversion to mac os x [17:58:01] i got my powerbook in '03 [17:58:15] apple should buy you a black turtleneck [17:58:21] srsly [17:58:52] i have bought far too many macs since then :) [17:59:01] lemme see if these other enabled variable folks have ssh keys copied properly [17:59:55] example, gwicke [18:00:02] and he does indeed have ssh key copied to apaches [18:00:36] brion: can you try ssh'ing right now to bast1001 as brion ? [18:00:52] maybe with a -vv ? [18:01:06] Permission denied (publickey). [18:01:16] yea, i see failed pubkey [18:01:26] LeslieCarr: its going to fail [18:01:28] LeslieCarr: https://gist.github.com/brion/5223414 [18:01:32] his home direcotry has no authorized keys [18:01:36] no .ssh subdir. [18:01:38] ah [18:01:39] hrm [18:01:40] so [18:01:44] so his key isnt copying [18:01:46] * brion .sshudders in tterror [18:01:56] when it is actually copying for other users with same permission sets in admins.pp [18:02:13] (thus it shouldnt be the enabled flag thing) [18:02:16] maybe. [18:02:28] actually it's not copying for everyone [18:02:31] jeluf for example [18:02:44] Why is that? [18:02:44] or demon [18:03:06] i'm guessing same reason it's not for brion [18:03:25] * Damianz yells 'nobody knows' quietly [18:04:01] shows him added to users swith proper uid [18:04:17] LeslieCarr: so other folks know about this [18:04:20] and im not crazy, thats good. [18:04:27] but does anyone know why? =] [18:05:31] hrm, perhaps the manage_home flag ? [18:05:40] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/55271 [18:05:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55103 [18:05:58] LeslieCarr: look at gwicke user [18:06:05] his is identical syntax to brion's [18:06:09] and his key is in place [18:06:37] oh nm demon has his .ssh directory [18:06:45] not enough caffeine this morn [18:07:36] i dont wanna take away brion's root access since its the only access that works, heh. [18:08:59] PROBLEM - Puppet freshness on mw1077 is CRITICAL: Puppet has not run in the last 10 hours [18:09:33] well, he was at one time set to disabled [18:09:41] i wonder if that flag set something someplace that changing it back doesnt fix [18:09:59] PROBLEM - Puppet freshness on mw1104 is CRITICAL: Puppet has not run in the last 10 hours [18:10:19] so could just yank the enabled part and see if fixes, but that just is shot in dark. [18:10:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:38] LeslieCarr: whadda think? [18:10:47] you made mistake of offering a view, now you are involved, bwahahahahaa [18:11:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [18:12:17] haha [18:12:19] i'm looking through [18:12:22] tparscal also doesn't work [18:12:24] why is that ? [18:12:28] ahha [18:12:32] we never disabled trevor [18:12:35] so there goes my idea [18:12:41] tparscal if $enabled == "true" and $manage_home [18:12:42] its something else, so odd. [18:12:43] that's why [18:12:53] $manage_home isn't true on bast1001 because it doesn't have nfs [18:12:59] PROBLEM - Puppet freshness on mw1043 is CRITICAL: Puppet has not run in the last 10 hours [18:13:01] but look at gwicke! [18:13:12] hrm [18:13:18] gwicke y u mess this up ? [18:13:25] heh, he iddnt, his user is perfect! [18:13:32] all user accounts should strive to be like it. [18:13:34] hrm? [18:13:48] gwicke: we are having issues with brion's key not copying correctly across the cluster [18:13:59] PROBLEM - Puppet freshness on mw1080 is CRITICAL: Puppet has not run in the last 10 hours [18:14:00] and his account syntax is identical to yours, so your name came up [18:14:08] we're trying to de-escalate me from root to a normal user :) [18:14:18] as both accounts are handled the same in puppet syntax [18:14:18] ah, I don't have root [18:14:31] we never got rid of sara's access (i.e. user = false) [18:14:31] indeed [18:14:40] not my fault ;) [18:14:42] lesliecarr: are you going to be around for a little while? eq is troubleshooting the cross connect. [18:14:44] hrmm [18:14:50] little bit [18:14:56] i wonder if its something to actually do with what account you are listed under [18:15:01] 40 minutes i think before lappy is taken away [18:15:05] ie: brion in roots at end includes [18:15:08] but that shouldnt matter [18:15:16] okay [18:15:16] as it handles the roots users key copies the same as the mortals [18:15:31] only difference is roots user key is, independently of this file, inserted into the private repo [18:15:40] i thought. [18:15:50] though there may be something else I am missing in that, hence this issue [18:15:59] New patchset: Lcarr; "making sara enabled = false since she has left" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55294 [18:16:00] PROBLEM - Puppet freshness on mw1003 is CRITICAL: Puppet has not run in the last 10 hours [18:16:08] LeslieCarr: I don't see any other roots with the enabled flag specifically [18:16:25] (i see it for disabled for roots no longer workign here) [18:16:40] yeah, enabled should automatically be true [18:16:47] but the home directory [18:16:48] hrm [18:16:53] i think that may be the ticket [18:17:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55294 [18:17:11] i dunno [18:17:12] i try removing it [18:17:16] looking at my user [18:17:19] i have if manage home [18:17:22] and my key is there. [18:17:26] same with gwic [18:17:51] haha you all thought de-rooting me would make your lives easier, not add more work ;) [18:19:00] PROBLEM - Puppet freshness on virt5 is CRITICAL: Puppet has not run in the last 10 hours [18:19:07] LeslieCarr: are you changing something? [18:19:14] i dont wanna merge conflict [18:19:21] so if you are i wont [18:19:23] just changed something [18:19:30] if you git pull [18:19:34] that should be upt o date [18:20:00] PROBLEM - Puppet freshness on mw1089 is CRITICAL: Puppet has not run in the last 10 hours [18:20:00] PROBLEM - Puppet freshness on mw59 is CRITICAL: Puppet has not run in the last 10 hours [18:20:59] PROBLEM - Puppet freshness on mw1129 is CRITICAL: Puppet has not run in the last 10 hours [18:23:13] YuviPanda: i wonder if these will not get deleted http://commons.wikimedia.org/wiki/File:Center_for_Sustainable_Landscapes_at_Phipps_Conservatory,_Pittsburgh,_Pennsylvania_-_16.jpeg [18:23:24] oops [18:23:26] wrong channel [18:23:37] hehe [18:23:59] PROBLEM - Puppet freshness on mw1141 is CRITICAL: Puppet has not run in the last 10 hours [18:24:36] greg-g - did Wikidata folks say when they need the review to be done for that search key rebuild? [18:24:50] interesting those are accurate issues but their motd's claim that puppet was run 7 minutes ago [18:25:19] woosters: within 2 weeks, I believe [18:25:22] ok. thks [18:25:56] speak of the devil... [18:26:13] Denny_WMDE: the review of the search key rebuild, you all needed that by....? [18:26:43] two weeks ago, as usual :) [18:26:48] but whenever it is ready is fine too [18:26:54] whatever comes first [18:26:58] woosters: ^ [18:27:18] got it [18:27:29] search has been broken now for a month or so, so a few days there or here won't hurt [18:30:00] PROBLEM - Puppet freshness on virt7 is CRITICAL: Puppet has not run in the last 10 hours [18:31:59] PROBLEM - Puppet freshness on mw1016 is CRITICAL: Puppet has not run in the last 10 hours [18:31:59] RECOVERY - Puppet freshness on mw59 is OK: puppet ran at Fri Mar 22 18:31:55 UTC 2013 [18:32:41] are we using libvmod for varnish? (variables) [18:32:43] https://github.com/varnish/libvmod-var [18:39:20] mutante: is wikpedia.org supposed to work? [18:39:28] " [18:39:29] We're already challenging the wikpedia.org domain name." [18:39:48] (Mike Godwin, 2009-01-07) [18:42:56] Nemo_bis: did not have a ticket for it, is not in DNS, but .. whois shows we own it. creating ticket to add and redirect it. thanks [18:46:22] mutante: how about wikispecies.org? whois services seem confused about it [18:46:57] and it was free in 2009-10-30 [18:47:13] heya paravoid, I need to manually remove the gadolinium varnishncsa i instances from the frontend varnish machines [18:47:28] i'm not sure how to find out which ones they are [18:47:33] mark said there wasn't a complete dsh group [18:47:42] i'm not sure where our dsh stuff is defined anyway [18:47:49] Nemo_bis: "We are in the process of obtaining wikispecies.org" that quote is just one day old [18:48:10] Nemo_bis: we have a ticket for that RT-4445 and it is being worked on [18:48:58] mutante: ah, weird :) [18:49:04] let me tell Sj [18:50:04] it's "definted" in a file on fenari called /etc/dsh/groups/ [18:52:01] ottomata: ^^ [18:52:26] binasher: i was considering starting a wikitech page with something like guidelines for developers who want code running in wmf production [18:52:56] can i steal some of your quotes and make an (incomplete) chart with the request latency (and the actual latency numbers blank for now) [18:53:36] LeslieCarr: ok, but make sure step 1) is rewrite mediawiki from scratch [18:53:44] haha [18:53:58] well i was thinking something that people will actually obey instead of what we want ;) [18:57:12] thanks LeslieCarr [18:57:18] how do I know which group I should look at? [18:57:23] i guess mark said it was incomplete anyway [18:57:25] hm, i guess puppet, eh? [18:59:47] o my goodness [18:59:57] role::cache::upload has a big if (eqiad) hack [18:59:57] haha [19:06:09] LeslieCarr, can you advise? [19:06:19] i need to remove a varnishncsa instance [19:06:32] i was gonna make puppet be smart about it, but it was hacky and would have caused problems [19:06:35] so i just removed the definition [19:06:40] now I need to go and remove the instance manually [19:06:58] i've got regexes from site.pp that I think would match all of the hosts I need to find [19:07:25] is there a better way to find these hosts rather than expanding the regex and checking if the matching hosts exist? [19:14:42] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [19:14:52] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [19:17:13] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:23] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:23] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:23] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:35] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:35] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:35] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:42] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:42] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:42] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:42] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:42] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:43] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:44] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:45] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:45] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:45] PROBLEM - Apache HTTP on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:46] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:46] PROBLEM - Apache HTTP on mw1093 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:47] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:48] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:48] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:49] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:49] PROBLEM - Apache HTTP on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:49] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:50] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:52] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:52] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:52] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:52] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:52] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:18:47] robh: notpeter ^ [19:18:59] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [19:18:59] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [19:18:59] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [19:18:59] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [19:18:59] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.159 second response time [19:18:59] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [19:19:00] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.142 second response time [19:19:00] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.152 second response time [19:19:01] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.138 second response time [19:19:02] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.144 second response time [19:19:02] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.157 second response time [19:19:03] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [19:19:03] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [19:19:03] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [19:19:04] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [19:19:04] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [19:19:05] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [19:19:05] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 61130 bytes in 0.180 second response time [19:19:07] bleh [19:19:20] cmjohnson1: just pinging cuz you know we are about? [19:19:26] or did you do something to trigger that ;] [19:19:30] nope [19:19:56] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.303 second response time [19:19:56] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [19:19:56] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.094 second response time [19:19:56] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.094 second response time [19:19:56] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.198 second response time [19:19:56] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.380 second response time [19:19:57] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.045 second response time [19:19:57] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [19:19:58] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.107 second response time [19:19:58] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [19:19:59] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.083 second response time [19:20:00] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [19:20:00] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [19:20:01] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [19:20:01] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [19:20:02] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [19:20:02] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.992 second response time [19:20:02] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.091 second response time [19:20:12] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [19:20:27] really? [19:21:56] yeah....not sure what happened [19:22:31] not my doing :-P [19:26:42] Change abandoned: Krinkle; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54798 [19:29:16] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [19:29:16] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:29:16] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [19:29:16] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [19:29:53] paravoid: Can you push the package before the end of today? [19:32:09] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [19:32:18] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [19:32:32] !log removed unicast udp2logging to gadolinium; gadolinium will consume from multicast stream [19:32:38] Logged the message, Master [19:34:26] PROBLEM - Varnish traffic logger on cp3003 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [19:35:14] New patchset: Aklapper; "Comment query for urgent issues. Not working as expected yet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55301 [19:39:35] New review: Hashar; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55259 [19:43:15] New review: Demon; "(1 comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55259 [19:48:10] New review: Krinkle; "(1 comment)" [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/54691 [19:48:15] New patchset: Yurik; "Unified default language redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [19:48:26] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55259 [19:48:31] New review: Andrew Bogott; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [19:49:28] New patchset: Andrew Bogott; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [19:49:57] New review: Krinkle; "(1 comment)" [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/54691 [19:51:46] New review: Yurik; "DO NOT MERGE until reviewed by dfoy & brion" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55302 [19:52:08] New patchset: Hashar; "erb expander for testing purposes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55304 [19:54:13] yurik: can't you vote CR-2 on patch sets ? [19:54:31] yurik: ah no, it is operations/puppet. Forget me :) [19:54:38] hashar: nope [19:54:52] hashar: are you the one setting up a test env for me? :) [19:55:12] yurik: probably not. What do you want to test? [19:55:13] * yurik badly needs a way to test puppets [19:55:22] ahhhh [19:55:30] basically i need a way to test that when i break the varnish script [19:55:33] so you can use a virtual instance on your local computer [19:55:35] i really really break it [19:55:42] or use a labs instance :) [19:55:56] hashar: sure, but it seems a bit complex at the time [19:56:09] we have a puppet class to let your instance fetch from a local directory [19:56:10] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [19:56:10] basically my goal - deploy puppets and pretend that i am coming from ip XXX [19:56:28] New review: Brion VIBBER; "Zero doesn't work on HTTPS because the certs are wrong (and if it did work, I don't think the carrie..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [19:56:38] hashar: the zuul_service check still doesnt work.. sigh.. for some reason the NRPE file was not updated on gallium .. i'm looking now..hrmm [19:56:56] mutante: did nagios-nrpe restart properly ? [19:57:14] hashar: yes, but the config does not have the new command [19:57:16] mutante: btw if you know how to find out when a process has started, I am willing to learn :] [19:57:18] ah [19:57:38] there is a timer in the output of "ps" [19:57:46] START TIME [19:58:00] ah ps [19:58:06] ps aux [19:58:07] I will one day have to learn how to use it [19:58:34] If you don't feel like ps, just check the ctime on the pid dir in /proc heh [19:58:35] hashar: would love to meet w you next week (you are in SF, right?) to see how i can test varnish & puppets [19:58:36] Mar08 1:30 /usr/sbin/nrpe -c /etc/icinga/nrpe.cfg -d [19:58:40] mutante: so that looks a bit old [19:58:55] hashar: grep zuul /etc/nagios/nrpe_local.cfg [19:59:00] mutante: ah no nrpe does nothing but receiving the commands I guess so unrelated [19:59:40] the question is more why that file has not been updated after we changed both checkcommands templates, icinga AND Nagios even [19:59:44] yurik: hashar isn't in SF, he's in France [19:59:52] bummer [20:00:05] yurik: I would love to fly over to SF to meet you :-] [20:00:20] hehe, hashar, i think NYC is closer :D [20:00:33] yurik: but that would cost a bit of money and make my wife angry 8-) We can google hangout next week if you want :D [20:00:44] that's what i was thinking [20:00:58] ahh NYC even better timezone wise. Your mornings are my afternoon, that works well [20:01:11] true, but i'm in SF next week [20:01:26] unlucky [20:01:38] Damianz: ah /proc would work. thanks! [20:02:06] hashar: /etc/icinga/nrpe_local.cfg has the new command, /etc/nagios/nrpe_local.cfg does not :p [20:02:20] nagios is obsolete isn't it ? [20:02:25] hehe. hashar, just to give you a heads up - my goal is to have a setup where i can deploy new varnish settings and run a few unit? tests pretending to come from a set of IPs, checking the results [20:02:42] hashar: but.. no.. because that's the one it uses :p [20:02:47] apparently [20:02:52] mutante: I would reload / restart nrpe [20:03:33] mutante: /etc/icinga/nrpe_local.cfg is included by /etc/icinga/nrpe.cfg which is passed as a parameter to the /usr/sbin/nrpe currently running since Mar 08 [20:03:36] hashar: puppet does that [20:03:40] ah [20:03:41] notice: /Stage[main]/Nrpe::Packages/File[/etc/icinga/nrpe_local.cfg]/content: content changed '{md5}a7b3f5d6672f0e9471b076b729f0e884' to '{md5}1e1c9657962ef70b5859356c4b6393f1' [20:03:45] info: /Stage[main]/Nrpe::Packages/File[/etc/icinga/nrpe_local.cfg]: Scheduling refresh of Service[nagios-nrpe-server] [20:03:52] ah [20:03:56] so hmm [20:04:03] puppet is bugged ? :-]]]]]]]] [20:04:05] Is this a right channel to ask why wikinews.com does not redirect to wikinews.org? [20:04:22] odder: we are a non profit, use the .org :-] [20:04:29] odder: yea :p [20:05:09] yurik: you probably want to use a labs instance, though I am not sure how you will forge an IP :-] [20:05:10] dig wikinews.com [20:05:10] hashar: I was just going through the list of squatted domains from internal, and noticed this one is ours already, but it doesn't seem to redirect properly. [20:05:16] * Damianz uses the other window [20:05:39] odder: confirmed, and thanks for reporting, turning into ops ticket [20:05:48] :-D [20:05:50] so far we just have .de and .org [20:07:28] odder: i can PM you a list of domains we have tickets for if you feel like comparing all [20:07:48] hashar: i was hoping varnish has some ip value somewhere that I can set from either the real IP or from some magic header, and later I will use that ip to check against the ip ranges with the ~ operator [20:08:01] the one i have was created by going through our DNS and checking which are not in Apache [20:08:09] but the ones that are not in our DNS might be missing [20:08:25] instead of client.ip ~ (acl list) [20:09:30] RECOVERY - Puppet freshness on virt7 is OK: puppet ran at Fri Mar 22 20:09:28 UTC 2013 [20:09:55] yurik: I have no idea :-] You might want to ask mark about it [20:10:27] mark is MIA [20:10:29] :) [20:10:32] yurik: I also know it exist a varnish test suite that could potentially fit your needs. [20:10:39] mutante: sure [20:10:45] oh! that might be good :) [20:10:51] yurik: well friday night in Europe. He must be having some beers with relatives :-] [20:11:07] great. both ops are on the other side of the planet :( [20:11:17] It's the good side [20:11:21] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [20:11:22] right [20:11:30] see, even puppet complains [20:13:06] yurik: note that I am not part of ops :-] [20:13:28] despite a common misconception that I am part of that team [20:13:54] hehe [20:15:19] hashar, ops-shmops, close enough [20:15:30] hashar just keeps jenkins in his box :D [20:16:22] PROBLEM - Puppet freshness on mw1093 is CRITICAL: Puppet has not run in the last 10 hours [20:18:42] hashar is honorary ops [20:19:25] when Asher joined, some people started to ask me to fix Innodb on some random sql server [20:19:41] cause I used to be known by the first name of "Ashar" [20:19:47] I must confuse everyone :( [20:21:41] hashar: I was confused at first when they hired Asher Feldman, I was wondering whether he and Ashar Voultoiz (sp?) were different people for a while [20:22:03] silly Roan [20:22:32] they better not hire another brion [20:22:32] RoanKattouw: you were not alone, at the berlin hackaton in 2011, some people congratulated me to have joined the foundation ;D [20:22:45] RoanKattouw: does moving test2wiki pages give you trouble? [20:22:48] brion: that is explicitly forbidden by HR [20:22:53] :) [20:23:19] AaronSchulz: RoanKattouw: that page move issue we had with wmf12, is it covered by a PHPUnit test? That would be nice to have [20:23:44] hashar: It was replag-specific [20:23:54] If you ran the unit tests with simulated replag you might get it [20:24:04] AaronSchulz: I haven't tried [20:24:23] brion: Hey we already have five Roberts and we used to have four Andrews, so .... ;) [20:24:40] you need roberts@lists.wm.org? [20:24:45] haha [20:24:58] namediversity@lists.wm.o [20:25:20] heh [20:26:21] PROBLEM - Puppet freshness on db1049 is CRITICAL: Puppet has not run in the last 10 hours [20:27:55] RoanKattouw: can you try? [20:28:22] PROBLEM - Puppet freshness on mw1065 is CRITICAL: Puppet has not run in the last 10 hours [20:28:55] RoanKattouw: ah replay :-/ We don't have a system to emulate that, niklas filled a but about it a few months ago. [20:29:06] AaronSchulz: Sure [20:29:51] AaronSchulz: Successfully moved [[Test]] to [[Test2]] [20:29:57] Checking DB to verify it got it right [20:31:20] arghg [20:31:29] /usr/local/bin/sql is broken [20:31:43] It needs to run mwscript as Apache but wikiadmin_pass as a wikidev [20:31:43] So sudo'ing it doesn't work eitehr [20:33:02] RoanKattouw: stop breaking the maint scripts :-p [20:33:07] Whoa wtf [20:33:12] `which sql` works [20:33:37] It's probably still using /h/w/bin/sql ? [20:34:48] AaronSchulz: Yup looks fine [20:34:57] I moved it back on top of the redirect too, looks good [20:37:14] RoanKattouw: it works unless your name is Greg or Chris [20:37:14] haha [20:37:51] !log aaron synchronized php-1.21wmf12/includes/Title.php 'Reverted follow-up fixes too' [20:37:58] Logged the message, Master [20:37:58] update the bug title "Page moves totally broken for people named Greg or Chris" [20:46:09] !log aaron synchronized php-1.21wmf12/includes/WikiPage.php 'deployed e372b635ae6d6a589f6d87f14fe4c9452d7a1b4d' [20:46:15] Logged the message, Master [20:48:46] New patchset: Diederik; "Added domain referer info to blog query, and ignore search and preview urls." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55390 [20:51:10] !log aaron synchronized php-1.21wmf12/includes/Title.php 'deployed 0f4914bb4ff2d13cb124cad94dc5ded663123a18' [20:51:16] another [20:51:16] Logged the message, Master [20:52:51] \o/ [20:52:53] greg-g: https://gerrit.wikimedia.org/r/#/c/43445/ [20:54:24] thanks much, AaronSchulz [20:59:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52578 [20:59:51] New patchset: Aaron Schulz; "Revert "Roll back all wikis to php-1.21wmf11 due to bug 46397"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55392 [21:00:37] New patchset: Yurik; "Unified default language redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [21:01:46] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55392 [21:03:16] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Set all non-Wikipedias back to 1.21wmf12 again. [21:03:23] Logged the message, Master [21:06:47] csteipp: https://gerrit.wikimedia.org/r/#/c/55389/ [21:06:52] New patchset: Ottomata; "Removing nginx from udp2log webrequest stream." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55394 [21:07:14] ottomata: thanks, merged the haproxy class [21:07:44] yay, danke! [21:08:25] ottomata: http://en.wiktionary.org/wiki/gern_geschehen [21:08:29] New review: Ottomata; "This is pending some more discussion, I'm just getting it ready. See RT 859" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55394 [21:09:01] gern geschehen [21:09:04] not heard that one! [21:09:36] it's kind of like a 3-way handshake. bitte, danke, gern geschehen :) [21:11:54] Is notpeter around ? [21:12:12] Someone needs to start the incrementalupdater on the search index box. [21:12:35] xyzram: You don't have shell access to the search cluster? [21:12:36] csteipp: rebased [21:13:03] xyzram: don't you have sudo rights as lsearch user ? [21:13:29] don't know the password for that user [21:13:42] hi notpeter [21:13:44] <^demon> We don't use passwords in prod for that, so no :) [21:13:45] what is the box name ? [21:13:49] summoned him [21:14:12] searchidx1001 [21:14:16] I guess any root could restart it [21:14:51] started [21:14:52] it is all about sudo -l -u lsearch; /a/search/lucene.jobs.sh incremental-update [21:14:57] you are the boss :-] [21:15:06] it's just: [21:15:06] root@searchidx1001:~# killall -g java [21:15:06] root@searchidx1001:~# /etc/init.d/lucene-search-2 start [21:15:07] root@searchidx1001:~# sudo -u lsearch /a/search/lucene.jobs.sh inc-updater-start [21:15:07] notpeter: would it make sense to convert that to an upstart job ? [21:15:20] so we could get: start lucene-inc-updater [21:15:29] and let upstart/puppet make sure it is always running [21:15:42] sure, could do [21:15:44] it needs to be started after the lucene daemon [21:15:53] but that could be done with dependencies [21:16:11] for the record, the commands can be found in https://wikitech.wikimedia.org/wiki/Lucene#Adding_new_wikis [21:16:14] yeah I guess both puppet and upstart handle dependencies [21:16:25] that's the general restarting the indexer method [21:16:38] I mean, I'd be inclined to not use time to make that better [21:16:43] while you are around, search was broken in beta because lucene-search-2 service was not running on the search box [21:16:45] <^demon> Letting wikidev users sudo as lsearch might be nice, so we could restart that. [21:16:45] and put time into setting up something better like solr ;) [21:16:48] despite puppet having ensure => running :( [21:17:16] hashar: that's lame [21:17:21] ^demon: yep, makes sense [21:17:25] notpeter: well if you invest some time to make the current lucene search better, you will end up freeing more time to work on solr [21:17:48] fyi: 02:52 Tim: on searchidx1001: started incremental indexer, apparently it died on March 21 at 02:08 when it ran at the same time as a cron job import [21:18:04] hashar: sure. I guess I don't see kicking it from time to time as a big time investment [21:18:06] notpeter: maybe I will write the upstart job for the inc updater. [21:18:11] sure! [21:18:13] go for it :) [21:18:18] i like upstart [21:18:31] <^demon> Hmm, killing those labswikimedia wikis seems to have upset something too. [21:18:39] <^demon> Tons of "Error getting snapshot for index en_labswikimedia.hl java.lang.RuntimeException: Index en_labswikimedia.hl doesn't exist" [21:18:52] I have put in my hundreds and hundreds of hours keeping search from not falling over. I no longer have energy left to invest in it... [21:18:56] <^demon> (substitute your favorite labswikimedia wiki) [21:19:14] notpeter: yeah i can understand that [21:19:30] Shirley: creating a wiki is like creating a baby, yes you should have a good reason to create one, but if you don't for whatever reason, you should have an _extra_ good reason for killing one [21:19:35] notpeter: you need a new pet project :) [21:19:48] hashar: I have plenty :) [21:20:05] but it seems like i can't get rid of this one ;) [21:20:15] I was seeing those errors about labs indexes even before; not sure why our production box even knows about labs [21:20:48] notpeter: at least you are not alone now! I can provide some first level support [21:20:59] and xyzram + ^demon knows about the java side pretty well [21:21:02] installs otrscivicrmmingle on hashar1001 [21:21:14] <^demon> xyzram: They aren't new labs. They're old labs (which was sorta on-cluster) [21:21:21] yep! am grateful for people who speak java to be helping on this now :) [21:21:24] <^demon> Recently deleted. [21:21:26] mutante: we all have our nightmare projects it seems :-]]] [21:21:56] we need a lucene-task-force@wikimedia.org mailing list [21:21:59] hehe [21:22:04] s/lucene/search/ [21:22:15] ^demon: I don't see any references to labs in the config files anywhere [21:22:40] New review: Dfoy; "m.wikipedia.org should redirect to the landing page in the same manner that zero.wikipedia.org redir..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [21:22:40] <^demon> Yeah, I noticed. They do seem to be in the indexes/index/* directory though. [21:22:50] https://github.com/duckduckgo/duckduckgo [21:22:59] <- search that already uses wikipedia :) [21:23:22] http://duckduckhack.com/ [21:25:35] Labs before Labs. [21:26:16] <^demon> xyzram: Ah, I think I got it. It's from RMIMessengerImpl.getIndexTimestamp(). It's trying to get the status of the existing index, but fails when it tries to load the index id (since config doesn't know about it). [21:26:27] <^demon> Removing the indices should probably clear that up. [21:27:02] Ah, great! [21:27:53] <^demon> notpeter: Removing `find /a/search -type d -name '*labs*'` from searchidx1001 should stop the "index not found" errors. [21:32:54] New review: Ottomata; "Not a bad point. " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [21:33:24] notpeter: I think I have the upstart job :-]  Will test it out on beta [21:35:27] ^demon: yeah, those look pretty rm-able [21:35:35] !g I84292345f5e4135e03c524a02dd7eb34ee9119cb [21:35:35] https://gerrit.wikimedia.org/r/#q,I84292345f5e4135e03c524a02dd7eb34ee9119cb,n,z [21:36:54] RECOVERY - Puppet freshness on constable is OK: puppet ran at Fri Mar 22 21:36:44 UTC 2013 [21:37:44] ^demon: ran that on searchidx2 [21:37:49] the non-active one, and errors still be flooding [21:38:26] New patchset: Hashar; "lucene: upstart job for incremental updater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [21:38:46] Maybe restart lsearchd in case it's still holding on to the file descriptors ? [21:39:08] <^demon> Probably is. [21:39:14] I did so [21:39:14] or [21:39:16] I stopped it [21:39:18] rm'd [21:39:19] and started [21:39:57] <^demon> Ah, missed a couple maybe. Shouldn't have done -type d. [21:40:01] !log killing nagios-nrpe on gallium and testing if puppet restores it properly [21:40:06] <^demon> `find . -name '*labs*'` [21:40:08] Logged the message, Master [21:40:23] <^demon> s/\./\/a\/search\// [21:40:54] and yeah [21:40:56] status [21:40:57] and links [21:41:15] New review: Hashar; "Testing it out on labs instance deployment-searchidx01" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/55406 [21:41:18] <^demon> It's prolly those status files. [21:41:23] <^demon> If I had to guess. [21:41:50] hashar: ignore gallium monitoring for a minute [21:42:03] mutante: sure :] [21:42:04] PROBLEM - jenkins_service_running on gallium is CRITICAL: Connection refused by host [21:42:05] still erros :/ [21:42:17] |log jenkins died on gallium :( [21:42:32] hah [21:42:38] *grin* [21:42:57] soo, puppet does fix nagios-nrpe if it's stopped [21:43:05] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java -jar /usr/share/jenkins/jenkins.war [21:43:05] RECOVERY - zuul_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/local/bin/zuul-server [21:43:06] and, surprise, it runs as user icinga now [21:43:14] <^demon> notpeter: And of course that stacktrace is fucking useless. [21:43:14] and look at that zuul recovery [21:43:19] woo! [21:43:41] java: a dsl for turning xml files into large stacktraces ;) [21:43:48] hashar: wee, it was running as user "4294967295" again :/ [21:44:03] init script cant stop it then, just claims it does [21:44:15] ah [21:44:21] kill, let puppet fix it, runs as correct user, zuul check also fixed :p [21:44:37] you might want to dsh on all box to find out any nrpe process still running with a bad uid [21:44:52] good idea, ok [21:45:19] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=gallium [21:48:57] mutante: great! [21:49:12] New patchset: Hashar; "lucene: upstart job for incremental updater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [21:49:16] mutante: thank you very much [21:50:01] notpeter: I am going to quote you on "java: a dsl for turning xml files into large stacktraces ;)" [21:50:04] New patchset: Ottomata; "Adding puppet Limn module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [21:50:29] Your quip ' java: a dsl for turning xml files into large stacktraces ;)' has been added. [21:50:36] you are going to be famous in bugzilla ! [21:50:45] <^demon> Trying to chase a real stacktrace when all you were given was 2 useful lines + a load of crap -> not a friday funday. [21:51:13] hmm, Java is actually a dgl, there is nothing specific about it [21:52:15] New patchset: Hashar; "lucene: upstart job for incremental updater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [21:52:15] RoanKattouw: why does git push origin wmf/1.21wmf12 give me a password prompt? [21:52:32] git remote says it uses the https url, hrm [21:52:38] git remote -v ? [21:53:05] AaronSchulz: On which machine? [21:53:12] hashar: ok :) [21:53:21] RoanKattouw: mine [21:53:26] hashar: yw, have nice weekend [21:54:41] New patchset: Pyoungmeister; "WIP: first bit of stuff for taming the mysql module and making the SANITARIUM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [21:55:24] New patchset: Hashar; "lucene: upstart job for incremental updater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [21:55:26] mutante: I wish i had actual weekends [21:56:16] hashar: :/ [21:56:53] hashar: ouch, nrpe, i used salt to check..and it's like that .. on a whole bunch [21:57:28] you probably want to fill a RT and take care of it on monday :-] [21:57:41] +1:) [21:57:44] notice: Finished catalog run in 35.42 seconds  \O/ [21:57:47] my patch works [21:58:56] New review: Hashar; "I have deployed this patch on deployment-searchidx01.pmtpa.wmflabs and it is now taking care of the ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55406 [21:59:22] New patchset: Pyoungmeister; "WIP: first bit of stuff for taming the mysql module and making the SANITARIUM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [21:59:44] notpeter: I have sprinted the upstart job for lucene job inc updater : https://gerrit.wikimedia.org/r/#/c/55406/ added you as a reviewer for it and I got the change deployed on the labsinstance. [22:00:14] notpeter: feel free to play with the instance if you wanna test it out :-] Would let us do something like: # start wmf-lucene-incupdate [22:01:00] I am off now, bed time 11pm [22:01:05] salt '*' cmd.run 'ps aux | grep 4294967295' :p [22:01:14] easy :-] [22:01:23] good night [22:01:31] hashar: cool! thank you! good night [22:02:09] mutante: if you know about salt, wikitech is waiting for you [ https://wikitech.wikimedia.org/wiki/Salt ] [22:02:21] now I am gone *wave* enjoy your weekend [22:02:43] heh, copy/paste from some etherpad that includes bitches in the title :) ok, waves [22:02:54] ;D [22:05:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [22:07:11] !log installing package upgrades on oxygen [22:07:17] Logged the message, Master [22:09:00] !log kill and restart nrpe on oxygen [22:09:07] James_F: stop dragging Roan [22:09:08] Logged the message, Master [22:09:14] PROBLEM - Puppet freshness on mw1085 is CRITICAL: Puppet has not run in the last 10 hours [22:10:13] notpeter: it looks like a lot of puppet agents are getting stuck with the new crons [22:10:17] like mw1085 for example [22:10:19] preciuse hosts [22:10:56] New review: Milimetric; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55390 [22:11:32] LeslieCarr: just looked at it [22:11:39] it was an old puppet agent that wasn't killed [22:12:28] !log killall nrpe via salt, then restart nagios-nrpe-server , runs as wrong user [22:12:34] Logged the message, Master [22:12:44] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: Connection refused by host [22:13:10] LeslieCarr: ^ it was running as 4294967295 pretty much everywehre :p [22:13:28] RECOVERY - MySQL Slave Running on db67 is OK: OK replication [22:13:35] but a simple kill and restart fixes it, no matter if doing manually or letting puppet start it [22:13:44] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay seconds [22:13:44] RECOVERY - MySQL Replication Heartbeat on db67 is OK: OK replication delay seconds [22:14:14] PROBLEM - Puppet freshness on mw1102 is CRITICAL: Puppet has not run in the last 10 hours [22:14:15] PROBLEM - Puppet freshness on mw35 is CRITICAL: Puppet has not run in the last 10 hours [22:14:15] PROBLEM - Puppet freshness on mw1158 is CRITICAL: Puppet has not run in the last 10 hours [22:14:33] fixed [22:14:40] mutante: thanks :) [22:14:45] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:14:46] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:14:47] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:14:47] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:14:49] maybe ishould have killed all the puppets a little more aggresively [22:15:04] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:04] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:13] hrmm, let's fix those as well now [22:15:14] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:14] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:14] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:15] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:23] what is the correct number? [22:15:25] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:25] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:37] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:38] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:39] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 3 processes with command name varnishncsa [22:15:46] well [22:15:47] 3 [22:15:48] now [22:15:52] yesterday it was 4 [22:15:55] ottomata: you there? [22:16:15] PROBLEM - Puppet freshness on mw1118 is CRITICAL: Puppet has not run in the last 10 hours [22:18:24] RECOVERY - Puppet freshness on mw1118 is OK: puppet ran at Fri Mar 22 22:18:15 UTC 2013 [22:19:53] ja hiii [22:19:58] ah [22:19:59] right [22:20:01] on it [22:20:03] check_procs -w 4:4 -c 4:8 -C varnishncsa [22:20:06] yeah [22:20:07] fixing [22:20:08] cool, thanks:) [22:20:26] NRPE wasnt running right, that's why they all popped up at once now [22:20:32] after i restarted it with correct user [22:20:53] mutante: what did you run to kill puppet agents? [22:21:23] i did not kill puppet agents, i killed "nrpe" processes [22:21:45] kill: salt '*' cmd.run 'killall nrpe' [22:21:46] start: salt '*' cmd.run '/etc/init.d/nagios-nrpe-server start' [22:21:50] oh, gotcha [22:21:56] we were talking about different things [22:21:57] sorry [22:22:32] np:) [22:22:55] New patchset: Ottomata; "Only 3 varnishncsa instances running, fixing logging_monitor." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55419 [22:23:40] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55419 [22:23:56] edit war:) [22:24:34] did you merge that on sockpuppet? [22:24:35] mutante? [22:24:39] ja [22:24:41] oh ok [22:24:42] ha [22:24:42] cool [22:24:43] thank you [22:25:04] RECOVERY - Puppet freshness on mw1085 is OK: puppet ran at Fri Mar 22 22:24:55 UTC 2013 [22:26:50] last puppet run on random varnish cache box.. 250 minutes ago, running on cp1034 [22:27:38] ok, puppet changed the check, restarted nrpe, and still running as icinga as it should, looks good [22:27:57] icinga-wm: gimme a recovery [22:28:29] AaronSchulz: I should just let him starve? :-) [22:29:02] so it's dragging vs starving now? [22:29:03] !log aaron synchronized php-1.21wmf12/extensions/Translate 'deployed 0aab5984415a94bd214088767ff1de33ab57c4e0' [22:29:09] Logged the message, Master [22:37:47] New patchset: Pyoungmeister; "correcting inline_template for sanitarium db def" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55425 [22:38:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55425 [22:40:55] New patchset: Pyoungmeister; "also need correct typing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55426 [22:41:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55426 [22:41:45] RECOVERY - Puppet freshness on mw35 is OK: puppet ran at Fri Mar 22 22:41:40 UTC 2013 [22:48:35] New patchset: Pyoungmeister; "invoking mysql::config correctly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55428 [22:49:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55428 [22:51:03] notpeter: /usr/local/apache/common/wikiversions.dat has references to labs; do you know who uses that file ? [22:51:26] xyzram: I do not [22:51:29] sorry :( [22:51:45] It has a current timestamp on both index boxes. [22:51:49] xyzram: Every request does [22:51:53] Well, not directly [22:52:11] fenari:/h/w/common/wikiversions.dat [22:52:13] wikiversions.cdb is generated from wikiversions.dat , and on each request we check the .cdb to determine which version of MW should be run for the wiki that's being requested [22:52:45] xyzram: That's because the entire wmf-config dir is synced over to the index boxes, right? Or all of MediaWiki? [22:53:02] (The person who wrote the multiversion subsystem is AaronSchulz BTW) [22:53:25] RoanKattouw: But the dat file is handmade ? [22:53:33] -ish [22:53:58] It's human-editable but I think most of the time people use a script that you can tell "change all wikis matching this pattern to this version" [22:55:10] We're seeing a ton of errors in the lucene logs all related to the labs wikis; but none of the config files has any mention of labs at all; so I'm trying to track down how lucene even knows about labs. [22:55:35] <^demon> We need to remove any last vestigates of labswikimedia. [22:55:45] <^demon> If they're still in wikiversions.dat, we should prolly remove them. [22:56:34] de_labswikimedia php-1.21wmf11 * [22:56:41] readerfeedback_labswikimedia php-1.21wmf11 * [22:56:44] can do? want me to? [22:58:01] New patchset: Dzahn; "remove labs wikis from wikiversions.dat" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55430 [22:58:04] there [22:58:15] are they gone from all.dblist too? [22:58:30] otherwise sync-wikiversions will error out for sanity [22:58:39] pattern "labs" not found in all.dblist [22:59:11] there is also wikiversions-labs.dat [23:00:15] Looks like all.dblist at one time had these: de_labswikimedia [23:00:15] en_labswikimedia [23:00:15] flaggedrevs_labswikimedia [23:00:15] liquidthreads_labswikimedia [23:00:15] readerfeedback_labswikimedia [23:00:15] deleted.dblist:de_labswikimedia [23:00:15] deleted.dblist:en_labswikimedia [23:00:15] deleted.dblist:flaggedrevs_labswikimedia [23:00:23] they are in deleted.dblist [23:00:33] those are removed from DNS [23:00:42] shall i merge the above to remove from all.dblist? [23:01:10] I don't see them in all.dblist now. [23:01:28] sorry, i meand wikiversions.dat [23:01:30] meant [23:01:30] New patchset: Pyoungmeister; "fixing up a couple of incorrect variable references" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55432 [23:01:36] xyzram: https://gerrit.wikimedia.org/r/#/c/55430/1/wikiversions.dat [23:02:00] I don't see why not [23:02:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55432 [23:02:14] What's the intent of deleted.dblist ? Just a historical record ? [23:02:21] i just don't remember how to create the .cdb from the .dat [23:02:48] <^demon> `sync-wikiversions` [23:03:31] New review: Dzahn; "these have recently been killed and removed from DNS and apparently it caused problems in labs still..." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/55430 [23:03:32] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55430 [23:04:13] !log sync-wikiversions [23:04:17] wikiversions.cdb successfully built. [23:04:18] Copying wikiversions dat and cdb files to apaches... [23:04:19] Logged the message, Master [23:04:31] oh wait, forgot a merge on fenari :p [23:04:37] !log dzahn rebuilt wikiversions.cdb and synchronized wikiversions files: [23:04:43] Logged the message, Master [23:05:46] untracked files in /h/w/common/ ..meh [23:05:56] !log dzahn rebuilt wikiversions.cdb and synchronized wikiversions files: [23:06:02] Logged the message, Master [23:06:04] ok, done [23:06:13] just rsync errors on a single box: mw1209 [23:06:24] mw1209: rsync: mkstemp "/usr/local/apache/common-local/.wikiversions.cdb.AKTF2s" failed: Permission denied (13) [23:06:47] xyzram: did that change things? [23:07:43] 2013-03-22 23:07:24,684 [RMI TCP Connection(7709)-10.0.3.14] WARN org.wikimedia.lsearch.interoperability.RMIMessengerImpl - Error getting snapshot for index readerfeedback_labswikimedia [23:07:58] No change. [23:09:20] :/ i don't know where it would be looking for those snapshots [23:10:01] I know where it's looking, they are not there, the mystery is _why_ it is looking for those. [23:11:42] New patchset: Pyoungmeister; "have to call templates correctly..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55433 [23:12:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55433 [23:13:32] searching through all in /h/w/common in fenari for "labs", i just get "labswiki" in all-labs.dblist, then ones we just talked about in deleted.dblist, wikiversions.dat~ [23:13:45] readerfeedback is only in deleted.dblist .hrm [23:14:37] Yes, that's what I see -- pretty psychic Java code. [23:22:05] New patchset: Pyoungmeister; "oh yeah, need the datadir..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55434 [23:22:37] xyzram: Where are you running that code? Where is it getting wikiversions.cdb from? [23:23:37] searchidx2 and searchidx1001; looks like /usr/local/apache/common has those files [23:24:11] xyzram: Did they actually get updated there? [23:24:20] I think those hosts are in the dsh list [23:24:34] -rw-r--r-- 1 mwdeploy mwdeploy 23418 Mar 22 23:05 /usr/local/apache/common/wikiversions.dat [23:24:34] -rw-r--r-- 1 mwdeploy mwdeploy 79229 Mar 22 23:05 /usr/local/apache/common/wikiversions.cd [23:24:38] New patchset: Pyoungmeister; "oh yeah, need the datadir..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55434 [23:24:42] <^demon> They get updated via cron, not dsh. [23:24:48] Ooooh [23:24:49] <^demon> (Which I think is kinda silly, tbh) [23:26:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55434 [23:31:32] New patchset: Pyoungmeister; "testing multiple mysql instances per host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55439 [23:32:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55439 [23:32:59] !log restarting hadoop setting yarn.nodemanager.resource.memory-mb to 16G [23:33:04] oops, wrong logbot [23:33:05] Logged the message, Master [23:36:06] New patchset: Danny B.; "cswikinews: Set autopatrolled group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55441 [23:47:46] New patchset: Pyoungmeister; "let's see just how abusable puppet is...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55442 [23:48:35] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55442 [23:50:31] New patchset: Dzahn; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [23:50:42] New review: Dzahn; "fix 59 x "tab character found"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [23:51:32] New patchset: Pyoungmeister; "insufficiently abusable, it would seem..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55443 [23:52:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55443 [23:53:00] New patchset: Dzahn; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [23:56:31] New patchset: Dzahn; "Rework the RT manifests so it can be installed in Labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026