[00:02:00] (03PS1) 10Dzahn: ytterbium, first include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:08:28] (03PS2) 10Dzahn: ytterbium, have group wikidev, include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:09:40] (03PS3) 10Dzahn: ytterbium, have group wikidev, include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:10:40] (03CR) 10Dzahn: [C: 032] "user demon requires group wikidev wherever he goes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 (owner: 10Dzahn) [00:15:03] <^d> Hehe :) [00:16:07] ^d: crontab: user `gerrit2' unknown [00:16:19] trying to setup some crontabs now [00:16:26] but it will work on second run [00:16:30] <^d> Dammit. Ok, so let's abort for today. [00:17:10] well, just fixing the puppet run so you have your user, right:) [00:17:21] <^d> Well gerrit2 is an ldap user. [00:17:29] ah [00:17:53] a) i meant your own user, it couldn't create demon [00:18:03] <^d> Ah :) [00:18:05] then when i fixed that puppet continues so we see b) [00:18:09] about gerrit2 user [00:18:18] and i thought it's just the order in puppet again [00:18:34] well now it finishes puppet again [00:18:46] and you have /home/demon/ [00:19:30] <^d> I can login just fine and sudo now :) [00:19:33] <^d> So thanks! [00:19:35] cool [00:19:52] and yea, you got the remaining LDAP hookup issue then [00:20:16] <^d> I'll sort that tomorrow, no big deal. [00:20:21] kk,cool [00:23:22] PROBLEM - RAID on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:12] RECOVERY - RAID on db35 is OK: OK: 1 logical device(s) checked [00:35:22] PROBLEM - RAID on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:36:12] RECOVERY - RAID on db35 is OK: OK: 1 logical device(s) checked [01:01:06] <^d> mutante: Could you do one quick thing? [01:02:55] <^d> Or another root who has a second to do something quick :) [01:05:10] ^d: what is it, a trap?:) [01:05:32] <^d> `chown -R gerritslave:gerritslave /srv/ssd/gerrit` on lanthanum? [01:06:16] ah, i see lots of stuff there now, yea [01:06:34] replication needs that user, nod [01:06:58] <^d> Yep [01:07:14] !log chown'ing /srv/ssd/gerrit to gerritslave for replication on lanthanum [01:07:18] done [01:07:21] Logged the message, Master [01:08:45] <^d> Thanks. [01:12:47] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:37] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [01:15:17] yw, cya [01:20:47] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:33] ^d: Uhh [01:29:46] ssh: connect to host gerrit.wikimedia.org port 29418: Connection refused [01:29:46] fatal: Could not read from remote repository. [01:29:46] Please make sure you have the correct access rights [01:29:46] and the repository exists. [01:30:09] <^d> Sorry. Queue was empty so I restarted to pick up some changes. [01:30:22] Ahh [01:30:28] Just badly timed on my part ;) [01:38:23] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [01:41:33] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:59] (03PS1) 10Spage: Remove unused UseVForm settings variables. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81174 [02:07:21] !log LocalisationUpdate completed (1.22wmf14) at Tue Aug 27 02:07:20 UTC 2013 [02:07:30] Logged the message, Master [02:09:01] (03CR) 10Mattflaschen: [C: 031] "Looks good to me. Let's deploy this Thursday." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81174 (owner: 10Spage) [02:12:42] !log LocalisationUpdate completed (1.22wmf13) at Tue Aug 27 02:12:42 UTC 2013 [02:12:48] Logged the message, Master [02:13:28] (03PS1) 10Demon: Until we switch over, still replicate all from manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/81175 [02:21:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 27 02:21:16 UTC 2013 [02:21:22] Logged the message, Master [04:15:58] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [04:21:58] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:28:58] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:58] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:43:13] (03PS2) 10Mattflaschen: Set group for /srv/mediawiki on singlenode mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 [04:43:13] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [04:43:22] (03CR) 10Mattflaschen: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [04:44:13] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [04:50:13] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:50:13] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [04:53:13] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [04:53:13] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:56:13] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:13] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [04:59:13] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [05:00:13] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [05:00:13] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [05:02:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:13] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [05:03:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:53:27] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:20:28] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [07:00:34] (03PS1) 10Ori.livneh: Track event counts in Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 [07:01:32] (03CR) 10Ori.livneh: "Ryan, this patch would add hafnium to the list of EventLogging deployment targets. Is there anything I need to do on the host to prepare i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 (owner: 10Ori.livneh) [07:19:07] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:19:27] (03CR) 10Hashar: [C: 031] "Ends up being very easy to enable :-] Whenever you +2 this, could you check the authentication is working properly on beta?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81148 (owner: 10CSteipp) [07:19:57] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [07:33:22] hello [07:34:10] !log jenkins: upgrading pep8 on gallium {{bug|53352}} [07:34:15] Logged the message, Master [07:34:57] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:35:02] (03PS6) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 [07:35:43] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:36:07] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:38:59] (03CR) 10Hashar: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 (owner: 10Hashar) [07:39:02] (03PS14) 10Hashar: contint: publish Zuul git repositories [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 [07:40:04] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [08:49:29] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 185 seconds [08:49:29] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 185 seconds [08:52:29] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [08:52:29] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [09:24:21] * hashar I am going to restart Jenkins in a few minutes for plugins upgrades [09:27:19] !log restarting Jenkins for plugins upgrade [09:27:25] Logged the message, Master [09:28:43] !log jenkins: Failed to exec '/usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java (deleted)' No such file or directory holy crap [09:28:49] Logged the message, Master [09:49:19] jenkins not back yet? [09:49:22] hashar: ? [09:49:26] * aude can wait [09:49:39] aude: nop :( [09:49:44] k [09:49:59] once in a while it refuses to start quickly [09:49:59] where did java go? [09:50:07] probably got upgraded [09:50:11] it is still there :) [09:50:13] hmmm [09:50:20] jenkins is still parsing the conf [09:50:26] its disk seems to be veryyyyy slow [09:50:37] varnishkafka seems to work fine now [09:50:58] and I have no clue how to find out which part is slowing the box I/O :(r [09:51:26] iotop? [09:51:39] yup [09:51:49] gives me a roughly 461.32 K/s [09:52:00] though I have no clue from where it is reading hehe [09:52:32] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:38] how many i/o requests a second? [09:54:34] mark: doesn't show up in iotop, couldn't find that in vmstat/top :( [09:54:45] anyway, completed! [09:54:49] !log jenkins restarted :-] [09:54:54] Logged the message, Master [09:55:13] jenkins has to read a myriad of tiny files [09:55:56] hashar: iostat -kx 3 [09:56:06] wasn't that the box with SSD(s)? [09:56:20] ah will have to remember about iostat :-] [09:56:25] yeah we have SSD for the jenkins workspace [09:56:32] but the configuration files are still on the hdd [09:56:45] (as well as all the history I will have to wipe one day) [09:56:53] configuration files surely can't be that much to read? [09:56:57] is there like 100 gigs of configuration?:P [09:57:24] it's a java app, could be [09:57:32] roughly 1400 config files [09:57:37] + 1400 state files [09:57:59] maybe you can have mapreduce job to parse the configs in hadoop [09:58:08] that would be the java way [09:58:11] or I could drop jenkins :-] [09:58:26] still, 15ms/seek worst case * 1400 files = 21 second:) [09:58:34] will probably start working on replacing jenkins next year [10:00:44] with what? [10:00:52] with a shell script [10:01:12] bah, can it be ruby at least? [10:02:45] well Zuul is now using gearman to distribute jobs to some workers [10:02:55] which could be jenkins or whatever gearman client :-] [10:15:01] PROBLEM - Disk space on cp1059 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 13577 MB (4% inode=99%): [10:22:36] hrm [10:22:40] that's gonna be annoying [10:46:07] (03PS1) 10Ori.livneh: Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 [10:48:24] Nikerabbit / MaxSem: ^^ does that look all right? [10:49:51] ori-l, shouldn't you stop the service before killing the files? [10:50:16] that's boring and predictable [10:50:36] (03CR) 10Faidon: [C: 031] "This will do for now" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [10:51:30] (03PS2) 10Ori.livneh: Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 [10:51:49] ori-l, any you're deleting the files _before_ that [10:52:17] hm? [10:52:37] i changed the 'before' to a 'require' [10:52:37] (03CR) 10MaxSem: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [10:53:20] puppet antonyms [10:56:50] (03CR) 10MaxSem: [C: 031] Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [10:57:19] thanks [10:58:03] (03CR) 10Edenhill: [C: 031] "Looks good, implementation is exactly what I would've done!" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [11:02:18] (03PS1) 10Hashar: sudo right for hashar on lanthanum (Jenkins slave) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 [11:03:14] (03CR) 10Hashar: "Pending RT https://rt.wikimedia.org/Ticket/Display.html?id=5677" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 (owner: 10Hashar) [11:13:54] don't give hashar root, he'd just break stuff, he always says it himself [11:14:37] Coren: there's a bunch of new unowned tickets :) [11:18:32] is someone around that can deploy update for wikidata ? [11:18:33] https://gerrit.wikimedia.org/r/#/c/81198/ [11:18:33] (03CR) 10Nikerabbit: "(3 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [11:18:50] it's a bit time sensitive, sooner better [11:20:26] ori-l: added comments [11:20:50] aude: I could potentially handle it [11:20:59] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [11:21:04] hashar: would be great [11:21:08] it's simple but important [11:22:24] (03PS1) 10Faidon: wikipedia.com.il -> wikipedia.co.il [operations/dns] - 10https://gerrit.wikimedia.org/r/81211 [11:22:36] aude: we still have some wiki on wmf/1.22wmf13 [11:23:04] (03CR) 10Faidon: [C: 032] wikipedia.com.il -> wikipedia.co.il [operations/dns] - 10https://gerrit.wikimedia.org/r/81211 (owner: 10Faidon) [11:23:36] (03PS2) 10Faidon: Åland, Guernsey, Isle of Man and Jersey to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80970 [11:23:55] (03CR) 10Faidon: [C: 032] Åland, Guernsey, Isle of Man and Jersey to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80970 (owner: 10Faidon) [11:26:23] hashar: that's fine [11:26:28] the old code is fine [11:26:39] new code is on wikidata and wikivoyage [11:26:39] (03PS2) 10Faidon: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [11:26:40] (03PS2) 10Faidon: Middle-East to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 [11:26:41] (03PS2) 10Faidon: Switch Central/South Asia to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 [11:26:42] patch got merged [11:26:42] (03PS2) 10Faidon: Africa to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80971 [11:26:48] thanks [11:27:05] got to deploy it now [11:27:08] great [11:29:57] !log hashar synchronized php-1.22wmf14/extensions/DataValues 'Update DataValues {{gerrit|81198}}, requested by aude' [11:30:03] Logged the message, Master [11:30:13] aude: should be deployed now [11:30:24] thank you!!!! [11:33:09] aude: if that is a work for you, I will head out to grab a snack [11:33:35] that's fine [11:33:42] we will have some more patches in a bit [11:34:56] * aude also getting lunch [11:35:51] going to get a snack then commute to my coworking place [11:35:55] should be back in roughly an hour [12:17:31] re [12:28:10] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:10] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [12:43:28] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [12:46:48] PROBLEM - DPKG on search28 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:28] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.0006822347641 secs [14:01:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:16:52] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [14:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:52] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [14:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:29:52] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [14:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:33:52] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [14:39:20] hashar: Reedy if around we need one more update (should be the last!) [14:39:25] https://gerrit.wikimedia.org/r/#/c/81233/ and https://gerrit.wikimedia.org/r/#/c/81237/ are needed [14:40:54] aude: ok :/ [14:41:01] ok, thanks [14:41:51] * aude wants to get back to coding new stuff and not fixing these bugs [14:42:21] aude: will do the wmf13 one first [14:42:27] ok, that's more important [14:44:00] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [14:45:00] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:44] aude: the wmf13 one got merged https://gerrit.wikimedia.org/r/#/c/81237/ [14:48:49] updating and syncing [14:48:53] thanks! :) [14:49:02] * hashar cross fingers [14:50:10] yeahhh live hacks [14:51:00] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [14:51:00] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [14:51:20] aude: syncing [14:51:29] :) [14:51:30] !log hashar synchronized php-1.22wmf13/extensions/Wikibase 'https://gerrit.wikimedia.org/r/#/c/81237/' [14:51:36] Logged the message, Master [14:51:51] (03PS3) 10BBlack: Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 (owner: 10BryanDavis) [14:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:44] now the wmf14 one [14:52:53] (03CR) 10BBlack: [C: 032] Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 (owner: 10BryanDavis) [14:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [14:54:00] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:00] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:14] !log hashar synchronized php-1.22wmf14/extensions/WikibaseDataModel 'https://gerrit.wikimedia.org/r/#/c/81233/' [14:54:20] Logged the message, Master [14:54:36] !log hashar synchronized php-1.22wmf14/extensions/Wikibase 'https://gerrit.wikimedia.org/r/#/c/81233/' [14:54:41] Logged the message, Master [14:54:51] aude: I have deployed both [14:55:11] aude: no fatal /exception (yet) :-D [14:55:13] * aude owes you a beer [14:55:43] yeah will have to book a flight to Berlin to reclaim it [14:57:00] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:00] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:52] :) [15:00:00] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:00] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:00] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:39] (03PS1) 10Mark Bergsma: Initial version of PROXY support for Varnish [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/81244 [15:03:03] (03Abandoned) 10Mark Bergsma: Initial version of PROXY support for Varnish [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/80982 (owner: 10Mark Bergsma) [15:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:04:00] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [15:12:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [15:19:43] (03CR) 10Faidon: [C: 031] "You're awesome :)" [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/81244 (owner: 10Mark Bergsma) [15:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:26:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:35:01] (03PS1) 10Ottomata: Updating tracked CDH4 version to 4.3.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81245 [15:36:11] (03CR) 10Ottomata: [C: 032 V: 032] Updating tracked CDH4 version to 4.3.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81245 (owner: 10Ottomata) [15:50:14] (03CR) 10Andrew Bogott: [C: 031] "Thanks for writing this! What will happen to..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon) [15:52:36] ottomata: hey, i'm here, and in the same city all day [15:53:02] awesooomme! [15:53:11] let's do it! I am all prepped! [15:53:17] here's a quick q for you [15:53:40] I have some bash scripts I've used to make partitioning all of these nodes easier [15:53:56] hehehe [15:53:57] ok [15:54:08] they aren't partman because it was complicated and we weren't sure of how the partitions would end up looking when we did this the firs time [15:54:08] (03CR) 10Andrew Bogott: [C: 031] "Have you run a test to make sure that these different permissions don't mess with web access to mediawiki? If so then I'm happy to merge " [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [15:54:27] (03PS4) 10Andrew Bogott: rake task to generate puppet documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/77090 (owner: 10Hashar) [15:54:37] i'm thinking of just putting them on wikitech? [15:54:48] also partman is complicated … does anyone know how it works? ;) [15:54:51] https://github.com/wikimedia/analytics-kraken/blob/master/bin/setup-scripts/disks/datanodes_dell_r720.sh [15:54:51] https://github.com/wikimedia/analytics-kraken/blob/master/bin/setup-scripts/disks/namenode.sh [15:54:54] haha, hardly [15:55:11] well we had made the software git repo for things like scripts [15:55:14] shoudl I put the scripts on wikitech, or just link to them? [15:55:16] hmmmmmMMM [15:55:20] so i'd link to them [15:55:21] software git repo..... [15:55:27] ssh://lcarr@gerrit.wikimedia.org:29418/operations/software.git [15:55:34] well without the lcarr bit :) [15:55:35] cloning [15:55:45] (03CR) 10Andrew Bogott: [C: 032] rake task to generate puppet documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/77090 (owner: 10Hashar) [15:56:53] hmmm, ok, i'll make a subdir there…hadoop? analytics? [15:58:12] (03CR) 10Faidon: "So, if I'm user "faidon" and try to login to RT, it will first try an LDAP bind and if that fails it'll fallback to my old password. So, w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon) [15:58:31] hadoop ? [15:58:34] k [15:58:41] well, there will be a similar script for kafka I think? [15:58:48] hrm [15:58:56] maybe "partitioning" [15:59:05] then one labeled for hadoop and one labeled for kafka ? [16:00:01] k [16:02:48] cleaning these up a bit [16:02:58] i don't have one for journlnodes yet, we'll have to talk about that in a sec [16:04:52] ok, LeslieCarr, one thing i'm noticing [16:04:55] i just reinstlaled these nodes, right? [16:05:01] I didn't delete any of the non root partitions yet [16:05:08] so the old partitions are still there, going to have to delet ethem [16:05:30] the scripts I had before were very particular about how things were, they should mostly work, but the rely on fdisk prompts happening in the proper order, ec. [16:05:36] ok [16:05:43] want to share screen for this bit and just go thorugh it? it will probably be just a lot of back and forth [16:05:46] sound good [16:05:48] or do you want me to just get it to a clean state? [16:05:48] on iron ? [16:05:58] k, i've actually not shared screen from iron before [16:06:03] do I just start one there? [16:06:05] forward my ssh key? [16:06:10] yeah [16:06:14] ottomata: better use parted [16:06:18] and GPT partitions [16:06:23] it's also a bit easier to script [16:06:28] does it have to be done on boot? [16:06:32] sorry [16:06:33] on install? [16:06:39] parted? [16:06:42] no [16:06:43] yes ok [16:06:51] hm [16:06:51] ok [16:06:57] well, hm [16:07:03] fdisk can't do disks > 2 TB [16:07:18] ohhh parted sorry yes yes [16:07:25] not partman, parted ;) [16:07:36] remember this, i think there was a reason why I didn't use parted, but can't remember right now.... [16:08:10] ok thanks [16:08:30] LeslieCarr: hm, should I take the time now to write parted scripts? or should we just do this since we have the time together now? [16:08:36] ceph uses sgdisk [16:09:08] why don't we do this together [16:09:25] though i have never written parted scripts before, so it may be mostly me looking over your virtual shoulder [16:09:36] parted? I haven't either really [16:09:42] i barely remember looking into using it last summer [16:09:43] that's all [16:09:52] it's just the commands you type in the cli [16:09:56] like fdisk really [16:10:09] look at sgdisk too, it might be nicer indeed [16:10:45] mkpart part-type [fs-type] … [16:10:48] fs-type can't be ext3?! [16:10:49] bwa [16:10:57] ext2 [16:11:13] eh, sgdisk not installed by default :/ [16:11:46] manifests/swift.pp: command => "parted -s -a optimal ${title} mklabel gpt mkpart swift-${dev_suffix} 0% 100% && mkfs -t xfs -i size=512 -L swift-${dev_suffix} ${dev}", [16:11:53] OH PUPPETIZED [16:11:57] OOOoooO [16:11:58] hardly [16:12:47] uh ok, that is a nice way to do it [16:13:10] that actually is pretty awesome [16:13:39] ok, LeslieCarr, I'm going to add something for that to the hadoop.pp role class [16:13:48] a bit dangerous but nice indeed [16:13:51] (if someone wants to tell me to put it somewhere other than role class, speak now) [16:13:53] be careful so it won't wipe your data ;) [16:13:56] haha [16:13:57] yeah [16:14:32] (03CR) 10Demon: [C: 031] "I thought I could outsmart gerrit, but that wasn't the case. I broke replication so this needs merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81175 (owner: 10Demon) [16:14:53] ^d: hey, quick question [16:15:01] <^d> Shoot [16:15:19] there's both https://github.com/wikimedia/operations-software-varnish-varnishkafka & https://github.com/wikimedia/varnishkafka [16:15:42] the latter is because of a stanza you added to gerrit.pp per my request [16:15:59] they have both the same content and they have a few differences [16:16:19] hmmmmmmmmMMM i might at that to the cdh4 module, if I can make it generic enough! [16:17:09] hrm, make it a class parameter, default to false, and then make all the options more parameters ? [16:17:13] the latter doesn't have the "our actual code" banner [16:17:16] and has Issues enabled [16:18:22] <^d> paravoid: Both easily fixed. [16:19:01] <^d> Issues and wikis disabled, description updated. [16:20:07] manually? [16:20:19] I mean, I don't mind, I just want to know what the process is so I don't have to ping you :) [16:20:29] and bother you [16:20:34] <^d> Yeah. They're supposed to set automatically when the repos are created via the github plugin, but this was a manual setup anyway :) [16:20:36] LeslieCarr: yeah, i was going to make it a define and have users use it manually [16:20:43] don't want to make people automatically use it [16:20:45] cool [16:20:45] but aahhhhh [16:20:45] so [16:20:48] <^d> paravoid: That stuff can be edited via https://github.com/wikimedia/varnishkafka/settings [16:20:50] the thing is though [16:20:53] this will take a lot of testing [16:20:58] to make sure I get it right [16:21:00] in labs, etc. [16:21:06] probably more than just today [16:21:16] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [16:21:26] a parted define? [16:21:31] that shouldn't really be used much [16:21:43] totally, but if I add it I want it to work for sure [16:22:26] i'd want to test a few different cases, etc. [16:22:28] so ummmmmm [16:22:47] i'll def do this, but LeslieCarr, maybe we can just proceed with repaving today with out this? [16:23:09] ok [16:23:23] because we finally are both here at the same time? ;) [16:23:47] yeah [16:23:48] exactly [16:23:57] and we kinda scheduled for this to happen today, and here we are [16:24:11] so, let's share a screen on iron, and start partitioning stuff manually using my scripts, ja? [16:24:25] btw, where is iron? .eqiad? .wikimedia.org? [16:24:26] i'm trying to log in [16:24:30] .wikimedia [16:25:02] thaaar we go [16:25:03] cool [16:25:19] do I need to do anything other than start a named screen? [16:25:23] i just did [16:25:25] screen -S hadoop [16:25:27] join if you can? [16:26:23] ok [16:26:36] hrm, that started one [16:26:41] are you logged in as you or root ? [16:26:52] haha, i need coffee, i did -S [16:26:53] hehe [16:27:04] but -x didn't work either [16:27:05] * ^d looks around for something to bribe opsen with [16:27:05] root [16:27:09] hmmm [16:27:13] what did -x say? [16:27:14] ther ewe go [16:27:16] great [16:27:23] what's your window size ? [16:27:45] kinda big [16:27:47] what should I use? [16:28:16] 160x50 ? [16:28:22] tyring to find where to set that [16:28:42] oh, i use a mac, so i just pull the window to resize :) [16:28:43] kinda new to this shared screen thing [16:28:44] oh ok [16:28:47] i just did that too [16:28:49] did that work? [16:28:59] well i don't see anything cutting off weirdly [16:29:00] so yay [16:29:04] ok cool [16:29:07] woot [16:29:13] ahhhhhhh [16:29:13] ok [16:29:16] hehe [16:29:19] :) [16:29:22] ^d: for? [16:29:30] ok so , namenodes really just need a single mount [16:29:36] PROBLEM - MySQL Replication Heartbeat on db35 is CRITICAL: CRIT replication delay 730316 seconds [16:29:40] <^d> paravoid: https://gerrit.wikimedia.org/r/#/c/81175/ [16:29:45] i've got it mirrored raid 1 [16:29:49] jsut for redundancy [16:30:11] so, just before you signed on, I did mkfs.ext3 on /dev/md2 [16:30:26] cool [16:31:18] (03CR) 10Faidon: [C: 032] Until we switch over, still replicate all from manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/81175 (owner: 10Demon) [16:31:47] cool, ok so that's the primary namenode, let's go do the same thing for the secondary [16:31:51] the secondary was never set up before [16:31:52] cool [16:31:57] which was 1009, right ? [16:31:57] so we'll need to add md2 there [16:31:58] yup [16:32:34] ah ok, so, the ciscos were the datanodes a looong time ago [16:32:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:32:38] before we had the dells [16:32:41] so we need to delete a bunch or partitions [16:32:58] awesome [16:33:00] <^d> Heh, I decided to upload a picture. https://www.mediawiki.org/wiki/File:HowToBribeOps.jpg [16:33:14] fdisk ? [16:33:25] ^d: done, if you didn't see that :) [16:33:31] <^d> I did, thanks! [16:33:47] yup, just pasted in a scripted thang [16:34:02] cool, so we have /dev/sde-sdj free [16:34:10] we're just going to use sde and sdf for the raid [16:34:19] cool [16:34:28] <^d> !log reloaded gerrit replication plugin to pick up changes [16:34:33] Logged the message, Master [16:34:55] i just made sde1 and sdf1 [16:35:08] now creating md2 with them [16:35:17] cool [16:35:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.139 second response time [16:35:58] yay mkfs [16:36:36] cool, thar we go [16:36:46] ok, now let's go do some data nodes :) [16:37:07] so, in this case, all of the partitions here shoudl already be made [16:37:10] hrm ? [16:37:15] that's interesting [16:37:17] well [16:37:22] because these were datanodes before [16:37:25] so [16:37:29] we'll just mkfs them to wipe them? [16:37:30] sound ok? [16:37:38] cool [16:37:40] oh [16:37:41] the ssh [16:37:46] yeah [16:37:48] bwer [16:38:06] oh [16:38:07] ssh key [16:38:07] ? [16:38:11] noo [16:38:15] yeah, no agent ? [16:38:23] maybe it expired? [16:38:43] i do have a timeout on my local agent, but i thought it remained during th session [16:38:44] hm [16:38:59] not sure how this works with screen [16:39:09] if I detach, log out, log back in and reattach? [16:39:09] hmm [16:39:12] it shouldn't time out that quickly [16:39:15] should work i think [16:39:32] nope [16:39:38] argh [16:39:54] i have a 5 minute timeout locally, but I never have had a prolbem with an already logged in session before [16:39:56] hm [16:39:59] let's close this screen [16:40:04] wanna just forward yours and start one? [16:40:09] sure [16:40:11] k [16:40:47] screen named hadoop [16:41:59] hmm, still no good? [16:42:06] oh [16:42:07] i'm already in [16:42:15] i joined the screen [16:42:16] with -x [16:42:17] right? [16:42:24] look at prompt [16:42:30] ha [16:42:32] doh [16:42:33] hah [16:42:34] k cool [16:42:35] :) [16:42:42] great yeah [16:42:43] so [16:42:53] datanodes just use jbod [16:42:54] we're using [16:42:57] /dev/sd{c,d,e,f,g,h,i,j,k,l}1 [16:43:03] cool [16:43:20] i'm going to run a forloop on those [16:43:28] and mkfs.ext3 & in the background [16:43:32] this will take a while, since the disks are big [16:43:43] cool [16:43:48] backgrounding in means we don't have to wait for each one in serial [16:44:35] i'm also running tune2fs -m 0 to grab all of the blocks on the disk [16:44:39] do we want to open a new window and do another one at the same itme ? [16:44:44] we don't need any reserved os stuff , or wahtever it is ext3 does [16:44:44] yeah [16:45:14] so, LeslieCarr, I use iTerm 2 [16:45:31] does it do something special ? [16:45:35] and I know dsh is good for stuff lke this, but sometimes its a little tricky to script things around ssh keys etc. [16:45:36] yeah [16:45:38] i just use pain terminal [16:45:46] you can send keyboard intput to all tabs/screens at a time [16:45:52] so, if I log into all the other datanodes [16:45:52] oh cool [16:45:53] I can do this all at once [16:45:57] why don't you do that [16:45:59] k [16:46:10] it'll be just the same thing I just did [16:47:16] journal nodes [16:47:18] ok right so journalnodes [16:47:19] yeah [16:47:30] so, we need to run a quorum of journalnodes [16:47:40] this is to suppor tthe standby namenode and HA [16:47:53] they sync the hdfs metadata changes from the primary namenode to the standby namenode [16:48:06] they keep a journal of the edits as well, and need their own parititons [16:48:17] ok, so which machines are we thinking for those ? [16:48:19] Toby has a hadoop friend, and I asked him about that [16:48:31] i'm not sure actually, i was thikning of running them on 3 of the datanodes [16:48:39] but, taht means we might need to save a disk or two for them [16:48:50] and who knows if that will affect performance on those nodes [16:48:53] hrm [16:49:04] i'd rather have them on other nodes somehwere [16:49:05] do we have some extra ciscos ? [16:49:11] ha, yes we do [16:49:14] hmmmmm [16:49:16] not a bad idea [16:49:19] we could put them there for now [16:49:28] but, i hate the idea of wasting the ciscos just for journalnodes [16:49:36] we also have the 3 zookeeper nodes [16:49:41] oh, maybe low performance misc ? [16:49:42] oh [16:49:43] and I would naturally just put these there [16:49:46] yeah, they are low perf misc [16:49:48] but [16:49:48] here's the issue with that [16:49:51] what's the downside of that ? [16:49:54] we are using the debian version of zookeeper [16:50:02] which conflicts with the cdh4 hadoop dependency [16:50:09] oh and hte hadoop version has its own stuff [16:50:12] so we can't install hadoop deps on the same nodes as zookeeper [16:50:13] yeah [16:50:19] cdh4 has their own package of zookeeper [16:50:24] effectively either package will work [16:50:28] i had forgotten about that ... [16:50:30] but the cdh4 packages are set up to depend on their zks [16:50:39] so apt will get all pissy [16:50:52] yeah pretty annoying, actually [16:51:35] i mean, its seems stupid to get new machines just because of a package conflict, right? [16:51:38] but i dunno what else to do [16:51:56] i'd go ahead and run the journalnodes alongside of a few datanodes, but i just don't like how that makes things inconsistent [16:52:12] right now it is all homegeneous and sane [16:52:24] yeah, inconsistent in disk size and possibly performance.... [16:52:27] especially since we have to partition them differently [16:52:30] yeah [16:52:40] RobH: do we have 3 extra low performance boxes sitting around ? [16:52:40] we might have to configure them specially then too [16:52:41] dunno [16:52:56] cmjohnson1: ^^ ? [16:53:34] LeslieCarr: i'm going to go ahead and partition the data nodes as is [16:53:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:43] and we can either find new servers for journalnodes, or for now just use ciscos [16:53:51] i'm pretty sure we can move journalnodes later [16:53:54] might be cluster downtime, but that'll be ok [16:54:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [16:57:55] paravoid: for web auth for LDAP we generally use the CN, and not the uid [16:58:10] can we rename users and change the RT config to use CN? [16:58:23] that would suck [16:58:37] why's that? [16:58:39] this kind of explains why my CN is "Faidon" instead of my full name [16:58:42] I hate that [16:58:47] every other web application uses CN [16:59:03] why does that explain that? [16:59:06] mine is Ryan Lane [16:59:31] I'm not the only one that's confused (or was when I was hired), half of the users have CNs like that [16:59:52] was that before account creation was through wikitech? [17:00:04] akosiaris, bblack, BryanDavis etc. [17:00:32] why is uid bad? [17:00:51] it's not that uid is bad [17:00:53] having to type your full name into a username prompt is confusing, is my opinion [17:01:09] people are used to that for mediawiki [17:01:18] yeah, i agree its kinda weird [17:01:22] mainly beacuse of spaces [17:01:23] and that was the #1 most requested thing when I was setting it up [17:01:29] i mean, i know its normal [17:01:40] but nowhere else has spaces in log in names [17:01:49] the field is still called "Username" in mediawiki [17:01:51] having random web applications take different usernames is incredibly confusing [17:02:21] I would have made my wikitech account name "bd808" if I had understood what it was for [17:02:27] to me, cn is the full name, uid is the username [17:02:33] if you want to put your full name into uid, that's fine [17:02:39] no, it's not [17:02:52] because then using ssh is basically impossible [17:03:20] do we want UTF8 ssh usernames? [17:04:56] ⓦⓗⓨ ⓝⓞⓣ? [17:05:01] ori-l: :) [17:05:22] blergh [17:05:25] people want to use their project user names on wikitech [17:05:48] LeslieCarr: Sorry, was on route to office and such [17:05:52] maybe we need better help docs on the user creation page on wikitech [17:06:00] you need 3 low perf misc boxen in eqiad? [17:06:09] so the question then is whether you want to align web apps with wikitech or SSH [17:06:18] wikitech [17:06:40] which is the same as gerrit [17:07:05] logging in with my full name is entirely weird to me tbh, but if you're asking if RT can use cn, yes it can [17:07:29] having a consistent log-in scheme is good. web = full name, shell = shell account name [17:07:35] maybe wikitech should use the uid for login and then display the displayName attribute throughout the site ;-) [17:07:48] harder than you'd imagine ;) [17:08:01] yeah, I kinda guessed that... [17:08:02] we'd need to make gerrit and every other app do that as well [17:08:22] (03PS1) 10Reedy: Make puppet cronjob to run AbuseFilter/maintenance/purgeOldLogIPData.php [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 [17:08:26] speaking of that, can I finally have my cn changed? :) [17:08:29] heh [17:08:36] it's possible, but hard [17:08:39] gerrit uses that as the git full name on merges [17:08:43] ^d: ^^ [17:08:48] chad also wants his changed [17:08:58] RT uses it as the email address name on outgoing emails through the web interface [17:08:59] (03PS3) 10Ori.livneh: Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 [17:09:00] (03CR) 10Reedy: [C: 04-1] "Need to confirm run frequency" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81257 (owner: 10Reedy) [17:09:01] what do you want as your name? full name? [17:09:07] so the field there is really the full name [17:09:10] * Ryan_Lane nods [17:09:31] it's not only what I want, it's that we don't want RT to send mails as "Faidon that's just unprofessional :) [17:09:43] :) [17:09:58] <^d> Huh? [17:10:05] i have two small commits that need mergin': https://gerrit.wikimedia.org/r/#/c/75284/ (fixes EL's Ganglia metrics) & https://gerrit.wikimedia.org/r/81202 (rids vanadium of solr) [17:10:06] ^d: let's try to change some usernames [17:10:07] * ^d missed some scrollback [17:10:09] up for it? [17:10:10] <^d> Oh sweet. [17:10:11] <^d> Sure. [17:10:15] use me! use me! [17:10:18] should be 3 steps [17:10:26] * ^d pushes in front of paravoid [17:10:29] paravoid: hah. you want to possibly be broken for part of the day? :) [17:10:37] technically this is my relaxing week [17:10:44] I was banned from the channel yesterday [17:10:46] ^d: so, we need to rename your user in ldap, then in mediawiki and gerrit [17:10:46] that's even better [17:10:50] paravoid: hahaha [17:10:54] Ryan_Lane: break his username then please [17:10:57] he shouldnt be working! [17:11:00] paravoid: are you supposed to be on vacation? [17:11:04] kind of? I guess? [17:11:09] :D [17:11:14] you are the worst vacationer ever. [17:11:17] ^d: ok, what do you want your cn to be? [17:11:45] Ryan_Lane: I'd really like to have policies about that... [17:11:52] paravoid: policies about what? [17:11:59] if we want to integrate LDAP more with other stuff, we really should keep data properly [17:12:03] RobH: so ottomata and i need 3 journal nodes, we can use ciscos but low perf misc should be better [17:12:05] like, if we want to do the NDA thing we were saying [17:12:08] we need to keep the full name somewhere [17:12:11] Ryan_Lane: paravoid policies about taking vacation! [17:12:18] paravoid: ah. right. indeed [17:12:22] <^d> Ryan_Lane: "Chad" [17:12:27] ^d: ok [17:12:31] Ryan_Lane: depending on where your changing names, there was also some talk of removing (WMF) from my gerrit username (same thing?) [17:12:41] ebernhardson: yep [17:12:53] let's make sure this won't break ^d first, though :) [17:12:55] so, ^d's cn should really be set to his full name :) [17:13:06] <^d> I don't want my full name :( [17:13:09] heh [17:13:15] I have mixed feelings on that [17:13:26] <^d> Because my last name sucks? [17:13:36] ^d: dude, have you seen mine? :P [17:13:36] beats 'livneh' [17:13:38] I'd rather have an NDA field [17:13:49] or have it put in the user's description field [17:14:06] why do you two mind that though? [17:14:16] is it because it's also the username? or for other reasons? [17:14:18] because they may not work for the foundation forever [17:14:35] and they should be free to have their username as whatever they want [17:14:42] this is the same reason I banned (WMF) usernames [17:14:49] heh, and we're back in square one [17:14:57] why? [17:15:03] I'd like to have the full name in cn, I don't care what people put in their usernames :) [17:15:44] some people like using pseudonyms [17:15:50] phew, reading scroll back :) [17:15:53] RobH: yeah [17:16:03] <^d> Yeah, not everyone uses a real name in their CN. [17:16:08] <^d> Or a full name. [17:16:12] if they are avail that would be nice, i'd use 3 of the low perf misc servers we already have…but there are complicated package conflicts [17:16:16] ok, they need eqiad i assume, hrmm [17:16:18] ja [17:16:23] I think everyone universally hates OITs username policy [17:16:33] I do too [17:16:39] ottomata: I think I have them, let me take a look [17:16:44] k danke [17:16:51] pseudonyms are fine, but canonical name to me is the canonical name, not the display or user name [17:16:57] hello, rlane! [17:17:01] YuviPanda: howdy [17:17:02] <^d> [First initial][last name] is so....corporate. [17:17:31] paravoid: well, we could have used displayname. [17:17:33] nothing productive, just poking fun of your OIT approved username :) (still stuck on the nginx package though :( ) [17:17:45] <^d> We could have used displayname, but we didn't. [17:17:53] <^d> And last time I brought that up you said it would confuse ppl. [17:17:55] ottomata: So my low performance are single cpu 8gb hosts usually, thats ok? [17:17:55] I very badly want to avoid having 3 unique attributes [17:18:10] i think so, how many disks? [17:18:27] usually dual, smaller disks [17:18:29] 250gb [17:18:31] paravoid: heh. you've hit the hardest problem in auth you know, right? :) [17:18:43] I know :) [17:18:46] and one of the three in computer science [17:18:54] along with conference wifi? [17:18:57] :D [17:19:08] cmjohnson1: You about? I'm spare server hunting, but with the influx of relocated servers to eqiad I rather just chat with you about it. [17:19:40] ok cool, thanks RobH [17:19:44] those should be fine...but [17:19:54] before you go an allocate them, lemme check a couple of things and confirm [17:19:58] my point is, I'd like us to encode the information people would put in their email address, somewhere [17:19:58] i might ask a question on a hadoop mailing list [17:20:08] the name before the email address I mean [17:20:13] I'm hesitant to have a display name attribute or any other attribute that lets a user set a real name, unless we're specifically managing it [17:20:21] ottomata: Ok, I have some older 'high performance' misc that arent high performacne anymore [17:20:29] so you may end up with those, which is slightly better than stock misc [17:20:40] ^d doesn't send mails as "Chad " for example [17:20:41] just put in procurement ticket when you need them and explain why, etc... [17:20:43] (03CR) 10MaxSem: "I wonder what happens to monitor_service() from the role class..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [17:21:05] anyway [17:21:11] ok cool, will do, thanks RobH [17:21:18] let's run trial renames now and decide later? [17:21:22] can I help? [17:21:56] well, I think we know the manual steps [17:22:02] it's a matter of automating it [17:22:14] we'll also need the manual step for RT later, too [17:22:20] so that we can add that into the automation [17:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:45] RT updates the "real name" field automatically, but yes, "username" needs to stay the same or be manually changed [17:23:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:23:25] yep [17:23:29] same with gerrit and mediawiki [17:23:52] ottomata: can i tap you for a couple of small puppet changes? [17:24:03] it's definitely possible to have a real name field, but I worry about people using it to impersonate others [17:24:47] ^d: ok, ready to try? [17:24:58] <^d> s/try/break shit/ [17:25:00] <^d> Yup :) [17:26:32] paravoid: http://www.elasticsearch.com/blog/welcome-jordan-logstash/ [17:26:47] <^d> Ryan_Lane: I logged out of gerrit. I'm afraid of something going wonky if I change the database while I'm logged in still ;-) [17:27:23] ok, renamed in ldap and MW [17:27:31] you'll handle gerrit? [17:27:32] \o/! [17:27:50] that's *awesome* news [17:27:55] bd808: see above [17:28:11] ^d: let's make a page somewhere in wikitech listing the manual steps so that we can do this alone :) [17:28:18] Ryan_Lane too [17:28:28] hm, LeslieCarr, RobH [17:28:36] elasticsearch hired the author of logstash [17:28:43] and kibana [17:28:44] so we have 1 un-allocated low perf misc server already in analytics [17:28:46] analytics1026 [17:28:46] <^d> Ryan_Lane: Yeah, done in gerrit [17:28:51] maybe we only need two more? [17:28:52] (different authors) [17:28:53] ^d: ok, try it out :) [17:29:00] that's really great news [17:29:01] ottomata: even better =] [17:29:04] k cool [17:29:10] <^d> A-ha! It worked! [17:29:15] i only mentally flagged those for you, nothing is officially flagged yet [17:29:17] http://www.elasticsearch.org/content/uploads/2013/08/Screen-Shot-2013-07-11-at-5.00.28-PM.png [17:29:23] did they do some major ui work? [17:29:25] Ryan_Lane: yes. want. [17:29:29] because that looks way nicer [17:29:31] so you can change reqs as needed, and those changes that reduce overall request are encouraged ;] [17:29:33] there's http://three.kibana.org/ [17:29:38] which is entirely client side [17:29:40] <^d> Ryan_Lane: See https://gerrit.wikimedia.org/r/#/c/75777/ for example. [17:29:55] suprinsingly www.kibana.org doesn't mention it though [17:30:02] and says "written in ruby" instead [17:30:03] wow. that looks really nice [17:30:07] although v3 is entirely HTML [17:30:13] we should really be using this :) [17:30:13] RobH: https://rt.wikimedia.org/Ticket/Display.html?id=5678 [17:30:16] and javascript, obviously [17:30:32] <^d> Ryan_Lane: And logging in to wikitech worked too. [17:30:36] \o/ [17:30:42] paravoid: so, want to be renamed now? [17:30:48] yes [17:30:52] ori-l uMMmmmmmmmmm they are quick ones? :) [17:30:58] <^d> Ryan_Lane: Lemme find him in gerrit. [17:31:04] ottomata: I assume these need to be on the analytics vlan and thus cannot share metal with non analytics servers? [17:31:17] are you documenting the steps? [17:31:18] cuz if its a low overhead and desn't need ot be on same vlan, we can disperse the service [17:31:24] paravoid: Faidon Liambotis ? [17:31:26] if you are, we should add an "ldap users" section [17:31:28] yes please [17:31:29] but if it needs same vlan, then yea has to be bare metal stand alone [17:31:31] ok [17:31:38] paravoid: yeah, we're going to record the steps [17:31:43] it's relatively easy [17:31:58] <^d> Just has to be done in a certain order :) [17:32:06] ldap, then gerrit/mediawiki [17:32:06] (man our chat rooms have to be confusing as hell to folks who aren't used to parsing out 2-7 different conversations) [17:32:10] <^d> Gerrit's most likely to go wonky if we don't :) [17:32:13] paravoid: ES hired Sissel? Awesome [17:32:21] bd808: yep [17:32:37] bd808: and kibana is under github.com/elasticsearch/ too [17:32:43] it is confusing to me who is used to it [17:32:46] ElasticSearch will take over the universe [17:32:49] ottomata: yes. https://gerrit.wikimedia.org/r/#/c/75284/ , https://gerrit.wikimedia.org/r/#/c/81202/ , https://gerrit.wikimedia.org/r/#/c/81182/ [17:32:50] RobH: ottomata yeah, needs bare metal [17:33:11] ^d: I'm ready for faidon [17:33:15] paravoid: please log out of gerrit [17:33:18] just in case [17:33:31] gerrit is known to be an asshole [17:33:38] done [17:34:01] I have another woe too [17:34:04] <^d> Ryan_Lane: To "Faidon Liambotis"? [17:34:07] yes please [17:34:15] done in ldap/wikitech [17:34:28] RobH, yeah needs to be in same vlan [17:34:51] these talk with hadoop namenode when there are any hdfs file changes [17:35:30] (03CR) 10Ottomata: [C: 032 V: 032] Tweak 'collect_every' and 'name_match' in EL's Ganglia module [operations/puppet] - 10https://gerrit.wikimedia.org/r/75284 (owner: 10Ori.livneh) [17:35:51] hallelujah [17:35:54] <^d> paravoid: Ok, log back into gerrit. [17:36:13] (03CR) 10Ottomata: [C: 032 V: 032] Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [17:36:29] ^d: worked [17:36:31] :D [17:36:51] <^d> Sweet. Ok, so this process totally works and we can document it now. [17:36:56] <^d> And maybe one day script it. [17:36:57] ooo, ori-l, gonna want to talk to you about graphite/statsd sometime later :) [17:37:45] sure [17:37:55] i'm just getting to know it myself [17:37:57] is this one ready to be merge? i see your unanswered q for Ryan [17:39:22] oh, maybe hold off on that one then [17:39:32] though it should be ok i think [17:40:07] <^d> Ryan_Lane: Did you start documenting this somewhere? [17:40:17] yeah i'll just scp the deployment target dir over from vanadium to boostrap things [17:40:47] ^d: nope [17:40:49] one sec [17:40:54] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [17:41:00] I wonder where we should put this [17:41:21] maybe https://wikitech.wikimedia.org/wiki/Renaming_users ? [17:41:27] and link to it from the main page? [17:41:39] <^d> Sounds good enough to me. [17:41:48] under the "Other" section? [17:41:49] ok [17:43:45] ottomata: shld be ok [17:44:19] k [17:44:34] (03PS2) 10Ottomata: Track event counts in Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 (owner: 10Ori.livneh) [17:44:39] (03CR) 10Ottomata: [C: 032 V: 032] Track event counts in Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 (owner: 10Ori.livneh) [17:45:18] binasher: 1 0.000499 query-m: UPDATE `abuse_filter` SET af_hit_count=af_hit_count+N WHERE af_id = 'X' [17:45:19] (03PS1) 10Faidon Liambotis: Switch ops to passwordless sudo in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/81267 [17:45:20] 2 5.018608 Parser::parse-WikitextContent::getParserOutput [17:45:21] 4 0.001795 query-m: COMMIT [17:45:24] great, ori-l done [17:45:31] binasher: better or worse? :) [17:45:35] ottomata: sweet, thanks very much [17:45:40] Aaron|home: saw the mail, thanks a lot for the details and the work [17:45:55] Ryan_Lane: ^^^ [17:45:56] * Aaron|home tries to raise the bus factor slightly [17:46:01] <^d> Ryan_Lane: Added to mainpage. [17:46:21] ^d: I'm adding the docs right now [17:47:24] PROBLEM - Apache HTTP on mw131 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50053 bytes in 0.169 second response time [17:47:35] (03CR) 10Faidon Liambotis: [C: 032] "Per discussion with Ryan." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81267 (owner: 10Faidon Liambotis) [17:47:52] ^d: added [17:48:10] paravoid: thanks [17:48:16] Can someone fix /usr/local/apache/common-local/php-1.22wmf12/extensions/CirrusSearch on mw131? Seems to be owned by root. Trying to work out the source of an error in the apache logs which is coming from mw131 apparently [17:50:07] binasher: I also found out why prefs get saved twice sometimes. It's via the API, a watchlist token is populated on demaned on prefs load. The calling code loads the prefs then saves them, so happens twice. [17:50:32] none of that would matter much if GadgetHooks::getPreferences didn't randomly take so long on commons [17:51:14] but it sucks since token population deletes/inserts the whole prefs again before that slow function (which for the life of me I don't know why is slow) [17:52:00] Aaron|home: getParserOutput before commit? noooooo! [17:52:26] and after a counter...that's the best [17:53:08] Aaron|home: have you looked at the full tx the `math` replace is a part of yet? [17:53:45] Reedy: fixin [17:53:45] Aaron|home, in principle GadgetHooks::getPreferences should take long only if cache is expired (one per 24h or when a gadget definition gets edited) - does it appear to happen more often? [17:53:49] (03PS1) 10Ori.livneh: Remove 'solr::decommission'; qualify 'eventlogging' include [operations/puppet] - 10https://gerrit.wikimedia.org/r/81271 [17:54:59] (03CR) 10Ottomata: [C: 032 V: 032] Remove 'solr::decommission'; qualify 'eventlogging' include [operations/puppet] - 10https://gerrit.wikimedia.org/r/81271 (owner: 10Ori.livneh) [17:55:07] ori-l: which change did you want me to look at again? [17:55:13] I don't see one in my queue [17:55:18] (03CR) 10Bsitu: [C: 032] Enable job queue to process web and email notifs on testwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/61647 (owner: 10Bsitu) [17:55:19] it was about git-deploy [17:55:36] assuming it's getting the grain applied, it should just work [17:55:39] Ryan_Lane: i added a git-deploy target for EventLogging, was wondering if anything extra needed to be done [17:55:46] coool, i'll try it [17:55:48] binasher: an example is http://pastebin.com/WbSK4WdN [17:55:53] let me make sure its prereqs are installed by puppet [17:56:00] <^d> Ryan_Lane: Added my bits. [17:56:12] note that the REPLACE calls are actually nested in the parse call, not strictly after as the log might make it seem...but still [17:56:29] ^demon|away: sweet [17:56:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:09] (03Merged) 10jenkins-bot: Enable job queue to process web and email notifs on testwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/61647 (owner: 10Bsitu) [17:57:11] Aaron|home: FileBackendStore in the middle… yikes [17:57:14] thanks ottomata [17:57:15] ori-l: it's possible that it's going to be missing some packages [17:57:27] it = git deploy, salt, or eventlogging? [17:57:29] I'm not sure the best way to add these in puppet [17:57:33] git deploy [17:57:45] it needs python-redis [17:57:49] and git-core [17:57:55] Aaron|home: is there any reason the math replace queries couldn't be auto commit? tx level atomicity seems unnecessary here [17:58:01] I could add these to the deployment target [17:58:04] definition [17:58:17] i'll just install them by hand for now [17:58:44] PROBLEM - DPKG on analytics1026 is CRITICAL: Timeout while attempting connection [17:59:01] hm, yeah, I'll add these to the deployment definition [17:59:10] which will ensure any deployment target has the dependencies [17:59:22] k [17:59:52] binasher: I was thinking of using onTransacationIdle(), which would be autocommit [18:00:17] batching the REPLACES and the file stores would be sweet though [18:00:24] RECOVERY - Apache HTTP on mw131 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.382 second response time [18:00:32] AaronSchulz: agreed [18:00:36] * Aaron|home looks at bd808 ;) [18:00:50] (03PS1) 10Jgreen: make OTRS spamassassin even more strict [operations/puppet] - 10https://gerrit.wikimedia.org/r/81274 [18:01:02] * bd808 hides under desk [18:01:44] bd808: Point at bugzilla! [18:02:24] PROBLEM - Disk space on analytics1027 is CRITICAL: Connection refused by host [18:02:24] PROBLEM - RAID on analytics1026 is CRITICAL: Connection refused by host [18:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.396 second response time [18:02:34] PROBLEM - SSH on analytics1026 is CRITICAL: Connection refused [18:02:34] PROBLEM - Disk space on analytics1026 is CRITICAL: Connection refused by host [18:02:35] PROBLEM - SSH on analytics1027 is CRITICAL: Connection refused [18:02:51] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable job queue to process web and email notifs on testwiki/test2wiki' [18:02:54] PROBLEM - DPKG on analytics1027 is CRITICAL: Connection refused by host [18:02:54] PROBLEM - RAID on analytics1027 is CRITICAL: Connection refused by host [18:02:57] Logged the message, Master [18:03:21] (03PS1) 10Ryan Lane: Ensure deployment targets have dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/81276 [18:03:26] !log bsitu synchronized wmf-config/CommonSettings.php 'Enable job queue to process web and email notifs on testwiki/test2wiki' [18:03:31] Logged the message, Master [18:03:44] I need to stop wondering where to start and just dive into the deep end on the whole media management pipeline [18:04:38] I was thinking about what things are the most actionable [18:04:46] bd808: bug 53400 might be an OK start [18:04:59] !log updated Parsoid to 84fac157 [18:05:04] Logged the message, Master [18:05:33] basically, writeToDatabase() should at least use onTransactionIdle() and put the REPLACE in a callback/closure there [18:05:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:06:16] (03CR) 10Jgreen: [C: 032 V: 031] make OTRS spamassassin even more strict [operations/puppet] - 10https://gerrit.wikimedia.org/r/81274 (owner: 10Jgreen) [18:06:59] bd808: in terms of RfC'ish stuff, the thumbnail coalescing isn't too bad of a place to dig into [18:07:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.633 second response time [18:07:25] (03CR) 10Ryan Lane: [C: 032] Ensure deployment targets have dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/81276 (owner: 10Ryan Lane) [18:07:26] what about the whole "chunked uploads suck" theme? [18:07:28] the amount of code change needed wouldn't be that huge, though some of it would be some small varnish module code [18:07:50] ori-l: if you just run puppet the dependencies should now be installed [18:08:26] bd808: the whole issue of large uploads bothers me because it relates to a bunch of problems that are hard to fix without rewriting everything (or horribly hacking around with job queue + persistent locks) [18:08:58] we can make large uploads work better for the first stage of the pipeline (upload) though re-upload, move, delete, restore will still suck horribly [18:09:31] Ryan_Lane: thanks, doing so [18:09:54] I'm not 100% sure, but I think roblaAWAY is open to major rewrite type projects [18:09:55] that said, if videos tend to just be uploaded once and not changed, and it's badly wanted, it could be worth it I suppose [18:10:22] well, there are different levels of "huge rewrites" ;) [18:10:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:11:59] bd808: I think for someone new to MW, the thumbnail thing is a better place to get started rather than going down that rabbit whole just yet (which still scares me after all these years) [18:12:30] * bd808 listens to sage advice [18:12:38] of course, if the priority for the quarter was already decided, I guess you don't have much choice though ;) [18:12:45] thumbnail coalescing? [18:13:03] (03PS4) 10Dzahn: Replace public key for jamesofur [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 (owner: 10Jalexander) [18:13:14] PROBLEM - DPKG on virt2 is CRITICAL: Timeout while attempting connection [18:13:15] paravoid: whatever you call, fudging vcl_hash to group them for PURGEs [18:13:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.068 second response time [18:13:36] maybe not using swift/ceph anymore for this and not having 7 copies of everything [18:13:42] ah, that [18:13:49] so, I was looking a bit at that in the past [18:13:56] I think the real priority at this point is "do things to make multimedia less sucky" [18:13:57] remember the linear search issue? [18:14:04] PROBLEM - NTP on analytics1027 is CRITICAL: NTP CRITICAL: No response from NTP server [18:14:07] yes [18:14:15] but smoothing problems in the upload path seems to be a recurring theme [18:14:28] so Tim was saying that he didn't expect this to be a huge problem, how many thumbs can a file have [18:14:35] then you know what I pointed out? [18:14:44] PROBLEM - NTP on analytics1026 is CRITICAL: NTP CRITICAL: No response from NTP server [18:14:45] PDF and multi-page TIFFs [18:14:54] 1000-page PDF with 3-4 thumb sizes [18:15:00] that's not uncommon at all [18:15:03] heh [18:15:21] our djvu/pdf handling sucks too [18:15:21] there's a few wikis that use that a lot [18:15:24] ori-l: let me know if that doesn't make a new target work [18:15:28] bd808: oh, wait, I told you that already [18:15:28] arwikisource I think? [18:15:41] like loading the whole text is metadata and slowing down category views [18:15:44] I want to make adding a new target as simple as just having the puppet class added [18:15:47] * Aaron|home only fixed the OOM aspect of that [18:15:49] questions of the grammatical form "how many ___ could possibly ___..." are prayers to sauron [18:16:11] but the solution could be handling pdf/tiff/djvu entirely differently [18:16:14] paravoid: if nothing else, one could except by file extension and use the old system for those [18:16:18] right [18:16:21] heh [18:16:24] and fix that crap later [18:16:29] but yeah, this needs to be done with care [18:16:47] we don't want to get caught in the spiderweb of having to redoing everything though, but breaks things into bits [18:16:53] (03PS1) 10Bsitu: Enable job queue to process notifs for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81278 [18:17:07] I'll leave that up to the people actually doing the work :) [18:17:19] (03CR) 10Dzahn: [C: 032] "jamesofour = jalexander, https://office.wikimedia.org/wiki/User:Jalexander and IRC ~jamesur@wikimedia/Jamesofur" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 (owner: 10Jalexander) [18:17:29] I'm merely pointing out the issue [18:17:46] (03PS1) 10RobH: testing smokeping role before modifying it [operations/puppet] - 10https://gerrit.wikimedia.org/r/81279 [18:17:52] paravoid: sure [18:18:06] PROBLEM - Check status of defined EventLogging jobs on hafnium is CRITICAL: Connection refused by host [18:18:29] but yeah, not having to store millions of tiny thumb files into media storage would be hugely appreciated [18:19:01] I'm of the naive opinion that "we" need to document the use cases and acceptance tests, evaluate current impl and design next-gen solution. [18:19:19] Then we need to figure out how to build that solution in smallish chunks [18:19:40] but I'm also talking out of my ass as to the specifics [18:20:01] (03CR) 10RobH: [C: 032] testing smokeping role before modifying it [operations/puppet] - 10https://gerrit.wikimedia.org/r/81279 (owner: 10RobH) [18:21:36] RECOVERY - SSH on analytics1026 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:21:37] RECOVERY - SSH on analytics1027 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:22:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.701 second response time [18:23:42] Learning to do all of this remote is hard. At $DAYJOB-1 I would have already locked a group in a conf room with a whiteboard and tried to make some progress [18:24:21] you can always calla google hangout meeting [18:24:25] virtual locked in a room [18:24:33] (03CR) 10Dzahn: "notice: /Stage[main]/Accounts::Jamesofur/Ssh_authorized_key[jalexander@wikimedia.org2]/ensure: created" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79304 (owner: 10Jalexander) [18:24:49] LeslieCarr, not enough resolution for whiteboard:/ [18:25:39] LeslieCarr: true enough. But still have to herd cats into pile and feed them agenda [18:25:56] the whole wall is a whiteboard now on some hallways [18:26:24] I'm not sure I can pull that off yet. I tried to get started last week but things only moved a few centimeters [18:27:07] mostly because I don't grok the problem space yet I think. And the roster of key players [18:29:47] yes still have to do a lot of that :) [18:29:53] and despite multiple timezones [18:30:26] RECOVERY - RAID on analytics1026 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:32:17] Aaron|home & paravoid: thanks for looping me in on this discussion. I'll try to make some sense out of it [18:32:36] RECOVERY - Disk space on analytics1026 is OK: DISK OK [18:32:42] * bd808 needs lunch first though [18:32:56] RECOVERY - RAID on analytics1027 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:33:11] (03CR) 10Dzahn: "not all but most of the RT users follow the "asmith" scheme. (first letter of first name, last name). how about the group memberships, we " [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon Liambotis) [18:33:26] RECOVERY - Disk space on analytics1027 is OK: DISK OK [18:34:27] Aaron|home: ping [18:38:57] * Aaron|home pongs [18:39:59] Aaron|home: high priority jobs in testwiki and test2wiki are processed almost immediately, is that normal? [18:41:57] bsitu: I guess it can happen, I wouldn't count on it though [18:42:06] there is nothing special about those queues afaik [18:46:45] Aaron|home: thx! I did see jobs in the runJobs.log, so they were processed. It used to take more than 5 minutes, so I wanted to make sure there is nothing special, :) [18:46:46] Ryan_Lane: [ERROR ] The return failed for job 20130827184259456139 'deploy_redis.returner' (on hafnium) [18:46:49] not sure what that means [18:47:01] hm [18:47:01] that's /var/log/upstart/salt-minion.log, fwiw [18:47:03] one sec [18:47:51] * Ryan_Lane grumbles [18:48:02] seems that it's necessary to sync the modules for new hosts as well [18:48:07] I need to figure out how to automate that [18:48:10] (03PS1) 10Ottomata: Setting up analytics102[678] as Hadoop JournalNodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81284 [18:48:14] ori-l: try now [18:48:37] RECOVERY - NTP on analytics1026 is OK: NTP OK: Offset -0.01771819592 secs [18:48:56] (03CR) 10Ottomata: [C: 032 V: 032] Setting up analytics102[678] as Hadoop JournalNodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81284 (owner: 10Ottomata) [18:49:14] Ryan_Lane: doing so with force, getting: [18:49:17] # WARN : --force enabling rolling out same thing you had when you started [18:49:17] fatal: Unknown commit none/master [18:49:30] but it continues to run [18:49:30] (03CR) 10Bsitu: [C: 032] Enable job queue to process notifs for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81278 (owner: 10Bsitu) [18:49:47] on hafnium: [WARNING ] Unable to import "softwareproperties.ppa": No module named softwareproperties.ppa [18:50:02] that warning is normal [18:50:06] RECOVERY - NTP on analytics1027 is OK: NTP OK: Offset -0.01477134228 secs [18:50:11] (03Merged) 10jenkins-bot: Enable job queue to process notifs for all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81278 (owner: 10Bsitu) [18:50:15] it'll go away when we upgrade salt [18:50:33] i think it worked now [18:51:05] yep. earlier failure was likely due to modules/returners/pillars not being sync'd [18:51:34] I need to have some way of triggering those to be sync'd when the grain is added [18:52:21] yes, it worked [18:53:06] is there a reason not to sync them everywhere? [18:53:29] so that any host could rapidly become a deployment target for anything? [18:53:40] !log bsitu synchronized wmf-config/InitialiseSettings.php 'Enable job queue to process notifs for all wikis' [18:53:45] Logged the message, Master [18:53:56] it's not a problem, but I worry about salt being a SPOF for puppet [18:54:22] ah, i see [18:54:38] [ERROR ] Targeted grain "deployment_target" not found <-- hm. I wonder why that shows on hosts that don't have that grain [18:54:47] that's a weird error [18:55:41] I'd imagine that should just be ignored [18:56:05] hm. let me see if salt blocks if the master is down when salt-call saltutil.sync_all is called [18:56:58] seems so [19:01:25] (03PS1) 10Ottomata: Fixing typo in role::analytics::common class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/81285 [19:01:39] (03CR) 10Ottomata: [C: 032 V: 032] Fixing typo in role::analytics::common class name [operations/puppet] - 10https://gerrit.wikimedia.org/r/81285 (owner: 10Ottomata) [19:02:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:53] (03PS1) 10Lcarr: adding another common typo of "commmon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81286 [19:03:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [19:03:27] (03CR) 10Lcarr: [C: 032 V: 032] adding another common typo of "commmon" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81286 (owner: 10Lcarr) [19:04:26] !log mlitn Started syncing Wikimedia installation... : Updating ArticleFeedbackv5, Echo, PageTriage and Thanks [19:04:32] Logged the message, Master [19:10:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:12:15] (03PS1) 10Lcarr: merging submodule? [operations/puppet] - 10https://gerrit.wikimedia.org/r/81288 [19:12:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.161 second response time [19:12:52] ottomata: ^^ that look right ? [19:13:24] perfect! [19:13:45] (03CR) 10Lcarr: [C: 032] merging submodule? [operations/puppet] - 10https://gerrit.wikimedia.org/r/81288 (owner: 10Lcarr) [19:14:07] running puppet-merge [19:14:12] now let's see if it's all up to date [19:14:13] (03CR) 10Akosiaris: "The truth is that the config as is now is kind of misleading. It gives the impression that it will force HTTPS but it does not (at least n" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 (owner: 10Dzahn) [19:14:16] oh noes [19:14:55] no good? [19:15:00] http://pastebin.com/PNVF7nBs [19:15:15] LOCAL CHANGES!? [19:15:18] bwah! [19:17:02] weird [19:17:03] uh [19:17:06] it looks like it worked [19:17:23] hmm [19:17:27] it worked on stafford [19:17:48] why did you work on stafford puppet merge [19:18:01] oh, wait [19:18:01] sorry [19:18:03] !log mlitn Finished syncing Wikimedia installation... : Updating ArticleFeedbackv5, Echo, PageTriage and Thanks [19:18:09] Logged the message, Master [19:18:10] the sockpuppet puppet working copy has moved, right? [19:18:15] not in /root/puppet anymore [19:18:24] oh yeah, it should be in /var/lib iirc [19:18:29] same place as stafford - /var/lib/git/operations/puppet/modules/cdh4 [19:18:42] yeah ok i see the local change snow [19:18:42] well without the cdh4 bit [19:18:46] no idea why there are any local changes [19:18:49] i'm going to git reset them [19:18:53] sounds good [19:18:58] so, stafford worked [19:19:00] because its just a merge hook [19:19:01] that's weird [19:19:06] its hte local git submodule update that failed [19:19:06] yeah, but sockpuppet didn't ... [19:19:16] merge worked in both places [19:19:23] just sockpuppet git submodule update failed [19:19:42] there we go [19:19:42] Submodule path 'modules/cdh4': checked out '5bde42ead8e72fb23a33cb7efd1c90f8b43e746d' [19:19:43] weird [19:19:46] no idea why that happened [19:19:53] but, ok! [19:19:55] now we are good to go [19:20:02] shall we proceed? back to an26 screen? [19:20:10] (also, shoudl we move back to PM or do this here?) [19:20:24] we can do this here, others can veto if we're being too chatty [19:20:30] k [19:20:39] ok lets try again! running puppet on an26 [19:20:47] fingers crossed [19:20:50] so far nothing's broken :) [19:21:18] (03CR) 10Chad: "Yep, this is exactly how you update a submodule. Congrats on coming to the dark side :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81288 (owner: 10Lcarr) [19:21:25] yay! [19:22:04] berp [19:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:44] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:46] who is handling search now? [19:22:55] notit! [19:22:57] ah, got it, me! [19:22:58] <^d> cmjohnson1: Sup yo? [19:22:59] oh no!~ [19:23:00] haha [19:23:00] not search! [19:23:02] hehe [19:23:08] haha, thought leslie was talking to me [19:23:12] :) [19:23:25] <^d> cmjohnson1: But if you're talking about old search, I'd be hard pressed to care! I'm trying to kill that damn thing ;-) [19:23:28] have a h/w issue (dimm) on search1002...going to update bios ...i am pretty sure it will depool itself but wanna make sure [19:23:45] ahha, so $::cdh4::hadoop::dfs_journalnode_edits_dir is undef [19:24:17] <^d> cmjohnson1: What's the IP on that box? [19:24:21] yeah, missed that in the production configs [19:24:22] fixing [19:24:24] (03PS1) 10Ottomata: Setting dfs_journalnode_edits_dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/81361 [19:24:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.613 second response time [19:24:40] search1002.eqiad.wmnet has address 10.64.32.109 [19:24:44] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [19:24:45] (03CR) 10Ottomata: [C: 032 V: 032] Setting dfs_journalnode_edits_dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/81361 (owner: 10Ottomata) [19:24:48] it's ok, i'm just following the settings through the module [19:24:49] ^d search1002.eqiad.wmnet has address 10.64.32.109 [19:24:51] which is good [19:24:57] <^d> cmjohnson1: Oh yeah, it's all lvs'd. Nevermind, you're right. [19:25:01] (03PS1) 10Hashar: contint: python-httplib2 for pywikibot/core tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/81362 [19:25:02] ^d https://wikitech.wikimedia.org/wiki/Search#Cluster_Host_Hardware_Failure [19:25:02] <^d> Should just depooled. [19:25:08] k running puppet again [19:25:10] cool [19:25:15] that says it depools itself [19:25:31] ^d thx for looking [19:25:37] 5th time's the charm [19:26:03] <^d> cmjohnson1: np. Thanks for the heads up. [19:26:20] !log search1002 going down for bios update [19:26:25] Logged the message, Master [19:27:21] (03Abandoned) 10Hashar: contint: python-httplib2 for pywikibot/core tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/81362 (owner: 10Hashar) [19:27:44] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:15] icinga-wm: you should know by now we don't care about pdf1 [19:28:34] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [19:28:43] hahah [19:28:44] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:32] LeslieCarr, is pdf1 what we want replace as part of that hackathon? [19:29:43] yes, pdf1-3 [19:29:57] make the pdf rendering something that can be packaged, deployed properly on machines , etc [19:30:05] hey it sgoing! [19:30:06] cuz or else it's totally dying in december [19:30:08] yes [19:30:12] ottomata: canyou unsplit the window ? [19:30:18] i thought I did [19:30:23] so it's more screen real estate [19:30:24] hrm [19:30:28] yeah, it is small for me too [19:30:30] dunno how to fix though [19:30:33] hrm [19:30:36] me neither :-/ [19:30:40] lemme look up screen documentation! [19:31:32] hmmmm, you know, maybe I didn't need analytics::common for these [19:31:33] hm [19:31:36] yay [19:31:42] ctrl+a F [19:31:54] oh yeay! [19:32:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:35] (03PS1) 10Ottomata: Don't need role::analytics::common for JournalNodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81365 [19:33:24] (03CR) 10Ottomata: [C: 032 V: 032] Don't need role::analytics::common for JournalNodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81365 (owner: 10Ottomata) [19:33:25] ^d...do you find it odd that icinga didn't notice search1002 down? [19:33:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.318 second response time [19:33:38] cmjohnson1: "icinga-wm: PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% " ? [19:33:52] hrm..missed that thx leslie [19:34:10] np, some assholes are doing hadoop stuff and littering up the channel [19:34:25] haha [19:34:27] :-P [19:34:35] going to run puppet on an27 [19:34:36] ottomata: interesting that analytics common class doesn't change the salt grain to cluster: analytics from misc [19:34:38] should we fix that ? [19:34:47] oh? [19:34:48] sure! [19:35:04] i haven't actually used salt yet, so I don't know much bout it [19:35:14] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:35:16] do it in role::analytics [19:35:51] ah doesn't set realm i think is where salt gets it [19:36:14] wait it did, didn'ti t? [19:36:14] grains: [19:36:15] - cluster: misc [19:36:15] + cluster: analytics [19:36:30] yeah it didn't set it in analytics common [19:37:01] oh yeah, its in role::analytics [19:37:04] not role::analytics::common [19:37:04] # ganglia cluster name. [19:37:04] $cluster = "analytics" [19:37:07] that's it, right? [19:37:12] yep [19:37:20] hrm [19:37:30] oh though that just sets it at the role::analytics scope [19:37:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:37] i think it might need to be $::cluster ? [19:38:08] hate variable scopes [19:38:42] naw i think it won't let you set a var like that [19:38:46] it worked though, right? [19:39:22] well it did work now :) just wish it worked on initial run or if you only made the common role [19:39:27] without setting it twice [19:39:56] it worked on initial run with role::analytics included, right? [19:40:52] it was the run with the 178 include role::analytics::hadoop::client [19:41:03] not initial run [19:41:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.438 second response time [19:41:54] or actually the role::analytics class sets it [19:41:55] yeah [19:42:42] LeslieCarr: ah. crap. I know why [19:42:55] because nothing is specifically removing the grain [19:43:04] and it allows multiple values to be set for it [19:43:14] some grains should only allow one value, though [19:43:27] ah [19:43:27] let me open an rt for that and take a look at it [19:43:30] cool [19:43:32] thanks Ryan_Lane [19:43:41] mystery solved! [19:43:43] i have no idea what you guys are talking about but cool! [19:43:46] thanks! [19:43:47] :) [19:44:05] k waiting on an27, an26 is looking goooOOOood [19:44:49] yay [19:45:07] LeslieCarr: you can set the grain manually [19:45:16] i think both are happy ottomata ? [19:45:29] http://docs.saltstack.com/ref/modules/all/salt.modules.grains.html#salt.modules.grains.setval [19:45:47] yes! [19:45:48] cool [19:45:50] which will work until I fix it in puppet [19:45:54] making a commit for namenodes and datanodes now :) [19:46:58] (03PS1) 10Ottomata: Including hadoop classes on NameNodes and DataNodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/81366 [19:47:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:43] (03CR) 10Ottomata: [C: 032 V: 032] Including hadoop classes on NameNodes and DataNodes. [operations/puppet] - 10https://gerrit.wikimedia.org/r/81366 (owner: 10Ottomata) [19:47:44] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:27] bwerrrrrr ssh key! [19:48:30] LeslieCarr :p [19:48:44] gah [19:49:00] s'ok we are done with an26 and 27 for now anyway [19:49:01] new screen! [19:49:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:49:25] but lets get to namenodes quick! before cron does! :p [19:49:32] ok [19:49:34] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [19:49:37] restarted a hadoop named screen [19:50:44] k [19:50:52] do it! [19:50:54] puppet! [19:51:22] hahaha [19:51:30] i just made a "gaaah" sound at hte coffeeshop [19:51:34] haha [19:51:38] some people looked at me funny [19:51:42] hha [19:51:52] ok fingers crossed, this is going to be coool! [19:52:02] it shoudl format the namenode, which will cause the journalnodes to start doing their thang [19:52:45] i hope hope this will all work, i'm not exactly sure what's going to happen on the standby namenode when we run this….. this whole standby thing kinda has a chicken vs egg issue [19:52:58] i've done this before, but it was a while ago now :p [19:53:11] hehe [19:54:46] bah format failed! [19:54:47] i saw that [19:55:02] hah. damn it [19:55:10] that's set in the config [19:55:19] I don't know why it's showing up in both grains [19:55:26] sigh, failed dependencies! bad! [19:55:34] I just realized that as I went to fix it [19:55:43] it's probably cached [19:55:46] ottomata: well let's run the command manually and see what's up ? [19:55:47] which host is this? [19:55:50] ok once this finishes running, we'll run the format... [19:55:51] yup [19:55:52] :) [19:55:53] hehehe [19:55:58] jinx! [19:56:10] the change I made isn't a bad idea, but I didn't need to make it :D [19:56:19] <^d> Ryan_Lane: How does Thursday from 3-4 sound? [19:56:28] ^d: that's fine [19:56:31] <^d> mmk. [19:56:50] hmm LeslieCarr, it might be because 1028 doesn't exist and I added to the list [19:56:50] 13/08/27 19:56:28 FATAL namenode.NameNode: Exception in namenode join [19:56:51] java.lang.IllegalArgumentException: Unable to construct journal, qjournal://analytics1026.eqiad.wmnet:8485;analytics1027.eqiad.wmnet:8485;analytics1028.eqiad.wmnet:8485/kraken [19:56:52] i'll remove it [19:56:52] ah [19:56:54] ok [19:57:16] <^d> greg-g: I'm taking a thursday deploy window from 3-4 for gerrit downtime. Should take only a fraction of the time. Calendar is open :) [19:58:19] (03PS1) 10Ottomata: Not including analytics1028 as a journalnode until it actually exists. [operations/puppet] - 10https://gerrit.wikimedia.org/r/81370 [19:58:27] LeslieCarr, ottomata: which system was showing the incorrect grain? [19:58:35] (03CR) 10Ottomata: [C: 032 V: 032] Not including analytics1028 as a journalnode until it actually exists. [operations/puppet] - 10https://gerrit.wikimedia.org/r/81370 (owner: 10Ottomata) [19:59:12] Ryan_Lane: it was analytics1026 when it had only class role::analytics::common on it [19:59:17] LeslieCarr: lets try puppet again [19:59:27] LeslieCarr: analytics1026 didnt' have common on it [19:59:32] until we applied the journalnode stuff [19:59:34] which one was it ? [19:59:35] it was just standard before that [19:59:37] (03CR) 10MaxSem: [C: 032] Update $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80494 (owner: 10MaxSem) [19:59:39] ah [19:59:40] ok [19:59:50] hrm, analytics1009 just changed its grain [19:59:52] (03PS1) 10Ryan Lane: Allow grains to be set with a single value [operations/puppet] - 10https://gerrit.wikimedia.org/r/81372 [19:59:53] ^d: sounds good [19:59:57] puppet is runnign on its own! [19:59:59] gah! [20:00:00] :p [20:00:01] is it correct now? [20:00:10] yeah [20:00:12] (03CR) 10MaxSem: [C: 032] Instruct robots to not index Wikipedia Zero. No deploy before 25-June-2013. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/69420 (owner: 10Dr0ptp4kt) [20:00:13] * Ryan_Lane nods [20:00:18] it was previously cached [20:00:23] the master caches results for a while [20:01:05] (03Merged) 10jenkins-bot: Update $wgMFRemovableClasses [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80494 (owner: 10MaxSem) [20:02:22] ahhh, i'm remembering how the standby works now, lesliecarr [20:02:27] it overrides the -format exec [20:02:27] (03Merged) 10jenkins-bot: Instruct robots to not index Wikipedia Zero. No deploy before 25-June-2013. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/69420 (owner: 10Dr0ptp4kt) [20:02:35] to bootstrap the standby from the running primary namenode [20:02:52] so, once the primary is up, running puppet on standby should behave [20:04:36] !log maxsem synchronized w/robots.php 'https://gerrit.wikimedia.org/r/69420' [20:04:41] Logged the message, Master [20:04:48] ok [20:05:12] yurik, ^^^ deployed the robots.txt change [20:05:25] MaxSem, thx, checking [20:07:19] hrm [20:07:22] why'd ti fail this time ? [20:07:39] or does it take a while and we need to extend the timeout ? [20:07:50] hm still format failure [20:07:52] no don't think so [20:08:24] ha, or maybe so? [20:08:24] hm [20:08:27] looks like it is [20:08:28] yeah [20:08:56] i don't htink it hsould take that long though [20:08:56] cool, "timeout" [20:09:17] we could strace -p th e process to see if it's doing stuff [20:10:35] (03PS1) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [20:10:47] (03PS1) 10CSteipp: Remove redirect from http to https from loginwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/81375 [20:11:05] (03CR) 10jenkins-bot: [V: 04-1] Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 (owner: 10Chad) [20:11:37] hm i dunno LeslieCarr [20:11:58] that doesn't seem right to me [20:12:00] (03PS2) 10Chad: Switch Gerrit from manganese to ytterbium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 [20:12:12] it would seem like it should be doing something [20:12:33] http://man7.org/linux/man-pages/man2/futex.2.html (in case you didn't know futex either) [20:13:47] ori-l: https://gerrit.wikimedia.org/r/#/c/81372/ since you wrote some of this code... [20:13:51] yeah i googled that too :) [20:14:01] i'm going to kill that and try to run manually [20:14:23] oh look! [20:14:25] it was waiting for input! [20:14:29] haha [20:14:32] this is because it only has 2 nodes [20:14:39] (03PS1) 10RobH: rt4670 we don't own this domain, nor is it pointed at our nameservers, removing support [operations/dns] - 10https://gerrit.wikimedia.org/r/81376 [20:14:49] ahha [20:14:51] (03CR) 10Chad: [C: 04-1] "Don't merge til Thursday, just wanted to get it ready." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81374 (owner: 10Chad) [20:14:51] :) [20:15:20] coool [20:15:21] and [20:15:24] (03CR) 10RobH: [C: 032] rt4670 we don't own this domain, nor is it pointed at our nameservers, removing support [operations/dns] - 10https://gerrit.wikimedia.org/r/81376 (owner: 10RobH) [20:15:48] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:59] Ryan_Lane: reviewing [20:16:00] look! [20:16:03] journal! [20:16:09] on an26 (and 27) [20:16:12] great! [20:16:20] yay [20:16:23] namenode starting up too! [20:16:39] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [20:16:42] btw, puppet ran on all the datanodes already [20:16:42] :p [20:16:51] i'm tailling logs on them [20:16:57] they are all trying to connect to the namenode [20:17:02] they should succeed once it comes up [20:17:24] (03CR) 10Dzahn: "RT #4670" [operations/dns] - 10https://gerrit.wikimedia.org/r/81376 (owner: 10RobH) [20:18:03] hehe [20:18:11] so should we run the format manually on 1009 ? [20:18:34] hmm, dunnoooooooo [20:18:37] i dunno! [20:18:57] let's run puppet and see what happens [20:19:36] hashar: should i make it a bug? authdns-gen-zones: command not found in operations-dns-lint [20:19:45] hmm that's a long puppet run on an09! [20:19:48] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:52] haha yup! [20:20:00] yup same deal [20:20:19] so, LeslieCarr this time [20:20:23] instead of running namenode -format [20:20:24] i'm running [20:20:27] namenode -bootstrapStandby [20:20:36] bootstrapStandby … got it :) [20:20:42] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81372 (owner: 10Ryan Lane) [20:20:44] hm you know [20:20:49] i think this didn't used to prompt like this [20:20:50] ^ Ryan_Lane [20:20:53] and for future namenode #3 we will do bootstrap standby as well [20:20:56] this is a slightly newer version [20:20:57] also, this time wasn't because anything is wrong [20:21:00] cdh4.3.1 [20:21:06] we were running 4.2.1 before [20:21:07] yeah [20:21:11] hrm, is there a command line option ? [20:21:12] right exactly [20:21:29] great, that worked [20:21:30] runnign puppet [20:21:38] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [20:21:42] dunno [20:22:17] AH [20:22:24] and! [20:22:31] we have to promote the primary namenode to active [20:22:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:43] since we aren't using automatic failover [20:22:49] it makes you do it manually [20:22:57] sorry, this is so messy, we are doing a few nonstandard things here [20:23:02] ottomata: 4.3.1 ? weren't we tracking 4.2.1 only ? [20:23:03] its gone super smoothly in labs usually [20:23:09] we were, but i upped it this morning [20:23:20] s'pok? [20:23:24] s'ok? [20:23:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [20:23:38] i suppose. You know better :-) [20:24:11] k great :) [20:24:11] how's the repavement? [20:24:14] LeslieCarr: I am now running [20:24:23] hdfs haadmin -transitionToActive analytics1010-eqiad-wmnet [20:24:48] see bottom of this bit of the docs: [20:24:49] hrm, might have to do it manually … [20:24:52] https://github.com/wikimedia/puppet-cdh4#adding-high-availability-to-a-running-cluster [20:25:46] cool [20:26:07] paravoid, good! [20:26:13] good to hear [20:26:22] hadoop is almost up and running, 100% repaved [20:26:27] the kafka 0.7.2 brokers are still running [20:26:38] but there are temp ciscos running 0.8 we can play with for now [20:26:57] i'll repave the 0.7.2 machines once we have a 0.8 .deb in our apt [20:27:04] and can apply the proper kafka puppet stuff [20:27:18] LeslieCarr: there we go! puppet is finally liking this [20:27:21] RobH: btw, http://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [20:27:23] since an10 is now active [20:27:31] it can create the hdfs filesystem hierarchy [20:27:32] crossing fingers... [20:27:33] RobH: long commit msg lines make me sad [20:27:39] ottomata: I 'll have a look at this 0.8 kafka deb [20:27:48] oh the branch t hing I mentioned? [20:27:48] k [20:27:51] much obliged! [20:27:58] ottomata: do we still have 0.7 producers? [20:28:03] no [20:28:05] they are paved! [20:28:10] so yeah, its basically flattened [20:28:15] i guess i can reinstall whenever then :) [20:28:22] i just can't do anything with them til we're ready [20:28:28] so the 0.7 brokers are kinda useless now? [20:28:34] true dat [20:28:51] wire speak? [20:28:51] i mean, ja i could flatten them anytime [20:28:53] i suppose [20:29:02] pssh, that's american man! [20:29:19] coool! LeslieCarr, looking goooood! [20:29:23] on a completely unrelated note, have a look at http://bsonspec.org/ [20:29:29] yay [20:29:33] but let's leave this for the next version, too late now imho :) [20:29:46] (and too many format work from magnus' side) [20:29:49] hehe [20:29:50] much even [20:29:57] yay datanodes! [20:29:59] let's check it! [20:30:10] 180 TB! [20:30:10] woot! [20:30:15] https://github.com/mongodb/libbson/releases/tag/0.2.0 was released last friday [20:30:15] 10 datanodes [20:30:16] woot! [20:30:28] think of all the NSA data we can save on that! [20:30:31] WHaooooo bson [20:30:31] cool [20:30:38] paravoid: are you saying i should have put that one line down? [20:30:50] RobH: no, see the doc I pointed you at [20:30:54] and just been more terse [20:30:56] yes, i read it [20:30:59] and im asking you to clarify. [20:31:05] first line 50 chars, then whitespace line, then body [20:31:09] first line more terse, be more verbose on following [20:31:11] yes, ok [20:31:14] interesting! i wonder if order matters in bson... [20:31:14] and the RT using an RT: NNNN [20:31:25] which makes it clickable in gerrit as well [20:31:32] it used to do just rt2131 [20:31:37] i recall, but it doesnt now [20:31:42] and was pointed out to me yea [20:32:00] LeslieCarr: [20:32:01] ssh -N bast1001.wikimedia.org -L 50070:analytics1010.eqiad.wmnet:50070 [20:32:03] then navigate to [20:32:10] http://localhost:50070 [20:32:26] awesooome, with journalnodes showing up and everything! [20:32:36] nice work ottomata [20:32:38] now we just need that extra journalnode :) poke poke RobH [20:32:38] (03CR) 10Dzahn: [C: 031] Remove redirect from http to https from loginwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/81375 (owner: 10CSteipp) [20:32:42] yay! [20:32:47] ? [20:32:48] and LeslieCarr :) [20:33:15] it's ok RobH , we should have thought about it and asked earlier [20:33:24] 179.06 TB ??? [20:33:28] i have no idea what you are talkin about [20:33:28] woot [20:33:30] yeah no worries RobH [20:33:34] https://rt.wikimedia.org/Ticket/Display.html?id=5678 [20:33:36] yes you do! [20:33:51] meh, not on procurement right now ;p [20:33:58] hehe, no probs [20:34:08] we've been running hadoop with no standby namenode for a long time now [20:34:18] no reason we can't wait a bit for 3rd journalnode to make the quorum complete [20:34:19] :p [20:34:37] i'll get back to it either later today or tomorrow am [20:35:40] k danke, no hurry [20:35:50] (03CR) 10Reedy: [C: 031] Remove redirect from http to https from loginwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/81375 (owner: 10CSteipp) [20:35:56] LeslieCarr: we now need to do the hive/oozie/hue stuff, but we need to wait for the other node for that too [20:36:00] so uuuummmmmmmmmm [20:36:02] i think we are done for the day! [20:36:16] cool [20:36:17] yay! [20:36:18] :) [20:38:16] yayyay! [20:38:35] thank you! we did it! it was not quite as smooth as I would ahve liked, but lets blame −1 journalnode, eh? :p [20:38:57] hehehe [20:39:00] :) [20:39:03] we can blame that ;) [20:39:09] it was a great learning experience [20:41:57] !log demon synchronized php-1.22wmf13/extensions/CentralAuth 'CentralAuth to master' [20:42:02] Logged the message, Master [20:42:12] (03PS1) 10Akosiaris: Add an archival pool for backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/81385 [20:42:34] !log demon synchronized php-1.22wmf14/extensions/CentralAuth 'CentralAuth to master' [20:42:40] Logged the message, Master [20:44:28] (03CR) 10Chad: [C: 031] Remove redirect from http to https from loginwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/81375 (owner: 10CSteipp) [20:45:29] (03CR) 10Akosiaris: [C: 032] Add an archival pool for backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/81385 (owner: 10Akosiaris) [20:49:43] (03PS1) 10Edenhill: Let log.data.copy default to true. [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/81388 [20:49:44] (03PS1) 10Edenhill: Added optional %{}t formatting to %t [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/81389 [20:49:45] (03PS1) 10Edenhill: Added optional secondary formatter: format.key [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/81390 [20:51:19] (03CR) 10Ryan Lane: [C: 032] Remove redirect from http to https from loginwiki [operations/apache-config] - 10https://gerrit.wikimedia.org/r/81375 (owner: 10CSteipp) [20:51:55] (03PS1) 10Akosiaris: Adding blog.wikimedia.org to backup schedule [operations/puppet] - 10https://gerrit.wikimedia.org/r/81393 [20:54:01] (03CR) 10Ottomata: [C: 032 V: 032] Installing kafka-mirror init.d and default scripts. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [20:54:04] (03CR) 10Akosiaris: [C: 032] Adding blog.wikimedia.org to backup schedule [operations/puppet] - 10https://gerrit.wikimedia.org/r/81393 (owner: 10Akosiaris) [20:57:56] mutante: re """authdns-gen-zones: command not found in operations-dns-lint""" [20:58:17] mutante: I talked to paravoid about it. He is going to write a puppet class to have the gdsnd tools installed on jenkins [20:58:32] mutante: that will magically make the test to run something and hopefully they will be passing :-] [21:00:45] hashar: thank you:) i already made one , bug 53422 [21:01:01] sounds cool, especially the magical part [21:01:12] ottomata: what's the issue with the zookeeper conflict? [21:01:16] scapping... [21:01:24] mutante: thanks :) [21:02:51] !log deployed change 81375 to remove the http to https redirect from loginwiki [21:02:56] sleeep time [21:02:57] Logged the message, Master [21:07:28] paravoid [21:07:38] its just a package name conflict [21:07:39] afaik [21:07:50] the cloudera hadoop packages depend on the cloudera zookeeper package [21:07:56] and we are suing the debian zookeeper package [21:07:58] using* [21:08:19] which package is the one that you can't install? [21:08:21] and on which machine? [21:08:39] and do we want to install clouder zookeeper anywhere or do we want to use debian zookeeper everywhere? [21:09:18] could we just create a dummy cloudera-named zookeeper package that falls back to the version of zookeeper that we want? [21:10:20] hmmmm [21:10:50] I mean, I don't know what the issue is, I just saw the procurement RT [21:10:54] yeah [21:10:54] well [21:10:58] if it wasn't for this issue [21:11:02] and a package conflict seems like a bad reason to procure hardware :) [21:11:08] ja agree [21:11:17] but ja if it wasn't for this issue [21:11:18] i'd install hadoop-hdfs-journalnode on the three zookeeper nodes [21:11:25] an23,an24,an25 [21:11:50] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [21:11:55] Logged the message, Master [21:12:29] wait [21:12:40] a simple dist-upgrade on an23 wants to remove zookeeperd and upgrade zookeeper [21:12:56] so that's already an issue, journalnode has nothing to do with this? [21:13:07] .. [21:13:52] ok so in our apt, we have [21:13:55] zookeeper_3.4.5+20-1.cdh4.3.1.p0.76~precise-cdh4.3.1_all.deb [21:14:02] also wants to upgrade java, blergh [21:14:23] installed is 3.3.5+dfsg1-1ubuntu1 [21:14:40] cdh4 zookeeper installs this [21:14:41] ./usr/lib/zookeeper/zookeeper.jar -> zookeeper-3.4.5-cdh4.3.1.jar [21:15:12] ubuntu zookeper has /usr/share/java/zookeeper-3.3.5.jar [21:15:27] i'm pretty sure the 3.3.5 jar would work fine for the hadoop stuff [21:15:37] but if we force installed the ubuntu one [21:15:40] we'd at least have to symlink [21:15:48] so that hadoop would know where to look [21:16:03] do we use cdh4's zookeeper anywhere? [21:16:16] not the server, no [21:16:18] hadoop uses it [21:16:20] yeah but i think that zookeeper_3.4.5 is a dependency of some cdh4 package [21:16:25] ight [21:16:26] right [21:16:42] right, we talked about this before when we were doing the zk puppet module [21:16:46] we said we would use the ubuntu one [21:16:52] because other people in the org would need zk too [21:16:59] and we didn't want to make them have to use the cdh4 module [21:17:00] hadoop depends: zookeeper (>= 3.4.0) [21:17:14] hadoop-0.20-mapreduce-zkfc too [21:17:34] hbase depends: zookeeper (>= 3.3.1) [21:17:34] the zookeeper server packages of the different dists work totally different too [21:17:40] so the puppetization of the two are different [21:17:44] but that's just for the servers [21:17:48] yeah i remember that [21:18:09] there were quite a lot of changes... binary names, directory names... [21:18:21] so ja, we talked about this and basically just said: welp, i guess we won't install hadoop and zk servers on the same nodes [21:18:22] oh well [21:18:34] hence the procurement ticket :p [21:18:40] well [21:18:42] we didn't say exactly that [21:18:48] hha, ok maybe I said that [21:18:54] we said that a zookeeper cluster could be useful for other services in general [21:19:06] and that it might make sense to build it outside of analytics [21:19:36] and have it as an internal service for whatever other service [21:19:42] (we were thinking solrcloud at the time) [21:19:50] (03PS1) 10Akosiaris: Variable should not be in double quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81399 [21:19:50] sure! soooo, what, should we do a procurement for non-analytics zk nodes ? [21:20:05] and then we can just use the nodes we already ahve for the hadoop jounalnode stuff? [21:21:58] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [21:22:36] I think so, although it does sound like a pointless shuffling around [21:24:11] well, now is a good time for it, since hadoop is kinda out of commission for a bit [21:24:15] (03CR) 10Akosiaris: [C: 032] Variable should not be in double quotes [operations/puppet] - 10https://gerrit.wikimedia.org/r/81399 (owner: 10Akosiaris) [21:24:19] should def do before we get kafka up [21:24:51] paravoid, how should I proceed? comment on and close the ticket? send email to ops list and ask if that is what we should do? [21:25:36] the procurement ticket will need approval anyway [21:25:43] by mark I think [21:26:00] so just add a comment there and let the one who'll approve decide [21:26:14] I think a non-analytics zookeeper cluster sounds best to me [21:26:19] mark and ken would work together on budget and approval of non budgeted servers yeua [21:26:21] zookeeper is a very useful piece of software [21:26:33] I can imagine various uses for it [21:28:06] ok cool, yeah I think that woudl be better too [21:31:22] paravoid, also, when we set up Kafka in other datacenters [21:31:26] those will need zookeeper too [21:31:49] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [21:31:55] Logged the message, Master [21:31:58] fun [21:33:03] k, updated RT ticket [21:33:05] https://rt.wikimedia.org/Ticket/Display.html?id=5678&results=21f3a10c2f51031c1055cc844d2c083b [21:33:07] (03PS1) 10Ori.livneh: Add self ('olivneh') to professor.pmtpa.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/81401 [21:33:11] binasher: ^ [21:34:08] (03PS2) 10Asher: Add self ('olivneh') to professor.pmtpa.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/81401 (owner: 10Ori.livneh) [21:34:33] woot, thanks. [21:34:37] (03CR) 10Asher: [C: 032 V: 032] Add self ('olivneh') to professor.pmtpa.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/81401 (owner: 10Ori.livneh) [21:36:12] (03PS1) 10coren: Fix webserver class to use truth value for ssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 [21:38:07] (03PS1) 10RobH: RT:2640 fixing missed quoting in case statement [operations/puppet] - 10https://gerrit.wikimedia.org/r/81404 [21:38:21] Coren: want a second set of eyes on patchset or you have it? [21:38:50] RobH: That touches too many thing for self +2 IMO. Please to check. [21:38:59] will do! [21:39:59] (03CR) 10RobH: [C: 031] "glad someone spotted this cuz it was driving me insane earlier" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 (owner: 10coren) [21:40:06] Coren: looks good to me [21:40:10] (03PS1) 10Dzahn: fix missing quotes in realm case, do not quote booleans [operations/puppet] - 10https://gerrit.wikimedia.org/r/81406 [21:40:13] heh [21:40:17] mutante: ^ changeset i just mentioned to you [21:40:44] (03CR) 10RobH: [C: 032] RT:2640 fixing missed quoting in case statement [operations/puppet] - 10https://gerrit.wikimedia.org/r/81404 (owner: 10RobH) [21:40:58] (03CR) 10Dzahn: [C: 032] fix missing quotes in realm case, do not quote booleans [operations/puppet] - 10https://gerrit.wikimedia.org/r/81406 (owner: 10Dzahn) [21:41:33] (03CR) 10coren: [C: 032] "If RobH likes it, it's good enough for me. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 (owner: 10coren) [21:41:36] ori-l: notice: /Stage[main]/Accounts::Olivneh/Unixaccount[Ori Livneh]/User[olivneh]/ensure: created [21:41:38] (03PS2) 10coren: Fix webserver class to use truth value for ssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 [21:41:43] Coren: im on sockpuppet, i'll merge for you there [21:41:48] (03CR) 10coren: [C: 032] "Bah, rebase." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 (owner: 10coren) [21:41:55] (03CR) 10Dzahn: [C: 031] "yes please thx" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81403 (owner: 10coren) [21:41:56] baaaahhhhh rebase! [21:42:54] Pushed, merged. [21:43:22] you puppet-merged too i see? (cuz its not waiting for me) [21:43:23] yay [21:43:25] thanks Coren, this happened before and drove people nuts:) [21:43:26] thx for that [21:44:03] Yeah, I have a script on my own box that stuffs the keys in my agent and goes do the puppet-merge. :-) [21:44:19] !log Running extensions/Wikibase/repo/maintenance/rebuildPropertyInfo.php against wikidatawiki in screen on terbium [21:44:26] Logged the message, Master [21:49:36] PROBLEM - HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 545 bytes in 0.383 second response time [21:49:55] !log disregard netmon1001 alerts, its my project server [21:50:01] Logged the message, RobH [21:51:55] (03PS1) 10Dzahn: single quote the realm in the labs/prod case [operations/puppet] - 10https://gerrit.wikimedia.org/r/81408 [21:51:57] Coren: RobH , then those as well [21:52:25] (03CR) 10RobH: [C: 031] single quote the realm in the labs/prod case [operations/puppet] - 10https://gerrit.wikimedia.org/r/81408 (owner: 10Dzahn) [21:52:58] so the inital puppet run on new isntances is automated yes? [21:53:09] i logged in and tried to sudo or login as root and no dice [21:53:28] can just be little old me with no special rights and puppet not running.... [21:54:09] yea... there it goes [21:54:13] finally works. [21:54:23] (03CR) 10Dzahn: [C: 032] single quote the realm in the labs/prod case [operations/puppet] - 10https://gerrit.wikimedia.org/r/81408 (owner: 10Dzahn) [21:57:00] <^demon|away> manybubbles: Your reorg code is live on beta, working well. [21:57:09] !log kaldari synchronized wmf-config/InitialiseSettings.php 'syncing InitialiseSettings.php to try to fix fatal error from MobileFrontend' [21:57:15] Logged the message, Master [22:06:06] !log kaldari synchronized php-1.22wmf13/extensions/MobileFrontend 'syncing MobileFrontend to try to address fatal error' [22:06:11] Logged the message, Master [22:07:29] ori-l: I was showing ryan http://fedmsg.readthedocs.org/en/latest/ before [22:07:40] re: 0mq [22:07:54] but redis/0mq/kafka seem unsuitable for external users, aren't they? [22:08:01] so it seems kind of orthogonal [22:08:19] i would take redis off that list, but thats just me [22:09:10] well, i guess if we are still considering external users as "hosted on godaddy shared hosting" maybe cant remove redis :( [22:09:19] paravoid: re RT #5681, we already had some of those separetely, looking [22:09:21] paravoid: they're two separate questions -- (1) how do you feed the data from the mediawikis to a host on the cluster [22:09:25] as separate tickets i mean [22:09:25] (03PS1) 10RobH: fixing smokeping to use standard [operations/puppet] - 10https://gerrit.wikimedia.org/r/81412 [22:09:26] and (2) how do you expose it to the outside world [22:09:38] right [22:09:44] for (2) I think XMPP might be "the worst except all others", to paraphrase churchill [22:10:12] for (1) the basic point I was trying to make is that we have options [22:11:30] the advantage of XMPP over MQs designed for internal use is that it copes well with badly-behaving clients [22:11:34] people that are hosted on godaddy would never be using this [22:11:53] (03CR) 10RobH: [C: 032] fixing smokeping to use standard [operations/puppet] - 10https://gerrit.wikimedia.org/r/81412 (owner: 10RobH) [22:11:53] using what? [22:12:06] (03PS1) 10Dzahn: remove vikipedia.org and .com, RT #4673, RT #4674, RT #5681 [operations/dns] - 10https://gerrit.wikimedia.org/r/81414 [22:12:07] any feed thing we discuss [22:12:46] we shouldn't even consider them [22:12:50] * paravoid doesn't understand the godaddy reference [22:12:59] yes, i'm a bit confused too [22:13:12] "well, i guess if we are still considering external users as "hosted on godaddy shared hosting" maybe cant remove redis :(" [22:13:15] (03PS2) 10Dzahn: remove vikipedio.org and vikipedio.com, RT #4673, RT #4674, RT #5681 [operations/dns] - 10https://gerrit.wikimedia.org/r/81414 [22:13:19] we shouldn't be constrained by the technical choices available to hosted wikis? i don't think we have to, given that the transport is configurable and extensible [22:14:27] * Ryan_Lane nods [22:20:15] !log kaldari synchronized php-1.22wmf13/extensions/MobileFrontend/includes/formatters/ExtractFormatter.php 'syncing live-hack to try to address fatal error' [22:20:21] Logged the message, Master [22:23:56] (03CR) 10CSteipp: "I'll obviously defer to ops if they want to support this or not, but I'll argue that there is an expectation that the bits of text colored" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 (owner: 10Dzahn) [22:26:06] csteipp: https://bugzilla.wikimedia.org/show_bug.cgi?id=53424 [22:27:08] (03CR) 10Lcarr: "With https it's trivial to impersonate another person -- you just change your name in etherpad." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80314 (owner: 10Dzahn) [22:30:00] heh:) [22:30:36] csteipp: it was for you:) [22:32:09] !log purging login.wikimedia.org from text varnish [22:32:14] Logged the message, Master [22:34:24] PROBLEM - Host netmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:39:04] csteipp: btw... you can also change your color in the interface. Not just your name. [22:39:29] (03PS1) 10Ryan Lane: Add newer pmtpa virt nodes to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/81418 [22:39:53] akosiaris1: Yep, I just think grabbing someone else's token and using that is worse. [22:39:56] PROBLEM - DPKG on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:40] csteipp: only if you use it to do something you can not otherwise do. [22:44:56] RECOVERY - Host netmon1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [22:47:16] PROBLEM - SSH on netmon1001 is CRITICAL: Connection refused [22:48:48] !log reedy synchronized php-1.22wmf14/extensions/CentralAuth/maintenance/ [22:48:54] Logged the message, Master [22:58:16] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay 0 seconds [22:58:35] !log kaldari synchronized php-1.22wmf13/extensions/MobileFrontend/ 'updating MobileFrontend to master for wmf13' [22:58:41] Logged the message, Master [22:59:46] PROBLEM - NTP on netmon1001 is CRITICAL: NTP CRITICAL: No response from NTP server [22:59:56] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You cannot ALTER a log table if logging is enabled on query [23:01:26] you now have a new "show other bug" link on bugzilla tickets that show other bugs in the same component [23:07:00] (03PS1) 10Dzahn: RT #4671, RT #5681, remove quickipedia.org [operations/dns] - 10https://gerrit.wikimedia.org/r/81424 [23:07:08] PROBLEM - Host netmon1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:07:51] uhhhh, I don't wanna know [23:07:58] (03CR) 10Dzahn: "telling legal for their records" [operations/dns] - 10https://gerrit.wikimedia.org/r/81414 (owner: 10Dzahn) [23:08:08] RECOVERY - SSH on netmon1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [23:08:18] RECOVERY - Host netmon1001 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [23:10:29] !log kaldari synchronized php-1.22wmf14/extensions/MobileFrontend/ 'updating MobileFrontend to master for wmf14' [23:10:34] Logged the message, Master [23:17:28] !log reedy synchronized php-1.22wmf13/extensions/CentralAuth [23:17:34] Logged the message, Master [23:18:13] !log reedy synchronized php-1.22wmf14/extensions/CentralAuth [23:18:19] Logged the message, Master [23:20:28] RECOVERY - HTTP on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 454 bytes in 0.021 second response time [23:24:38] RECOVERY - NTP on netmon1001 is OK: NTP OK: Offset -0.01024508476 secs [23:26:38] PROBLEM - HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 545 bytes in 0.385 second response time [23:35:10] (03PS1) 10RobH: include curl in smokeping class [operations/puppet] - 10https://gerrit.wikimedia.org/r/81431 [23:36:00] (03CR) 10RobH: [C: 032] include curl in smokeping class [operations/puppet] - 10https://gerrit.wikimedia.org/r/81431 (owner: 10RobH) [23:36:32] (03PS1) 10RobH: Revert "include curl in smokeping class" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81432 [23:36:59] (03Abandoned) 10RobH: Revert "include curl in smokeping class" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81432 (owner: 10RobH) [23:37:42] (03PS1) 10RobH: introduced a typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81433 [23:38:25] (03CR) 10RobH: [C: 032 V: 032] introduced a typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/81433 (owner: 10RobH) [23:42:30] RECOVERY - HTTP on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 3886 bytes in 1.431 second response time [23:52:18] !log bsitu synchronized php-1.22wmf14/extensions/Echo 'Update Echo to master' [23:52:24] Logged the message, Master [23:52:59] !log bsitu synchronized php-1.22wmf13/extensions/Echo 'Update Echo to master' [23:53:10] Logged the message, Master