[00:02:00] (03PS1) 10Dzahn: ytterbium, first include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:08:28] (03PS2) 10Dzahn: ytterbium, have group wikidev, include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:09:40] (03PS3) 10Dzahn: ytterbium, have group wikidev, include accounts then do all the rest (sudo user), attempt to avoid dependency problem for Group[500] [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 [00:10:40] (03CR) 10Dzahn: [C: 032] "user demon requires group wikidev wherever he goes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81159 (owner: 10Dzahn) [00:15:03] <^d> Hehe :) [00:16:07] ^d: crontab: user `gerrit2' unknown [00:16:19] trying to setup some crontabs now [00:16:26] but it will work on second run [00:16:30] <^d> Dammit. Ok, so let's abort for today. [00:17:10] well, just fixing the puppet run so you have your user, right:) [00:17:21] <^d> Well gerrit2 is an ldap user. [00:17:29] ah [00:17:53] a) i meant your own user, it couldn't create demon [00:18:03] <^d> Ah :) [00:18:05] then when i fixed that puppet continues so we see b) [00:18:09] about gerrit2 user [00:18:18] and i thought it's just the order in puppet again [00:18:34] well now it finishes puppet again [00:18:46] and you have /home/demon/ [00:19:30] <^d> I can login just fine and sudo now :) [00:19:33] <^d> So thanks! [00:19:35] cool [00:19:52] and yea, you got the remaining LDAP hookup issue then [00:20:16] <^d> I'll sort that tomorrow, no big deal. [00:20:21] kk,cool [00:23:22] PROBLEM - RAID on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:24:12] RECOVERY - RAID on db35 is OK: OK: 1 logical device(s) checked [00:35:22] PROBLEM - RAID on db35 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:36:12] RECOVERY - RAID on db35 is OK: OK: 1 logical device(s) checked [01:01:06] <^d> mutante: Could you do one quick thing? [01:02:55] <^d> Or another root who has a second to do something quick :) [01:05:10] ^d: what is it, a trap?:) [01:05:32] <^d> `chown -R gerritslave:gerritslave /srv/ssd/gerrit` on lanthanum? [01:06:16] ah, i see lots of stuff there now, yea [01:06:34] replication needs that user, nod [01:06:58] <^d> Yep [01:07:14] !log chown'ing /srv/ssd/gerrit to gerritslave for replication on lanthanum [01:07:18] done [01:07:21] Logged the message, Master [01:08:45] <^d> Thanks. [01:12:47] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:37] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [01:15:17] yw, cya [01:20:47] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [01:29:33] ^d: Uhh [01:29:46] ssh: connect to host gerrit.wikimedia.org port 29418: Connection refused [01:29:46] fatal: Could not read from remote repository. [01:29:46] Please make sure you have the correct access rights [01:29:46] and the repository exists. [01:30:09] <^d> Sorry. Queue was empty so I restarted to pick up some changes. [01:30:22] Ahh [01:30:28] Just badly timed on my part ;) [01:38:23] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [01:41:33] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:59] (03PS1) 10Spage: Remove unused UseVForm settings variables. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81174 [02:07:21] !log LocalisationUpdate completed (1.22wmf14) at Tue Aug 27 02:07:20 UTC 2013 [02:07:30] Logged the message, Master [02:09:01] (03CR) 10Mattflaschen: [C: 031] "Looks good to me. Let's deploy this Thursday." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81174 (owner: 10Spage) [02:12:42] !log LocalisationUpdate completed (1.22wmf13) at Tue Aug 27 02:12:42 UTC 2013 [02:12:48] Logged the message, Master [02:13:28] (03PS1) 10Demon: Until we switch over, still replicate all from manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/81175 [02:21:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 27 02:21:16 UTC 2013 [02:21:22] Logged the message, Master [04:15:58] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [04:21:58] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:28:58] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [04:32:58] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [04:43:13] (03PS2) 10Mattflaschen: Set group for /srv/mediawiki on singlenode mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 [04:43:13] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [04:43:22] (03CR) 10Mattflaschen: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [04:44:13] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:47:13] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [04:50:13] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [04:50:13] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [04:53:13] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [04:53:13] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [04:56:13] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:58:13] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [04:59:13] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [05:00:13] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [05:00:13] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [05:02:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:03:13] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [05:03:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [05:53:27] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:20:28] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [07:00:34] (03PS1) 10Ori.livneh: Track event counts in Graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 [07:01:32] (03CR) 10Ori.livneh: "Ryan, this patch would add hafnium to the list of EventLogging deployment targets. Is there anything I need to do on the host to prepare i" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81182 (owner: 10Ori.livneh) [07:19:07] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:19:27] (03CR) 10Hashar: [C: 031] "Ends up being very easy to enable :-] Whenever you +2 this, could you check the authentication is working properly on beta?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81148 (owner: 10CSteipp) [07:19:57] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [07:33:22] hello [07:34:10] !log jenkins: upgrading pep8 on gallium {{bug|53352}} [07:34:15] Logged the message, Master [07:34:57] (03Restored) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:35:02] (03PS6) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 [07:35:43] (03CR) 10jenkins-bot: [V: 04-1] Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:36:07] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/60866 (owner: 10Hashar) [07:38:59] (03CR) 10Hashar: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 (owner: 10Hashar) [07:39:02] (03PS14) 10Hashar: contint: publish Zuul git repositories [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 [07:40:04] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [08:49:29] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 185 seconds [08:49:29] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 185 seconds [08:52:29] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [08:52:29] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [09:24:21] * hashar I am going to restart Jenkins in a few minutes for plugins upgrades [09:27:19] !log restarting Jenkins for plugins upgrade [09:27:25] Logged the message, Master [09:28:43] !log jenkins: Failed to exec '/usr/lib/jvm/java-6-openjdk-amd64/jre/bin/java (deleted)' No such file or directory holy crap [09:28:49] Logged the message, Master [09:49:19] jenkins not back yet? [09:49:22] hashar: ? [09:49:26] * aude can wait [09:49:39] aude: nop :( [09:49:44] k [09:49:59] once in a while it refuses to start quickly [09:49:59] where did java go? [09:50:07] probably got upgraded [09:50:11] it is still there :) [09:50:13] hmmm [09:50:20] jenkins is still parsing the conf [09:50:26] its disk seems to be veryyyyy slow [09:50:37] varnishkafka seems to work fine now [09:50:58] and I have no clue how to find out which part is slowing the box I/O :(r [09:51:26] iotop? [09:51:39] yup [09:51:49] gives me a roughly 461.32 K/s [09:52:00] though I have no clue from where it is reading hehe [09:52:32] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:38] how many i/o requests a second? [09:54:34] mark: doesn't show up in iotop, couldn't find that in vmstat/top :( [09:54:45] anyway, completed! [09:54:49] !log jenkins restarted :-] [09:54:54] Logged the message, Master [09:55:13] jenkins has to read a myriad of tiny files [09:55:56] hashar: iostat -kx 3 [09:56:06] wasn't that the box with SSD(s)? [09:56:20] ah will have to remember about iostat :-] [09:56:25] yeah we have SSD for the jenkins workspace [09:56:32] but the configuration files are still on the hdd [09:56:45] (as well as all the history I will have to wipe one day) [09:56:53] configuration files surely can't be that much to read? [09:56:57] is there like 100 gigs of configuration?:P [09:57:24] it's a java app, could be [09:57:32] roughly 1400 config files [09:57:37] + 1400 state files [09:57:59] maybe you can have mapreduce job to parse the configs in hadoop [09:58:08] that would be the java way [09:58:11] or I could drop jenkins :-] [09:58:26] still, 15ms/seek worst case * 1400 files = 21 second:) [09:58:34] will probably start working on replacing jenkins next year [10:00:44] with what? [10:00:52] with a shell script [10:01:12] bah, can it be ruby at least? [10:02:45] well Zuul is now using gearman to distribute jobs to some workers [10:02:55] which could be jenkins or whatever gearman client :-] [10:15:01] PROBLEM - Disk space on cp1059 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12437 MB (3% inode=99%): /srv/sdb3 13577 MB (4% inode=99%): [10:22:36] hrm [10:22:40] that's gonna be annoying [10:46:07] (03PS1) 10Ori.livneh: Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 [10:48:24] Nikerabbit / MaxSem: ^^ does that look all right? [10:49:51] ori-l, shouldn't you stop the service before killing the files? [10:50:16] that's boring and predictable [10:50:36] (03CR) 10Faidon: [C: 031] "This will do for now" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/79927 (owner: 10Ottomata) [10:51:30] (03PS2) 10Ori.livneh: Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 [10:51:49] ori-l, any you're deleting the files _before_ that [10:52:17] hm? [10:52:37] i changed the 'before' to a 'require' [10:52:37] (03CR) 10MaxSem: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [10:53:20] puppet antonyms [10:56:50] (03CR) 10MaxSem: [C: 031] Add solr::decommission class & apply it to vanadium [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [10:57:19] thanks [10:58:03] (03CR) 10Edenhill: [C: 031] "Looks good, implementation is exactly what I would've done!" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [11:02:18] (03PS1) 10Hashar: sudo right for hashar on lanthanum (Jenkins slave) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 [11:03:14] (03CR) 10Hashar: "Pending RT https://rt.wikimedia.org/Ticket/Display.html?id=5677" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81203 (owner: 10Hashar) [11:13:54] don't give hashar root, he'd just break stuff, he always says it himself [11:14:37] Coren: there's a bunch of new unowned tickets :) [11:18:32] is someone around that can deploy update for wikidata ? [11:18:33] https://gerrit.wikimedia.org/r/#/c/81198/ [11:18:33] (03CR) 10Nikerabbit: "(3 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81202 (owner: 10Ori.livneh) [11:18:50] it's a bit time sensitive, sooner better [11:20:26] ori-l: added comments [11:20:50] aude: I could potentially handle it [11:20:59] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [11:21:04] hashar: would be great [11:21:08] it's simple but important [11:22:24] (03PS1) 10Faidon: wikipedia.com.il -> wikipedia.co.il [operations/dns] - 10https://gerrit.wikimedia.org/r/81211 [11:22:36] aude: we still have some wiki on wmf/1.22wmf13 [11:23:04] (03CR) 10Faidon: [C: 032] wikipedia.com.il -> wikipedia.co.il [operations/dns] - 10https://gerrit.wikimedia.org/r/81211 (owner: 10Faidon) [11:23:36] (03PS2) 10Faidon: Åland, Guernsey, Isle of Man and Jersey to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80970 [11:23:55] (03CR) 10Faidon: [C: 032] Åland, Guernsey, Isle of Man and Jersey to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80970 (owner: 10Faidon) [11:26:23] hashar: that's fine [11:26:28] the old code is fine [11:26:39] new code is on wikidata and wikivoyage [11:26:39] (03PS2) 10Faidon: Add all Asian countries in the list [operations/dns] - 10https://gerrit.wikimedia.org/r/80974 [11:26:40] (03PS2) 10Faidon: Middle-East to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80972 [11:26:41] (03PS2) 10Faidon: Switch Central/South Asia to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80973 [11:26:42] patch got merged [11:26:42] (03PS2) 10Faidon: Africa to esams [operations/dns] - 10https://gerrit.wikimedia.org/r/80971 [11:26:48] thanks [11:27:05] got to deploy it now [11:27:08] great [11:29:57] !log hashar synchronized php-1.22wmf14/extensions/DataValues 'Update DataValues {{gerrit|81198}}, requested by aude' [11:30:03] Logged the message, Master [11:30:13] aude: should be deployed now [11:30:24] thank you!!!! [11:33:09] aude: if that is a work for you, I will head out to grab a snack [11:33:35] that's fine [11:33:42] we will have some more patches in a bit [11:34:56] * aude also getting lunch [11:35:51] going to get a snack then commute to my coworking place [11:35:55] should be back in roughly an hour [12:17:31] re [12:28:10] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:10] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [12:43:28] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [12:46:48] PROBLEM - DPKG on search28 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:28] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.0006822347641 secs [14:01:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:16:52] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [14:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:52] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [14:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:29:52] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [14:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:33:52] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [14:39:20] hashar: Reedy if around we need one more update (should be the last!) [14:39:25] https://gerrit.wikimedia.org/r/#/c/81233/ and https://gerrit.wikimedia.org/r/#/c/81237/ are needed [14:40:54] aude: ok :/ [14:41:01] ok, thanks [14:41:51] * aude wants to get back to coding new stuff and not fixing these bugs [14:42:21] aude: will do the wmf13 one first [14:42:27] ok, that's more important [14:44:00] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [14:45:00] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:00] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [14:48:44] aude: the wmf13 one got merged https://gerrit.wikimedia.org/r/#/c/81237/ [14:48:49] updating and syncing [14:48:53] thanks! :) [14:49:02] * hashar cross fingers [14:50:10] yeahhh live hacks [14:51:00] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [14:51:00] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [14:51:20] aude: syncing [14:51:29] :) [14:51:30] !log hashar synchronized php-1.22wmf13/extensions/Wikibase 'https://gerrit.wikimedia.org/r/#/c/81237/' [14:51:36] Logged the message, Master [14:51:51] (03PS3) 10BBlack: Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 (owner: 10BryanDavis) [14:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:44] now the wmf14 one [14:52:53] (03CR) 10BBlack: [C: 032] Add *_delta stats for vhtcpd ganglia. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80151 (owner: 10BryanDavis) [14:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [14:54:00] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:00] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [14:54:14] !log hashar synchronized php-1.22wmf14/extensions/WikibaseDataModel 'https://gerrit.wikimedia.org/r/#/c/81233/' [14:54:20] Logged the message, Master [14:54:36] !log hashar synchronized php-1.22wmf14/extensions/Wikibase 'https://gerrit.wikimedia.org/r/#/c/81233/' [14:54:41] Logged the message, Master [14:54:51] aude: I have deployed both [14:55:11] aude: no fatal /exception (yet) :-D [14:55:13] * aude owes you a beer [14:55:43] yeah will have to book a flight to Berlin to reclaim it [14:57:00] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:00] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [14:59:52] :) [15:00:00] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:00] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:00] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [15:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:39] (03PS1) 10Mark Bergsma: Initial version of PROXY support for Varnish [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/81244 [15:03:03] (03Abandoned) 10Mark Bergsma: Initial version of PROXY support for Varnish [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/80982 (owner: 10Mark Bergsma) [15:03:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:04:00] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [15:12:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:13:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [15:19:43] (03CR) 10Faidon: [C: 031] "You're awesome :)" [operations/debs/varnish] (patches/proxy-support) - 10https://gerrit.wikimedia.org/r/81244 (owner: 10Mark Bergsma) [15:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [15:26:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [15:35:01] (03PS1) 10Ottomata: Updating tracked CDH4 version to 4.3.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81245 [15:36:11] (03CR) 10Ottomata: [C: 032 V: 032] Updating tracked CDH4 version to 4.3.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81245 (owner: 10Ottomata) [15:50:14] (03CR) 10Andrew Bogott: [C: 031] "Thanks for writing this! What will happen to..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon) [15:52:36] ottomata: hey, i'm here, and in the same city all day [15:53:02] awesooomme! [15:53:11] let's do it! I am all prepped! [15:53:17] here's a quick q for you [15:53:40] I have some bash scripts I've used to make partitioning all of these nodes easier [15:53:56] hehehe [15:53:57] ok [15:54:08] they aren't partman because it was complicated and we weren't sure of how the partitions would end up looking when we did this the firs time [15:54:08] (03CR) 10Andrew Bogott: [C: 031] "Have you run a test to make sure that these different permissions don't mess with web access to mediawiki? If so then I'm happy to merge " [operations/puppet] - 10https://gerrit.wikimedia.org/r/79955 (owner: 10Mattflaschen) [15:54:27] (03PS4) 10Andrew Bogott: rake task to generate puppet documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/77090 (owner: 10Hashar) [15:54:37] i'm thinking of just putting them on wikitech? [15:54:48] also partman is complicated … does anyone know how it works? ;) [15:54:51] https://github.com/wikimedia/analytics-kraken/blob/master/bin/setup-scripts/disks/datanodes_dell_r720.sh [15:54:51] https://github.com/wikimedia/analytics-kraken/blob/master/bin/setup-scripts/disks/namenode.sh [15:54:54] haha, hardly [15:55:11] well we had made the software git repo for things like scripts [15:55:14] shoudl I put the scripts on wikitech, or just link to them? [15:55:16] hmmmmmMMM [15:55:20] so i'd link to them [15:55:21] software git repo..... [15:55:27] ssh://lcarr@gerrit.wikimedia.org:29418/operations/software.git [15:55:34] well without the lcarr bit :) [15:55:35] cloning [15:55:45] (03CR) 10Andrew Bogott: [C: 032] rake task to generate puppet documentation [operations/puppet] - 10https://gerrit.wikimedia.org/r/77090 (owner: 10Hashar) [15:56:53] hmmm, ok, i'll make a subdir there…hadoop? analytics? [15:58:12] (03CR) 10Faidon: "So, if I'm user "faidon" and try to login to RT, it will first try an LDAP bind and if that fails it'll fallback to my old password. So, w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 (owner: 10Faidon) [15:58:31] hadoop ? [15:58:34] k [15:58:41] well, there will be a similar script for kafka I think? [15:58:48] hrm [15:58:56] maybe "partitioning" [15:59:05] then one labeled for hadoop and one labeled for kafka ? [16:00:01] k [16:02:48] cleaning these up a bit [16:02:58] i don't have one for journlnodes yet, we'll have to talk about that in a sec [16:04:52] ok, LeslieCarr, one thing i'm noticing [16:04:55] i just reinstlaled these nodes, right? [16:05:01] I didn't delete any of the non root partitions yet [16:05:08] so the old partitions are still there, going to have to delet ethem [16:05:30] the scripts I had before were very particular about how things were, they should mostly work, but the rely on fdisk prompts happening in the proper order, ec. [16:05:36] ok [16:05:43] want to share screen for this bit and just go thorugh it? it will probably be just a lot of back and forth [16:05:46] sound good [16:05:48] or do you want me to just get it to a clean state? [16:05:48] on iron ? [16:05:58] k, i've actually not shared screen from iron before [16:06:03] do I just start one there? [16:06:05] forward my ssh key? [16:06:10] yeah [16:06:14] ottomata: better use parted [16:06:18] and GPT partitions [16:06:23] it's also a bit easier to script [16:06:28] does it have to be done on boot? [16:06:32] sorry [16:06:33] on install? [16:06:39] parted? [16:06:42] no [16:06:43] yes ok [16:06:51] hm [16:06:51] ok [16:06:57] well, hm [16:07:03] fdisk can't do disks > 2 TB [16:07:18] ohhh parted sorry yes yes [16:07:25] not partman, parted ;) [16:07:36] remember this, i think there was a reason why I didn't use parted, but can't remember right now.... [16:08:10] ok thanks [16:08:30] LeslieCarr: hm, should I take the time now to write parted scripts? or should we just do this since we have the time together now? [16:08:36] ceph uses sgdisk [16:09:08] why don't we do this together [16:09:25] though i have never written parted scripts before, so it may be mostly me looking over your virtual shoulder [16:09:36] parted? I haven't either really [16:09:42] i barely remember looking into using it last summer [16:09:43] that's all [16:09:52] it's just the commands you type in the cli [16:09:56] like fdisk really [16:10:09] look at sgdisk too, it might be nicer indeed [16:10:45] mkpart part-type [fs-type] … [16:10:48] fs-type can't be ext3?! [16:10:49] bwa [16:10:57] ext2 [16:11:13] eh, sgdisk not installed by default :/ [16:11:46] manifests/swift.pp: command => "parted -s -a optimal ${title} mklabel gpt mkpart swift-${dev_suffix} 0% 100% && mkfs -t xfs -i size=512 -L swift-${dev_suffix} ${dev}", [16:11:53] OH PUPPETIZED [16:11:57] OOOoooO [16:11:58] hardly [16:12:47] uh ok, that is a nice way to do it [16:13:10] that actually is pretty awesome [16:13:39] ok, LeslieCarr, I'm going to add something for that to the hadoop.pp role class [16:13:48] a bit dangerous but nice indeed [16:13:51] (if someone wants to tell me to put it somewhere other than role class, speak now) [16:13:53] be careful so it won't wipe your data ;) [16:13:56] haha [16:13:57] yeah [16:14:32] (03CR) 10Demon: [C: 031] "I thought I could outsmart gerrit, but that wasn't the case. I broke replication so this needs merging." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81175 (owner: 10Demon) [16:14:53] ^d: hey, quick question [16:15:01] <^d> Shoot [16:15:19] there's both https://github.com/wikimedia/operations-software-varnish-varnishkafka & https://github.com/wikimedia/varnishkafka [16:15:42] the latter is because of a stanza you added to gerrit.pp per my request [16:15:59] they have both the same content and they have a few differences [16:16:19] hmmmmmmmmMMM i might at that to the cdh4 module, if I can make it generic enough! [16:17:09] hrm, make it a class parameter, default to false, and then make all the options more parameters ? [16:17:13] the latter doesn't have the "our actual code" banner [16:17:16] and has Issues enabled [16:18:22] <^d> paravoid: Both easily fixed. [16:19:01] <^d> Issues and wikis disabled, description updated. [16:20:07] manually? [16:20:19] I mean, I don't mind, I just want to know what the process is so I don't have to ping you :) [16:20:29] and bother you [16:20:34] <^d> Yeah. They're supposed to set automatically when the repos are created via the github plugin, but this was a manual setup anyway :) [16:20:36] LeslieCarr: yeah, i was going to make it a define and have users use it manually [16:20:43] don't want to make people automatically use it [16:20:45] cool [16:20:45] but aahhhhh [16:20:45] so [16:20:48] <^d> paravoid: That stuff can be edited via https://github.com/wikimedia/varnishkafka/settings [16:20:50] the thing is though [16:20:53] this will take a lot of testing [16:20:58] to make sure I get it right [16:21:00] in labs, etc. [16:21:06] probably more than just today [16:21:16] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [16:21:26] a parted define? [16:21:31] that shouldn't really be used much [16:21:43] totally, but if I add it I want it to work for sure [16:22:26] i'd want to test a few different cases, etc. [16:22:28] so ummmmmm [16:22:47] i'll def do this, but LeslieCarr, maybe we can just proceed with repaving today with out this? [16:23:09] ok [16:23:23] because we finally are both here at the same time? ;) [16:23:47] yeah [16:23:48] exactly [16:23:57] and we kinda scheduled for this to happen today, and here we are [16:24:11] so, let's share a screen on iron, and start partitioning stuff manually using my scripts, ja? [16:24:25] btw, where is iron? .eqiad? .wikimedia.org? [16:24:26] i'm trying to log in [16:24:30]