[00:00:05] Hmm, this conference wifi must be crappy then [00:00:09] Sorry to bother you :( [00:03:49] binasher: http://people.apache.org/~joestein/kafka-0.8.0-beta1-candidate1/RELEASE_NOTES.html [00:04:17] andrewbogott_afk: why doesn't operations/debs/kafka have a master branch? [00:04:59] drdee: thx! [00:05:19] <^demon> RoanKattouw: Conference wifi is always crappy :) [00:05:32] blody difficult to login nowadays [00:05:42] getting logged out and complains about cookies [00:12:44] s/conference//g [00:16:00] It was actually good down in the hacker lounge [00:16:16] But I'm now up in the speaker lounge (the hacker lounge got too crowded) where it's worse [00:18:37] which conference are you at? [00:22:26] Open Source Bridge in Portland [00:24:32] RoanKattouw: ah, gotcha :) have fun [00:28:50] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [00:56:18] !log updated Parsoid to 0445108 [00:56:27] Logged the message, Master [01:02:29] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.00465965271 secs [01:21:21] New patchset: Catrope; "Hopefully fix the Parsoid Varnishes not showing up as such in Ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69443 [01:26:23] New patchset: Mattflaschen; "Add remaining skins (Cologne Blue and Modern) to sync script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69444 [01:32:04] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003474831581 secs [01:49:52] !log catrope Started syncing Wikimedia installation... : Updating VisualEditor to master [01:50:03] Logged the message, Master [02:04:26] !log catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master [02:04:35] Logged the message, Master [02:07:43] !log LocalisationUpdate completed (1.22wmf7) at Wed Jun 19 02:07:42 UTC 2013 [02:07:51] Logged the message, Master [02:14:07] !log LocalisationUpdate completed (1.22wmf6) at Wed Jun 19 02:14:07 UTC 2013 [02:14:16] Logged the message, Master [02:15:14] !log catrope synchronized php-1.22wmf6/resources/startup.js 'touch' [02:15:22] Logged the message, Master [02:15:37] !log catrope synchronized php-1.22wmf7/resources/startup.js 'touch' [02:15:45] Logged the message, Master [02:18:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 19 02:18:52 UTC 2013 [02:19:01] Logged the message, Master [02:56:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [03:28:16] New review: Krinkle; "(1 comment)" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [03:32:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:45:06] http://puppetlabs.com/security/cve/cve-2013-3567/ [03:49:21] New review: Ori.livneh; "(1 comment)" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [03:49:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:28] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:28] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:29] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:29] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:30] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:49:30] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:35] go go go icinga [04:17:56] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [04:21:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [04:58:24] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [04:58:43] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:56:27] New patchset: Aklapper; "Add product name to component output in Weekly Bugzilla Report mail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69457 [06:31:13] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.110 second response time [07:46:07] hello [07:47:43] apergos: here I am again :) [07:47:58] morning [07:53:09] apergos: I had a patch for beta to reenable commons as a foreign repo ( https://gerrit.wikimedia.org/r/#/c/62606/3 ) seems it is not needed anymore though [07:53:15] (related to instant commons) [07:53:37] ah yep [07:53:53] seems it is no more needed, though I have no idea how instant commons and foreign repo works together :( [07:53:57] ah [07:54:02] ok so this is easy [07:54:50] in Setup.php there is a stanza [07:55:59] and all it does is add the foreign api repo for you (but in case there are things like [07:56:09] doh disconnection :D [07:56:16] 'apibase' => WebRequest::detectProtocol() === 'https' ? [07:56:18] yeah [07:56:29] or other little changes that we don't see [07:56:38] it's good to use the version in Setup [07:56:47] the beta enwiki has two foreign file repos: http://paste.openstack.org/show/38931/ [07:57:04] one ForeignDBViaLBRepo which is the beta commons a [07:57:05] yes [07:57:08] I was saying that [07:57:14] and another one which is ForeignAPIRepo to production [07:57:15] 0:54:50 πμ) apergos: in Setup.php there is a stanza [07:57:15] (10:54:56 πμ) apergos: if ( $wgUseInstantCommons ) { [07:57:33] which just adds the foreignapirepo for commons for you [07:57:52] http://paste.openstack.org/show/38931/ :D [07:58:07] the advantage is it's shorter and it is kept up to date with things like the https check pasted above [07:58:12] which is why I like to use it [07:59:00] um did you mean to repaste the same link? [07:59:17] I forgot I pasted it already haha [07:59:41] so one potential issue is that the beta commonswiki has no wgForeignFileRepos (it gives array() ) [07:59:42] so when mw looks for an image it checks the local repo (local db for the row) then checks the next repo in the array = foreigndb which means it checks that db for a row [08:00:02] that's fine, beta commonswiki should only have the local repo [08:00:40] ... then it would check the third repo = api repo, which means it would request the file descr page via api [08:00:50] and if that fails it will whine and say there's nothing to be found [08:01:53] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003840208054 secs [08:01:54] * hashar looks at production [08:02:32] what foreign repo would commons have? it should only retrieve files from local... [08:03:05] so in prod, the wikis lookup via the database [08:03:14] and commons obviously does not lookup anything [08:03:30] on beta I was wondering whether we should set the wiki to lookup the local database ( ForeignDBViaLBRepo ) [08:03:36] commons will of course look up file description pages in the locla table [08:03:37] and have the beta commons to fallback to the production commons [08:03:40] which is what we do [08:04:03] no, I don't think we want any foreigndb for commons [08:04:14] on beta, it should check its local db file table and that's it [08:04:43] the other wikis can have instant commons as their third repo which covers them [08:05:14] ok [08:05:19] but then there is only two repo on the wikis [08:05:25] oh maybe 3 sorry [08:05:26] three [08:05:33] their local one being the first ? [08:05:35] local, beta commons, "real" commons [08:05:36] yep [08:06:33] Change abandoned: Hashar; "Per discussion with Ariel, this is no more needed, beta is properly configured right now. Wikis loo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62606 [08:06:34] thank you ! [08:06:40] sure! [08:33:05] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.001609921455 secs [08:36:04] New patchset: Hashar; "beta: wgAbuseFilterCentralDB = labswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [08:37:13] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [08:58:10] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [09:07:38] New review: Akosiaris; "Apart from the inline comments i also noticed the following. After commenting the first line with th..." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [09:36:27] New patchset: Hashar; "beta: properly override $wgAbuseFilterCentralDB" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69465 [09:36:34] New review: Hashar; "moved to InitialiseSettings-labs.php with https://gerrit.wikimedia.org/r/69465" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [09:37:09] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69465 [09:49:16] New review: Hashar; "and that does not work, need to use a wmg setting." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [09:55:09] New patchset: Hashar; "beta: $wmg for AbuseFilterCentralDB" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69466 [09:56:11] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69466 [09:59:26] !log hashar synchronized wmf-config/CommonSettings.php '{{gerrit|69466}} beta: $wmg for AbuseFilterCentralDB' [09:59:35] Logged the message, Master [10:00:00] !log hashar synchronized wmf-config '{{gerrit|69466}} beta: $wmg for AbuseFilterCentralDB' [10:00:10] Logged the message, Master [10:10:27] New patchset: Hashar; "beta: enable global abuse filters on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69469 [10:11:07] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69469 [10:16:16] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:29:15] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [11:03:50] New review: Daniel Kinzler; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [11:39:22] New patchset: QChris; "Take advantage of hook-bugzillas new event mechanism" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [11:39:22] New patchset: QChris; "Turn off hooks-bugzilla legacy event handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [11:40:52] New review: QChris; "Bringing this in does not hurt older hooks-bugzilla plugins." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [11:42:30] New review: QChris; "This change should only merged once the new hooks-bugzilla is in place, as it disables adding bugzil..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69476 [11:58:20] New patchset: Nemo bis; "Add User:Odder to English Planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69482 [12:28:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.161 second response time [12:34:40] New review: Ottomata; "(1 comment)" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [12:36:00] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [12:51:37] New patchset: Odder; "Adding a couple of blogs from [[m:Planet Wikimedia]]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [12:59:48] New review: Nemo bis; "I already submitted I882f7f5a" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [13:16:41] New review: Odder; "I15ea12d adds a couple of blogs at the same time." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69482 [13:20:39] @notify binasher [13:20:39] I'll let you know when I see binasher around here [13:23:26] Change abandoned: Nemo bis; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69482 [13:50:04] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:05] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:05] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:06] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [13:50:06] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:07] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:51:06] !log powering down carbon to replace bad disk [13:51:15] Logged the message, Master [13:52:54] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [13:54:58] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68829 [14:07:42] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [14:18:22] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [14:22:22] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [14:35:22] !log powering down mw1171 to replace hard drive and reinstalling [14:35:30] Logged the message, Master [14:54:09] !log update & reboot boron [14:54:18] Logged the message, Master [14:57:04] !log powering down and relocating searchidx1001 [14:57:13] Logged the message, Master [14:59:03] RECOVERY - Host mw1171 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:59:53] PROBLEM - Host searchidx1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:11] !log authdns update [15:01:13] PROBLEM - RAID on mw1171 is CRITICAL: Connection refused by host [15:01:19] Logged the message, Master [15:01:23] PROBLEM - SSH on mw1171 is CRITICAL: Connection refused [15:01:33] PROBLEM - Disk space on mw1171 is CRITICAL: Connection refused by host [15:01:33] PROBLEM - twemproxy process on mw1171 is CRITICAL: Connection refused by host [15:01:43] PROBLEM - Apache HTTP on mw1171 is CRITICAL: Connection refused [15:02:03] PROBLEM - DPKG on mw1171 is CRITICAL: Connection refused by host [15:06:13] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:53] New patchset: Hashar; "vary wmerrors.ini 'fatal_log_file' per realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:06:58] apergos: ^^^ [15:07:03] RECOVERY - Disk space on mc15 is OK: DISK OK [15:07:20] New review: Hashar; "added missing 'include role::applicationserver::configuration' :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:07:50] New review: Hashar; "tested on labs:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:13:30] PROBLEM - NTP on mw1171 is CRITICAL: NTP CRITICAL: No response from NTP server [15:14:07] New review: Hashar; "Got tested with Ariel on labs. Both of us have to leave so we will get that merged and deployed tomo..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:14:20] RECOVERY - SSH on mw1171 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:37:28] RECOVERY - Host searchidx1001 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:40:18] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.004 second response time [15:43:56] New patchset: Andrew Bogott; "Move ldap into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [15:47:18] RECOVERY - Disk space on mw1171 is OK: DISK OK [15:47:28] RECOVERY - DPKG on mw1171 is OK: All packages OK [15:47:38] RECOVERY - RAID on mw1171 is OK: OK: no RAID installed [15:48:28] RECOVERY - NTP on mw1171 is OK: NTP OK: Offset -0.05871582031 secs [15:49:21] PROBLEM - Apache HTTP on mw1171 is CRITICAL: Connection refused [15:52:28] New patchset: Andrew Bogott; "Fix string continuation and a few pep8 issues." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:53:04] New patchset: Andrew Bogott; "Fix string continuation and a few pep8 issues." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:54:38] New review: Andrew Bogott; "Quick merge so that future patches don't get blocked." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/69496 [15:54:38] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:55:03] New review: Andrew Bogott; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [15:57:18] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [16:01:54] New patchset: Reedy; "(bug 49639) Change local time zone for ko.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68947 [16:02:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68947 [16:07:24] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [16:07:25] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [16:07:25] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [16:07:25] PROBLEM - search indices - check lucene status page on search1004 is CRITICAL: Connection refused [16:07:34] PROBLEM - search indices - check lucene status page on search1001 is CRITICAL: Connection refused [16:07:35] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:07:35] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:07:44] PROBLEM - Lucene on search1004 is CRITICAL: Connection refused [16:07:44] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:07:44] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [16:07:44] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection refused [16:07:54] PROBLEM - search indices - check lucene status page on search1002 is CRITICAL: Connection refused [16:07:54] PROBLEM - NTP on search1004 is CRITICAL: NTP CRITICAL: Offset unknown [16:10:24] PROBLEM - search indices - check lucene status page on search1005 is CRITICAL: Connection refused [16:10:34] PROBLEM - search indices - check lucene status page on search1006 is CRITICAL: Connection refused [16:11:35] RECOVERY - twemproxy process on mw1171 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [16:11:55] RECOVERY - NTP on search1004 is OK: NTP OK: Offset -0.01002562046 secs [16:13:14] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [16:13:24] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [16:13:25] PROBLEM - Lucene on search1005 is CRITICAL: Connection refused [16:13:25] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [16:15:44] RECOVERY - Puppet freshness on mw1171 is OK: puppet ran at Wed Jun 19 16:15:39 UTC 2013 [16:38:04] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:38:04] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:38:14] RECOVERY - Host search1012 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:38:14] RECOVERY - Host search1009 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [16:38:14] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:14] RECOVERY - Host search1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:14] RECOVERY - Host search1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:15] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:38:15] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:38:16] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:38:16] RECOVERY - Host search1019 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:38:17] RECOVERY - Host search1010 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [16:38:24] PROBLEM - search indices - check lucene status page on search1009 is CRITICAL: Connection refused [16:38:24] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [16:38:24] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [16:38:24] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [16:38:24] RECOVERY - Host search1011 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:38:34] PROBLEM - Lucene on search1014 is CRITICAL: Connection refused [16:38:34] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [16:38:34] PROBLEM - NTP on search1019 is CRITICAL: NTP CRITICAL: Offset unknown [16:38:37] rise search rise! [16:38:44] PROBLEM - NTP on search1018 is CRITICAL: NTP CRITICAL: Offset unknown [16:39:44] RECOVERY - NTP on search1018 is OK: NTP OK: Offset -0.008357286453 secs [16:40:24] PROBLEM - search indices - check lucene status page on search1008 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1010 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1011 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1013 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1007 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1014 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1012 is CRITICAL: Connection refused [16:42:34] RECOVERY - NTP on search1019 is OK: NTP OK: Offset -0.01334488392 secs [16:43:24] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1012 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1021 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [16:43:35] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [16:43:35] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [16:43:36] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [16:43:54] PROBLEM - Lucene on search1010 is CRITICAL: Connection refused [16:43:55] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [16:43:55] PROBLEM - Lucene on search1011 is CRITICAL: Connection refused [16:51:39] looks like zero is not deploying today, so if anyone needs it for lightning depl [16:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:00] yurik: oh, good to know, thanks! [16:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.158 second response time [17:06:37] RECOVERY - Host search1024 is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [17:06:37] RECOVERY - Host search1022 is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [17:06:37] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [17:06:37] PROBLEM - search indices - check lucene status page on search1023 is CRITICAL: Connection refused [17:06:37] PROBLEM - search indices - check lucene status page on search1024 is CRITICAL: Connection refused [17:06:56] PROBLEM - Lucene on search1023 is CRITICAL: Connection refused [17:07:26] PROBLEM - Lucene on search1022 is CRITICAL: Connection refused [17:12:06] PROBLEM - Lucene on search1024 is CRITICAL: Connection refused [17:13:07] PROBLEM - RAID on analytics1019 is CRITICAL: Timeout while attempting connection [17:14:16] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:51] mutante, Leslie fixed the netboot issue, i think I need to allow disk booting in the bios again, right? [17:19:01] ottomata: correct, and ..correct [17:19:24] ok, loading bios... [17:19:24] yea, i had it disabled just to see more of the error instead of always ending up in OS [17:19:29] yeah [17:19:34] i think it reinstalled all just fine! [17:19:38] cool [17:20:46] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:47] RECOVERY - SSH on analytics1020 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:22:56] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:24:46] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [17:25:03] are you doing 1019 as well? [17:26:26] puppet design (roles / profiles) video https://www.youtube.com/watch?v=ZpHtOnlSGNY [17:27:16] PROBLEM - DPKG on analytics1019 is CRITICAL: Connection refused by host [17:27:36] PROBLEM - Disk space on analytics1019 is CRITICAL: Connection refused by host [17:27:37] PROBLEM - SSH on analytics1019 is CRITICAL: Connection refused [17:30:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [17:38:05] yo dudes [17:38:08] ever seen this on a fresh reinstall initial puppet run? [17:38:08] err: Could not retrieve catalog from remote server: Server hostname 'sockpuppet.pmtpa.wmnet' did not match server certificate; expected sockpuppet.pmtpa.wmnet [17:38:38] RECOVERY - search indices - check lucene status page on search20 is OK: HTTP OK: HTTP/1.1 200 OK - 60075 bytes in 0.115 second response time [17:38:57] PROBLEM - NTP on analytics1019 is CRITICAL: NTP CRITICAL: No response from NTP server [17:47:41] RECOVERY - SSH on analytics1019 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:52:00] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [17:52:38] RECOVERY - DPKG on analytics1020 is OK: All packages OK [17:52:47] RECOVERY - DPKG on mc15 is OK: All packages OK [17:52:57] RECOVERY - Disk space on analytics1020 is OK: DISK OK [17:55:37] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:28] RECOVERY - DPKG on analytics1019 is OK: All packages OK [18:00:37] RECOVERY - Disk space on analytics1019 is OK: DISK OK [18:08:41] RECOVERY - NTP on analytics1020 is OK: NTP OK: Offset -0.01051580906 secs [18:14:16] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master' [18:14:25] Logged the message, Master [18:16:25] RECOVERY - NTP on analytics1019 is OK: NTP OK: Offset -0.01820898056 secs [18:17:27] <^demon> manybubbles: So, did my update rewrite totally break things for you? :) [18:17:53] well I just merged it and the merge was clean. that is something. [18:18:03] but it looks like batch indexing doesn't work [18:18:07] but I can fix that [18:18:26] <^demon> Hmm, worked for me locally. [18:18:49] <^demon> I was curious though...there's already an updateSearchIndex script in core. I'm wondering if we could make it backend-agnostic enough to where it would work for us too. [18:18:56] <^demon> Less codepaths to maintain then. [18:19:45] we might be able to [18:19:58] <^demon> forceSolrIndex kind of annoys me right now since it doesn't use the SearchUpdate codepath. [18:20:13] it'd be worth looking at but we'll have to make sure we can still do the time updates [18:20:22] fair enough - it was before I knew about it [18:20:47] I'm happy so long as we essentially have the same queries powering the updates (and deletes) that can be done in batches. [18:21:38] New patchset: Dzahn; "Adding a couple of blogs from [[m:Planet Wikimedia]]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [18:22:45] ^demon: got it. I just hadn't pulled mediawiki recently [18:22:51] <^demon> :) [18:22:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [18:23:01] <^demon> Yeah, it depended on that core change. [18:24:41] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:31] RECOVERY - Disk space on mc15 is OK: DISK OK [18:25:47] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/67203 [18:28:07] !log sync-apache and graceful for votewiki rewrites (change 67203) [18:28:21] Logged the message, Master [18:29:33] New patchset: Asher; "pull db64, bad bbu" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69511 [18:30:00] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69511 [18:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:37] !log: taking down db64 to replace Raid Controller battery [18:31:26] !log asher synchronized wmf-config/db-pmtpa.php 'pull db64' [18:31:36] Logged the message, Master [18:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:33:32] PROBLEM - Host db64 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:07] !log nikerabbit synchronized php-1.22wmf7/extensions/UniversalLanguageSelector/ 'ULS to master' [18:34:15] Logged the message, Master [18:34:33] !log updated Parsoid to ddc47d8 [18:34:43] Logged the message, Master [18:35:22] New review: Dzahn; "deployed and graceful'ed" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/67203 [18:38:38] RECOVERY - Host db64 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [18:38:39] heh, gwicke was harassed IRL right after he deployed [18:38:59] "harassed" meaning "interrupted and talked to" [18:41:18] PROBLEM - mysqld processes on db64 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:41:26] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master again' [18:41:34] Logged the message, Master [18:43:50] PROBLEM - Varnish HTTP parsoid-backend on cerium is CRITICAL: Connection refused [18:45:40] Oh heh [18:45:44] Forgot to start it back up [18:46:48] RECOVERY - Varnish HTTP parsoid-backend on cerium is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [18:53:28] PROBLEM - NTP on db64 is CRITICAL: NTP CRITICAL: Offset unknown [18:55:42] New patchset: Ottomata; "Including role::analytics::hadoop::worker on analytics1019 and analytics1020" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:57:28] RECOVERY - NTP on db64 is OK: NTP OK: Offset -0.0001238584518 secs [18:57:36] New patchset: Ottomata; "Including role::analytics::hadoop::worker on analytics1019 and analytics1020" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:58:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:58:49] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [19:00:15] New patchset: Ottomata; "Adding role/analytics/hive,pig,sqoop.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69521 [19:00:39] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69521 [19:11:12] PROBLEM - Host mc1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:39] RECOVERY - Host mc1015 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:14:53] New patchset: Ottomata; "Fixing pig.properties.erb comment" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69523 [19:15:05] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69523 [19:38:57] ottomata: hey [19:39:08] I'm not sure I understand exactly the situation with the group.id [19:39:28] and I'm more curious than reviewing at this point :) [19:40:11] aye so [19:40:30] a consumer has an associated consumer group [19:40:55] ok [19:41:02] when it consumes data from kafka, it keeps a high water mark stored in zookeeper under the name of the consumer group [19:41:03] so [19:41:04] why can't it be a variable set by the manifest? [19:41:14] it could, but what would we set it to? [19:41:30] a consumer is not a single instance daemon [19:41:39] set it in ::defaults as "test-consumer-group"? [19:41:41] there could be any number of consumers running on any number of boxes [19:41:47] sure i don't mind that [19:41:54] in practice [19:41:56] what will we do? [19:42:02] in practice, probably not even use that file [19:42:09] except for testing [19:42:16] so, what our file would be like? [19:42:25] not all consumers use that file :) [19:42:27] and [19:42:31] for those that ship with kafka [19:42:49] uh, there are two main CLI consumers that ship with kafka [19:42:51] one of them uses that file [19:42:57] the other just accepts a bunch of CLI args [19:43:28] what will we use? [19:43:30] none of the consumers that use that file are good for importing data into HDFS (or into storm), which is what we mainly use [19:43:55] we aren't exactly sure what consumer we're going to use yet, right now we're using this hacky 3rd party thing called kafka-hadoop-consumer [19:43:55] but [19:44:03] if we import straight from kafka into hdfs [19:44:06] (no storm) [19:44:10] we will probably use Camus [19:44:23] https://github.com/linkedin/camus [19:44:39] moar components! [19:44:43] hehe yup! [19:44:44] maybe [19:44:52] use sartre as middleware [19:44:55] we haven't tried it yet, but it is probably the best as what it does [19:45:01] haha [19:45:26] so wait [19:45:35] we'll use librdkafka as a producer and camus as a consumer [19:45:41] most likelye, yes [19:45:42] Wikimedia Platform operations, serious stuff | Log: https://wikitech.wikimedia.org/wiki/Server_Admin_Log | Channel logs: https://goo.gl/mDoFg | MediaWiki errors: http://tinyurl.com/n3twd8k | on RT duty: RobH [19:45:45] so the kafka package & module is for brokers? [19:45:46] err [19:45:54] mostly, yes [19:46:24] but we will use a group.id anyway, right? [19:46:40] we will set one, yes, but it won't be read out of the consumer.properties file [19:46:50] and you set the group.id only in the consumer? [19:46:54] yes [19:47:05] because multiple consumers can consume the same topic [19:47:10] so you need multiple high water marks for each one [19:47:18] uhhh a high water mark for each one [19:47:32] multple high water marks per topic, IDed by group.id [19:47:35] we can still use the CLI for testing though, right? [19:47:37] yes [19:47:41] so that would read this [19:55:58] paravoid? was there more you were going to say there? [19:56:23] he reached his high-water mark for the topic [19:56:31] haha [19:56:47] * YuviPanda awards ori-l 30 internet points [19:57:32] from what I understand, I think it's either not useful at all so it shouldn't be shipped or it should be shipped and have the group.id customizable so that it's useful as a file when you want to use the cli consumer [19:57:40] with a preference for the latter [19:57:57] but if its use is so limited, I don't think it matters that much, I'll agree to that :) [20:02:29] i don't mind setting a default at all actually [20:02:34] probably no one will ever change i [20:02:35] t [20:02:36] but might as well [20:05:03] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [20:11:08] RECOVERY - Puppet freshness on analytics1020 is OK: puppet ran at Wed Jun 19 20:11:00 UTC 2013 [20:16:47] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [20:18:37] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:54] robh: I don't mean to be impatient, but do you need anything else on rt 5274? [20:22:36] uhhh, nope. [20:22:46] im just not back to that queue yet today [20:23:00] i did it yesterday, today im on ops-requests, but i'll jump back and handle your stuff no worries [20:23:04] cuz it blocks you doing real shit. [20:23:16] not all the real shit - just something I want to do [20:31:51] New review: Aaron Schulz; "I'd prefer a more modest increase, like to 100." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66665 [20:32:07] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [20:33:12] I failed delete Remote.jpg on enwiki, getting "Error deleting file: A non-identical file already exists at "mwstore://local-swift/local-deleted/o/5/5/o55tq356m6n0lea0f7ravhzvaive5ld.jpg"." [20:33:23] " Exception from line 114 of /usr/local/apache/common-local/php-1.22wmf7/includes/upload/UploadStash.php: UploadStash::getFile No user is logged in, files must belong to users" [20:33:34] why O why must this go in the logs? [20:33:57] !log analytics1019 and analytics1020 are fully puppetized. rebalancing them back into the hadoop cluster [20:33:59] the current Remote.jpg isn't the Remote.jpg I deleted though [20:34:06] Logged the message, Master [20:34:23] ottomata: so what was it with an1020? [20:34:46] the issue [20:34:51] ah, it was deleted... [20:34:58] the error message was fubar I assume [20:35:50] paravoid: I don't know the exact details, but apparently whichever ftp client that was being used chooses to communicate on the same udp port number that the daemon sends from, or something like that [20:35:57] which can be random [20:36:03] so the ACL firewall was blocking the traffic [20:36:07] something along those lines [20:36:11] LeslieCarr knows more [20:36:14] tftp uses random src & dst ports [20:36:17] http://i.imgur.com/AiI58RH.png [20:36:35] you can't open it on the firewall unless you do l7 sniffing [20:36:39] that was it all along? [20:36:46] yup [20:36:52] lol [20:37:16] working just fine now :) [20:37:23] the puppetization worked perfectly [20:37:33] i made sure the data partitions were formatted and mounted [20:37:34] ran puppet [20:37:39] and tada! new datanode. [20:37:47] good! [20:40:19] New patchset: RobH; "RT 5274 shell access for nik everett on search systems" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69569 [20:40:27] "We just completed the deployment of Kafka 0.8 (latest code in 0.8 branch) to production at LinkedIn yesterday. No major issues were found." junrao @ linkedin [20:41:24] I filed it at http://i.imgur.com/AiI58RH.png [20:41:45] dunno if mwstore is inconsistent now [20:43:20] yeah, i opened up ethereal ports between carbon/brewster and analytics [20:43:25] New review: RobH; "I forgot to change my tabs to spacing, must conform to own standard, rejecting until i fix" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69569 [20:43:40] and cursed tftpd -- some implementations let you set the source port to be fixed , but not atftpd as far as i could tell [20:45:26] New patchset: RobH; "RT 5274 shell access for nik everett on search systems" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69569 [20:45:35] LeslieCarr: We can certainly change what tftpd we use if it helps. [20:45:59] LeslieCarr: mcast-port? [20:46:01] we only run it on three different servers, so rolling a puppet change is easy enough [20:46:13] and all those servers are opsen impact only services [20:47:53] tcp port [20:48:54] well, there is --port, --mcast-port, --mtftp-port [20:49:08] i mean udp port [20:49:10] stupid tftp [20:49:10] dunno what you refer to "source" in this context [20:49:40] mcast-port is "Specify the UDP port to use for multicast transfer." per man [20:50:25] yeah, but we're using unicast. the source port from the tftp server [20:51:10] LeslieCarr: The "port the mtftp server shall listen to for incomming request"? [20:51:21] nope, also, not using multicast [20:51:46] not the port the server is listening on, the source port when it is sending data [20:51:55] LeslieCarr: aint it --port then? [20:52:07] which is the blody port it listen to [20:52:09] that's the port the server is listening on [20:52:11] New review: RobH; "im not self reviewing, this is one of my latent personalities." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/69569 [20:52:12] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69569 [20:52:15] yeah, that port is fixed [20:52:24] i want the source port when its transferring to be fixed [20:52:31] ah [20:53:46] LeslieCarr: have you trieds using --no-multicast but still try to use the mcast params? [20:53:59] why would that work ? [20:55:39] LeslieCarr: http://www.dinodigusa.com/images/Magic1.gif [20:56:00] !log updating OpenStackManager on virt0 to 3e92211313ed8b73ee44fb395230bafb14dc9bdc [20:56:08] Logged the message, Master [20:57:45] LeslieCarr: I've myself have only used tftpd-hpa [21:04:59] !log rebooting tridge due to updates [21:05:07] Logged the message, RobH [21:16:40] PROBLEM - DPKG on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:30] RECOVERY - DPKG on mc15 is OK: All packages OK [21:20:20] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:22:21] RECOVERY - Disk space on mc15 is OK: DISK OK [21:28:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69457 [21:31:07] The search eqiad cluster doesn't have any members in ganglia. that seems wrong. [21:34:13] <^demon> Last week: http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&s=by+name&c=Search+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:36:01] New review: Dzahn; "deployed and works. you should have received a test mail. now reports like this:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69457 [21:40:23] on a similar note - has anyone done any digging on how to get stats from JVM applications into Ganglia? It'd be nice to have graphs of some of the juicy stuff solr keeps about itself. [21:41:14] New review: RobH; "mortals means deployment, and yes this kind of thing needs an RT ticket." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/69313 [21:42:09] New review: RobH; "ie: dont use mortals if you just want bastions, use restricted if you want to put into a group, and ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69313 [21:43:05] <^demon> manybubbles: No, it'd be nice for gerrit too. [21:44:18] <^demon> Hmm, https://github.com/ganglia/jmxetric [21:48:48] jmxtrans also seems nice [21:49:04] and it seems to have a command to build a deb file [21:49:30] the analytics team has a card scheduled to debianize jmxtrans [21:49:38] New patchset: RobH; "RT 5337 stats.wikimedia.org ssl cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69591 [21:50:10] manybubbles: talk to ottomata about jmxtrans and ganglia [21:50:36] we use it to instrument hadoop [21:50:38] yeah, manybubbles :) we are using that, I had built a .deb once (although it wouldn't pass review as is) [21:50:54] also, i have a really nice puppe tmodule for generating jmxtrans json files [21:51:02] https://github.com/wikimedia/puppet-jmxtrans [21:51:04] ottomata: cool! [21:51:05] also hasn't been through review yet [21:51:08] but it will! [21:51:12] indeed :) [21:52:10] ottomata: what about jmxtran's own debian build? I'm just curious why we can't use it. [21:53:01] oh maybe we can? [21:53:21] its been a while since I looked at upstream there, hmm, I think I built my own at the time…maybe not [21:54:14] oh nope, I did use theirs before [21:54:15] it looks like it built me a .deb file [21:54:42] cool! [21:54:45] ohhh sorry i meant we have a puppetize jmxtrans card not a debianize card [21:54:52] https://mingle.corp.wikimedia.org/projects/analytics/cards/133 [21:55:47] cool! I'd love to be able to use this with Solr and all the other JVM things. [21:56:17] paravoid: funny how many bogus listings in swift :) [21:56:42] I think this came up before, though [22:00:05] I remember, you were interested in the list of files [22:00:22] though they all seemed to be ones already in the deleted container though [22:01:04] so one interesting thing: solr generates statistics on a per collection basis so we'd have a pile per wiki per server [22:02:05] now that I think about it they might be quite difficult to enumerate up front.... [22:12:11] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [22:12:41] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [22:12:46] gah, i thought search was muted [22:13:20] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: No route to host [22:13:45] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: No route to host [22:13:49] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: No route to host [22:14:26] * AaronSchulz hopes 0d5ab249ac87dd07db338af26af7a720cf946426 fixed those DB select errors on db64 [22:14:47] oh the downtimes expired [22:14:57] notpeter: do you know when the search moves will be all done ? [22:15:14] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [22:22:39] paravoid: hi Faidon :) [22:22:50] paravoid: saw your comments on #68711 [22:24:27] LeslieCarr: not sure. wil take a look tonight after velocity [22:24:28] paravoid: I wanted to ask why you consider shlibs not to be necessary. You said this here http://goo.gl/ePsSP and here http://goo.gl/viQfn [22:30:17] cool, i mued for 2 more days notpeter [22:37:05] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:49:26] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:54:37] New review: Aklapper; "Thanks, the changed part of the weekly mail looks good and the numbers for components fit to the res..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69457 [23:03:56] PROBLEM - Redis on mc15 is CRITICAL: Connection timed out [23:04:46] RECOVERY - Redis on mc15 is OK: TCP OK - 0.027 second response time on port 6379 [23:28:22] New patchset: AzaToth; "adding gitreview" [operations/debs/ircd-ratbox] (wmf) - https://gerrit.wikimedia.org/r/69607 [23:28:22] New patchset: AzaToth; "flush patches through gbp-pq to normalize them" [operations/debs/ircd-ratbox] (wmf) - https://gerrit.wikimedia.org/r/69608 [23:28:22] New patchset: AzaToth; "adding gbp.conf" [operations/debs/ircd-ratbox] (wmf) - https://gerrit.wikimedia.org/r/69609 [23:28:23] New patchset: AzaToth; "adding patch to disable non-ops channel creation" [operations/debs/ircd-ratbox] (wmf) - https://gerrit.wikimedia.org/r/69610 [23:28:27] Ryan_Lane: there [23:28:56] added it as a config channel.create_ops_only [23:28:59] awesome :) [23:29:02] default false [23:29:12] I'll need to get mark to test this [23:29:13] I removed the voice code as it didn't make any sense [23:29:25] why didn't it make sense? [23:29:33] it's meant to not allow non-ops to talk in channels [23:29:41] if you can't create any channels, and no one will voice you anyway, the flood thingi will trigger correctly anyway [23:29:57] voice is done by chanserv [23:30:00] we don't have chanserv [23:30:35] "chanops and voiced can flood their own channel with impunit" [23:30:59] if no one will get voiced, and there cannot be any chanops due to ops_create_only [23:31:06] the voice change seemed unneeded [23:31:37] or am I missing something here? [23:31:59] Ryan_Lane: voice is done by the ircd usally [23:32:15] although most services packages will have a shortcut to do it [23:33:20] ah. ok. sorry [23:33:24] just looked at the code again [23:34:08] yeah, I'd imagine this is fine [23:34:30] it builded, but I've not tested it [23:34:37] * Ryan_Lane nods [23:34:52] I'll see if I can test it out properly [23:34:55] if not I'll get mark to do so [23:39:21] Ryan_Lane: the three first patches are pretty straightforward [23:39:33] they are ubuntu specific [23:39:46] so if they are unnecessary, who cares? :) [23:39:54] I meant the patchsets I made [23:39:59] ahh [23:40:00] ok [23:40:01] right [23:43:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:44:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [23:50:29] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:50:29] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:29] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:29] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:29] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:30] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:30] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [23:50:31] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:31] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [23:50:32] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours