[00:00:05] Hmm, this conference wifi must be crappy then [00:00:09] Sorry to bother you :( [00:03:49] binasher: http://people.apache.org/~joestein/kafka-0.8.0-beta1-candidate1/RELEASE_NOTES.html [00:04:17] andrewbogott_afk: why doesn't operations/debs/kafka have a master branch? [00:04:59] drdee: thx! [00:05:19] <^demon> RoanKattouw: Conference wifi is always crappy :) [00:05:32] blody difficult to login nowadays [00:05:42] getting logged out and complains about cookies [00:12:44] s/conference//g [00:16:00] It was actually good down in the hacker lounge [00:16:16] But I'm now up in the speaker lounge (the hacker lounge got too crowded) where it's worse [00:18:37] which conference are you at? [00:22:26] Open Source Bridge in Portland [00:24:32] RoanKattouw: ah, gotcha :) have fun [00:28:50] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [00:56:18] !log updated Parsoid to 0445108 [00:56:27] Logged the message, Master [01:02:29] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.00465965271 secs [01:21:21] New patchset: Catrope; "Hopefully fix the Parsoid Varnishes not showing up as such in Ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69443 [01:26:23] New patchset: Mattflaschen; "Add remaining skins (Cologne Blue and Modern) to sync script." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69444 [01:32:04] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003474831581 secs [01:49:52] !log catrope Started syncing Wikimedia installation... : Updating VisualEditor to master [01:50:03] Logged the message, Master [02:04:26] !log catrope Finished syncing Wikimedia installation... : Updating VisualEditor to master [02:04:35] Logged the message, Master [02:07:43] !log LocalisationUpdate completed (1.22wmf7) at Wed Jun 19 02:07:42 UTC 2013 [02:07:51] Logged the message, Master [02:14:07] !log LocalisationUpdate completed (1.22wmf6) at Wed Jun 19 02:14:07 UTC 2013 [02:14:16] Logged the message, Master [02:15:14] !log catrope synchronized php-1.22wmf6/resources/startup.js 'touch' [02:15:22] Logged the message, Master [02:15:37] !log catrope synchronized php-1.22wmf7/resources/startup.js 'touch' [02:15:45] Logged the message, Master [02:18:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 19 02:18:52 UTC 2013 [02:19:01] Logged the message, Master [02:56:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [03:28:16] New review: Krinkle; "(1 comment)" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [03:32:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [03:45:06] http://puppetlabs.com/security/cve/cve-2013-3567/ [03:49:21] New review: Ori.livneh; "(1 comment)" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/68135 [03:49:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:49:27] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:28] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:28] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:29] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:29] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:30] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [03:49:30] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [03:49:35] go go go icinga [04:17:56] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [04:21:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [04:58:24] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [04:58:43] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [05:56:27] New patchset: Aklapper; "Add product name to component output in Weekly Bugzilla Report mail" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69457 [06:31:13] PROBLEM - search indices - check lucene status page on search20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60051 bytes in 0.110 second response time [07:46:07] hello [07:47:43] apergos: here I am again :) [07:47:58] morning [07:53:09] apergos: I had a patch for beta to reenable commons as a foreign repo ( https://gerrit.wikimedia.org/r/#/c/62606/3 ) seems it is not needed anymore though [07:53:15] (related to instant commons) [07:53:37] ah yep [07:53:53] seems it is no more needed, though I have no idea how instant commons and foreign repo works together :( [07:53:57] ah [07:54:02] ok so this is easy [07:54:50] in Setup.php there is a stanza [07:55:59] and all it does is add the foreign api repo for you (but in case there are things like [07:56:09] doh disconnection :D [07:56:16] 'apibase' => WebRequest::detectProtocol() === 'https' ? [07:56:18] yeah [07:56:29] or other little changes that we don't see [07:56:38] it's good to use the version in Setup [07:56:47] the beta enwiki has two foreign file repos: http://paste.openstack.org/show/38931/ [07:57:04] one ForeignDBViaLBRepo which is the beta commons a [07:57:05] yes [07:57:08] I was saying that [07:57:14] and another one which is ForeignAPIRepo to production [07:57:15] 0:54:50 πμ) apergos: in Setup.php there is a stanza [07:57:15] (10:54:56 πμ) apergos: if ( $wgUseInstantCommons ) { [07:57:33] which just adds the foreignapirepo for commons for you [07:57:52] http://paste.openstack.org/show/38931/ :D [07:58:07] the advantage is it's shorter and it is kept up to date with things like the https check pasted above [07:58:12] which is why I like to use it [07:59:00] um did you mean to repaste the same link? [07:59:17] I forgot I pasted it already haha [07:59:41] so one potential issue is that the beta commonswiki has no wgForeignFileRepos (it gives array() ) [07:59:42] so when mw looks for an image it checks the local repo (local db for the row) then checks the next repo in the array = foreigndb which means it checks that db for a row [08:00:02] that's fine, beta commonswiki should only have the local repo [08:00:40] ... then it would check the third repo = api repo, which means it would request the file descr page via api [08:00:50] and if that fails it will whine and say there's nothing to be found [08:01:53] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003840208054 secs [08:01:54] * hashar looks at production [08:02:32] what foreign repo would commons have? it should only retrieve files from local... [08:03:05] so in prod, the wikis lookup via the database [08:03:14] and commons obviously does not lookup anything [08:03:30] on beta I was wondering whether we should set the wiki to lookup the local database ( ForeignDBViaLBRepo ) [08:03:36] commons will of course look up file description pages in the locla table [08:03:37] and have the beta commons to fallback to the production commons [08:03:40] which is what we do [08:04:03] no, I don't think we want any foreigndb for commons [08:04:14] on beta, it should check its local db file table and that's it [08:04:43] the other wikis can have instant commons as their third repo which covers them [08:05:14] ok [08:05:19] but then there is only two repo on the wikis [08:05:25] oh maybe 3 sorry [08:05:26] three [08:05:33] their local one being the first ? [08:05:35] local, beta commons, "real" commons [08:05:36] yep [08:06:33] Change abandoned: Hashar; "Per discussion with Ariel, this is no more needed, beta is properly configured right now. Wikis loo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62606 [08:06:34] thank you ! [08:06:40] sure! [08:33:05] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.001609921455 secs [08:36:04] New patchset: Hashar; "beta: wgAbuseFilterCentralDB = labswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [08:37:13] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [08:58:10] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [09:07:38] New review: Akosiaris; "Apart from the inline comments i also noticed the following. After commenting the first line with th..." [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [09:36:27] New patchset: Hashar; "beta: properly override $wgAbuseFilterCentralDB" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69465 [09:36:34] New review: Hashar; "moved to InitialiseSettings-labs.php with https://gerrit.wikimedia.org/r/69465" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [09:37:09] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69465 [09:49:16] New review: Hashar; "and that does not work, need to use a wmg setting." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69461 [09:55:09] New patchset: Hashar; "beta: $wmg for AbuseFilterCentralDB" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69466 [09:56:11] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69466 [09:59:26] !log hashar synchronized wmf-config/CommonSettings.php '{{gerrit|69466}} beta: $wmg for AbuseFilterCentralDB' [09:59:35] Logged the message, Master [10:00:00] !log hashar synchronized wmf-config '{{gerrit|69466}} beta: $wmg for AbuseFilterCentralDB' [10:00:10] Logged the message, Master [10:10:27] New patchset: Hashar; "beta: enable global abuse filters on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69469 [10:11:07] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69469 [10:16:16] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [10:29:15] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [11:03:50] New review: Daniel Kinzler; "(1 comment)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/65443 [11:39:22] New patchset: QChris; "Take advantage of hook-bugzillas new event mechanism" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [11:39:22] New patchset: QChris; "Turn off hooks-bugzilla legacy event handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [11:40:52] New review: QChris; "Bringing this in does not hurt older hooks-bugzilla plugins." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [11:42:30] New review: QChris; "This change should only merged once the new hooks-bugzilla is in place, as it disables adding bugzil..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69476 [11:58:20] New patchset: Nemo bis; "Add User:Odder to English Planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69482 [12:28:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.161 second response time [12:34:40] New review: Ottomata; "(1 comment)" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [12:36:00] New patchset: Ottomata; "Initial commit of Kafka Puppet module for Apache Kafka 0.8" [operations/puppet/kafka] (master) - https://gerrit.wikimedia.org/r/50385 [12:51:37] New patchset: Odder; "Adding a couple of blogs from [[m:Planet Wikimedia]]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [12:59:48] New review: Nemo bis; "I already submitted I882f7f5a" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [13:16:41] New review: Odder; "I15ea12d adds a couple of blogs at the same time." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/69482 [13:20:39] @notify binasher [13:20:39] I'll let you know when I see binasher around here [13:23:26] Change abandoned: Nemo bis; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69482 [13:50:04] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:04] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:05] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:05] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:06] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [13:50:06] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [13:50:07] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [13:51:06] !log powering down carbon to replace bad disk [13:51:15] Logged the message, Master [13:52:54] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [13:54:58] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68829 [14:07:42] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [14:18:22] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [14:22:22] PROBLEM - Puppet freshness on magnesium is CRITICAL: No successful Puppet run in the last 10 hours [14:35:22] !log powering down mw1171 to replace hard drive and reinstalling [14:35:30] Logged the message, Master [14:54:09] !log update & reboot boron [14:54:18] Logged the message, Master [14:57:04] !log powering down and relocating searchidx1001 [14:57:13] Logged the message, Master [14:59:03] RECOVERY - Host mw1171 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:59:53] PROBLEM - Host searchidx1001 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:11] !log authdns update [15:01:13] PROBLEM - RAID on mw1171 is CRITICAL: Connection refused by host [15:01:19] Logged the message, Master [15:01:23] PROBLEM - SSH on mw1171 is CRITICAL: Connection refused [15:01:33] PROBLEM - Disk space on mw1171 is CRITICAL: Connection refused by host [15:01:33] PROBLEM - twemproxy process on mw1171 is CRITICAL: Connection refused by host [15:01:43] PROBLEM - Apache HTTP on mw1171 is CRITICAL: Connection refused [15:02:03] PROBLEM - DPKG on mw1171 is CRITICAL: Connection refused by host [15:06:13] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:53] New patchset: Hashar; "vary wmerrors.ini 'fatal_log_file' per realm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:06:58] apergos: ^^^ [15:07:03] RECOVERY - Disk space on mc15 is OK: DISK OK [15:07:20] New review: Hashar; "added missing 'include role::applicationserver::configuration' :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:07:50] New review: Hashar; "tested on labs:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:13:30] PROBLEM - NTP on mw1171 is CRITICAL: NTP CRITICAL: No response from NTP server [15:14:07] New review: Hashar; "Got tested with Ariel on labs. Both of us have to leave so we will get that merged and deployed tomo..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/68831 [15:14:20] RECOVERY - SSH on mw1171 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:37:28] RECOVERY - Host searchidx1001 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:40:18] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 455 bytes in 0.004 second response time [15:43:56] New patchset: Andrew Bogott; "Move ldap into a module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [15:47:18] RECOVERY - Disk space on mw1171 is OK: DISK OK [15:47:28] RECOVERY - DPKG on mw1171 is OK: All packages OK [15:47:38] RECOVERY - RAID on mw1171 is OK: OK: no RAID installed [15:48:28] RECOVERY - NTP on mw1171 is OK: NTP OK: Offset -0.05871582031 secs [15:49:21] PROBLEM - Apache HTTP on mw1171 is CRITICAL: Connection refused [15:52:28] New patchset: Andrew Bogott; "Fix string continuation and a few pep8 issues." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:53:04] New patchset: Andrew Bogott; "Fix string continuation and a few pep8 issues." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:54:38] New review: Andrew Bogott; "Quick merge so that future patches don't get blocked." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/69496 [15:54:38] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69496 [15:55:03] New review: Andrew Bogott; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69337 [15:57:18] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [16:01:54] New patchset: Reedy; "(bug 49639) Change local time zone for ko.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68947 [16:02:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/68947 [16:07:24] RECOVERY - Host search1004 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [16:07:25] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [16:07:25] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [16:07:25] PROBLEM - search indices - check lucene status page on search1004 is CRITICAL: Connection refused [16:07:34] PROBLEM - search indices - check lucene status page on search1001 is CRITICAL: Connection refused [16:07:35] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:07:35] RECOVERY - Host search1006 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [16:07:44] PROBLEM - Lucene on search1004 is CRITICAL: Connection refused [16:07:44] RECOVERY - Host search1005 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:07:44] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [16:07:44] PROBLEM - search indices - check lucene status page on search1003 is CRITICAL: Connection refused [16:07:54] PROBLEM - search indices - check lucene status page on search1002 is CRITICAL: Connection refused [16:07:54] PROBLEM - NTP on search1004 is CRITICAL: NTP CRITICAL: Offset unknown [16:10:24] PROBLEM - search indices - check lucene status page on search1005 is CRITICAL: Connection refused [16:10:34] PROBLEM - search indices - check lucene status page on search1006 is CRITICAL: Connection refused [16:11:35] RECOVERY - twemproxy process on mw1171 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [16:11:55] RECOVERY - NTP on search1004 is OK: NTP OK: Offset -0.01002562046 secs [16:13:14] PROBLEM - Lucene on search1006 is CRITICAL: Connection refused [16:13:24] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [16:13:25] PROBLEM - Lucene on search1005 is CRITICAL: Connection refused [16:13:25] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [16:15:44] RECOVERY - Puppet freshness on mw1171 is OK: puppet ran at Wed Jun 19 16:15:39 UTC 2013 [16:38:04] RECOVERY - Host search1013 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:38:04] RECOVERY - Host search1015 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [16:38:14] RECOVERY - Host search1012 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:38:14] RECOVERY - Host search1009 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [16:38:14] RECOVERY - Host search1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:14] RECOVERY - Host search1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:14] RECOVERY - Host search1014 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [16:38:15] RECOVERY - Host search1018 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:38:15] RECOVERY - Host search1021 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [16:38:16] RECOVERY - Host search1016 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [16:38:16] RECOVERY - Host search1019 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:38:17] RECOVERY - Host search1010 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [16:38:24] PROBLEM - search indices - check lucene status page on search1009 is CRITICAL: Connection refused [16:38:24] PROBLEM - Lucene on search1016 is CRITICAL: Connection refused [16:38:24] RECOVERY - Host search1017 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [16:38:24] RECOVERY - Host search1020 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [16:38:24] RECOVERY - Host search1011 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:38:34] PROBLEM - Lucene on search1014 is CRITICAL: Connection refused [16:38:34] PROBLEM - Lucene on search1017 is CRITICAL: Connection refused [16:38:34] PROBLEM - NTP on search1019 is CRITICAL: NTP CRITICAL: Offset unknown [16:38:37] rise search rise! [16:38:44] PROBLEM - NTP on search1018 is CRITICAL: NTP CRITICAL: Offset unknown [16:39:44] RECOVERY - NTP on search1018 is OK: NTP OK: Offset -0.008357286453 secs [16:40:24] PROBLEM - search indices - check lucene status page on search1008 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1010 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1011 is CRITICAL: Connection refused [16:41:05] PROBLEM - search indices - check lucene status page on search1013 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1007 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1014 is CRITICAL: Connection refused [16:41:14] PROBLEM - search indices - check lucene status page on search1012 is CRITICAL: Connection refused [16:42:34] RECOVERY - NTP on search1019 is OK: NTP OK: Offset -0.01334488392 secs [16:43:24] PROBLEM - Lucene on search1019 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1012 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1021 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1018 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1015 is CRITICAL: Connection refused [16:43:34] PROBLEM - Lucene on search1020 is CRITICAL: Connection refused [16:43:35] PROBLEM - Lucene on search1008 is CRITICAL: Connection refused [16:43:35] PROBLEM - Lucene on search1013 is CRITICAL: Connection refused [16:43:36] PROBLEM - Lucene on search1009 is CRITICAL: Connection refused [16:43:54] PROBLEM - Lucene on search1010 is CRITICAL: Connection refused [16:43:55] PROBLEM - Lucene on search1007 is CRITICAL: Connection refused [16:43:55] PROBLEM - Lucene on search1011 is CRITICAL: Connection refused [16:51:39] looks like zero is not deploying today, so if anyone needs it for lightning depl [16:52:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:00] yurik: oh, good to know, thanks! [16:53:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.158 second response time [17:06:37] RECOVERY - Host search1024 is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [17:06:37] RECOVERY - Host search1022 is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [17:06:37] RECOVERY - Host search1023 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [17:06:37] PROBLEM - search indices - check lucene status page on search1023 is CRITICAL: Connection refused [17:06:37] PROBLEM - search indices - check lucene status page on search1024 is CRITICAL: Connection refused [17:06:56] PROBLEM - Lucene on search1023 is CRITICAL: Connection refused [17:07:26] PROBLEM - Lucene on search1022 is CRITICAL: Connection refused [17:12:06] PROBLEM - Lucene on search1024 is CRITICAL: Connection refused [17:13:07] PROBLEM - RAID on analytics1019 is CRITICAL: Timeout while attempting connection [17:14:16] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:51] mutante, Leslie fixed the netboot issue, i think I need to allow disk booting in the bios again, right? [17:19:01] ottomata: correct, and ..correct [17:19:24] ok, loading bios... [17:19:24] yea, i had it disabled just to see more of the error instead of always ending up in OS [17:19:29] yeah [17:19:34] i think it reinstalled all just fine! [17:19:38] cool [17:20:46] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:47] RECOVERY - SSH on analytics1020 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:22:56] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:24:46] RECOVERY - Host analytics1019 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [17:25:03] are you doing 1019 as well? [17:26:26] puppet design (roles / profiles) video https://www.youtube.com/watch?v=ZpHtOnlSGNY [17:27:16] PROBLEM - DPKG on analytics1019 is CRITICAL: Connection refused by host [17:27:36] PROBLEM - Disk space on analytics1019 is CRITICAL: Connection refused by host [17:27:37] PROBLEM - SSH on analytics1019 is CRITICAL: Connection refused [17:30:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:33:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [17:38:05] yo dudes [17:38:08] ever seen this on a fresh reinstall initial puppet run? [17:38:08] err: Could not retrieve catalog from remote server: Server hostname 'sockpuppet.pmtpa.wmnet' did not match server certificate; expected sockpuppet.pmtpa.wmnet [17:38:38] RECOVERY - search indices - check lucene status page on search20 is OK: HTTP OK: HTTP/1.1 200 OK - 60075 bytes in 0.115 second response time [17:38:57] PROBLEM - NTP on analytics1019 is CRITICAL: NTP CRITICAL: No response from NTP server [17:47:41] RECOVERY - SSH on analytics1019 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:52:00] PROBLEM - DPKG on mc15 is CRITICAL: Timeout while attempting connection [17:52:38] RECOVERY - DPKG on analytics1020 is OK: All packages OK [17:52:47] RECOVERY - DPKG on mc15 is OK: All packages OK [17:52:57] RECOVERY - Disk space on analytics1020 is OK: DISK OK [17:55:37] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:28] RECOVERY - DPKG on analytics1019 is OK: All packages OK [18:00:37] RECOVERY - Disk space on analytics1019 is OK: DISK OK [18:08:41] RECOVERY - NTP on analytics1020 is OK: NTP OK: Offset -0.01051580906 secs [18:14:16] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master' [18:14:25] Logged the message, Master [18:16:25] RECOVERY - NTP on analytics1019 is OK: NTP OK: Offset -0.01820898056 secs [18:17:27] <^demon> manybubbles: So, did my update rewrite totally break things for you? :) [18:17:53] well I just merged it and the merge was clean. that is something. [18:18:03] but it looks like batch indexing doesn't work [18:18:07] but I can fix that [18:18:26] <^demon> Hmm, worked for me locally. [18:18:49] <^demon> I was curious though...there's already an updateSearchIndex script in core. I'm wondering if we could make it backend-agnostic enough to where it would work for us too. [18:18:56] <^demon> Less codepaths to maintain then. [18:19:45] we might be able to [18:19:58] <^demon> forceSolrIndex kind of annoys me right now since it doesn't use the SearchUpdate codepath. [18:20:13] it'd be worth looking at but we'll have to make sure we can still do the time updates [18:20:22] fair enough - it was before I knew about it [18:20:47] I'm happy so long as we essentially have the same queries powering the updates (and deletes) that can be done in batches. [18:21:38] New patchset: Dzahn; "Adding a couple of blogs from [[m:Planet Wikimedia]]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [18:22:45] ^demon: got it. I just hadn't pulled mediawiki recently [18:22:51] <^demon> :) [18:22:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69488 [18:23:01] <^demon> Yeah, it depended on that core change. [18:24:41] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:25:31] RECOVERY - Disk space on mc15 is OK: DISK OK [18:25:47] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/67203 [18:28:07] !log sync-apache and graceful for votewiki rewrites (change 67203) [18:28:21] Logged the message, Master [18:29:33] New patchset: Asher; "pull db64, bad bbu" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69511 [18:30:00] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/69511 [18:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:37] !log: taking down db64 to replace Raid Controller battery [18:31:26] !log asher synchronized wmf-config/db-pmtpa.php 'pull db64' [18:31:36] Logged the message, Master [18:32:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [18:33:32] PROBLEM - Host db64 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:07] !log nikerabbit synchronized php-1.22wmf7/extensions/UniversalLanguageSelector/ 'ULS to master' [18:34:15] Logged the message, Master [18:34:33] !log updated Parsoid to ddc47d8 [18:34:43] Logged the message, Master [18:35:22] New review: Dzahn; "deployed and graceful'ed" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/67203 [18:38:38] RECOVERY - Host db64 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [18:38:39] heh, gwicke was harassed IRL right after he deployed [18:38:59] "harassed" meaning "interrupted and talked to" [18:41:18] PROBLEM - mysqld processes on db64 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:41:26] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master again' [18:41:34] Logged the message, Master [18:43:50] PROBLEM - Varnish HTTP parsoid-backend on cerium is CRITICAL: Connection refused [18:45:40] Oh heh [18:45:44] Forgot to start it back up [18:46:48] RECOVERY - Varnish HTTP parsoid-backend on cerium is OK: HTTP OK: HTTP/1.1 200 OK - 634 bytes in 0.001 second response time [18:53:28] PROBLEM - NTP on db64 is CRITICAL: NTP CRITICAL: Offset unknown [18:55:42] New patchset: Ottomata; "Including role::analytics::hadoop::worker on analytics1019 and analytics1020" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:57:28] RECOVERY - NTP on db64 is OK: NTP OK: Offset -0.0001238584518 secs [18:57:36] New patchset: Ottomata; "Including role::analytics::hadoop::worker on analytics1019 and analytics1020" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:58:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69518 [18:58:49] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [19:00:15] New patchset: Ottomata; "Adding role/analytics/hive,pig,sqoop.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69521 [19:00:39] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69521 [19:11:12] PROBLEM - Host mc1015 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:39] RECOVERY - Host mc1015 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:14:53] New patchset: Ottomata; "Fixing pig.properties.erb comment" [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69523 [19:15:05] Change merged: Ottomata; [operations/puppet/cdh4] (master) - https://gerrit.wikimedia.org/r/69523 [19:38:57] ottomata: hey [19:39:08] I'm not sure I understand exactly the situation with the group.id [19:39:28] and I'm more curious than reviewing at this point :) [19:40:11] aye so [19:40:30] a consumer has an associated consumer group [19:40:55] ok [19:41:02] when it consumes data from kafka, it keeps a high water mark stored in zookeeper under the name of the consumer group [19:41:03] so [19:41:04] why can't it be a variable set by the manifest? [19:41:14] it could, but what would we set it to? [19:41:30] a consumer is not a single instance daemon [19:41:39] set it in ::defaults as "test-consumer-group"? [19:41:41] there could be any number of consumers running on any number of boxes [19:41:47] sure i don't mind that [19:41:54] in practice [19:41:56] what will we do? [19:42:02] in practice, probably not even use that file [19:42:09] except for testing [19:42:16] so, what our file would be like? [19:42:25] not all consumers use that file :) [19:42:27] and [19:42:31] for those that ship with kafka [19:42:49] uh, there are two main CLI consumers that ship with kafka [19:42:51] one of them uses that file [19:42:57] the other just accepts a bunch of CLI args [19:43:28] what will we use? [19:43:30]