[00:38:14] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [00:39:04] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.98 ms [02:08:37] !log LocalisationUpdate completed (1.22wmf22) at Mon Oct 28 02:08:37 UTC 2013 [02:08:58] Logged the message, Master [02:15:57] !log LocalisationUpdate completed (1.23wmf1) at Mon Oct 28 02:15:57 UTC 2013 [02:16:13] Logged the message, Master [02:36:16] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Oct 28 02:36:16 UTC 2013 [02:36:31] Logged the message, Master [04:02:34] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: No successful Puppet run in the last 10 hours [06:00:53] (03PS1) 10Ori.livneh: Remove references to 'olivneh' account from node defs [operations/puppet] - 10https://gerrit.wikimedia.org/r/92267 [06:35:36] (03PS1) 10Ori.livneh: [WIP] Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [06:35:59] paravoid: ^ [06:36:31] very incomplete. there's an overview of the state of the patch in the commit message [06:38:52] I wonder where python-{carbon,whisper} come from [06:38:56] and python-graphite-web [06:39:14] these aren't Debian/Ubuntu for sure [06:40:25] Debian does have a different python-whisper plus graphite-carbon and graphite-web [06:40:31] different from what we have [06:41:08] some random PPA at some point the past is what I'm guessing [06:41:39] probably; IIRC the listed maintainer is asher@ [06:41:58] !log powercycled ms-be1001, unaccessible via ssh or mgmt [06:42:14] Logged the message, Master [06:42:16] Maintainer: Chris Davis [06:42:27] apergos: oh, I didn't realize -- thanks! [06:42:30] yw [06:42:42] they've been hanging [06:42:47] one of the benefits of checking puppet freshness every day [06:42:47] it's the third one I think [06:42:50] no idea why [06:42:54] nothing obvious [06:43:11] saw some: [2107952.990239] BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:1:17962] [06:43:14] yeah [06:43:18] different processes on each though [06:43:18] same as the others [06:43:34] ori-l: so, we need to either forward port these packages to precise or switch to the Debian ones [06:43:43] I think you can guess my preference :P [06:43:49] they're in precise [06:44:02] in our apt repo, i mean [06:44:23] oh are they [06:44:41] that's how i've been testing things; mediawiki-vagrant uses apt.wikimedia.org [06:44:54] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [06:44:54] RECOVERY - swift-object-server on ms-be1001 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [06:44:54] RECOVERY - swift-account-reaper on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [06:44:55] RECOVERY - swift-container-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:44:55] RECOVERY - DPKG on ms-be1001 is OK: All packages OK [06:45:04] RECOVERY - swift-container-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [06:45:04] RECOVERY - swift-object-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [06:45:05] RECOVERY - swift-object-auditor on ms-be1001 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [06:45:07] but i'm in favor of switching to debian ones too [06:45:14] RECOVERY - RAID on ms-be1001 is OK: OK: State is Optimal, checked 14 logical drive(s), 14 physical drive(s) [06:45:14] RECOVERY - swift-container-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [06:45:14] RECOVERY - swift-account-auditor on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [06:45:14] RECOVERY - swift-account-replicator on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [06:45:14] RECOVERY - swift-container-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [06:45:14] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:45:24] RECOVERY - swift-account-server on ms-be1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [06:45:30] ok, I see them now [06:45:33] different packages [06:45:34] RECOVERY - swift-object-updater on ms-be1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [06:45:34] RECOVERY - Puppet freshness on ms-be1001 is OK: puppet ran at Mon Oct 28 06:45:31 UTC 2013 [06:45:44] asher@ indeed [06:45:53] other differences too, including versions [06:45:54] oh fun [06:46:12] the whisper one is the Debian one though [06:46:39] so that leaves carbon & web [06:47:02] ceres is also packaged in Debian fwiw [06:47:30] what are you thoughts on ceres btw? [06:47:56] i saw, but graphite is already severely lacking in good documentation and on ceres the interwebs are practically silent [06:48:04] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.002200245857 secs [06:48:54] ok [06:48:56] maybe later then [06:49:31] yeah. there are some good resources for scaling carbon-cache, written from experience.. i linked to them in manifests/role/graphite.pp. haven't found anything comparable for ceres yet [06:50:16] http://anatolijd.blogspot.com/2013/06/graphitemegacarbonceres-multi-node.html : "And while being announced two years ago, Ceres comes completely undocumented. But lack of documentation should never stop us to experiment!" [06:50:57] class role::graphite { 1 [06:50:58] class { 'graphite': [06:51:03] does that even work? [06:51:25] if it is, I'll be impressed [06:51:36] at puppet's usual craziness :) [06:51:39] no, it probably needs to be qualified. the role class wrapper was something i added in the course of copying the files over to operations/puppet [06:51:50] oh, ok [06:54:15] ok, looks reasonable in general [06:54:28] I puked every time I saw /opt of course :P [06:55:20] but you anticipated that [06:55:59] at least it's contained in one tree [06:56:20] the Debian packages I was talking about before are of course not like that [06:57:12] well, if some good soul went through the trouble of dotting the is and crossing the ts in the config files to make it work, let's use it [06:57:52] carbon.conf.example has some cheery comment to the effect of 'to use FHS paths, just set these three values', which i dutifully did, and lots of weird things started to break [06:59:19] carbon & whisper are even part of debian stable now [06:59:29] ceres & web are only unstable/testing [06:59:43] but stable probably means that they work [07:17:41] !log delaying one tampa slave per shard during OSC [07:17:58] Logged the message, Master [07:31:07] (03PS2) 10ArielGlenn: removing last vestiges (mgmt ip) for sq31-36, decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/91160 [07:37:03] (03CR) 10ArielGlenn: [C: 032] removing last vestiges (mgmt ip) for sq31-36, decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/91160 (owner: 10ArielGlenn) [07:45:11] nov 2012? [07:45:12] lol [08:13:42] was etherpad updated? I'm noticing weird new behaviours and I wonder if it's because of their new handling of corrupted pads [08:21:13] actually, I'm unable to save anything on any pad [08:22:39] !log etherpad.wikimedia.org seems read-only [08:22:55] Logged the message, Master [08:27:35] filed as https://bugzilla.wikimedia.org/show_bug.cgi?id=56232 [08:28:15] (03PS1) 10ArielGlenn: remove entries for db5,6,7,26,27 long since decommed [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 [08:31:52] Nemo_bis: please check now [08:34:04] PROBLEM - etherpad_lite_process_running on zirconium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^node node_modules/ep_etherpad-lite/node/server.js [08:34:39] mm no good it seems. that's too bad [08:35:07] apergos: same [08:36:32] yeah just a sec [08:37:04] RECOVERY - etherpad_lite_process_running on zirconium is OK: PROCS OK: 1 process with regex args ^node node_modules/ep_etherpad-lite/node/server.js [08:37:12] Nemo_bis: now? [08:38:06] (it seems now to work for me) [08:42:35] apergos: seems to work, let me test a bit more [08:42:41] !log restarted etherpadlite on zirconium, see ticket 6093, it was not saving edits [08:42:56] Logged the message, Master [08:56:09] (03PS1) 10ArielGlenn: remove db5,8,26 from all dsh group files, long since decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/92273 [08:56:34] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 305 seconds [08:57:04] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 315 seconds [08:58:57] (03CR) 10ArielGlenn: [C: 032] remove db5,8,26 from all dsh group files, long since decommed [operations/puppet] - 10https://gerrit.wikimedia.org/r/92273 (owner: 10ArielGlenn) [09:02:04] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 142 seconds [09:02:34] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay -0 seconds [09:10:09] mark, i tried tons of different options - seems that the moment ESI is enabled, varnish creates tons of worker threads and dies, irregardless of the URL that backend asks it to include [09:23:22] (03PS1) 10ArielGlenn: one more db26 removal from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92277 [09:28:05] (03CR) 10ArielGlenn: [C: 032] one more db26 removal from dsh files [operations/puppet] - 10https://gerrit.wikimedia.org/r/92277 (owner: 10ArielGlenn) [09:33:56] (03PS1) 10ArielGlenn: removing dhcp entries for arsenic/niobium (reclaimed, see rt #5848) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92278 [09:36:31] (03CR) 10ArielGlenn: [C: 032] removing dhcp entries for arsenic/niobium (reclaimed, see rt #5848) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92278 (owner: 10ArielGlenn) [09:45:57] @replag [09:57:53] (03PS1) 10ArielGlenn: add back palladium mgmt ip [operations/dns] - 10https://gerrit.wikimedia.org/r/92280 [10:08:46] (03PS2) 10ArielGlenn: add back palladium mgmt ip [operations/dns] - 10https://gerrit.wikimedia.org/r/92280 [10:10:11] (03CR) 10ArielGlenn: [C: 032] add back palladium mgmt ip [operations/dns] - 10https://gerrit.wikimedia.org/r/92280 (owner: 10ArielGlenn) [10:49:51] (03PS1) 10Matanya: removing cache clean up patch [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 [11:07:49] apergos: around? [11:08:20] yes (though busy), what's up? [11:09:26] apergos: hi, sorry to interuppt, just need to know if any lucid mysql server still exists [11:09:43] let's see [11:11:55] the m2 shard hosts are apparently lucid [11:12:19] there are a number in tampa as well [11:12:19] oh, darn. thanks a lot apergos [11:12:21] that's it [11:12:29] yw [11:25:27] (03PS1) 10Mark Bergsma: Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92292 [11:25:28] (03PS1) 10Mark Bergsma: Cleanup [operations/dns] - 10https://gerrit.wikimedia.org/r/92293 [11:26:11] (03CR) 10Mark Bergsma: [C: 032] Tabs to spaces [operations/dns] - 10https://gerrit.wikimedia.org/r/92292 (owner: 10Mark Bergsma) [11:26:25] (03CR) 10Mark Bergsma: [C: 032] Cleanup [operations/dns] - 10https://gerrit.wikimedia.org/r/92293 (owner: 10Mark Bergsma) [11:37:54] PROBLEM - Disk space on wtp1005 is CRITICAL: DISK CRITICAL - free space: / 353 MB (3% inode=76%): [11:43:14] PROBLEM - Parsoid on wtp1005 is CRITICAL: Connection refused [11:44:30] (03PS1) 10ArielGlenn: remove srv151-192,194-234 from dsh groups and add back srv193 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92295 [11:47:58] (03CR) 10ArielGlenn: [C: 032] remove srv151-192,194-234 from dsh groups and add back srv193 [operations/puppet] - 10https://gerrit.wikimedia.org/r/92295 (owner: 10ArielGlenn) [12:26:54] RECOVERY - Disk space on wtp1005 is OK: DISK OK [12:27:14] RECOVERY - Parsoid on wtp1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [12:37:40] (03PS1) 10ArielGlenn: add explicit comments for ranges, hoping to avoid off-by-one [operations/dns] - 10https://gerrit.wikimedia.org/r/92296 [12:40:20] (03CR) 10ArielGlenn: [C: 032] add explicit comments for ranges, hoping to avoid off-by-one [operations/dns] - 10https://gerrit.wikimedia.org/r/92296 (owner: 10ArielGlenn) [12:55:27] (03PS1) 10Mark Bergsma: Cleanup [operations/dns] - 10https://gerrit.wikimedia.org/r/92300 [12:56:30] (03CR) 10Mark Bergsma: [C: 032] Cleanup [operations/dns] - 10https://gerrit.wikimedia.org/r/92300 (owner: 10Mark Bergsma) [12:59:36] thanks for that [13:00:26] I had a patchset ready earlier today to unilaterally remove the kenniset legacy stuff but then thought better of it [13:00:34] so yay [13:05:27] better hold off on the other in-addr.arpa files btw, I have a branch with service IP changes in them [13:26:33] !log restarted elasticsearch nodes to pick up new config [13:26:48] Logged the message, Master [13:53:33] (03PS1) 10ArielGlenn: give cmjohnson perms to ack, disable notifications etc in icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/92305 [13:56:05] (03CR) 10ArielGlenn: [C: 032] give cmjohnson perms to ack, disable notifications etc in icinga [operations/puppet] - 10https://gerrit.wikimedia.org/r/92305 (owner: 10ArielGlenn) [14:50:57] (03CR) 10Chad: "Actually it will be upstreamed ;-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 (owner: 10Odder) [14:51:02] (03PS1) 10Cmjohnson: removing search21-36 from decommissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92313 [14:52:12] (03CR) 10Cmjohnson: [C: 032] removing search21-36 from decommissioning.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92313 (owner: 10Cmjohnson) [14:53:54] (03CR) 10Andrew Bogott: [C: 031] "Looks right; will get Faidon to merge." [operations/puppet] - 10https://gerrit.wikimedia.org/r/92288 (owner: 10Matanya) [14:54:14] RECOVERY - Host search29 is UP: PING OK - Packet loss = 0%, RTA = 27.63 ms [14:54:59] paravoid: Shall I merge this? https://gerrit.wikimedia.org/r/#/c/92288/ [14:58:04] (03CR) 10Andrew Bogott: [C: 032] "Yep! Thanks for cleanup." [operations/puppet] - 10https://gerrit.wikimedia.org/r/92079 (owner: 10Matanya) [14:58:32] (03PS2) 10Andrew Bogott: site.pp: removed apache-utils [operations/puppet] - 10https://gerrit.wikimedia.org/r/92079 (owner: 10Matanya) [14:58:59] (03PS1) 10Dereckson: DynamicPageList extension configuration maintenance [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92314 [14:59:43] (03CR) 10Andrew Bogott: [C: 032] site.pp: removed apache-utils [operations/puppet] - 10https://gerrit.wikimedia.org/r/92079 (owner: 10Matanya) [15:01:59] !log removing search21-36 from pybal search_pools [15:02:17] Logged the message, Master [15:56:28] moorning manybubbles|lunc i guess you are not up yet? [15:56:32] on lunch [15:56:32] lucnh! [16:07:34] (03PS1) 10Ottomata: Making elasticsearch ganglia plugin query $ipaddress instead of localhost for ES stats. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92325 [16:09:09] (03PS2) 10Ottomata: Making elasticsearch ganglia plugin query $ipaddress instead of localhost for ES stats. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92325 [16:10:44] manybubbles|lunc: ^ [16:18:38] (03PS1) 10Andrew Bogott: Remove generic::sysfs::enable-rps [operations/puppet] - 10https://gerrit.wikimedia.org/r/92326 [16:23:35] heya LeslieCarr, I know you are super busy with the dc stuff, got anytime to help fix the inter rack multicast issue? [16:23:39] i'm not really sure what the problem is [16:24:30] but, multicast doesn't seem to work between racks, which makes ganglia not really work correctly in the analytics cluster [16:35:18] (03PS2) 10Andrew Bogott: Remove generic::sysfs::enable-rps [operations/puppet] - 10https://gerrit.wikimedia.org/r/92326 [16:35:57] (03CR) 10Andrew Bogott: [C: 032] Remove generic::sysfs::enable-rps [operations/puppet] - 10https://gerrit.wikimedia.org/r/92326 (owner: 10Andrew Bogott) [16:49:58] ottomata: which ganglia group was it again that wasn't working? [16:52:02] analytics, so 239.192.1.32 [16:52:35] if I start a multicast listener on a node in row B [16:52:45] PROBLEM - Host sq44 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:45] and then send to that mutlicast addy from a node in row C [16:52:48] I don't get any traffic [16:52:53] but, I do if I send from a node in Row B [16:52:54] right [16:52:59] vice-versa is the same [16:53:14] multicast emitted from row B is not receieved in row C [16:53:23] I have two ganglia aggregators in the analytics cluster [16:53:32] one on row B (analytics1009) and one on row C (analytics1011) [16:53:35] (03PS1) 10Aude: Temporary logo for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92333 [16:53:44] metrics generated in row B make it to ganglia fine [16:53:49] metrics generated in row C do not [16:53:59] most of our production nodes are in row C; row B is all ciscos [16:54:33] (03CR) 10Aude: [C: 04-1] "not to deploy until 0:00 UTC, October 29 or after." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92333 (owner: 10Aude) [16:56:15] RECOVERY - Host sq44 is UP: PING OK - Packet loss = 0%, RTA = 31.18 ms [16:56:29] i think it's the analytics ACL that's breaking it [16:56:57] oh yeah? [16:57:35] we ahd a problem with that before, but wahtever was happening before just made the udp2log multicast firehose be duplicated in the analytics cluster [16:57:55] so why do I see all hosts in ganglia now? [16:59:10] (03CR) 10Bartosz Dziewoński: [C: 04-1] "You really do not want to do it this way; the logo URL is put in page HTML and cached for up to 30 days. What you actually want is more al" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92333 (owner: 10Aude) [17:00:27] mark, i dunno [17:00:28] actually [17:00:30] all hosts are there [17:00:36] and even normal metrics seem to make it [17:00:43] its just custom ones that don'e [17:00:44] don't [17:00:59] (03CR) 10Aude: "@Bartosz putting it in css would indeed be better :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92333 (owner: 10Aude) [17:01:09] (03Abandoned) 10Aude: Temporary logo for Wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92333 (owner: 10Aude) [17:01:14] strange [17:01:22] that's all done over the same multicast channel [17:02:02] when you tested multicast [17:02:07] did you test the same ganglia multicast address? [17:02:10] so [17:02:11] netcat analytics1009.eqiad.wmnet 8649 | grep cpu | wc -l [17:02:11] 416 [17:02:11] because the filter blocks other ranges [17:02:16] netcat analytics1011.eqiad.wmnet 8649 | grep cpu | wc -l [17:02:16] 416 [17:02:17] and [17:02:25] netcat analytics1009.eqiad.wmnet 8649 | grep kafka | wc -l [17:02:25] 1905 [17:02:34] netcat analytics1011.eqiad.wmnet 8649 | grep kafka | wc -l [17:02:34] 4971 [17:02:35] hm [17:03:06] and, yes mark, when I tested I tested both, buuuut oh actually I did not use the ganglia multicast address to test [17:03:09] i just used a random one [17:03:15] i guess if that is not in the acl it won't make it through? [17:03:20] that would have been blocked then yes [17:03:27] ok let me try that again [17:03:31] but, in the output I just pasted [17:03:58] the cpu metrics look all the same on both aggregators [17:04:00] but not the kafka metrics [17:04:05] yeah strange [17:04:12] should test with tcpdump and friends [17:05:46] is the 8649 port in the acl? or just the multicast group addy? [17:09:14] (03CR) 10Aaron Schulz: [C: 032] Switched to JobQueueFederated [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92032 (owner: 10Aaron Schulz) [17:09:32] (03Merged) 10jenkins-bot: Switched to JobQueueFederated [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92032 (owner: 10Aaron Schulz) [17:10:59] gah, tampa [17:11:02] ok this is strange to mark [17:11:16] i tcpdumped traffic on the gmond multicast on both an09 and an11 [17:11:28] (03CR) 10Dzahn: [C: 04-1] "the commit message says db5,6,7 but the change removes db5,7,8" [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 (owner: 10ArielGlenn) [17:11:41] and then counted the occurrences of hostnames [17:11:47] in about 15 seconds of traffic [17:11:48] on an09 [17:11:54] 20 analytics1023.eqiad.wmnet.55418 [17:11:55] 27 analytics1024.eqiad.wmnet.60505 [17:12:07] soryr not those [17:12:09] these: [17:12:09] 6 analytics1021.eqiad.wmnet.35794 [17:12:09] 1 analytics1022.eqiad.wmnet.41099 [17:12:13] and then on an11 [17:12:17] 4345 analytics1021.eqiad.wmnet.53644 [17:12:17] 5928 analytics1022.eqiad.wmnet.38158 [17:12:18] (03PS1) 10Aaron Schulz: Update tampa for 7786c233a30d3c8552c862ab841d7b9dfa6d67be (also fixed gzip conf) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92336 [17:12:35] (03CR) 10Aaron Schulz: [C: 032] Update tampa for 7786c233a30d3c8552c862ab841d7b9dfa6d67be (also fixed gzip conf) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92336 (owner: 10Aaron Schulz) [17:12:47] (03Merged) 10jenkins-bot: Update tampa for 7786c233a30d3c8552c862ab841d7b9dfa6d67be (also fixed gzip conf) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92336 (owner: 10Aaron Schulz) [17:12:49] (03CR) 10Dzahn: [C: 031] "tickets matching the change though, db5/7/8. ack" [operations/dns] - 10https://gerrit.wikimedia.org/r/92272 (owner: 10ArielGlenn) [17:12:49] and conversly [17:12:53] an09 and an10 [17:12:56] (03PS1) 10coren: Labs DB: add recentchanges_userindex view [operations/software] - 10https://gerrit.wikimedia.org/r/92337 [17:12:56] on an09: [17:12:57] 813 analytics1010.eqiad.wmnet.36583 [17:12:57] 292 analytics1010.eqiad.wmnet.52368 [17:13:00] on an11: [17:13:05] 23 analytics1009.eqiad.wmnet.42442 [17:13:05] 4 analytics1010.eqiad.wmnet.47470 [17:13:24] !log aaron synchronized wmf-config/ 'Switched to JobQueueFederated' [17:13:26] (03CR) 10coren: [C: 032 V: 032] Labs DB: add recentchanges_userindex view [operations/software] - 10https://gerrit.wikimedia.org/r/92337 (owner: 10coren) [17:13:32] ah sorry, looks like multiple ports for an10 on an09, but still, same trend [17:13:38] Logged the message, Master [17:13:44] there are more metrics for hosts within the same rack [17:13:45] weird [17:13:59] so you're saying that most cross-rack multicast packets are not arriving, but some do [17:14:24] i think so? what's weirder is that it seems to be specifically custom ganglia metrics [17:14:43] which would indicate that maybe this is not a multicast specific issue, maybe? [17:14:44] not sure [17:14:57] i could be wrong on that, but that's just what i've noticed so far, [17:15:09] or at least, in ganglia the built in metrics seem to be fine on all hosts [17:15:22] yeah weird [17:22:49] ottomata: to be sure, perhaps do some multicast ping testing with the actual ganglia address (range)? [17:23:00] any other mcast address in that ganglia prefix works too [17:23:13] (03PS1) 10Andrew Bogott: Added the system_role module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92338 [17:25:12] hey [17:25:14] interesting problem [17:25:41] i wonder if it disappears if we deactivate the ACL [17:27:59] andrewbogott: or perhaps a "system" module, which could do some other wmf specific/inventory stuff in the future ;) [17:28:11] system::decommission [17:28:22] alarm at office :p.. wee [17:28:27] i don't know, just brainstorming [17:28:37] mark: That's probably better than putting it in the soon-to-exist generic::module [17:28:44] yeah [17:28:49] It still entails a massive search/replace but I guess that won't kill me [17:29:03] I think a "wmf" module is also too generic [17:29:04] s/_/::/g should do it, right? [17:29:21] well. for those system_role lines, yes ;) [17:29:30] kidding [17:29:59] generic module in progress: [17:30:25] (03PS1) 10Andrew Bogott: Add a 'generic' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92339 [17:30:56] :) [17:34:59] paravoid: could you replace the custom graphite debs in apt w/the official debian packages? i'll update the patch to refer to the debian packages' file paths [17:38:36] And now, a patch that's way to big to read! [17:38:55] (03PS2) 10Andrew Bogott: Added the system module and the system::role class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92338 [17:39:15] hehe [17:41:22] oops, that was broken, new patch coming up [17:41:25] (03PS3) 10Andrew Bogott: Added the system module and the system::role class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92338 [17:46:18] paravoid: ping re ^? sorry if this merely rehashes what we talked about yesterday, i just realized afterward that we didn't really plan the next step [17:46:27] ori-l: oh yeah, sorry [17:46:36] it'll need a backport [17:46:52] I will have a look [17:47:51] paravoid: cool, thank you. [17:48:12] not now though [17:48:23] mark, thing systemuser should go in there as well? [17:48:27] s/thing/think/ [17:48:38] (03PS1) 10Mark Bergsma: Repartition ulsfo LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92342 [17:48:39] (03PS1) 10Mark Bergsma: Repartition eqiad LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92343 [17:48:40] (03PS1) 10Mark Bergsma: Repartition esams LVS service IPs [operations/dns] - 10https://gerrit.wikimedia.org/r/92344 [17:48:50] andrewbogott: well... not really [17:49:05] 'k [17:49:11] system::role is a pretty meta, wmf inventory/role kinda thing [17:49:18] system_user is just a low level unix system user [17:49:20] very different [17:50:20] (03CR) 10Manybubbles: [C: 031] "Makes sense to me. I imagine when I put this together I thought the plugin would run on the Elasticsearch machine instead so localhost wo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92325 (owner: 10Ottomata) [17:50:25] ottomata: thanks! [17:52:56] paravoid: yeah, i didn't expect you to drop everything :) [18:04:22] mark, yeah i did the same test with the ganglia analytics multicast addy [18:04:24] same behavior [18:04:35] traffic only appears for listeners in the same rack [18:13:24] !log reedy synchronized php-1.23wmf1/includes/specials/SpecialContributions.php [18:13:37] Logged the message, Master [18:15:42] (03CR) 10Dzahn: "it was most likely the duplicate bot and start script as you pointed out, but honestly i don't remember the full history of this one." [operations/puppet] - 10https://gerrit.wikimedia.org/r/60359 (owner: 10Dzahn) [18:20:39] (03PS2) 10Andrew Bogott: Add a 'generic' module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92339 [18:25:04] (03CR) 10Odder: "Weee! :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 (owner: 10Odder) [18:26:08] !log reedy synchronized php-1.23wmf1/extensions/Wikibase [18:26:22] Logged the message, Master [18:39:50] (03PS1) 10Chad: Wikidatawiki gets Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92352 [18:40:05] (03CR) 10Chad: "Prepping for tomorrow" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92352 (owner: 10Chad) [18:45:47] (03CR) 10Chad: [C: 031] delete search.wikimedia.org Apache config file [operations/puppet] - 10https://gerrit.wikimedia.org/r/91132 (owner: 10Dzahn) [18:46:30] (03CR) 10Chad: "Please merge me. Super easy :D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/84743 (owner: 10QChris) [18:46:41] PROBLEM - MySQL Processlist on db1024 is CRITICAL: CRIT 88 unauthenticated, 0 locked, 0 copy to table, 0 statistics [18:46:48] (03PS2) 10Chad: (bug 40941) Increase font size in Gerrit diff messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 (owner: 10Odder) [18:46:55] (03CR) 10Chad: [C: 031] "Please merge me. Super easy :D" [operations/puppet] - 10https://gerrit.wikimedia.org/r/91879 (owner: 10Odder) [18:49:12] PROBLEM - DPKG on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:41] RECOVERY - MySQL Processlist on db1024 is OK: OK 6 unauthenticated, 0 locked, 0 copy to table, 2 statistics [18:50:01] RECOVERY - DPKG on searchidx1001 is OK: All packages OK [19:04:19] (03PS1) 10Cmjohnson: Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 [19:05:57] (03CR) 10jenkins-bot: [V: 04-1] Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 (owner: 10Cmjohnson) [19:06:59] (03PS2) 10Cmjohnson: Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 [19:07:56] (03CR) 10jenkins-bot: [V: 04-1] Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 (owner: 10Cmjohnson) [19:16:50] ^d: manybubbles Either of your rebuilding any indexes? [19:16:59] Reedy: nop! [19:17:02] nope [19:17:09] not I, at least [19:17:55] There's 3 wikis showing IndexMissingException atm [19:18:28] uh,4 - bhwiktionary, strategywiki, ugwikibooks and bmwikiquote [19:19:05] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 232 MB (2% inode=76%): [19:20:39] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: non wikipedia to 1.23wmf1 [19:20:53] Logged the message, Master [19:23:05] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:43] Reedy: will check [19:27:31] (03PS1) 10Reedy: non wikipedia to 1.23wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92361 [19:27:47] (03CR) 10Reedy: [C: 032] non wikipedia to 1.23wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92361 (owner: 10Reedy) [19:27:53] (03CR) 10Mwalker: [C: 04-2] "So we can't remove contribution tracking; it's what allows us to internally track and assign IDs to donors as they come through our system" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/91675 (owner: 10Reedy) [19:28:31] (03Merged) 10jenkins-bot: non wikipedia to 1.23wmf1 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/92361 (owner: 10Reedy) [19:29:15] (03CR) 10Mwalker: [C: 031] Make misc::maintenance::foundationwiki cronjobs ensure => absent [operations/puppet] - 10https://gerrit.wikimedia.org/r/91676 (owner: 10Reedy) [19:29:53] Reedy: I'm really not sure where all these wikis came from that we never made indexes for. I made a bunch of indexes a while ago when they were badly autocreated. I'm just going to go through cirrus.dblist and make sure they are all working as expected so I'll be sure everything is clean as of today [19:31:05] RECOVERY - Disk space on wtp1011 is OK: DISK OK [19:34:27] Reedy: all the wikis you mentioned are built except strategywiki - that one is building [19:34:51] nlwikinews [19:34:59] advisorywiki [19:37:03] Reedy: more wikis than I've ever heard of. languages, I didn't know existed. I'm just going to go down the list.... [19:37:31] 846 centralauthed wikis [19:37:35] plus a couple of handfuls more [19:37:46] 879 in total apparently [19:39:10] (03PS3) 10Andrew Bogott: Move generic::gluster* into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/91884 [19:41:13] hey manybubbles & ^d -- Krinkle & I have been meaning to ask you about something. How hard would be to set up a search interface that greps through the MediaWiki namespace of all Wikimedia wikis? [19:41:20] (03CR) 10Andrew Bogott: [C: 032] Move generic::gluster* into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/91884 (owner: 10Andrew Bogott) [19:42:59] whenever someone proposes to modify MediaWiki CSS or JS APIs, there is always the question of how many gadgets, site scripts, etc. exist in the wild that utilize the interface or rely on the selectors [19:44:07] it'd be very useful to be able to answer such questions definitively in reference to the code that exists on the cluster [19:44:19] There's also User:Foo/(skinname|common).(js|css) [19:44:36] hrm, yes, good point! [19:45:28] Certainly doing MediaWiki namespace would be a good start for site stuff, gadgets etc as you said [19:47:14] can't protect everyone... [19:47:39] (03PS1) 10Andrew Bogott: Specify group => 'root' for the logrotate config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92365 [19:48:22] (03CR) 10Andrew Bogott: [C: 032] Specify group => 'root' for the logrotate config. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92365 (owner: 10Andrew Bogott) [19:50:52] (03PS4) 10Andrew Bogott: Added the system module and the system::role class. [operations/puppet] - 10https://gerrit.wikimedia.org/r/92338 [19:53:37] (03PS3) 10Cmjohnson: Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 [19:54:26] (03CR) 10jenkins-bot: [V: 04-1] Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 (owner: 10Cmjohnson) [19:55:08] (03Abandoned) 10Cmjohnson: Removing search21-36 from lucene.pp (decomm'ing these servers) [operations/puppet] - 10https://gerrit.wikimedia.org/r/92355 (owner: 10Cmjohnson) [19:56:59] (03PS1) 10Cmjohnson: Removing search21-36 from lucene.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92366 [19:58:39] (03CR) 10Cmjohnson: [C: 032] Removing search21-36 from lucene.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/92366 (owner: 10Cmjohnson) [19:58:48]