[00:00:39] mysql's 'utf8' wasn't real utf8 until a fairly recent version (5.1 iirc) [00:00:49] (I ask because I am currently trying to resolve https://bugzilla.wikimedia.org/show_bug.cgi?id=53751) [00:00:49] (and was surprised at our default) [00:00:50] it was restricted to the basic multilingual plane [00:01:24] which is a problem if you want to support certain languages [00:01:31] ah; yes; that makes sense [00:01:31] and we do [00:02:01] do you know off the top of your head how we do case insensitive db lookups then? [00:03:37] ^d i can start up the lvs thing in about 15 minutes, will you be here then ? [00:04:23] oh... answer -- we don't do case insensitive lookups... [00:10:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [00:15:38] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 301 seconds [00:17:38] PROBLEM - MySQL Replication Heartbeat on db1046 is CRITICAL: CRIT replication delay 303 seconds [00:33:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 00:33:52 UTC 2013 [00:34:18] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [00:35:58] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [00:37:18] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [00:39:30] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [00:40:31] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.412 second response time [00:46:10] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:46:10] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [00:53:10] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [00:53:10] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [00:58:10] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [00:58:10] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:59:10] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [01:02:10] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [01:04:10] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [01:05:10] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [01:06:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:07:14] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [01:07:44] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 01:07:40 UTC 2013 [01:08:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [01:13:55] (03CR) 10TTO: "This restricts uploads to sysops, rather than disabling them altogether... is this what was intended?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [01:33:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 01:33:49 UTC 2013 [01:34:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:07:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:16:20] !log LocalisationUpdate completed (1.22wmf19) at Wed Oct 2 02:16:19 UTC 2013 [02:16:35] Logged the message, Master [02:30:23] !log LocalisationUpdate completed (1.22wmf18) at Wed Oct 2 02:30:23 UTC 2013 [02:30:34] Logged the message, Master [02:34:00] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 02:33:53 UTC 2013 [02:35:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Oct 2 02:56:05 UTC 2013 [02:56:18] Logged the message, Master [03:02:04] (03CR) 10MZMcBride: "An associated Bugzilla bug would be nice here." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [03:03:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 03:03:49 UTC 2013 [03:04:00] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [03:34:29] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 03:34:26 UTC 2013 [03:34:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:04:39] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 04:04:37 UTC 2013 [04:04:59] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:22] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 04:34:15 UTC 2013 [04:34:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [04:44:32] RECOVERY - MySQL Replication Heartbeat on db1046 is OK: OK replication delay 0 seconds [05:03:52] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 05:03:43 UTC 2013 [05:04:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:01] Intermittent search failures on Commons: [05:08:06] "An error has occurred while searching: HTTP request timed out." [05:09:25] i'd !log that [05:15:16] !log Commons has intermittent search failures: "An error has occurred while searching: HTTP request timed out." [05:15:28] cc LeslieCarr [05:15:31] Logged the message, Master [05:17:27] superm401: no report in bugzilla yet, right? [05:17:37] greg-g, didn't check or file. [05:18:29] k [05:18:31] superm401: what's strange about it? [05:19:02] Nemo_bis, didn't say it was strange, but it shouldn't happen either. Not sure what you mean. [05:19:09] superm401: https://gerrit.wikimedia.org/r/#/c/60759/ [05:19:18] we even stopped logging those errors because there were too many [05:19:36] it's totally normal [05:19:39] SNAFU [05:20:16] We could just raise the timeout as a workaround. [05:20:21] Although fixing it would be nice too. :) [05:26:37] superm401: and maybe that would kill the lucene hosts completely? :) [05:27:24] I mean, I'd like someone to work on lucene but it's not simple, that's why they're abandoning it [05:27:28] Maybe, depending on how much it was raised. [05:28:09] Lucene has been that broken for many months and probably years, we just didn't notice clearly because there are no logs and it said "0 results" when it failed [05:28:26] result is that now that search is being improved some think it's worse ;) [05:28:33] e.g. https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2013/September#w00t.21_New_search_indexing_with_all_scripts.2Ftemplate_expansion [05:28:45] (very short-sighted reply warning) [05:28:53] I don't recall getting either timeouts or spurious 0 results. [05:29:03] Not saying it never happened, but it wasn't often enough for me to notice. [05:29:49] And you probably know, but they're not abandoing Lucene, just using a tool that wraps it (solving some of the problems for us). [05:33:56] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 05:33:54 UTC 2013 [05:34:46] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [05:59:50] (03PS1) 10Yuvipanda: labs-vagrant: Ensure that vagrant homedir is created [operations/puppet] - 10https://gerrit.wikimedia.org/r/87041 [06:00:48] anyone to merge a trivial patch? [06:07:07] (03PS2) 10Legoktm: labs-vagrant: Ensure that vagrant homedir is created [operations/puppet] - 10https://gerrit.wikimedia.org/r/87041 (owner: 10Yuvipanda) [06:08:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:11:24] superm401: the timeouts were not reported as such, the spurious 0 results are not trivial to identify as spurious :) [06:11:51] Nemo_bis, yeah, sometimes I wouldn't know. [06:12:04] But a lot of times, I search for stuff I know there should be results for. [06:14:10] Yes, that's the only case when you can know [06:14:27] superm401: personally I think something useful to do would be this: https://bugzilla.wikimedia.org/show_bug.cgi?id=43544#c23 [06:15:15] being able to at least grep -c the logs so that we notice if the errors are suddenly an order of magnitude more frequent would be nice, even just for few months :) [06:15:28] Yeah [06:21:13] Ryan_Lane: trivial merge of https://gerrit.wikimedia.org/r/87041? [06:22:58] superm401: you could file a bug for it then ;) [06:23:17] I'm surely not as good at filing bugs in a focused way as you are [06:24:20] Nemo_bis, do you know whether what proportion of the errors are timeouts? [06:25:27] superm401: define "errors"? :) [06:25:47] Whatever makes it to the mwsearch log. [06:26:07] superm401: I think you are the one with shell access here? :P [06:26:15] it may be on fluorine:/a/mw-log/mwsearch.log still [06:26:21] or wherever it's rotated [06:27:49] a possible approach, if it's really so huge, would be to just turn on logging for, say, 24h and compare the length of the log to a "standard" one [06:28:24] No files in that directory with search anywhere in the name. [06:29:00] hmpf [06:29:14] where is the archive [06:29:50] only 5 years old page https://wikitech.wikimedia.org/wiki/MediaWiki_UDP_logging [06:31:42] (03CR) 10Akosiaris: [C: 032] labs-vagrant: Ensure that vagrant homedir is created [operations/puppet] - 10https://gerrit.wikimedia.org/r/87041 (owner: 10Yuvipanda) [06:31:50] ty, akosiaris [06:32:19] sigh https://wikitech.wikimedia.org/wiki/Search/UDP_Logger [06:32:19] wow you are fast :-) [06:32:47] akosiaris: :D one more patch coming up [06:32:52] hope there's a timezone chart of all ops [06:32:55] s/hope/wish [06:33:39] ahahaha... now that we got Sean from australia we pretty much cover the world :-) [06:34:38] :D [06:34:38] true [06:35:02] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 06:34:55 UTC 2013 [06:35:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [06:42:40] Nemo_bis, CCed you on https://bugzilla.wikimedia.org/show_bug.cgi?id=54865 [06:44:46] superm401: thanks [07:12:17] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [07:12:27] RECOVERY - search indices - check lucene status page on search1022 is OK: HTTP OK: HTTP/1.1 200 OK - 56465 bytes in 0.016 second response time [07:33:57] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 07:33:56 UTC 2013 [07:34:17] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:08:26] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:33:56] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 08:33:54 UTC 2013 [08:34:26] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [08:53:56] (03PS1) 10Mattflaschen: Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 [08:54:33] (03PS2) 10Mattflaschen: Labs: Turn off secure login on loginwiki due to untrusted SSL [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 [09:36:06] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 09:36:00 UTC 2013 [09:36:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [09:49:26] (03CR) 10Danny B.: "@TTO:" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [09:51:10] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [09:53:28] aaaaaaaaaaaaaaaarghhh [09:53:36] mw1125 is out of sync [09:58:36] ? [09:59:25] MaxSem: ? [09:59:32] I see errors on it that were resolved by yesterday's deploy and are not coming from other boxes anymore [09:59:46] running sync-common [10:00:08] it's in dsh groups, wonder how it would have failed [10:01:46] happens at times [10:10:36] (03PS2) 10TTO: skwikisource: Disable upload [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [10:13:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:13:52] DatabaseInstaller::setupSchemaVars: unexpected DB connection error [10:14:03] that is sooo useful :-D [10:30:20] !log Resynched MW on mw1125, looked like slightly out of sync [10:30:31] Logged the message, Master [10:34:01] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 10:33:58 UTC 2013 [10:34:41] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [10:46:06] (03PS1) 10Hashar: contint: browsertests needs php5-sqlite [operations/puppet] - 10https://gerrit.wikimedia.org/r/87056 [10:46:44] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:46:44] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [10:53:44] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [10:53:44] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [10:54:48] (03CR) 10MaxSem: [C: 04-1] "Please provide a link to bug requesting this." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [10:57:50] (03CR) 10Danny B.: "What is the sense of creating of a new bug and immediately closing it as fixed? No prob to do it, but seems like unnecessary bureaucracy t" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [10:58:44] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [10:58:44] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:59:44] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [11:02:44] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [11:04:44] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [11:05:43] (03PS1) 10Hashar: contint: fetch slave scripts on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/87058 [11:05:44] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [11:06:24] (03CR) 10MaxSem: "Sure, no problem - but then a link to community consensus would be appreciated:)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [11:07:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:08:13] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [11:15:34] (03CR) 10Danny B.: "There is no such link, this configuration is just logical outcome of the status quo (like on other projects):" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [11:33:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 11:33:44 UTC 2013 [11:34:43] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [11:36:18] (03CR) 10TTO: "In the spirit of fairness, at least post an announcement on the community portal to give anyone a chance to object." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [11:40:15] (03CR) 10Danny B.: "OK, as you wish. Although I am actually the only active user there ATM... ;-)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86643 (owner: 10Danny B.) [12:07:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [12:34:38] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 12:34:32 UTC 2013 [12:35:38] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:10:27] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:33:57] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 13:33:48 UTC 2013 [13:34:27] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [13:37:56] (03CR) 10Dzahn: [C: 04-1] "the command => "usermod -a -G Debian-exim nagios is necessary because puppet can't add an existing user to an existing group, it can only " [operations/puppet] - 10https://gerrit.wikimedia.org/r/86889 (owner: 10Matanya) [13:49:53] mutante: thanks for this [13:50:03] I see a better way to do it [13:51:19] (03PS1) 10Yuvipanda: Add vagrant user to sudoers by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/87080 [13:51:32] (03CR) 10jenkins-bot: [V: 04-1] Add vagrant user to sudoers by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/87080 (owner: 10Yuvipanda) [13:51:48] (03PS2) 10Yuvipanda: Add vagrant user to sudoers by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/87080 [13:58:00] hmm, do we have any python services deployed in production? [13:58:12] php is most of our stuff, and then there's parsoid... [14:03:04] yuvipanda: if you count IRC bots as services ...and monitoring scripts [14:03:15] find . -name *.py in puppet repo [14:03:37] hmm [14:03:52] (03PS3) 10Yuvipanda: Add vagrant user to sudoers by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/87080 [14:03:53] mutante: was thinking of setting aside some time to write up code for the ShortURL service from the RFC [14:04:08] YuviPanda: isn't analytics using some python stuff? [14:04:29] siebrand: I don't know if that is 'in production' as such yet, though. [14:04:35] metrics is on labs... [14:04:43] YuviPanda: pywikibot? ;) [14:04:49] heh :D [14:05:15] siebrand: puppet is pretty much the only ruby thing in our stack, afaik [14:05:26] but python is littered here and there, no big service as such uses it [14:05:31] YuviPanda: nope. All the front end QA is too. [14:05:45] YuviPanda: (ruby, that is) [14:05:47] siebrand: ah, right. but I was considering 'things that run in the cluster' [14:05:54] swift , ldap, salt, ganglia, there are some random pythong scripts in lots of places, but that doesnt make them python services i guess [14:06:05] yuvipanda: ( hashar: where is the meetbot repo? ) [14:06:08] YuviPanda: puppet doesn't run on the cluster, it configures the cluster? [14:06:21] siebrand: it runs on each machine in order to configure the cluster [14:06:25] siebrand: plus there's the puppetmaster too [14:06:28] yuvipanda: nothing and I don't have the time to work on meetbot. Consider the current install a teaser [14:06:40] hashar: okay, let me rephrase - where it the meetbot code? :) [14:06:53] yuvipanda: "I don't have time "sorry [14:07:13] hashar: to rephrase again - is it something that exists somewhere and you just used it, or did you write it yourself? [14:07:16] no more questions, I promise! [14:07:30] I already regret having done that proof of concept cause now people are distracting me and thought meetbot is important [14:07:31] :( [14:07:57] well, ok. [14:08:35] just saying that if the code is somewhere perhaps other people can fix things, set it up in a semi-production state, etc. [14:08:41] but I understand if you don't have time for it [14:08:58] yeah none sorry [14:09:10] but will ping ya in a couple weeks when I puppetize it :-] [14:09:20] fine, hoard the code :P [14:09:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:09:29] focusing on having browser tests triggered in Gerrit when people send patchsets in Gerrit [14:10:22] yuvipanda: integration-meetbot .pmtpa.wmflabs [14:10:34] thank you, that is all that I was looking for :) [14:10:38] yuvipanda: you are already member of that project and get root [14:10:44] let me look [14:10:48] * hashar context switch [14:11:05] yuvipanda: so basically : install supybot using the ubuntu package [14:11:34] yuvipanda: MeetBot is a plugin, I have fetched it under /mnt/meetbot [14:11:49] I see it [14:11:51] then /mnt/supybot is the base dir for supybot to run into [14:12:11] you can drop plugins in /mnt/supybot/plugins [14:12:15] I simply created a symbolic link [14:12:33] yeah, I understand. MeetBot is just a supybot plugin, and I see where the code is [14:12:37] which is ugly [14:13:03] hah! :D [14:13:05] and the configuration for Meetbot is in /mnt/meetbot/MeetBot/meetingLocalConfig.py [14:13:09] the file must be named like that [14:13:14] it is hardcoded inside MeetBot [14:13:31] note that /mnt/meetbot/MeetBot is a darks repository :-] [14:13:39] yeah, i noticed the _darcs folder [14:13:48] should be interesting, have only 'heard' of it before [14:14:16] then one can start meetbot using the upstart conf I imported from openstack: cat /etc/init/meetbot.conf [14:14:18] (03PS1) 10Dzahn: change wikimania.wm redirect from 2013 to 2014 [operations/apache-config] - 10https://gerrit.wikimedia.org/r/87112 [14:14:20] aka start meetbot [14:14:21] or stop meetbot [14:14:27] (03CR) 10Cmcmahon: [C: 031] "I'd like to test this right away in beta labs. It would be easy enough to revert if something goes haywire." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87045 (owner: 10Mattflaschen) [14:14:45] mm, I see that too! [14:15:12] i'll pkoe around [14:15:13] so it is really simple really: create a git repo to mirror the darcs repo (just need the current version) [14:15:14] thanks hashar! [14:15:55] write a manifest that provide the upstart conf, install supybot, create a meetbot user, clone the meetbot repo, service { meetbot: ensure => running } [14:16:04] + a template for meetingLocalConfig.py :D [14:16:26] Reedy: around? can you tell what the next move is to get this further along to +2? https://gerrit.wikimedia.org/r/#/c/84897/ [14:16:31] * hashar switch back [14:16:41] hashar: thank you! [14:17:07] (03CR) 10Dzahn: [C: 031] "i'm not touching the "wikimania.asia" links because 2014 wikimania isn't in Asia" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/87112 (owner: 10Dzahn) [14:18:30] (03CR) 10Dzahn: [C: 032] change wikimania.wm redirect from 2013 to 2014 [operations/apache-config] - 10https://gerrit.wikimedia.org/r/87112 (owner: 10Dzahn) [14:18:36] lolol [14:20:52] mutante: think you can +2 https://gerrit.wikimedia.org/r/#/c/87080/? [14:21:27] Reedy: ? [14:21:37] Still needs a better name... [14:21:44] The rest of it is simple enough to fix up [14:21:57] Reedy: paint that bikeshed, man! [14:22:00] yuvipanda: need to sync and graceful Apaches first.. [14:22:06] mutante: heh, sure! [14:22:17] and i dont know much vagrant stuff.. well.. [14:22:33] mutante: well, the only person doing vagrant stuff on ops repo is... me [14:22:39] chrismcmahon: It's more of the name doesn't make much sense... [14:22:53] $wgExtensionsEntryPointListFile [14:22:55] s [14:23:03] mark, if you were around, I'd appreciate another look at that same pybal.conf.erb puppet error on lvs4001 [14:23:11] mutante: can't +2 them myself :) [14:24:04] Reedy: I don't want to get stuck in loop where every new name gets a -1. that would take months at this rate. what IS a good name? (and who would know?) [14:24:28] Something that makes sense as to what the global does/is used for [14:25:28] Reedy: $messagesToNonProdEnvs [14:25:35] !log sync-apache, apache-graceful-all for wikimania2014 redirect [14:25:48] Logged the message, Master [14:32:28] Reedy: $splitMessagesForBetaLabs [14:34:55] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 14:34:46 UTC 2013 [14:35:17] (03PS3) 10Ottomata: Adding kafka::udp2log::relay define to consume from Kafka and send to udp2log. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 [14:35:25] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [14:41:48] (03CR) 10Dzahn: [C: 032] "i guess there is not really a way around thiis since the vagrant user sets stuff up" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87080 (owner: 10Yuvipanda) [14:42:09] mutante: thanks! [14:47:08] (03CR) 10Reedy: "(1 comment)" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 (owner: 10Reedy) [14:53:24] (03PS6) 10Reedy: Simplify wikimania apache conf, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [14:55:04] (03PS7) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [14:55:14] (03PS8) 10Reedy: Simplify wikimania apache confs, reuse wikimedia.org docroot. [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 [15:12:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:16:45] (03PS1) 10Ottomata: Updating kafka module to latest commit, modifying kafka role to reflect recent changes there. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87151 [15:18:07] (03PS2) 10Ottomata: Updating kafka module to latest commit, modifying kafka role to reflect recent changes there. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87151 [15:18:48] (03CR) 10Ottomata: [C: 032 V: 032] Updating kafka module to latest commit, modifying kafka role to reflect recent changes there. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87151 (owner: 10Ottomata) [15:21:33] (03CR) 10Dzahn: [C: 04-1] "unfortunately this looks like 404s when i tested" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/84707 (owner: 10Reedy) [15:27:05] (03PS4) 10Ottomata: Adding kafka::udp2log::relay define to consume from Kafka and send to udp2log. [operations/puppet] - 10https://gerrit.wikimedia.org/r/86894 [15:30:52] yuvipanda: an URL shortener sounds like an excellent task for Twisted or (faster) node [15:31:11] gwicke: indeed, i was thinking of Node [15:31:23] gwicke: with the idea that 99% of hits or more should be handled by varnish [15:31:34] yeah [15:31:44] gwicke: it's not terribly hard to do, I might see if I can spend an hour or so writing it [15:31:51] a simple node http server gets me something like 7k req/s [15:32:11] with both ab and the server on my aging laptop [15:32:51] gwicke: right [15:33:04] gwicke: only thing I was wondering is the status of npm packages and our ops repo [15:33:11] gwicke: since I'd need to use node-mysql [15:33:15] at the least [15:33:38] we handle that with a contrib/config repo [15:33:53] gwicke: how so? [15:34:08] currently we deploy that manually, but want to make it a subrepository instead that is automatically deployed along with the main code [15:34:10] gwicke: that's the one thing python has going for it, since the packages i'd use there are all in repo. [15:34:12] aaah [15:34:13] right [15:34:16] all with git-deploy [15:34:19] right [15:34:25] in repo dependencies [15:34:54] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 15:34:50 UTC 2013 [15:35:04] that will also let us test with the libraries before deploying [15:35:34] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [15:35:41] gwicke: right [15:35:46] gwicke: are ops okay with that? [15:36:22] yes, it is better than what we do right now [15:36:27] heh [15:36:46] https://bugzilla.wikimedia.org/show_bug.cgi?id=53723 [15:37:25] discussed this with Ryan, in the context of also supporting Debian packaging this is basically what we came up with [15:38:17] reading https://www.mediawiki.org/wiki/Parsoid/Packaging now [15:40:21] yuvipanda: in the longer term I hope that we can package more dependencies too [15:40:34] gwicke: i guess the nodejs packaging scene isn't that hot.. [15:41:00] gwicke: there are some, but not that many [15:41:17] iirc there is a tool that automatically packages npm modules [15:41:34] https://npmjs.org/package/npm2debian [15:42:39] oh [15:44:34] * Reedy kicks grrrit-wm [15:45:22] Reedy: what's happening? [15:45:34] Reedy: your comments came through on -dev... [15:46:01] thanks Reedy [15:46:02] Closest we have to kicking gerrit [15:46:03] Permission denied (publickey). [15:46:03] fatal: Could not read from remote repository. [15:46:03] Please make sure you have the correct access rights [15:46:03] and the repository exists. [15:46:44] <^d> Gerrit hides in a castle so you can't kick him :) [15:47:11] * Reedy just kicks the messenger instead [15:54:56] paravoid: Have you had any time yet to look over my RfC draft? [15:56:22] not yet :( [15:56:37] * bd808 sulks [15:57:29] paravoid: I'll keep busy today trying to make a better version of purgeDeletedFiles [15:59:18] sorry... [16:00:11] no worries. Everybody is busy. If I run out of other things I'll "be bold" and just move it over to the proper namespace [16:00:15] no need to apologize to bd808, I don't he has feelings. [16:00:44] s/don't he/don't think he/ # ugh [16:00:49] bd808: Will it just delete all the files with no prejudice? [16:01:29] Reedy: Some notes at https://www.mediawiki.org/wiki/User:BDavis_(WMF)/Notes/Finding_Files_To_Purge#purgeChangedFiles [16:02:07] basically want to expand it to handle things other than deletes and be able to limit htcp broadcast range [16:03:16] * bd808 ignores greg-g's trolling :P [16:04:19] bd808: awwwwww [16:04:43] Anyone know where the squid logs go now? locke has no recent writes in /a/squid and emery has no /var/log/squid - https://wikitech.wikimedia.org/wiki/Squid_logging [16:05:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:06:15] you had it almost right [16:06:22] just combine the two :) [16:06:25] emery /a/log [16:08:15] That's got sampled-1000.tsv.log but no sign of the 5XX logs [16:08:50] (I know, I didn't specify ;)) [16:08:53] oxygen /a/log ? [16:09:10] reedy@bast1001:~$ ssh -A oxygen [16:09:10] Permission denied (publickey). [16:09:15] oh heh [16:09:28] ottomata1: here? [16:09:32] I can get into emery, locke, stat1 [16:09:34] logging's hard [16:09:41] Not into oxygen and stat1001 etc... [16:10:28] yo [16:10:49] ottomata1: Reedy is looking for 5xx.log, is it anywhere else but oxygen? [16:10:54] Reedy, if you want historical [16:11:01] stat1002:/a/squid/archive [16:11:05] stat1002.eqiad.wmnet [16:11:16] Can't get onto stat1002 [16:11:22] need account? [16:11:25] I'm not sure what's wanted... manybubbles ^^ What're you wanting with the 5XX logs? [16:11:27] * bd808 is ready to think seriously about a giant logstash instance [16:11:47] bd808 yes! [16:11:50] let's talk about that [16:12:01] It's somewhat amusing I can apparently get onto rand() analytics hosts but not others [16:12:02] let's build the proper elasticsearch cluster first [16:12:07] (03PS1) 10Chad: Cirrus: Remove commented officewiki, cawiki to primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87157 [16:12:16] drdee: It's paravoid's pet project. [16:12:18] I was thinking of ordering a few more of the ES boxes for logstash [16:12:25] i have been wanting to look into that hadoop log files [16:12:29] it's also my pet project :D [16:12:36] yeahhhhh! [16:12:37] lets do it! [16:12:48] we have a procurement ticket where we discuss ES boxes with manybubbles [16:12:53] cool [16:12:57] let's figure out the details for these first [16:13:00] ok [16:13:19] We want all the fast servers [16:13:19] Done [16:13:20] yeah, and Reedy, if you want live logs, you need to get onto oxygen [16:13:45] Need to wait for manybubbles to reply [16:14:03] Reedy: sorry, you want to know why I want those 500s I emailed about earlier? [16:14:06] https://wikitech.wikimedia.org/wiki/Squid_logging is pretty out of date it seems [16:14:29] Reedy: someone was complaining about seeing gateway timeouts but I couldn't find them in any logs I knew how to look at [16:14:30] manybubbles: Not necesserily why, just whether you want live and now, or archive, or both [16:14:35] paravoid:what's the rt ticket? [16:14:49] As it's different boxes that you need access too for each... [16:14:54] drdee: 5883 [16:15:01] it's 5883 but please don't distract the conversation with logstash for now [16:15:06] <^d> Reedy: We getting a 1.22wmf20 today? :) [16:15:11] paravoid: good call [16:15:14] ^d: Are we? [16:15:21] i won't i just want to follow the ticket [16:15:24] It's Wednesday... [16:15:30] :) [16:15:36] <^d> Duh. [16:15:39] <^d> Wednesday != Thursday [16:15:46] ^d: not normally [16:15:46] heh [16:15:50] 24 hours or so [16:16:31] Reedy: I want them for a particular time period I guess, so archive. I don't imagine I'll be tailing the logs. [16:17:01] manybubbles: then you want stat1002.eqiad.wmnet [16:17:21] I suspect he won't currently have access either then... [16:17:24] if you want an account i guess you gotta make an RT ticket i suppose [16:17:25] ottomata: is that on the Squid_logging page? [16:17:57] Reedy and ottomata: I'll file one then [16:19:53] Squid logging [16:19:57] sounds like a page I have never heard of :p [16:20:30] no sir it is not [16:20:39] xD [16:20:46] * Reedy looks to find the out of date template [16:20:56] manybubbles: FYI, the diagram at the top looks relatively correct [16:21:35] roughly [16:21:37] sort of [16:21:48] drdee, re Cassandra: I have been evaluating it for a while now, and it looks good so far. The plan is to test it for HTML revision storage soon, and if that works fine it could be used more broadly. [16:22:09] Reedy: how diagrams should be [16:28:25] does anyone else here feel like troubleshooting ganglia is like looking for a your keys in a dark room you've never been in before? [16:28:46] or maybe like sticking your hand into an opaque bag full of an unknown substance [16:28:56] ottomata: a dark room full of bats [16:28:57] there must have been an old nickelodeon game show where you had to do that [16:29:30] Jeff_Green: have you had any success swatting your bats? [16:29:46] wearing a Red Shirt and signing up for an away mission? [16:29:52] everything seems normal on my side, i even see (some) of the metrics i'm looking for on my cluster's multicast source group [16:29:59] not recently, but in general when I've faught with ganglia I've eventually figured it out [16:30:01] but the metrics aren't on nickel gmetad [16:30:04] yeah [16:30:15] that's happened to me too, usually by poking as far as I have, and then restarting gmetad [16:30:21] but this time no luck so far [16:30:48] how do you mean re seeing them on your cluster's multicast source group? [16:34:53] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 16:34:52 UTC 2013 [16:35:13] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [16:35:53] Jeff_Green: you can telnet or netcat into gmond addies [16:36:04] oic [16:36:29] so you see it being aggregated but that metric doesn't seem to get collected? [16:37:58] and you have other metrics for the host in question successfully making it to nickel? [16:39:19] sorry, i suppose not on the multicast source [16:39:25] i think just into the aggregator addies [16:39:33] it shows the gangali xml that gets sent [16:39:36] so I can see the metrics there [16:39:42] but not at gmetad [16:40:30] but you see other metrics arriving at gmetad for that same cluster? [16:41:19] yes, the base ones [16:41:22] just not my extra kafka ones [16:43:56] my first wild guess would be to look closely at all the parameters your collector defines and make sure they're 'proper' [16:47:16] ottomata1: I ran into an issue at some point with reporting a decimal value without type=float or similar, I forget the specifics [16:48:09] i'm hardly defining the values, they are automatically sent by https://github.com/criteo/kafka-ganglia [16:48:16] and, this has worked before [16:48:41] and from nickel, if I do [16:48:45] telnet analytics1011.eqiad.wmnet 8649 | grep kafka [16:48:48] i can see things like [16:48:52] [16:48:54] which is exactly what I want [16:49:44] type=double? [16:50:30] -t, --type=STRING Either string|int8|uint8|int16|uint16|int32|uint32|float|double [16:50:31] looks ok [16:50:35] http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_readme [16:51:05] what's the "double" type? and does it make sense for values like 11.89? [16:51:16] i would have guessed float [16:52:12] double is a bigger float [16:54:16] it seems to spew to /var/log/messages -- maybe there's somethign there? [17:06:54] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [17:07:50] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [17:08:30] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [17:08:31] PROBLEM - Apache HTTP on mw31 is CRITICAL: Connection refused [17:08:50] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:08:55] what's up with mw31? [17:10:31] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.256 second response time [17:27:30] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.0006977319717 secs [17:34:20] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 17:34:16 UTC 2013 [17:34:50] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [17:36:35] OH I am having larger ganglia issues than I thought [17:36:40] none of the hadoop metrics are showing up either! [17:36:41] HMMMm [17:36:44] they used to. [17:38:22] are the RRD files there? [17:38:23] ottomata: because you got me looking, i discovered the pmtpa fundraising aggregator boxes were accidentally deaf=yes [17:38:56] hm, my aggregators are deaf=no [17:39:15] * drdee wonders whether ganglia is tone deaf [17:39:16] Jeff_Green: one thing I learned recently is that $ganglia_aggregator has to be one of the first things set in the node def in site.pp [17:39:18] neon would be screaming in /var/log/messages if otherwise [17:39:35] if it isn't the first thing, ganglia.pp won't see it [17:39:37] puppet is weird. [17:40:00] ottomata: it is selectively everything it is [17:40:20] it's exceptions all the way down :-P [17:41:07] ori-l: i never remeber where those rrd files are [17:41:08] do you know? [17:41:22] opnm [17:41:24] found them [17:43:03] bd808: had an issue where a change to the slope setting didn't seem to stick [17:43:33] you can query the RRD file for metadata using rrdtool [17:43:37] ori-l: "didn't stick"? [17:43:46] didn't seem to take effect [17:43:59] yeah [17:44:00] * bd808 reads backscroll [17:44:10] ori-l i see rrd files for hadoop metrics, but that's because they existed before [17:44:13] last value is from sept 24 though [17:44:18] kafka ones doen't exist yet [17:45:21] ottomata: also check /var/log/syslog [17:45:25] gmetad logs to syslog [17:45:29] on nickel i mean [17:46:01] yeah, not seeing much relevant [17:46:03] been looking there a bit [17:46:18] hm [17:46:49] hmmmm [17:47:04] i made a change to the analytics cluster confings on sept 24 [17:47:08] moved an aggregator [17:49:06] ori, have you used salt much yet? [17:49:11] if so, where do you run salt commands from? [17:49:52] i've never used it except indirectly via git-deploy, Ryan_Lane is the person to ask [17:50:28] k [17:54:57] ottomata: you run salt commands from sockpuppet [17:55:26] tin can also run a limited set of commands [17:55:27] k cool, i'm getting annoyed with dsh, i just tried to match the analytics hosts in salt, but it looks like it only matched analytics1001 and analytics1002 [17:55:33] is there anyting else I need to do? [17:55:43] maybe their keys aren't accepted? [17:55:47] salt 'analytics10*' test.ping [17:55:48] hm [17:55:49] were they rebuilt or recently built? [17:55:54] many yes [17:56:01] like within the last month or so [17:56:06] heh. there's a ton of unaccepted keys [17:56:27] ottomata: I accepted them [17:56:29] salt-key -L [17:56:38] salt-key -a [17:56:42] salt-key -A [17:56:43] ^^ for all [17:57:10] there's some docs on salt: https://wikitech.wikimedia.org/wiki/Salt [17:57:13] yeah was readin gthat [17:57:19] how do I know which key is accepted? [17:57:25] if they show up in -L are they unaccepted? [17:57:26] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:36] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:40] there's a list of accepted and unaccepted [17:57:47] I just accepted all the unaccepted ones [17:57:58] ah i see [17:58:12] does it take a while for them to know they are accepted? :p [17:58:22] i get just 5 nodes right now [17:58:23] it takes till their next checkin [17:58:34] ok [17:58:38] which is relatively often [17:58:42] https://github.com/saltstack/salt/issues/5752 [17:58:51] soon enough we won't have to worry about salt keys [17:58:56] oh yeah i remember hearing some discussion of that [17:58:58] they'll just use the puppet key system [18:00:05] Ryan_Lane: since you're hear, have you ever had trouble with ganglia seemingly reporting metrics within a defined cluster, but not showing up at gmetad? [18:00:15] here* [18:00:16] RECOVERY - DPKG on mw1125 is OK: All packages OK [18:00:26] RECOVERY - Disk space on mw1125 is OK: DISK OK [18:00:39] well, gmetad needs to have the cluster defined [18:00:43] yes, it does [18:00:48] thisis the analytics cluster [18:00:50] all of this was working [18:00:54] until I moved an aggregator on sept. 24 [18:01:01] i don't know if that broke it, but that's the last time I see these metrics [18:01:12] I've had this issue with the virt cluster, but that cluster is partially incorrectly defined [18:01:37] hm, howso? [18:01:42] i think mine is defined properly [18:01:53] root@nickel:/etc/ganglia# grep analytics /etc/ganglia/gmetad.conf [18:01:54] data_source "Analytics cluster eqiad" analytics1009.eqiad.wmnet analytics1011.eqiad.wmnet [18:02:27] root@nickel:/etc/ganglia# grep analytics /etc/ganglia/gmetad.conf [18:02:27] data_source "Analytics cluster eqiad" analytics1009.eqiad.wmnet analytics1011.eqiad.wmnet [18:02:27] an11: deaf = no [18:02:27] an09: deaf = no [18:07:23] ottomata: how are 1009 and 1011 configured? [18:07:32] are they listening on those ports? [18:07:45] is that cluster somehow firewalled? [18:07:58] yeah, they are mcast ganglia aggregators for the analytics cluster [18:08:10] um, it is sorta firewalled, but it shouldn't be for ganglia... [18:08:11] did you actually check via netstat, or are you just guessing? :) [18:08:13] also, this worked before [18:08:15] and yes, i did [18:08:23] ok [18:08:33] i can even see the metrics by netcatting into the aggregator ports [18:08:47] but if I netcat into the gmetad port, I don't see the new metrics [18:08:52] only the base ones that ganglia comes with [18:09:00] can you connect to them from nickel? [18:09:06] using nc? [18:09:07] i checked one of the rrd files for the missing metrics [18:09:16] and the last update was sept 24, same day I changed one of the aggregators [18:09:19] lemme check [18:09:20] i thikn so [18:09:41] yes [18:09:41] i can [18:09:45] and I can see the metrics that way [18:09:58] maybe gmetad just isn't trying [18:10:06] I'm not really great with ganglia [18:10:13] I tend to just restart things in this case [18:10:14] yeah, it seems no one is :p [18:10:21] yeah i've restarted gmetad a buncha times [18:10:28] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:10:29] and it seems that restarting things out of order just causes things to fail [18:10:43] did you restart gmond on the aggregators? [18:10:54] yes [18:11:20] hm, should I be able to see all the same metrics on both aggregators [18:11:26] I have 2 aggregators because they are each on a different subnet [18:11:41] I think both aggregators should work? [18:11:58] every cluster is supposed to have 2 aggregators [18:12:11] I think they are supposed to have the same data [18:12:28] correct [18:12:32] why would it matter that they are on different subnets? [18:12:44] it shouldn't, if multicast is working correctly [18:12:54] mark: I thought we switched away from multicast? [18:13:01] not yet [18:13:03] or is that not changed [18:13:03] ah [18:13:05] ok [18:13:25] but multicast should work correctly anyhow ;) [18:13:46] we'd have a number of issues if it wasn't, I'd imagine :) [18:16:38] mark: https://bugzilla.wikimedia.org/show_bug.cgi?id=54821 [18:18:29] when's that irc replacement ready again? ;) [18:20:21] heh [18:20:26] who was working on it? [18:20:30] RobH ? [18:20:33] ohhh [18:20:51] wait. you mean the thing ori/yuvi/petan were working on? [18:21:03] i think [18:21:22] no clue on that. even once that's in use, people will still need irc for backwards compatibility for a while [18:21:46] 3 days [18:24:55] :D [18:25:07] you'll give people 3 days after the new system is up? :) [18:26:08] only because it'll be over a weekend ;p [18:26:19] sorry, am sorta in a meeting [18:26:23] yeah they are reporting different metrics [18:26:31] when I netcat each aggregator port [18:27:01] note that ulsfo doesn't have proper multicast routing yet [18:27:01] root@nickel:/etc/ganglia# netcat analytics1009.eqiad.wmnet 8649 | grep kafka | wc -l [18:27:02] 0 [18:27:02] root@nickel:/etc/ganglia# netcat analytics1011.eqiad.wmnet 8649 | grep kafka | wc -l [18:27:02] 4659 [18:27:05] this is eqiad [18:27:11] ok [18:30:29] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:39] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:30:48] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:31:28] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:18] RECOVERY - DPKG on mw1125 is OK: All packages OK [18:32:18] RECOVERY - Disk space on mw1125 is OK: DISK OK [18:32:28] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [18:32:48] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [18:33:58] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 18:33:56 UTC 2013 [18:34:29] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [18:36:33] Ryan_Lane: in manifests/role/ipv6relay.pp miredo is called. since it is the only pp that calls it would it be better just to have that class inside? [18:36:54] If you think this is a good idea i'll push it [18:38:22] no [18:38:34] wh not paravoid ? [18:38:47] it's a matter of abstraction [18:39:00] the role class uses the low-level class to install the package [18:39:03] doesn't care of the details [18:39:17] ipv6relay is our role class for relays, miredo is just one component of it [18:40:23] but if miredo is a manifest just wondering around i can't convert it into a module in a sane way [18:40:44] it = ipv6relay [18:41:03] anyway, i see your point paravoid [18:41:25] convert what? [18:41:54] i wanted to convert ipv6relay into a module [18:42:16] ipv6relay is a role class, it's likely that we won't convert role classes into modules until much later [18:42:21] it's too messy now [18:42:29] but you can convert miredo if you'd like :) [18:42:56] I know, I have a branch for roles, and a branch for lonly manifest [18:43:01] the current discussion around role classes has been to perhaps convert them altogether as a role module at some point [18:43:15] s/discussion/idea/ [18:43:29] just noted the dependency, before started to move stuff [18:43:42] this is an important note, thanks for this paravoid [18:44:13] sure [18:44:20] I mean nothing is settled yet [18:44:38] but it's complicated enough that I wouldn't dare to touch it now personally :) [18:44:45] (03PS2) 10Reedy: Mobile works fine on wikivoyage. Remove specialcase on [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86347 [18:44:53] moving misc::miredo into a modules/miredo/init.pp should be fine though [18:44:57] any docs on the matter? it would be quite disappinting to do work and then see a whole diffreent approche was chosen [18:45:10] (03CR) 10Reedy: [C: 032] Mobile works fine on wikivoyage. Remove specialcase on [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86347 (owner: 10Reedy) [18:45:20] (03Merged) 10jenkins-bot: Mobile works fine on wikivoyage. Remove specialcase on [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/86347 (owner: 10Reedy) [18:45:48] hm, I think only implicitly by https://wikitech.wikimedia.org/wiki/Puppet_coding#Roles [18:46:02] maybe andrewbogott has something better somewhere? [18:46:19] there's also https://wikitech.wikimedia.org/wiki/Puppet_Todo that might interest you [18:46:30] (03PS2) 10Reedy: Remove mobileRedirect.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85394 [18:47:13] (03CR) 10Reedy: [C: 032] Remove mobileRedirect.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85394 (owner: 10Reedy) [18:47:22] (03Merged) 10jenkins-bot: Remove mobileRedirect.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85394 (owner: 10Reedy) [18:47:59] AaronSchulz: do you have some time to check out https://gerrit.wikimedia.org/r/#/c/86745/? [18:50:53] (03CR) 10GWicke: "ping!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/86745 (owner: 10GWicke) [18:52:13] (03PS7) 10Reedy: WIP don't deduce sites based on docroot stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 [18:55:17] (03PS8) 10Reedy: WIP don't deduce sites based on docroot stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 [18:55:34] (03CR) 10jenkins-bot: [V: 04-1] WIP don't deduce sites based on docroot stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [18:55:52] * andrewbogott catches up [18:57:08] (03PS9) 10Reedy: WIP don't deduce sites based on docroot stuff [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 [18:57:14] (03CR) 10Aude: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [18:57:56] Hm… matanya, I agree that miredo seems a bit trivial for a module, but nonetheless making it into a module is probably the right thing :) [18:58:39] The pages that paravoid linked to are all of the docs at the moment -- if you have questions about specific design situations I can try to add arbitrary answers to the docs. [18:58:42] already done [18:58:45] pushing now [18:59:12] (03CR) 10Aude: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [18:59:55] (03CR) 10Reedy: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [19:01:54] (03CR) 10Aude: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85165 (owner: 10Reedy) [19:04:04] (03PS1) 10Matanya: Meredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [19:04:54] (03CR) 10jenkins-bot: [V: 04-1] Meredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [19:07:37] (03PS2) 10Reedy: Update tests to remove docroot setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/85172 [19:08:35] (03CR) 10Aaron Schulz: [C: 031] Bug 54406: Speed up Parsoid job processing [operations/puppet] - 10https://gerrit.wikimedia.org/r/86745 (owner: 10GWicke) [19:08:57] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/86745/ [19:09:39] this is what happens when you are the only one here [19:11:48] AaronSchulz: thanks, will watch the ganglia graphs now [19:12:02] he didn't merge it :) [19:12:10] (03CR) 10Faidon Liambotis: [C: 032] Bug 54406: Speed up Parsoid job processing [operations/puppet] - 10https://gerrit.wikimedia.org/r/86745 (owner: 10GWicke) [19:12:27] ahh, thanks paravoid [19:12:40] we're pretty bad at noticing changesets with no reviewers btw [19:12:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:12:55] we need to fix this but in the meantime don't hesitate to add reviewers [19:13:01] feel free to add me [19:13:13] paravoid: k, will do next time [19:17:03] haha [19:18:22] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:19:12] RECOVERY - DPKG on mw1125 is OK: All packages OK [19:20:42] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:20:47] matanya, did you do a test run of this on a labs instance? [19:21:03] no, and i did a stupid mistake [19:21:12] forgot the "" in the import [19:21:18] pushing again [19:21:25] just tesing now [19:21:32] AaronSchulz: will the job runner restart on the next puppet run? [19:21:32] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [19:22:28] (03CR) 10Andrew Bogott: [C: 04-1] "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [19:24:17] andrewbogott: why still include? [19:25:40] Maybe I misunderstand what 'import' means in that context. [19:25:55] But you want to actually declare that class there, which is what 'include' does isn't it? [19:26:14] I think with modules you never need to 'import' anything. [19:26:24] http://docs.puppetlabs.com/puppet/2.7/reference/lang_import.html [19:28:20] yes, you are right [19:31:08] gwicke: yes, since the service is subscribed to that file in puppet [19:31:29] (03CR) 10Akosiaris: [C: 032] contint: browsertests needs php5-sqlite [operations/puppet] - 10https://gerrit.wikimedia.org/r/87056 (owner: 10Hashar) [19:32:22] AaronSchulz: yes, the load has started to rise [19:35:42] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 19:35:35 UTC 2013 [19:35:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [19:37:18] AaronSchulz: https://graphite.wikimedia.org/render/?width=1486&height=641&_salt=1371161654.988&from=-24hours&target=stats.job-insert-ParsoidCacheUpdateJob.count&target=stats.job-pop-ParsoidCacheUpdateJob.count does not show much of a change for some reason [19:43:41] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:42] PROBLEM - twemproxy process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:51] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:32] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:49] andrewbogott: I did something stupid, and git complians i'm trying to push two commits [19:45:56] how can i rebase them? [19:46:14] have you tried git fetch --all [19:46:18] yes [19:46:21] RECOVERY - Disk space on mw1125 is OK: DISK OK [19:46:29] i did two commits, he is right [19:46:31] RECOVERY - DPKG on mw1125 is OK: All packages OK [19:46:32] matanya: What I do in that situation is generally start with a clean branch and then cherry-pick the patch that I want. [19:46:35] then git rebase -i origin/master [19:46:41] RECOVERY - twemproxy process on mw1125 is OK: PROCS OK: 1 process with UID = 65534 (nobody), command name nutcracker [19:46:41] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [19:46:54] or -i origin/production if we're speaking of puppet [19:46:59] gwicke: did puppet actually run? [19:47:50] lol [19:47:57] Wait for the puppet freshness errors [19:48:21] Matanya: A helpful guide written for just this situation: https://wikitech.wikimedia.org/wiki/Help:Git_rebase#Don.27t_panic [19:51:21] PROBLEM - Puppet freshness on virt1000 is CRITICAL: No successful Puppet run in the last 10 hours [19:53:17] (03PS1) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87234 [19:54:21] oh, it failed :( [19:54:46] Yeah, you're missing miredo.conf in that last patch [19:55:13] looks ok otherwise [19:58:07] thanks, i'll fix it [19:58:12] two spaces instead of tabs [19:58:30] also puppet://module/ is wrong [19:58:39] it's module*s* [19:59:08] also align arrows and run puppet-lint in general [20:00:00] er, four spaces [20:00:10] I'll give a proper review quickly [20:01:10] (03CR) 10Faidon Liambotis: [C: 04-1] "(5 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87234 (owner: 10Matanya) [20:12:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:13:14] (03PS2) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:13:23] (03CR) 10Ori.livneh: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87234 (owner: 10Matanya) [20:15:13] AaronSchulz: pretty sure it did [20:15:26] at least we got a load spike [20:15:32] (03PS3) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:15:36] (03PS3) 10Ryan Lane: Add git clones for repos on deployment systems [operations/puppet] - 10https://gerrit.wikimedia.org/r/86762 [20:15:44] ori-l: ^^ thoughts on that change? [20:16:33] (03CR) 10Andrew Bogott: "A few inline comments. Also, as Faidon mentioned, please use 4-space indents in new files." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [20:16:40] (03CR) 10Ori.livneh: "Oh, and the service should set subscribe => File['/etc/miredo.conf']" [operations/puppet] - 10https://gerrit.wikimedia.org/r/87234 (owner: 10Matanya) [20:16:57] matanya: pile on! [20:17:05] j/k, it's nice to see patches [20:17:10] i am :) [20:17:15] mutante: shred please: https://rt.wikimedia.org/Ticket/Display.html?id=452#txn-133522 [20:17:22] I love when my patch get reviews [20:17:38] better then left on reviewd! [20:17:54] ori-l: my patch was something you asked for, so, yeah... ;) [20:17:58] Ryan_Lane: running late for a doc appt, back in an hour [20:18:02] ok [20:18:03] but will look then! [20:18:06] no rush [20:22:07] (03PS4) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:23:28] paravoid: ^ ? :) [20:24:24] (03CR) 10Andrew Bogott: "Apart from the tabs vs. spaces thing in init.pp this looks good to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [20:24:29] wait [20:24:40] why are these separate patchsets? [20:24:45] I'm confused [20:24:56] I commented on https://gerrit.wikimedia.org/r/#/c/87234/ but this is https://gerrit.wikimedia.org/r/#/c/87186/ ? [20:25:02] wrong commit [20:25:47] (03Abandoned) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87234 (owner: 10Matanya) [20:30:00] (03PS5) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:30:23] tabs... [20:30:57] why don't i see them? [20:31:15] hm. salt now has integrated support for logging to logstash... [20:31:20] haha [20:31:30] *hint* *hint* [20:31:57] paravoid: i feel dumb now :) where are those tabs? [20:32:00] I'm wondering if that's a good way to track what the deployment system is doing [20:32:16] matanya: https://gerrit.wikimedia.org/r/#/c/87186/5/modules/miredo/manifests/init.pp [20:32:30] (03PS1) 10Chad: Cleanup my SSH keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/87256 [20:32:36] I could tag the logs for deployment so that it would be possible to debug the system as it runs [20:32:51] the red arrows, paravoid ? [20:32:59] yes [20:33:05] (03CR) 10Faidon Liambotis: [C: 032] Cleanup my SSH keys [operations/puppet] - 10https://gerrit.wikimedia.org/r/87256 (owner: 10Chad) [20:33:40] <^demon> Oh thanks paravoid, that was fast :) [20:33:46] heh [20:34:26] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 20:34:20 UTC 2013 [20:34:36] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [20:35:10] sed -i 's/\t/ /g' [20:35:17] (03PS6) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:35:54] the closing braces are indented wrong now :) [20:36:11] it's so trivial I feel bad telling you to fix it [20:36:16] I can fix it myself if you want :) [20:36:34] my god. i should never push after 23:00 [20:36:47] i'll fix this [20:36:50] :) [20:37:36] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:37] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:37:49] paravoid: and yes, I'm hinting that I want logstash :) [20:37:52] only the first one right paravoid ? [20:38:04] logstash is awsome [20:38:42] Ryan_Lane: do you have any logstash running on the wikimedia servers? [20:38:52] ope [20:38:54] *nope [20:40:05] (03PS7) 10Matanya: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 [20:40:18] let's hope this is the last one [20:40:44] paravoid: if i missed something, it is yours now :P [20:41:37] RECOVERY - DPKG on mw1125 is OK: All packages OK [20:42:26] RECOVERY - Disk space on mw1125 is OK: DISK OK [20:43:36] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:36] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:36] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:40] Ryan_Lane: do you want me to convince the he.wiki to go https-by-default? [20:46:56] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: No successful Puppet run in the last 10 hours [20:46:56] PROBLEM - Puppet freshness on ms-be1012 is CRITICAL: No successful Puppet run in the last 10 hours [20:48:26] matanya: I'd like as many projects as possible to opt into the beta program [20:49:03] I'll post on the vp [20:50:46] PROBLEM - MySQL Idle Transactions on db1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:57] Ryan_Lane, keystone services are enumerated in… /etc/keystone/keystone.conf? [20:51:13] I guess I see a compute service described in default_catalog.templates, and commented out in keystone.conf [20:51:18] Maybe this all lives in a different place? [20:51:39] matanya: thanks [20:51:56] andrewbogott: keystone itself is configured there, yes [20:52:00] andrewbogott: why do you ask? [20:52:28] andrewbogott: please review my last push if you have time [20:52:29] Want to add yuviproxy endpoint to keystone [20:52:41] RECOVERY - MySQL Idle Transactions on db1021 is OK: OK longest blocking idle transaction sleeps for 0 seconds [20:53:31] PROBLEM - RAID on ms-be3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:32] PROBLEM - RAID on ms-be8 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:51] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: No successful Puppet run in the last 10 hours [20:53:51] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: No successful Puppet run in the last 10 hours [20:54:41] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 10 copy to table, 629 statistics [20:54:51] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:11] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:55:51] PROBLEM - RAID on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:41] RECOVERY - MySQL Processlist on db1021 is OK: OK 1 unauthenticated, 0 locked, 6 copy to table, 19 statistics [20:56:51] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:57:01] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.095 second response time [20:57:11] PROBLEM - DPKG on analytics1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:57:41] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:32] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:58:51] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: No successful Puppet run in the last 10 hours [20:58:51] PROBLEM - Puppet freshness on ms-be1004 is CRITICAL: No successful Puppet run in the last 10 hours [20:59:32] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:41] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.123 second response time [20:59:42] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:59:42] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 7.732 second response time [20:59:51] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: No successful Puppet run in the last 10 hours [21:00:21] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.050 second response time [21:00:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.041 second response time [21:01:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.520 second response time [21:01:23] on it [21:01:28] swift outage [21:01:41] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.514 second response time [21:02:12] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&g=cpu_report&h=ms-be3.pmtpa.wmnet&c=Swift+pmtpa [21:02:16] fun [21:02:51] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: No successful Puppet run in the last 10 hours [21:02:53] an unusual number of DELETEs going on [21:04:20] (03PS8) 10Andrew Bogott: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [21:04:51] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: No successful Puppet run in the last 10 hours [21:05:42] thanks andrewbogott [21:05:51] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: No successful Puppet run in the last 10 hours [21:06:22] (03CR) 10Andrew Bogott: [C: 031] "Just a few whitespace changes; otherwise this looks good to me. Will let Faidon have the last word." [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [21:06:48] someone is deleting "United States Reports" en masse [21:08:51] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: No successful Puppet run in the last 10 hours [21:11:41] someone is purging djvus en masse [21:11:48] The_ransom_of_Red_Chief_and_other_O._Henry_stories_for_boys.djvu [21:12:04] Emily_Dickinson_Poems_-_third_series_%25281896%2529.djvu [21:13:09] what is djvu? [21:13:13] file format [21:13:47] I figured, but… does it connote something else? Any reason it would be reviled? [21:13:56] too many DELETEs per second [21:14:00] kills swift [21:17:13] 350 apparently [21:17:17] 350? [21:17:59] AaronSchulz: about? [21:18:28] I think they had/have 0px thumbnails [21:18:53] I just to block whoever does that [21:19:19] DELETEs don't have anything identifying other than the appserver in them [21:20:31] * AaronSchulz meows [21:20:33] !log updated parsoid dependencies [21:20:42] * AaronSchulz was buried in strace debugging some POST requests to ceph [21:20:45] Logged the message, Master [21:20:57] hey, any clever ideas on how to identify them? [21:21:22] add a few log lines when filebackend is about to issue a delete maybe... [21:21:27] the offender is known [21:21:30] is it? [21:21:39] Indeed [21:21:45] not to me [21:21:48] has it stopped? [21:21:51] That's how I knew it was 350 [21:21:54] 350 what? [21:22:02] it has not [21:22:08] djvus to be purged [21:22:20] anyone want to fill me in here? [21:22:34] this is producing an outage [21:22:41] User:Dispenser on commons [21:22:48] MaxSem is currently giving them a kick [21:22:55] where? [21:23:14] where is this happening? [21:23:38] Private channel [21:24:32] https://commons.wikimedia.org/wiki/User_talk:Dispenser#TIFF_check.3F ? [21:24:53] Looks like it [21:27:46] aww /me misses the drama [21:28:11] !log updated Parsoid to 01ebbfd [21:28:24] Logged the message, Master [21:29:55] that sort of stuff is not supposed to be purging files though, hmpf [21:39:47] AaronSchulz: remind me, is the job queue FIFO or LIFO? [21:40:24] fifo [21:40:33] k, thanks [21:40:35] lifo would be fun ;) [21:41:28] yeah, would not make sense- was just wondering about cache hits in Parsoid job processing [21:57:06] (03PS9) 10Faidon Liambotis: Miredo: Convert into a module. [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [21:57:15] (03CR) 10Faidon Liambotis: [C: 032] miredo: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/87186 (owner: 10Matanya) [22:01:09] paravoid: I swear I can't get header update POSTs to work in ceph [22:01:24] at first I thought maybe it was CF, but no luck in curl either [22:01:37] * AaronSchulz recalls this working in argonaut [22:02:23] at least ceph health is unbroken in .69 [22:07:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:18:24] (03PS1) 10Odder: (bug 54229) Add autopatrolled user group on ukwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/87275 [22:34:32] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 22:34:22 UTC 2013 [22:34:52] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [22:43:04] ori-l: did you get a chance to look at https://gerrit.wikimedia.org/r/#/c/86762/ ? [22:43:14] there's one issue I see with it [22:43:36] mode is passed to git::clone, but isn't actually used in git::clone [22:43:42] but that's an issue in git::clone [22:44:14] !log csteipp synchronized php-1.22wmf19/includes/specials/SpecialUserlogin.php 'AbortLogin message fix' [22:44:29] Logged the message, Master [22:56:20] !log csteipp synchronized php-1.22wmf18/includes/specials/SpecialUserlogin.php [22:56:32] Logged the message, Master [23:00:36] Ryan_Lane: oops, I missed your ping [23:05:20] if you need to merge it go for it, I will review either way, can just submit an additional patch if necessary [23:05:20] am still a bit tied up [23:05:22] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 23:03:43 UTC 2013 [23:05:22] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours [23:15:50] !log demon synchronized php-1.22wmf18/extensions/CirrusSearch 'Cirrus to master' [23:16:02] Logged the message, Master [23:16:17] !log demon synchronized php-1.22wmf19/extensions/CirrusSearch 'Cirrus to master' [23:16:32] Logged the message, Master [23:17:31] ori-l: no need to merge. It doesn't really do anything. It's just a set of clones for the repos on tin [23:17:42] it can wait for a proper review [23:19:55] thanks [23:33:50] RECOVERY - Puppet freshness on labstore4 is OK: puppet ran at Wed Oct 2 23:33:49 UTC 2013 [23:34:50] PROBLEM - Puppet freshness on labstore4 is CRITICAL: No successful Puppet run in the last 10 hours