[00:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150126T0000). Please do the needful. [00:00:58] jouncebot: You seem to be confused. [00:01:40] jouncebot: refresh [00:01:44] I refreshed my knowledge about deployments. [00:01:49] jouncebot: next [00:01:49] In 15 hour(s) and 58 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150126T1600) [00:11:02] (03PS3) 10Ori.livneh: Import chromium module from mediawiki-vagrant [puppet] - 10https://gerrit.wikimedia.org/r/186614 [00:11:04] (03PS2) 10Ori.livneh: Add role::ve; apply on osmium [puppet] - 10https://gerrit.wikimedia.org/r/186620 [00:14:40] (03PS3) 10Ori.livneh: Add role::ve; apply on osmium [puppet] - 10https://gerrit.wikimedia.org/r/186620 [00:18:14] (03PS4) 10Ori.livneh: Add role::ve; apply on osmium [puppet] - 10https://gerrit.wikimedia.org/r/186620 [00:18:47] (03CR) 10Ori.livneh: [C: 032] Import chromium module from mediawiki-vagrant [puppet] - 10https://gerrit.wikimedia.org/r/186614 (owner: 10Ori.livneh) [00:19:05] (03CR) 10Ori.livneh: [C: 032] Add role::ve; apply on osmium [puppet] - 10https://gerrit.wikimedia.org/r/186620 (owner: 10Ori.livneh) [00:21:59] PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend [00:25:08] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: puppet fail [00:28:49] PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:32:19] RECOVERY - DPKG on osmium is OK: All packages OK [00:33:34] 3Wikimedia-Mailing-lists, operations: mailman's public list index (listinfo) has the wrong encoding in its Content-Type header - https://phabricator.wikimedia.org/T42971#993984 (10saper) This works only for English pages.... For Polish we still have "iso-8859-2" radziecki$ curl -Is "https://lists.wikimedia.org... [00:43:49] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:51:44] 3operations: Problems applying role::mediawiki to a fresh Trusty install - https://phabricator.wikimedia.org/T87550#993993 (10ori) 3NEW a:3ori [00:56:06] (03CR) 10Ori.livneh: [C: 04-1] "The bug in HHVM appears to be that it fails to update shutdown handlers registered via register_shutdown_function(). I have noticed it a f" [puppet] - 10https://gerrit.wikimedia.org/r/186579 (owner: 10Giuseppe Lavagetto) [01:14:38] (03PS1) 10Ori.livneh: Manage the home of the chromium user [puppet] - 10https://gerrit.wikimedia.org/r/186738 [01:15:32] (03CR) 10Ori.livneh: [C: 032 V: 032] Manage the home of the chromium user [puppet] - 10https://gerrit.wikimedia.org/r/186738 (owner: 10Ori.livneh) [02:13:19] RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself [02:14:45] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:14:49] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-26 02:14:49+00:00 [02:14:59] Logged the message, Master [02:15:06] Logged the message, Master [02:23:04] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 03s) [02:23:10] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-26 02:23:10+00:00 [02:23:12] Logged the message, Master [02:23:18] Logged the message, Master [03:56:38] 3Wikimedia-General-or-Unknown, operations: Cleanup and delete vewikimedia - https://phabricator.wikimedia.org/T57737#994069 (10Glaisher) [04:06:31] (03PS1) 10Andrew Bogott: Specify the DNS server for dnsmasq. [puppet] - 10https://gerrit.wikimedia.org/r/186741 [04:08:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jan 26 04:08:01 UTC 2015 (duration 8m 0s) [04:08:10] Logged the message, Master [04:08:49] 21:13 < icinga-wm> RECOVERY - Ori committing changes on the weekend on palladium is OK: OK: Ori is behaving himself [04:08:52] :) :) [04:14:20] andrewbogott_afk: yay? [04:20:01] (03PS2) 10Andrew Bogott: Specify the DNS server for dnsmasq. [puppet] - 10https://gerrit.wikimedia.org/r/186741 [04:30:04] (03PS1) 10Glaisher: Change enwikinews $wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186742 (https://phabricator.wikimedia.org/T87522) [04:35:35] (03PS1) 10Glaisher: Set $wmgAbuseFilterEmergencyDisableCount to 25 at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) [04:54:08] (03PS2) 10Glaisher: Standardize the name of interface editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186593 (https://phabricator.wikimedia.org/T85731) [04:55:01] (03PS3) 10Glaisher: Standardize the name of interface editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186593 (https://phabricator.wikimedia.org/T85731) [06:04:59] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:18] PROBLEM - Memcached on virt1000 is CRITICAL: Connection refused [06:07:09] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.107 second response time [06:14:36] paravoid: ^ [06:15:07] I have no laptop [06:15:21] And am too lazy to go up [06:15:30] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 5 failures [06:18:58] !log restarting virt1000 services [06:19:08] RECOVERY - Memcached on virt1000 is OK: TCP OK - 0.000 second response time on port 11000 [06:20:14] paravoid: I'm on virt1000 too -- what are you restarting? [06:20:33] pretty much everything :P [06:20:36] andrewbogott: did shinken email you? [06:20:44] YuviPanda: yes [06:20:47] paravoid: was it oom? [06:20:49] I restarted mysql, memcached, nova-conductor, salt-master, apache2 [06:20:51] yes, it oomed [06:20:57] Where referee fees we :) [06:21:14] Uh, I was typing whee with a lot of e [06:21:29] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:21:40] paravoid: Any idea why it's started doing this so much? I don't feel like we've actually increased its load recently [06:21:49] grep oom-killer /var/log/syslog [06:21:53] Cron time apparently [06:21:57] note that it's cron.daily time [06:22:29] although the oom-killer seems to have been triggered earlier [06:23:36] hm… db-bak.sh [06:23:55] and mw-xml.sh [06:23:59] so it's doing dumps, that could be taxing [06:25:00] dumps usually are [06:26:53] paravoid: do you think this is a clear case for moving wikitech off of virt1000, or do you have other thoughts about how to settle things? [06:28:31] I haven't really thought about it much tbh [06:28:49] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:08] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:08] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:08] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:13] paravoid: fair enough :) [06:32:28] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:38] andrewbogott: are you going to merge the other dns related patch? [06:36:42] What does that do? [06:37:02] You mean https://gerrit.wikimedia.org/r/#/c/186741/? [06:37:23] In theory it should do nothing at all. [06:37:37] But I'm not sure enough of that to try it right before bed. [06:39:37] YuviPanda: I'm actually not sure if that patch is right or necessary. /probably/ dnsmasq uses resolv.conf if nothing is specified. [06:39:59] Which would use our recursor0 and recursor1, which is probably correct anyway. [06:41:04] (03CR) 10Andrew Bogott: "I've no idea if this is right. It might slightly improve performance resolving things like foo.wmflabs.org." [puppet] - 10https://gerrit.wikimedia.org/r/186741 (owner: 10Andrew Bogott) [06:45:19] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:19] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:48] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:15] (03PS13) 10Anomie: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 [07:11:37] (03CR) 10BryanDavis: [C: 031] "Since our current production logstash bottleneck is the logstash elasticsearch cluster and this patch doesn't add any traffic to that syst" [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [07:25:15] (03PS14) 10Anomie: Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 [07:37:18] (03CR) 10BryanDavis: [C: 031] Configure Logstash and Elasticsearch for ApiFeatureUsage [puppet] - 10https://gerrit.wikimedia.org/r/173336 (owner: 10Anomie) [09:50:08] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 668 [09:55:08] RECOVERY - check_mysql on db1008 is OK: Uptime: 843896 Threads: 2 Questions: 1510481 Slow queries: 5606 Opens: 12981 Flush tables: 2 Open tables: 64 Queries per second avg: 1.789 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [12:35:19] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [12:40:39] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [13:13:04] (03PS1) 10Calak: Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87348) [13:15:48] (03PS2) 10Calak: Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87558) [13:17:55] (03PS2) 10JanZerebecki: Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 (owner: 10Aude) [13:19:00] (03PS3) 10JanZerebecki: Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [13:23:43] (03CR) 10JanZerebecki: [C: 031] Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [14:22:26] (03PS6) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [14:22:28] (03PS2) 10QChris: For stats user's cron jobs on stat1002, make empty MAILTO explicit [puppet] - 10https://gerrit.wikimedia.org/r/186253 [14:22:30] (03PS2) 10QChris: Grant stats user access to private analytics data [puppet] - 10https://gerrit.wikimedia.org/r/186254 [14:23:07] (03CR) 10jenkins-bot: [V: 04-1] Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [14:23:55] (03CR) 10jenkins-bot: [V: 04-1] Grant stats user access to private analytics data [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [14:24:14] (03CR) 10QChris: "The V-1 stems from the parent commit. Once the 'Adding the stats user to the group' there is resolved, and this patch set gets rebased, th" [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [14:25:58] (03CR) 10QChris: Grant stats user access to private analytics data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [14:27:37] (03CR) 10QChris: Grant stats user access to private analytics data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [14:44:06] 3Wikidata, Analytics, wikidata-query-service, operations, Services, MediaWiki-General-or-Unknown: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#994455 (10JanZerebecki) >>! In T84923#993443, @GWicke wrote: > Since 0mq is not actually durable or replicated this does not cover th... [14:59:55] (03CR) 10Ebrahim: [C: 031] Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87558) (owner: 10Calak) [15:29:40] (03CR) 10Steinsplitter: [C: 031] Set $wmgAbuseFilterEmergencyDisableCount to 25 at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [16:00:04] manybubbles, anomie, ^d, marktraceur, gi11es: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150126T1600). [16:00:12] I'm here [16:00:41] Are we actually doing this? [16:00:46] we should [16:01:05] Huh, so says the calendar. [16:01:09] <^d> seems silly [16:01:13] * ^d is walking out the door [16:01:14] gi11es: Will you be stable for long enough to do it? [16:01:16] we might be late for the 'thing, but this is more important [16:01:18] I know we're about to leave [16:01:27] I can wait another 30 minutes [16:01:29] Oh, gwt stuff, yeah [16:01:32] I support you [16:01:36] In spirit [16:06:20] reading back the ticket history, it'll be someone with GWT admin rights at some point today testing that the fix worked anyway [16:06:28] meanwhile they've been told not to use it [16:06:38] so not a risky SWAT [16:07:03] marktraceur: are you doing the SWAT? [16:11:58] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:26] *crickets* [16:21:09] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59725 bytes in 0.327 second response time [16:49:06] !log xtrabackup clone db2016 to db2034 [17:09:50] gi11es: Oh, no, I had left for the summit [17:10:11] And got here 10 minutes late [17:11:06] greg-g, around? [17:11:06] Krenair: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [17:11:08] am wondering when we should do that deployment [17:11:24] which is it again? [17:11:30] * greg-g 's brain hasn't booted up fully yet [17:12:02] gwtoolset, yeah [17:14:19] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [17:18:04] tomorrow afternoon? [17:18:13] that seems to be the next scheduled slot on the deployments page [17:18:42] !log amssq42 enabled as non-https text frontend in esams [17:18:57] (03PS1) 10BBlack: enable amssq42 (jessie test) as text backend [puppet] - 10https://gerrit.wikimedia.org/r/186792 [17:19:18] <_joe_> bblack: whohooo [17:19:20] (03CR) 10BBlack: [C: 032 V: 032] enable amssq42 (jessie test) as text backend [puppet] - 10https://gerrit.wikimedia.org/r/186792 (owner: 10BBlack) [17:19:35] <_joe_> our baby jessie is live! [17:19:55] :) [17:20:13] I'm turning it on for non-https only first to grab some stats to compare when SSL is flipped on after [17:21:03] _joe_: thanks for all baby work :) [17:26:44] <_joe_> kart_: praise bblack [17:28:26] <_joe_> now I have to code ruby that works on 1.8, 1.9 and 2.x [17:28:39] <_joe_> a dental extraction would be less painful probably [17:32:19] gi11es, next is this afternoon, 4PM local time [17:32:34] it's under tuesday because that's really midnight to 1AM [17:33:09] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:34:39] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Puppet has 1 failures [17:48:34] Krenair: gotcha. ^d, RoanKattouw, either of you willing to do that afternoon swat today? [17:48:48] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:49:01] I won't be available, probably [17:49:56] gi11es: Lemme see what I'm doing at 4pm [17:50:20] gi11es: There's a session on front-end standardization at 4:15pm so probably not [17:50:31] alright, thanks for checking [18:01:12] <_joe_> gi11es: whenever you need to do swat, please ping me first [18:01:29] <_joe_> I have a hhvm puppet patch I'd like to ship first [18:02:25] 3Beta-Cluster, Release-Engineering, operations: Intermittent DNS failures in beta labs regularly trigger a bunch of puppet failures - https://phabricator.wikimedia.org/T87480#994642 (10Reedy) [18:02:47] <_joe_> oh nevermind, just saw ori's comment [18:05:41] <^d> Reedy: Why don't we have a wikimania2016wiki yet? :p [18:05:58] ^d: We have dns... Ops haven't merged my patch for the apache config ;) [18:06:02] <^d> Ah [18:06:13] https://gerrit.wikimedia.org/r/#/c/181892/ [18:07:38] _joe_: Any chance you could deal with ^? :) [18:08:36] <_joe_> Reedy: yeah sorry man [18:08:52] heh [18:08:54] <_joe_> gonna merge it after damon is done [18:09:01] sweet, thanks [18:09:03] <_joe_> I am actually interested [18:09:07] yea i wouldnt take down the site when our bosses boss is talking ;D [18:09:08] <_joe_> which is amaizing [18:09:22] robh: where's your sense of adventure!? [18:09:42] i lost it in my second site outage [18:09:50] <_joe_> robh: I don't give half of fuck about that :) [18:09:50] never found it after recovery [18:09:55] <_joe_> I'm just interested :) [18:10:14] _joe_: yea but if it did result in a site break, then eventually the question would be 'why weren't you listening to the talk' ;] [18:10:15] <_joe_> for the first time I am hearing interesting things these days [18:10:35] <_joe_> oh well I assume everybody assumes I am in screensaver mode at these events [18:10:43] 3Beta-Cluster, Release-Engineering, operations: Intermittent DNS failures in beta labs regularly trigger a bunch of puppet failures - https://phabricator.wikimedia.org/T87480#994664 (10yuvipanda) [18:11:03] (03PS3) 10Giuseppe Lavagetto: Add wikimania2016.wikimedia.org to ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/181892 (https://phabricator.wikimedia.org/T85374) (owner: 10Reedy) [18:13:52] <_joe_> cscott: I am intersted in working on the imagescaling spin-off, we should pair up on that :) [18:14:06] _joe_: cscott me too :) [18:14:22] <_joe_> YuviPanda: are you here? didn't see you [18:14:33] _joe_: i am! [18:14:48] _joe_: am on the other side of the room. [18:15:03] _joe_: the sparse side [18:15:19] WILD YUVIPANDA SPOTTED [18:15:24] <_joe_> lol [18:15:48] i thought we were going ot have the subdermal tracker implanted in yuvi by now? [18:16:06] you guys were supposed to drug him and get it done the past weekend. [18:16:09] http://yuvi.in/where.html [18:16:22] <_joe_> we drugged ourselves too, and we forgot [18:16:23] stopped updating when it started changing too fast [18:16:32] _joe_: cool. i'm still a little uncertain how this schedule is going to work [18:16:56] <_joe_> cscott: me too, I just hope we can pair a little to try to figure out how to do it right [18:17:18] _joe_: sure, cool. [18:17:29] YuviPanda: You like Bangalore :P [18:17:51] JohnLewis: Anyone might think he's Indian or something [18:18:02] Reedy: Indeed [18:18:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Add wikimania2016.wikimedia.org to ServerAlias [puppet] - 10https://gerrit.wikimedia.org/r/181892 (https://phabricator.wikimedia.org/T85374) (owner: 10Reedy) [18:30:17] (03CR) 10Ottomata: "Yeha, no this is going to have to be done a little differently, since the stats user is a system user, which are not currently managed via" [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [18:33:58] JohnLewis: Reedy I stopped updating it [18:34:09] YuviPanda: why :'( [18:34:20] Bored [18:34:25] I wanted an easy way to stalkfollow you. [18:34:35] Hehe [18:37:36] <_joe_> Reedy: the wikimania2016 is still 404 [18:37:43] <_joe_> do we need to config something else? [18:37:46] _joe_: Yeah, I've not done the mediawiki side of stuff [18:37:51] I'm just building the config now [18:37:52] <_joe_> ok [18:37:55] <_joe_> how conveninet [18:37:55] thanks :) [18:38:04] <_joe_> a new wiki is 3 separate changes [18:38:12] <_joe_> "automation, lean operations" [18:38:30] <^d> It used to be worse :p [18:38:31] a new wiki used to be so much worse! [18:38:33] haha [18:38:56] when i started all the other opsen (read, all 3 of them) just made me the new wiki creator =P it sucked [18:39:00] <_joe_> ^d: it always is better than long ago [18:39:18] robh: you get stuck with all the interesting jobs :p [18:39:27] yea, wiki creation and domain mgmt [18:39:30] sorry, "interesting" :p [18:39:33] i get all the flash. [18:39:50] and SSL certs, can't forget that robh [18:39:55] did i mention how much better it is letting devs push mediawiki changes? its pretty awesome [18:40:09] JohnLewis: but i drink every day specifically to forget ssl. [18:40:30] <^d> robh: It was more fun when we had brion just pull master every 2-3 days :p [18:44:35] (03CR) 10BryanDavis: [C: 031] Use / for regex delimiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186616 (owner: 10Reedy) [18:46:55] (03PS1) 10Reedy: Add wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 [18:47:47] (03PS2) 10Reedy: Add wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 [18:48:12] JohnLewis: ^ CR? :P [18:48:23] chasemp: i want to ask you a q about puppet and admin module and system users! :) [18:48:33] Reedy: k [18:49:33] Otto :) sure [18:49:45] wonder if I should ask you in person after this>>...>.:) [18:49:49] yes I think I will! [18:50:10] (03CR) 10John F. Lewis: [C: 031] "Sane for Reedy at least." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 (owner: 10Reedy) [18:50:25] heh [18:50:29] Something will no doubt be wrong [18:50:51] (03CR) 10KartikMistry: "Also, depends on: https://phabricator.wikimedia.org/T87587" [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [18:51:18] well that's the luck of you Reedy, but it looks good to me [18:51:26] Sounds good otto [18:52:40] Reedy: found something :p [18:52:43] heh [18:52:53] (03CR) 10John F. Lewis: Add wikimania2016wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 (owner: 10Reedy) [18:53:11] Just a missing config line, but the patch is still good [18:53:42] <_joe_> "sane for reedy" sounds great [18:54:13] (03PS3) 10Reedy: Add wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 [18:54:34] !log Created wikimania2016wiki, not web accessible yet [18:54:45] (03CR) 10Reedy: [C: 032] Add wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 (owner: 10Reedy) [18:55:30] mmmm, database tables [18:55:47] * ^d wants food tables [18:55:56] (03Merged) 10jenkins-bot: Add wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186804 (owner: 10Reedy) [18:56:51] !log sync-docroot on tin for staging and wikimania2016wiki setup [18:58:16] (03PS2) 10Reedy: Use / for regex delimiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186616 [18:58:25] (03CR) 10Reedy: [C: 032] Use / for regex delimiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186616 (owner: 10Reedy) [18:58:29] ^d: Pull master? You mean trunk? [18:58:31] (03Merged) 10jenkins-bot: Use / for regex delimiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186616 (owner: 10Reedy) [18:58:38] <^d> Fiona: Indeed, yes [18:59:48] !log reedy Synchronized w/404.php: Fix log noise (duration: 00m 07s) [19:10:11] 3operations: Job queue stats are broken - https://phabricator.wikimedia.org/T87594#994786 (10faidon) 3NEW [19:13:38] !log reedy Synchronized database lists: wikimania2016wiki (duration: 00m 06s) [19:14:12] !log reedy Synchronized wmf-config/: wikimania2016wiki (duration: 00m 07s) [19:14:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikimania2016wiki [19:14:27] !log restarted keystone on virt1000, ^d couldn’t log in [19:15:08] Woo, userid 1 [19:15:17] Reedy: me :( [19:15:25] what ID am I then :p [19:15:48] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 06s) [19:15:53] https://wikimania2016.wikimedia.org/w/api.php?action=query&meta=userinfo [19:15:54] Logged the message, Master [19:16:09] oh 3. Curse you reedy [19:16:18] 3Ops-Access-Requests: Create shell access for Zeljko - RelEng rights - https://phabricator.wikimedia.org/T87597#994821 (10RobH) 3NEW a:3zeljkofilipin [19:16:18] <^d> 6 \o/ [19:16:28] "Alan" got no 2 [19:16:52] chasemp: whereru? [19:18:03] Time to go hold some steak [19:30:50] YuviPanda: just a note; might want to re-!log that message if you want it in the SAL [19:31:37] JohnLewis: oh? [19:31:45] !log restarted keystone on virt1000, ^d couldn’t log in [19:31:52] Logged the message, Master [19:31:56] morebots wasn't here when you logged it first time [19:31:57] I am a logbot running on tools-exec-04. [19:31:57] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:31:57] To log a message, type !log . [19:32:33] JohnLewis: right. [19:32:36] need to kill that [19:32:48] the bot? :o [19:32:56] yeah [19:33:03] and move things to logstash or something [19:33:09] well, a public version of logstash at least [19:33:24] andrewbogott: hmm, so DNS errors seem to have been silenced at least a little bit today [19:33:34] public? ha, keep everything private! It's what the community thinks the WMF does anyway [19:33:45] YuviPanda: I don't think that can be anything but coincidence. We haven't changed anything... [19:33:51] Unless /you/ changed something? [19:33:51] (03CR) 10Reedy: "26th/27th?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [19:34:33] (03CR) 1001tonythomas: "I'm ready for 26th though :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186242 (owner: 1001tonythomas) [19:34:53] Reedy: wait we're deploying that today? [19:34:56] andrewbogott: I actually did [19:34:58] andrewbogott: 'sysctl -w net.netfilter.nf_conntrack_max=131072' [19:35:06] at about just before the problem started disappearing? [19:35:16] legoktm: Dunno. It didn't specify a date [19:41:38] andrewbogott: think you can follow up on the conntrack stuff? [19:41:43] (03PS1) 10Reedy: Update interwiki cache for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186810 [19:41:44] I’m going to look at why graphite is broken there [19:41:58] YuviPanda: maybe, except I couldn't hear what akosiaris was saying :( [19:41:59] (03CR) 10Reedy: [C: 032] Update interwiki cache for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186810 (owner: 10Reedy) [19:42:03] (03Merged) 10jenkins-bot: Update interwiki cache for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186810 (owner: 10Reedy) [19:42:53] 3Analytics, Wikidata, wikidata-query-service, operations, Services, MediaWiki-General-or-Unknown: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#994876 (10mobrovac) Re: reliability, [RELP](http://www.rsyslog.com/doc/relp.html) might be of help on the application level. [19:45:18] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:45:19] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:45:19] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:47:14] Jeff_Green: ^ [19:50:18] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:50:18] PROBLEM - check_mysql on payments1002 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [19:50:19] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:50:25] 3operations: labnet1001 should be able to send stats to tungsten - https://phabricator.wikimedia.org/T87600#994896 (10yuvipanda) 3NEW [19:55:14] akosiaris: if you’re looking at firewally things https://phabricator.wikimedia.org/T87600? [19:55:18] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:55:18] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:55:19] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [19:55:41] <_joe_> Jeff_Green: are you going to look at this? [19:56:25] 3operations: labnet1001 should be able to send stats to tungsten - https://phabricator.wikimedia.org/T87600#994914 (10yuvipanda) Other machines in the labs subnet can hit tungsten, and labsnet can hit other machines in the labs subnet [19:57:09] PROBLEM - puppetmaster https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [19:58:08] PROBLEM - puppetmaster backend https on palladium is CRITICAL: Connection refused [19:58:09] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 33 failures [19:58:19] PROBLEM - puppet last run on virt1005 is CRITICAL: CRITICAL: puppet fail [19:58:19] PROBLEM - etherpad.wikimedia.org HTTPS on zirconium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:28] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: puppet fail [19:58:29] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 12 failures [19:58:29] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 44 failures [19:58:32] icinga-wm: don't do this please [19:58:38] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: puppet fail [19:58:39] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 11 failures [19:58:39] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 33 failures [19:58:39] PROBLEM - puppet last run on analytics1012 is CRITICAL: CRITICAL: Puppet has 17 failures [19:58:39] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: puppet fail [19:58:39] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Puppet has 8 failures [19:58:40] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 15 failures [19:58:40] PROBLEM - puppet last run on es2006 is CRITICAL: CRITICAL: Puppet has 17 failures [19:58:41] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 25 failures [19:58:41] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: Puppet has 10 failures [19:58:48] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Puppet has 12 failures [19:58:48] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 19 failures [19:58:48] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 3 failures [19:58:49] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet has 11 failures [19:58:49] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [19:58:49] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 20 failures [19:58:49] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet has 67 failures [19:58:54] YuviPanda ^^ palladium needs a kick pls [19:58:59] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 81 failures [19:58:59] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [19:58:59] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Puppet has 18 failures [19:58:59] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Puppet has 25 failures [19:58:59] PROBLEM - puppet last run on amssq39 is CRITICAL: CRITICAL: Puppet has 25 failures [19:59:00] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 23 failures [19:59:00] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [19:59:01] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 7 failures [19:59:01] PROBLEM - puppet last run on analytics1021 is CRITICAL: CRITICAL: puppet fail [19:59:02] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: puppet fail [19:59:02] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Puppet has 9 failures [19:59:03] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 13 failures [19:59:03] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: Puppet has 40 failures [19:59:04] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 44 failures [19:59:19] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: puppet fail [19:59:19] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 74 failures [19:59:19] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: puppet fail [19:59:19] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: puppet fail [19:59:19] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.062 second response time [19:59:19] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 29 failures [19:59:19] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 23 failures [19:59:20] PROBLEM - puppet last run on lanthanum is CRITICAL: CRITICAL: puppet fail [19:59:21] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 13 failures [19:59:21] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Puppet has 24 failures [19:59:22] RECOVERY - etherpad.wikimedia.org HTTPS on zirconium is OK: HTTP OK: HTTP/1.1 200 OK - 28850 bytes in 0.499 second response time [19:59:29] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 63 failures [19:59:29] PROBLEM - puppet last run on hydrogen is CRITICAL: CRITICAL: puppet fail [19:59:29] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 16 failures [19:59:29] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [19:59:30] PROBLEM - puppet last run on es1004 is CRITICAL: CRITICAL: Puppet has 7 failures [19:59:30] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 78 failures [19:59:30] <_joe_> !log restarted apache on palladium [19:59:33] <_joe_> JohnLewis: done [19:59:34] Logged the message, Master [19:59:37] _joe_: thanks [19:59:38] <_joe_> YuviPanda: I did it [19:59:38] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.608 second response time [19:59:39] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 72 failures [19:59:39] PROBLEM - puppet last run on db1005 is CRITICAL: CRITICAL: puppet fail [19:59:39] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 32 failures [19:59:39] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: puppet fail [19:59:39] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: puppet fail [19:59:39] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 35 failures [19:59:40] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Puppet has 15 failures [19:59:49] PROBLEM - puppet last run on labsdb1002 is CRITICAL: CRITICAL: Puppet has 17 failures [19:59:49] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: puppet fail [19:59:50] PROBLEM - puppet last run on elastic1013 is CRITICAL: CRITICAL: Puppet has 20 failures [19:59:50] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Puppet has 17 failures [19:59:50] <_joe_> icinga-wm: shush! [19:59:58] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 22 failures [19:59:58] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 23 failures [19:59:58] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 22 failures [19:59:58] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [19:59:58] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 23 failures [19:59:59] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [19:59:59] PROBLEM - puppet last run on elastic1010 is CRITICAL: CRITICAL: puppet fail [20:00:00] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 25 failures [20:00:08] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 72 failures [20:00:09] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: puppet fail [20:00:09] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 18 failures [20:00:09] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 22 failures [20:00:09] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [20:00:09] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: puppet fail [20:00:18] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [20:00:18] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [20:00:19] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: puppet fail [20:00:19] PROBLEM - check_mysql on payments1003 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [20:00:19] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 38 failures [20:00:19] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: puppet fail [20:00:19] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: Puppet has 65 failures [20:00:20] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CRITICAL: Puppet has 22 failures [20:00:28] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 75 failures [20:00:29] PROBLEM - puppet last run on es1003 is CRITICAL: CRITICAL: puppet fail [20:00:29] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 35 failures [20:00:38] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: puppet fail [20:00:39] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 15 failures [20:00:39] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: puppet fail [20:00:39] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: puppet fail [20:00:39] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: puppet fail [20:00:39] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 73 failures [20:00:40] !log restarted apache on palladium/strontium, cleared the, created on Jan 23, pid files from puppetmaster [20:00:44] Logged the message, Master [20:00:50] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 77 failures [20:00:58] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: puppet fail [20:00:58] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: puppet fail [20:00:58] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: puppet fail [20:00:59] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: puppet fail [20:00:59] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: puppet fail [20:00:59] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [20:01:08] PROBLEM - puppet last run on es1005 is CRITICAL: CRITICAL: Puppet has 9 failures [20:01:08] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: puppet fail [20:01:08] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: puppet fail [20:01:08] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 24 failures [20:01:08] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [20:01:09] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 38 failures [20:01:09] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [20:01:10] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [20:01:10] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Puppet has 37 failures [20:01:11] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [20:01:13] _joe_: maybe worth killing icinga-wm until thigns clear up? [20:01:16] <_joe_> akosiaris: I did it already [20:01:18] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: puppet fail [20:01:18] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: puppet fail [20:01:18] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail [20:01:19] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 49 failures [20:01:19] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: puppet fail [20:01:19] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Puppet has 19 failures [20:01:19] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Puppet has 48 failures [20:01:19] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: puppet fail [20:01:28] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: puppet fail [20:01:29] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [20:01:29] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: Puppet has 27 failures [20:01:29] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: puppet fail [20:01:29] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [20:01:29] PROBLEM - puppet last run on db1058 is CRITICAL: CRITICAL: puppet fail [20:01:29] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: puppet fail [20:01:31] it has a 180+ spam list according to icigina [20:01:37] <_joe_> I mean on palladium [20:01:38] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 108 failures [20:01:38] PROBLEM - puppet last run on mw1246 is CRITICAL: CRITICAL: puppet fail [20:01:39] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [20:01:39] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: puppet fail [20:01:39] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: puppet fail [20:01:39] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: puppet fail [20:01:59] JohnLewis: good point, done [20:02:10] <_joe_> akosiaris: what did you exactly do [20:02:19] I'll keep an eye on icinga and poke when it's green [20:02:41] _joe_: sudo service apache2 stop, rm /var/run/puppet/master.pid [20:02:55] _joe_: My bet is that someone did a service puppetmaster start on Jan 23 [20:03:28] <_joe_> akosiaris: after a plain apache restart puppet backend was ok on palladium [20:03:38] <_joe_> akosiaris: well, about that... GRRRR [20:03:38] _joe_: yes, but! [20:03:42] <_joe_> ok, lunch [20:03:47] ok [20:07:21] (03PS1) 10BBlack: Raise and recalculate varnish frontend mallocs [puppet] - 10https://gerrit.wikimedia.org/r/186816 [20:19:58] RECOVERY - puppet last run on wtp1024 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:19:58] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:19:58] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:19:59] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:59] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:20:00] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:20:00] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:20:00] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:20:01] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:20:10] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:20:10] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:10] RECOVERY - puppet last run on ytterbium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:20:11] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:20:11] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:20:11] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:20:18] RECOVERY - check_mysql on payments1003 is OK: Uptime: 1147093 Threads: 2 Questions: 1176444 Slow queries: 8320 Opens: 677 Flush tables: 1 Open tables: 46 Queries per second avg: 1.025 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:20:19] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:20:19] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:20:19] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:20:19] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:28] RECOVERY - puppet last run on mw1246 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:28] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:20:28] RECOVERY - puppet last run on analytics1019 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:20:28] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:20:28] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:20:29] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:29] RECOVERY - puppet last run on db1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:20:30] RECOVERY - puppet last run on virt1012 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:20:39] RECOVERY - puppet last run on wtp1021 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:20:49] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:20:49] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:20:49] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:20:49] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:20:49] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:50] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:20:50] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:58] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:20:58] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:20:59] RECOVERY - puppet last run on virt1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:21:09] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:21:09] RECOVERY - puppet last run on es1009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:21:09] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:21:19] RECOVERY - puppet last run on db1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:22] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:21:28] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:21:28] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:21:40] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:21:49] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:25:59] (03PS2) 10BBlack: Raise and recalculate varnish frontend mallocs [puppet] - 10https://gerrit.wikimedia.org/r/186816 [20:35:46] (03PS4) 10Dzahn: Redirect ve.wikimedia.org to wikimedia.org.ve [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [20:42:20] 3Wikimedia-Site-requests, operations: Create wiki for Wikimedia Venezuela - https://phabricator.wikimedia.org/T72579#994972 (10Dzahn) [20:42:21] 3Wikimedia-General-or-Unknown, operations: Cleanup and delete vewikimedia - https://phabricator.wikimedia.org/T57737#994971 (10Dzahn) [20:43:16] mutante: lovely structure, deleting a wiki is blocked by the creation of a wiki ;) [20:44:08] JohnLewis: true, isn't it ?:) [20:44:19] it is indeed :p [20:44:31] (i would have used a regular "refers to" if i had one, but there is only "blocking") [20:44:51] or how to use the Reference field [20:50:42] 3operations: labnet1001 should be able to send stats to tungsten - https://phabricator.wikimedia.org/T87600#994989 (10akosiaris) 5Open>3Resolved a:3akosiaris labs-in4 filter in cr1-eqiad and cr2-eqiad now has a allow_graphite term that allows port 2003, 2004 (graphite) and 8125 (statsd) to communicate with... [21:03:19] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: puppet fail [21:08:56] (03PS1) 10Hoo man: Remove toolserver IP ConfirmEdit whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186823 [21:11:27] (03CR) 10Reedy: [C: 031] Remove toolserver IP ConfirmEdit whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186823 (owner: 10Hoo man) [21:16:12] (03PS1) 10Chad: Disable all mail on phab-01 test instance [puppet] - 10https://gerrit.wikimedia.org/r/186827 [21:16:37] (03PS1) 10Hoo man: Exempt Item and Property namespaces from ConfirmEdit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186828 (https://phabricator.wikimedia.org/T86453) [21:18:56] (03CR) 10Rush: [C: 032 V: 032] Disable all mail on phab-01 test instance [puppet] - 10https://gerrit.wikimedia.org/r/186827 (owner: 10Chad) [21:21:05] (03PS1) 10RobH: allowing direct to task email for onsite queues [puppet] - 10https://gerrit.wikimedia.org/r/186831 [21:21:52] 3operations, ops-eqiad: testing ticket for emails - https://phabricator.wikimedia.org/T87481#995071 (10emailbot) **`Chase Pettet`** replied via email on `Mon, 26 Jan 2015 13:21:45 -0800` __Subject__: I am a subject > body! -------------------------- {None} [21:22:08] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [21:24:02] (03PS2) 10Rush: allowing direct to task email for onsite queues [puppet] - 10https://gerrit.wikimedia.org/r/186831 (owner: 10RobH) [21:24:08] (03CR) 10Rush: [C: 031] allowing direct to task email for onsite queues [puppet] - 10https://gerrit.wikimedia.org/r/186831 (owner: 10RobH) [21:24:14] 3Triagers, Phabricator, operations, Project-Creators: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#995080 (10Jdlrobson) Please can I be added? I'm still trying to work out a good mental model that is compatible with how the mobile team works that wil... [21:25:24] (03CR) 10RobH: [C: 032] allowing direct to task email for onsite queues [puppet] - 10https://gerrit.wikimedia.org/r/186831 (owner: 10RobH) [21:25:57] woo pushing it live now [21:26:27] woo walking into a disaster live now [21:27:56] _joe_: wanna merge https://gerrit.wikimedia.org/r/#/c/185940/ [21:28:18] <_joe_> YuviPanda: later maybe? [21:28:24] _joe_: ok [21:28:31] 3operations: Job queue stats are broken - https://phabricator.wikimedia.org/T87594#995082 (10Aklapper) [21:28:59] <_joe_> I am in the middle of a session that I have to follow with attention [21:29:06] _joe_: cool [21:30:17] (03CR) 10Ottomata: "I talked to Chase about this, and I think right now we have two options:" [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [21:30:26] (03PS1) 10Ori.livneh: EventLogging consumer: set respawn limit to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/186835 [21:30:33] (03PS2) 10Yuvipanda: hiera: fix up for labs hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/185940 (owner: 10Giuseppe Lavagetto) [21:31:13] (03CR) 10Ori.livneh: [C: 032 V: 032] EventLogging consumer: set respawn limit to unlimited [puppet] - 10https://gerrit.wikimedia.org/r/186835 (owner: 10Ori.livneh) [21:33:36] 3operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#995092 (10chasemp) 3NEW [21:34:01] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#995099 (10chasemp) [21:34:54] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#995103 (10chasemp) @csteipp, do you have any objection? I can't think of any other way to solve this problem that isn't worse. [21:38:32] (03CR) 10QChris: "> I think I'd prefer to go with the 2nd option. It'll be easier to" [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [21:38:34] (03PS1) 10Ottomata: Remove exec that added hdfs user to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/186847 [21:39:55] (03PS1) 10Yuvipanda: Move deployment-prep hiera config into ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/186852 (https://phabricator.wikimedia.org/T87223) [21:40:40] (03CR) 10Ottomata: [C: 032] Remove exec that added hdfs user to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/186847 (owner: 10Ottomata) [21:41:27] 3Beta-Cluster, operations: Renumber apache user/group to uid=48 - https://phabricator.wikimedia.org/T78076#995130 (10yuvipanda) I'll let @faidon elaborate, but I think we're going to re-number in prod and also try to explicitly set uid/gid for all system users declared in puppet. [21:41:58] (03CR) 10dschwen: [C: 031] "Looks straight forward." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [21:41:59] ottomata: removing hadoop logging seems more complex than I expected :D you wanna do it sometime? [21:42:03] YuviPanda: I will, sorry, hard to write lengthy responses right now :) [21:42:06] hah [21:42:09] uhhh [21:42:13] paravoid: yeah, not now, at some point in the future :) [21:42:14] not sure i know how either, hm. [21:42:36] ottomata: seems to be putting jar files in places and creating symlinks and stuff [21:42:49] oh, YuviPanda [21:42:49] set [21:42:50] $gelf_logging_enabled [21:43:19] role::analytics::hadoop::config [21:43:30] ottomata: hah! ok [21:44:21] line 59 i think [21:45:23] (03PS1) 10Yuvipanda: hadoop: Disable gelf logging to logstasah [puppet] - 10https://gerrit.wikimedia.org/r/186880 (https://phabricator.wikimedia.org/T87206) [21:45:33] ottomata: ^ [21:46:13] (03PS2) 10Ottomata: hadoop: Disable gelf logging to logstasah [puppet] - 10https://gerrit.wikimedia.org/r/186880 (https://phabricator.wikimedia.org/T87206) (owner: 10Yuvipanda) [21:46:30] hm, YuviPanda, it is possible hadoop daemons need to be restarted to pick up this change [21:46:49] ottomata: hmm, I guess tha’tll disrupt jobs? [21:47:01] (03CR) 10Ottomata: [C: 032 V: 032] "We should consider reenabling this when we are able to limit the logs to YARN application logs that we actually want users to be able to q" [puppet] - 10https://gerrit.wikimedia.org/r/186880 (https://phabricator.wikimedia.org/T87206) (owner: 10Yuvipanda) [21:47:05] (03PS7) 10QChris: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 [21:48:08] (03Abandoned) 10QChris: Grant stats user access to private analytics data [puppet] - 10https://gerrit.wikimedia.org/r/186254 (owner: 10QChris) [21:48:29] ha, just ran into yuvi and talked t him in PERSON [21:48:30] amazing! [21:49:34] (03Abandoned) 10QChris: For stats user's cron jobs on stat1002, make empty MAILTO explicit [puppet] - 10https://gerrit.wikimedia.org/r/186253 (owner: 10QChris) [21:49:43] 3Analytics, operations: Hadoop logs on logstash are being really spammy - https://phabricator.wikimedia.org/T87206#995149 (10Ottomata) Merged, but Hadoop daemons will need to be restarted to pick up this change. If you don't mind waiting, I will likely be restarting all of them soon (hopefully within the next f... [21:59:53] (03PS8) 10Ottomata: Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [22:00:44] (03CR) 10Ottomata: [C: 032] Mail webrequest partition status summaries to analytics ops [puppet] - 10https://gerrit.wikimedia.org/r/185442 (owner: 10QChris) [22:04:37] 3Beta-Cluster, operations: Set up an alert for unmerged changes in deployment-prep - https://phabricator.wikimedia.org/T87616#995192 (10yuvipanda) 3NEW a:3yuvipanda [22:09:05] (03PS1) 10Jdlrobson: Correct wikidata uri for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 [22:11:21] (03CR) 10John F. Lewis: [C: 04-1] Correct wikidata uri for beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [22:11:59] PROBLEM - etherpad.wikimedia.org HTTPS on zirconium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:12:59] RECOVERY - etherpad.wikimedia.org HTTPS on zirconium is OK: HTTP OK: HTTP/1.1 200 OK - 28848 bytes in 0.637 second response time [22:24:53] bleh, zirconium cpu wio spikes [22:30:25] (03PS1) 10Yuvipanda: Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) [22:32:05] bd808: I disabled hadoop logging into logstash, but ottomata says you need to restart hadoop to get it to fully take effect, sadly [22:33:26] bd808: lemme know how urgent that is. it takes a bit of effort and coordination to restart somee things, but not a huuuge amount. [22:33:33] YuviPanda: cool deal. Now get those spares reimaged :) [22:33:41] i plan on doing a hadoop upgrade within a few weeks, [22:33:43] hopefully [22:33:46] maybe a month at most [22:33:52] so it'll get restarted then anyway [22:34:13] ottomata: like I said last night I think we should fix our capacity rather than sniping at log volume [22:35:16] aye ok cool [22:35:20] if you chnage your mind lemme know [23:20:12] (03PS4) 10QChris: Add logs from 'misc' caches to kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/184183 [23:21:42] (03PS5) 10Ottomata: Add logs from 'misc' caches to kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/184183 (owner: 10QChris) [23:22:35] (03CR) 10Ottomata: [C: 032] Add logs from 'misc' caches to kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/184183 (owner: 10QChris) [23:25:38] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#995412 (10RobH) a:3csteipp I've assigned this to Chris for his commentary. Chris: Please provide feedback and then feel free to unassign yourself as owner (or assign to me since I'll be working on this as it gets re... [23:38:17] 3operations: detail hardware requests policy and procedure on wikitech/officewiki - https://phabricator.wikimedia.org/T87626#995452 (10RobH) 3NEW a:3RobH [23:45:53] 3operations, ops-core: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#995475 (10RobH) [23:45:53] 3operations, ops-core: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#995474 (10RobH) [23:46:08] 3operations: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#941826 (10RobH) [23:47:20] 3operations: Decomission svn.wikimedia.org - https://phabricator.wikimedia.org/T86655#995478 (10RobH) So then I suppose this task should be modified to import SVN into phab? [23:48:17] 3operations: Decomission (server) svn.wikimedia.org - by importing svn into phab - https://phabricator.wikimedia.org/T86655#995479 (10RobH) [23:49:22] 3operations: Decomission (server) svn.wikimedia.org - by importing svn into phab - https://phabricator.wikimedia.org/T86655#973496 (10RobH) @Chad, Would you be the person to import this in, or should an ops person take point on this? [23:57:15] <_joe_> !log depooling mw1018 for testing with user changing [23:57:24] Logged the message, Master [23:59:54] bd808: having mark___ explicitly approve on the logstash ticket maybe? robh likes that :)