[00:07:28] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 319 seconds [00:07:28] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 320 seconds [00:08:37] PROBLEM - Backend Squid HTTP on sq36 is CRITICAL: Connection timed out [00:10:27] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [00:10:28] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [00:11:44] hey opsen, who's around? [00:11:58] we're getting reports of the api being down again, same error (appears to be to me) [00:12:01] from europeans [00:12:03] see -tech [00:13:14] mutante, Ryan_Lane^^^ [00:13:30] paravoid: ^^ [00:13:36] meh [00:13:37] fun times [00:13:42] so, I think it may be prudent to move esams to eqiad [00:13:46] yeah agreed [00:13:55] on it [00:14:20] I mean, if the point was to keep the caches warm, that point is moot now [00:14:22] wow. my internet connection sucks [00:14:35] yep [00:14:37] we can keep the one cache warm, that doesn't do us much good :) [00:14:59] heh [00:15:03] well, we're keeping other ones warm too [00:15:11] but the API being down to one node is a bad idea :) [00:15:38] it does seem to work for Parsoid [00:15:53] parsoid? esams? [00:16:05] or pmtpa? [00:16:11] no, eqiad [00:16:20] eqiad is what's working :) [00:16:21] pmtpa is not [00:16:24] yeah, issue is with pmtpa and by extension esams [00:16:27] PROBLEM - SSH on sq36 is CRITICAL: Server answer: [00:16:34] ah, k- nm [00:17:10] !log switching esams from using pmtpa to using eqiad for carp [00:17:16] Logged the message, Master [00:18:22] !log rebooting sq36 [00:18:27] Logged the message, Master [00:18:43] wow you're much faster than I would be at this time [00:18:55] well, I had done it earlier and just didn't deploy ;) [00:19:32] it took me a bit to track it down [00:19:32] since it's generated from a weird config file and php [00:19:50] heh, yeah [00:19:53] but it's under git now [00:19:54] not rcs [00:19:57] that's something! [00:21:08] heh [00:21:08] yep [00:21:08] I'm not sure who did that, but I was happy about it :) [00:21:26] (I did) [00:21:27] RECOVERY - SSH on sq36 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:21:38] ah. cool [00:21:43] after my fourth change or so, I couldn't stand it anymore :) [00:21:51] hm. maybe it didn't switch properly [00:21:51] * Ryan_Lane checks the generated files [00:22:09] damn it [00:22:12] that didn't seem to work [00:22:14] oh [00:22:15] right [00:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:23:30] hm. that's not correct either [00:23:30] this is such a pain in the ass [00:24:00] ah [00:24:17] can't switch upload-settings to eqiad [00:24:30] upload shouldn't have to do with api? [00:24:38] *anything to do [00:24:46] it doesn't, but I was trying to see why pmtpa hosts are still showing up in the generated files [00:25:58] this reminds me of a pirate with a steering wheel attached to his pants [00:26:12] "arrrr, it's driving me nuts" [00:26:13] root@fenari:/h/w/conf/squid# grep -r sq36 generated/*/*ams* [00:26:28] doesn't look like matching anything but amssq36 [00:26:28] *rimshot* [00:26:31] did you deploy? [00:26:35] you need to match on IP [00:26:40] ah, right [00:26:42] 208.80.152.96, for instance [00:27:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:35] it's too bad mark isn't here. he'd be able to do this in seconds [00:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:29:27] ah [00:29:27] RECOVERY - Backend Squid HTTP on sq36 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.115 second response time [00:29:34] ugh, I see errors on normal pageviews, too [00:29:51] I think it's necessary to actually change the 'carp_weights' array [00:29:53] shit, site's down [00:29:57] main page too [00:30:19] GET http://commons.wikimedia.org/wiki/File:Dirty_white_pseudomembrane_classically_seen_in_diptheria_2013-07-06_11-07.jpg, from 91.198.174.47 via amssq40.esams.wikimedia.org (squid/2.7.STABLE9) [00:30:29] I'm going to revert my change [00:31:01] 2013/08/30 00:18:14| TCP connection to 10.64.0.137 (10.64.0.137:3128) failed [00:31:04] 2013/08/30 00:18:14| TCP connection to 10.64.0.137 (10.64.0.137:3128) failed [00:31:06] yeah [00:31:07] esams [00:31:13] definitely my change was incorrect [00:31:20] heh [00:31:28] pushing [00:31:33] k [00:31:40] I can't wait till we're only on varnish [00:31:52] this generated config is relatively confusing [00:32:04] paravoid: back up? [00:32:12] well to be fair, pushing a config to all those servers via puppet would take you about half an hour :P [00:32:15] yes [00:32:27] to this deploy thing is much better in this aspect at least :) [00:32:38] oh, yeah. I wouldn't use puppet :) [00:32:53] paravoid, sudo puppetd -tv via salt?:) [00:33:01] riiiiiight [00:33:14] since puppet is so much faster when being hammered by clients [00:33:15] ewww [00:33:18] and stafford never complains [00:33:27] I'd use something more like git deploy [00:33:27] or just use git deploy itself [00:33:37] though it would need to work in esams, too [00:33:48] have fun with the erbs :) [00:33:54] ok, looking at the config again :) [00:34:08] erbs? you mean for varnish? [00:34:13] yeah [00:34:17] yeah, that's problematic [00:34:39] will our current method continue to work in the future, though? [00:34:47] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [00:34:58] ok. looking again for how to do this properly [00:35:35] the bad thing is, I think I did it properly [00:35:40] we may not be able to switch this [00:35:45] the bad thing is, all cp10xx are intenral [00:35:49] eqiad's squids are all on 10. [00:35:50] yeah [00:36:09] ok, then we need to add more API squids [00:36:19] wait, why would POSTs go to one squid? [00:36:29] because there's just one squid left in the pool... [00:36:41] POSTs generally go to the same location anyway [00:36:43] due to CARP [00:36:54] but in this situation all requests are only going to a single node [00:36:58] because there's only one node :) [00:37:14] ok. I'm going to designate a couple squids as API [00:37:24] yep I was about to suggest that [00:37:39] 37 and 59? [00:37:44] how do they look in ganglia... [00:37:47] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [00:37:47] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [00:37:48] paravoid: that would explain heh [00:37:48] they look fine [00:38:04] I wouldn't see why CARP alone would be the problem unless they were all posts to the same url [00:38:05] paravoid: you're in vim :) [00:38:14] sorry [00:38:15] go ahead [00:38:17] no worries [00:39:13] added 37 and 59 [00:39:13] deploying [00:39:37] I hate taking the site down [00:39:42] at least it wasn't dns this time [00:40:22] at least it's just esams at 1:30am WEST/2:30am CEST/3:30am EEST [00:40:32] Working from home today and getting lots of 502 Bad Gateway errors [00:40:48] wait, in the US? [00:40:52] yes [00:41:07] (we were dealing with esams issues) [00:41:11] what kind of 502 errors? [00:41:14] does it say anything else? [00:41:41] not that I remember. They were very minimal error messages [00:41:42] ok, that should do it [00:42:01] can you get headers next time? [00:42:18] sure [00:42:23] thanks [00:43:03] we're not aware of any issues state-side and haven't heard it from anyone else, so it'd really help [00:43:30] boy this is about to get so much more fun with ulsfo [00:44:11] yeah [00:44:18] looks fine now [00:44:26] sq37 barely noticed [00:44:47] so far :) [00:47:13] speaking of squid, how does the process of its death fare?:) [00:47:56] progressing I think [00:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [01:03:35] Anyone else (EU) getting bits fail? [01:04:15] Meh, just fail on one connection [01:04:18] Disregard [01:20:04] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 687 MB (9% inode=65%): [01:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:25:14] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 584 MB (8% inode=65%): [01:25:54] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [01:26:43] Server version: 5.1.53-wm-log (mysql-at-facebook-r3753) [01:26:55] How many mysql boxen do we still have not upgraded to mariadb? [01:30:04] RECOVERY - check_disk on db1025 is OK: DISK OK - free space: / 3760 MB (52% inode=65%): [01:46:04] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [02:04:12] Reedy: about 50/50 mysql/mariadb overall. 40/60 on the s* clusters [02:05:23] Is there any near term plan to finish the migration? [02:06:15] 5.1.53 -> 5.5.32 is a pretty big jump, never mind the mariadb changes [02:07:33] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 301 seconds [02:08:33] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [02:15:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 702 MB (9% inode=83%): [02:15:41] full migration is indeed on the near-term todo list [02:16:51] the remaining 5.1 slaves in the s* clusters, definitely. swithcing all masters needs more care [02:18:27] !log LocalisationUpdate completed (1.22wmf14) at Fri Aug 30 02:18:27 UTC 2013 [02:18:34] Logged the message, Master [02:20:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 625 MB (8% inode=83%): [02:20:13] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [02:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [02:25:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 549 MB (7% inode=83%): [02:26:13] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:30:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 474 MB (6% inode=83%): [02:33:13] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [02:35:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 400 MB (5% inode=83%): [02:35:45] Jeff_Green: ^ if you haven't seen already... [02:36:53] (03PS1) 10Kaldari: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 [02:37:13] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:38:01] !log LocalisationUpdate completed (1.22wmf15) at Fri Aug 30 02:38:00 UTC 2013 [02:38:07] Logged the message, Master [02:39:04] Looks like the HTTPS switch broke Flickr importing on Commons. Anyone want to +2 the fix: https://gerrit.wikimedia.org/r/#/c/81890/ [02:40:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 321 MB (4% inode=83%): [02:41:38] (03CR) 10Yuvipanda: [C: 031] Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [02:41:50] (rather useless, i guess) [02:41:51] Reedy: ^ [02:42:17] YuviPanda: Thanks anyway. Every little bit counts... well sort of. [02:42:37] I forgot I didn't have +2 there, and... gave a +1 anyway because I was already there [02:45:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 242 MB (3% inode=83%): [02:46:27] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:47:27] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 167 MB (2% inode=83%): [02:50:27] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:27] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:27] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [02:51:06] ACKNOWLEDGEMENT - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 167 MB (2% inode=83%): Matt Walker Ive stopped the job that was filling the error logs until tomorrow. [02:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:53:27] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:27] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 30 02:53:46 UTC 2013 [02:53:52] Logged the message, Master [02:56:27] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:27] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:27] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:01:27] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:27] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [03:03:27] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [03:03:27] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:07] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [03:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:28] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 303 seconds [03:23:28] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 303 seconds [03:24:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [03:33:00] (03PS1) 10Yurik: Added Orange Madagascar carrier 646-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 [03:33:27] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:33:28] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [03:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [04:07:10] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:25:10] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [04:36:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:30] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 349 seconds [04:36:31] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 350 seconds [04:37:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:38:31] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 77 seconds [04:38:31] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 35 seconds [04:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:02] (03PS2) 10Reedy: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [04:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:00:08] Reedy: should we sync it? [05:00:15] Probably [05:00:16] i'm in favor [05:00:20] want me to do it? [05:00:23] I was going to, but got distracted [05:00:30] i'll do it [05:00:35] thanks [05:00:40] (03CR) 10Ori.livneh: [C: 032] Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [05:00:49] (03Merged) 10jenkins-bot: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [05:03:54] !log olivneh synchronized wmf-config/CommonSettings.php 'Make Flickr API URL protocol-relative for compatibility with HTTPS' [05:04:00] Logged the message, Master [05:12:22] huh [05:12:42] just the issue, not that you sync'd it [05:21:35] not that odd [05:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:23:49] yeah, just curious [05:25:09] i'd have asked if i knew you were around, btw :) [05:25:26] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 346 seconds [05:25:37] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 349 seconds [05:26:32] ori-l: The eye of Sa..... [05:26:54] I mean, yeah, I was doing some house moving stuff (getting moving quotes, looking up utility info, etc) [05:28:26] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [05:28:36] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [05:32:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.922 second response time [05:44:50] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:44:50] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:51:37] https://commons.wikimedia.org/w/index.php?title=Commons%3AAdministrators%27_noticeboard&diff=102962667&oldid=102962303 [05:51:58] seems to have fixed it [05:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [06:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [06:36:33] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 328 seconds [06:36:34] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 330 seconds [06:41:33] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 35 seconds [06:41:34] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 6 seconds [06:50:03] RECOVERY - check_disk on db1008 is OK: DISK OK - free space: / 4654 MB (65% inode=83%): [07:13:29] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 306 seconds [07:13:38] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 308 seconds [07:16:29] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [07:16:38] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -1 seconds [07:30:09] (03CR) 10TTO: "Why did you try to merge this without fulfilling the dependency?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:41:17] (03CR) 10Akosiaris: [C: 032] require nrpe package before collecting [operations/puppet] - 10https://gerrit.wikimedia.org/r/81676 (owner: 10Akosiaris) [07:43:32] (03PS6) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [07:46:29] (03PS1) 10Akosiaris: Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 [07:46:43] (03CR) 10TTO: "Sorry Reedy, I shouldn't sound so accusing... I guess it was probably jenkins-bot's fault for not being a bit smarter :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:48:44] (03PS4) 10TTO: skwiktionary: Set site logo to local file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:49:40] (03PS2) 10TTO: Adjust reupload-own permissions for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80546 [08:25:34] morning [08:27:15] morning [08:28:14] mogge [08:43:04] I am going to restart jenkins .. [08:46:10] !log restarting Jenkins for plugins upgrade [08:46:16] Logged the message, Master [09:24:24] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:36:35] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 302 seconds [09:36:36] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 304 seconds [09:38:35] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [09:38:36] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [09:41:35] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [09:41:55] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay -0 seconds [10:34:28] !log Moving esams text squid backend traffic from pmtpa to eqiad [10:34:34] Logged the message, Master [10:35:17] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:24] how? [10:35:32] site's broker [10:35:35] broken [10:35:36] revert [10:35:50] mark: [10:37:14] this was what ryan did yesterday and broke the site again [10:37:14] oh yeah, private ips [10:37:18] the eqiad group is all private IPs [10:37:19] yep [10:37:43] let's just move to varnish [10:37:43] so [10:37:50] Ryan forgot to commit yesterday [10:38:00] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [10:38:00] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [10:38:00] so your commit included the new API squidsn [10:38:08] and your revert reverted those too :) [10:38:11] i know [10:38:14] k [10:38:18] don't care [10:38:19] should I? [10:38:48] unless you're migrating to varnish today :) [10:39:07] we had exactly one API squid and that failed [10:39:12] so API in europe went down [10:39:43] first attempt was to switch to eqiad, when that failed he just designated two random squids as API squids [10:39:43] who cares about the api [10:39:51] which is a fine workaround I think :) [10:40:11] apparently users that came to IRC at 02:30 their time to complain :P [10:46:55] !log reverted [10:47:01] Logged the message, Master [10:48:55] (03PS2) 10TTO: (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [10:51:52] (03PS3) 10TTO: (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [10:59:57] so I can't migrate to varnish yet [11:00:14] because there are least 2 of those mediawiki bugs left setting cookies in cacheable responses [11:13:53] (03PS2) 10Akosiaris: Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 [11:16:57] sniff [11:17:04] my awesome position of the moon hack [11:17:21] mark: :-D [11:17:28] are you sure the inline template evaluates on every single catalog generation and isn't cached? :) [11:18:44] good question... gimme a sec.. we will know soon enough. I do expect that it is cached though. But does it matter in this case ? [11:19:25] yes [11:19:43] the point is to update twice a day, if it gets cached either it'll be less or more than that [11:20:36] and why twice a day ? [11:20:41] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:20:41] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:21:11] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:21:29] dunno? :) [11:22:07] instructing pg 3.3772 on osd.104 to repair [11:22:38] ahaha... so maybe it's twice a day on average for all machines ? which I think will still hold true [11:22:41] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 308 seconds [11:22:42] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 310 seconds [11:23:01] right now it's twice a day because it's a random number from 0-12 [11:23:17] or not, wait [11:23:23] I had that figured out at one point :) [11:23:27] puppet runs every 30' [11:23:33] so it's rand 0-47 I think [11:23:41] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [11:23:42] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -0 seconds [11:26:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [11:28:41] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [11:28:42] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [11:29:11] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [11:29:37] it's just so it's not delaying each and every puppet run, right [11:29:40] it doesn't really matter [11:31:00] btw... why are we collecting ssh keys on the entire fleet ? I would have expected just the bastion hosts [11:31:18] because we used to login across hosts a lot more than we do now [11:31:57] and: "why not?" [11:32:00] deployments still happen via SSH from deployment hosts [11:32:03] because puppet sucks, that's why [11:32:19] anyway, it was a reasonable compromise [11:33:56] ok.. makes sense... [11:40:58] (03PS1) 10Hashar: tweak memcached limit on beta (89GB -> 15GB) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 [11:42:26] (03CR) 10Hashar: "Hey Asher, that is meant to limit the max memory usage of memcached on the beta cluster (instances have 16GB of memory)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 (owner: 10Hashar) [11:43:16] how about using facter variables to discover that? :) [11:45:44] (03PS1) 10Hashar: contint: python-sphinx package [operations/puppet] - 10https://gerrit.wikimedia.org/r/81906 [11:46:17] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [11:50:55] (03CR) 10Andrew Bogott: [C: 031] "This looks fine to me. I'm not sure if labsdebrepo should be included in this patch as well, or if that should be done on a per-instance " [operations/puppet] - 10https://gerrit.wikimedia.org/r/78002 (owner: 10Yuvipanda) [11:51:58] (03CR) 10Andrew Bogott: [C: 031] Route requests based on data from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 (owner: 10Yuvipanda) [11:52:54] hello, i wanna help wikimedia, how can i contribute ? I mean in the tech field. [11:53:44] uhm [11:53:47] you can make ceph bug free [11:54:07] hello mark [11:55:09] mark: it's not possible to help with the sysadmin operations ? [11:55:27] well, we have a free git repo of our puppet manifests [11:55:37] you can always suggest improvements and submit patchsets for it [11:55:52] and in Wikimedia Labs they can always use help [11:57:30] mark: yes i see. You mean wikitech right ? [11:57:43] yes, it's called wikitech.wikimedia.org these days [12:19:26] cortexA9: and labs user lives in #wikimedia-labs :] [12:19:51] thanks hashar :) [12:20:24] cortexA9: we got an infrastructure to let people create instances, they can in turn have puppet classes applied to them [12:20:49] cortexA9: so eventually you could create a whole new piece of infrastructure on the instances, and if it is ever good enough and suitable for production, have it deployed :-] [12:21:01] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [12:24:47] hashar: not many people in wikimedia-labs, they don't use irc ? [12:27:01] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [12:27:31] cortexA9: wrong timezoning most likely [12:27:39] also you are heading into the weekend [12:27:51] hello p858snake|l [12:27:58] hehe maybe :) [12:34:01] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [12:37:38] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [12:40:13] cortexA9: the european volunteers will start to show up in 3 - 4 hours [12:40:31] the west coast WMF employees should be showing up during this hour [12:40:34] err [12:40:35] east coast [12:40:49] west coast (San Francisco), they would be there in 5-6 hours [12:41:17] oh ok so in about 1 hour :) [12:42:21] different timezones :) [12:43:18] The Wikimedia cycle. [12:46:39] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [12:47:38] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [12:50:38] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:50:38] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [12:50:38] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [12:53:38] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [12:53:38] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [12:56:38] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [12:56:38] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [12:59:38] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:01:39] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [13:02:39] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [13:03:38] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [13:03:38] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [13:08:10] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [13:20:50] PROBLEM - Disk space on wtp1014 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): [13:21:46] paravoid: mark: could one of you please merge in a change to the apache contint website https://gerrit.wikimedia.org/r/#/c/71968/ ? [13:22:42] I need to make some git repositories on gallium accessible to Jenkins slaves. The change above tweak the apache conf to have the git repo available internally under http://integration.wikimedia.org/zuul/git/ [13:22:50] PROBLEM - Parsoid on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:02] why do you use hashes instead of selectors all the time? :) [13:24:26] I use role::cache::configuration as a start :D [13:25:38] (03CR) 10Mark Bergsma: "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 (owner: 10Hashar) [13:26:50] RECOVERY - Disk space on wtp1014 is OK: DISK OK [13:27:29] thx ) [13:31:58] mark: what the '@zuul_git_dir' syntax is for ? Is that a shortcut for scope.lookupvar() ? [13:32:16] you need scope.lookupvar only if you want to refer to a var from another scope [13:33:23] (03CR) 10Hashar: "(5 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 (owner: 10Hashar) [13:34:02] (03PS15) 10Hashar: contint: publish Zuul git repositories [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 [13:34:24] mark ^^^^ should be nicer now [13:35:48] (03CR) 10Mark Bergsma: [C: 032] contint: publish Zuul git repositories [operations/puppet] - 10https://gerrit.wikimedia.org/r/71968 (owner: 10Hashar) [13:38:22] will probably get the jenkins slave in prod next week :) [13:48:27] (03CR) 10Akosiaris: [C: 032] contint: python-sphinx package [operations/puppet] - 10https://gerrit.wikimedia.org/r/81906 (owner: 10Hashar) [13:51:46] hmm [13:51:47] that works [13:51:58] though the apache deny rules are not working hehe [13:56:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [14:08:06] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [14:10:36] PROBLEM - Disk space on wtp1009 is CRITICAL: DISK CRITICAL - free space: / 300 MB (3% inode=77%): [14:10:37] PROBLEM - Disk space on wtp1007 is CRITICAL: DISK CRITICAL - free space: / 289 MB (3% inode=77%): [14:10:42] (03PS1) 10Hashar: contint: tweak Zuul git apache rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/81930 [14:11:43] mark: i had to tweak the allow/deny directive to prevent the ScriptAlias from being reachable publicly https://gerrit.wikimedia.org/r/81930 [14:11:44] sorry :( [14:12:02] what, you didn't test in labs?!? [14:12:10] I did [14:12:14] but the test was wrong :] [14:12:26] I think I used the instance-proxy.wmflabs hack [14:12:27] that's gonna be my response from now on [14:12:33] so the request came from 10.0.0./8 probably [14:12:39] "the test was wrong" [14:12:45] "just logging in apparently wasn't enough!" [14:13:10] (03CR) 10Mark Bergsma: [C: 032] contint: tweak Zuul git apache rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/81930 (owner: 10Hashar) [14:13:17] labs is nice [14:13:38] sometimes [14:13:42] that saves me a bunch of time to figure out puppet oddities [14:15:36] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [14:15:36] PROBLEM - Disk space on wtp1013 is CRITICAL: DISK CRITICAL - free space: / 240 MB (2% inode=77%): [14:15:46] PROBLEM - Parsoid on wtp1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:36] PROBLEM - Disk space on wtp1012 is CRITICAL: DISK CRITICAL - free space: / 262 MB (2% inode=77%): [14:19:37] RECOVERY - Disk space on wtp1013 is OK: DISK OK [14:21:16] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [14:21:37] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 284 MB (3% inode=77%): [14:26:06] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [14:26:36] PROBLEM - Disk space on wtp1016 is CRITICAL: DISK CRITICAL - free space: / 53 MB (0% inode=77%): [14:26:46] PROBLEM - Parsoid on wtp1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:27:16] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.010 second response time [14:27:36] RECOVERY - Disk space on wtp1012 is OK: DISK OK [14:29:46] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:36] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [14:32:37] RECOVERY - Disk space on wtp1009 is OK: DISK OK [14:33:36] PROBLEM - Disk space on wtp1015 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=77%): [14:34:36] PROBLEM - Parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:25] !log jenkins: lanthanum slave now has 8 executors and labels hasSlaveScripts hasContintPackages productionSlaves [14:37:31] Logged the message, Master [14:38:57] RECOVERY - Disk space on wtp1011 is OK: DISK OK [14:41:27] RECOVERY - Disk space on wtp1015 is OK: DISK OK [14:42:27] RECOVERY - Disk space on wtp1007 is OK: DISK OK [14:54:17] RECOVERY - Disk space on wtp1016 is OK: DISK OK [15:17:05] jenkins's broken, yo. https://gerrit.wikimedia.org/r/#/c/81938/ [15:24:23] MatmaRex: bah [15:25:27] !log Jenkins: removing labels hasSlaveScripts hasContintPackages productionSlaves from lanthanum slaves, it caught some jobs [15:25:33] Logged the message, Master [15:25:40] MatmaRex: my fault sorry [15:30:32] (03PS1) 10Jgreen: fix cron for otrs (hopefully) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81939 [15:37:01] stupid gerrit [15:37:11] 7 minutes and counting for a piddly review [15:37:28] yuck [15:37:38] luckily all my puppet stuff today is puppetmaster::self heh [15:38:06] one of these days I will go postal on gerrit and forkbomb the server [15:38:13] (03CR) 10Jgreen: [C: 032 V: 031] fix cron for otrs (hopefully) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81939 (owner: 10Jgreen) [15:39:04] garg and to top it off my edits were wrong in ways gerrit would never detect :-( -- is it friday yet? [15:41:02] it is! [15:41:23] (03PS1) 10Jgreen: oops, cron.d not crontab. fixed [operations/puppet] - 10https://gerrit.wikimedia.org/r/81941 [15:42:12] (03CR) 10Jgreen: [C: 032 V: 031] oops, cron.d not crontab. fixed [operations/puppet] - 10https://gerrit.wikimedia.org/r/81941 (owner: 10Jgreen) [15:44:59] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [15:44:59] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [16:10:48] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [16:13:58] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [16:29:58] (03CR) 10Dr0ptp4kt: [C: 031] Added Orange Madagascar carrier 646-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 (owner: 10Yurik) [16:30:33] paravoid, hi, any way to deploy ^ [16:32:01] (03PS1) 10coren: Tool Labs: make webservers proper submit hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/81947 [16:33:53] (03CR) 10coren: [C: 032] "Self +2 FTW" [operations/puppet] - 10https://gerrit.wikimedia.org/r/81947 (owner: 10coren) [16:34:40] what's going on with netmapper? [16:34:56] it's been months now [16:38:03] "netmapper"? [16:39:54] a replacement for all that [16:40:00] yurik: ^^ [16:42:54] yurik: you going to use your deploy window next week for Zero (on wednesday)? [16:45:56] figures [16:47:04] (03PS1) 10Dzahn: promote manybubbles from admins::restricted to mortals pending approval RT #5691 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81953 [16:50:55] (03CR) 10Reedy: "DING! GG." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81953 (owner: 10Dzahn) [17:57:26] PROBLEM - SSH on sq36 is CRITICAL: Server answer: [17:57:38] ugh, that's the one again [17:58:46] PROBLEM - Backend Squid HTTP on sq36 is CRITICAL: Connection timed out [17:58:51] http://ganglia.wikimedia.org/latest/?c=Text%20squids%20pmtpa&h=sq36.wikimedia.org&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [17:59:23] tries to restart it again [18:00:06] !log powercycling sq36 [18:00:11] Logged the message, Master [18:02:27] RECOVERY - SSH on sq36 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:04:36] PROBLEM - Frontend Squid HTTP on sq36 is CRITICAL: Connection refused [18:06:17] (03CR) 10Ryan Lane: [C: 032] Add redis lua library to labsproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/78002 (owner: 10Yuvipanda) [18:07:05] Ryan_Lane: there are three more patches :) [18:07:54] (03CR) 10Ryan Lane: [C: 032] Route requests based on data from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 (owner: 10Yuvipanda) [18:09:37] it's a very yuvi christmas [18:09:58] 2 more to merge :D [18:10:16] what format is the maxmind db in? [18:10:22] is it easily diffable or not? [18:10:48] yurik: heya, ya'll going to use your Zero deploy next wed? [18:11:03] greg-g, not sure yet [18:11:06] k [18:11:37] greg-g, CSV [18:11:56] Ryan_Lane: great, now two more patchsets to go :P [18:12:29] MaxSem: I thought you were kidding, but you're right (I just found this page: https://www.maxmind.com/en/country) [18:12:37] ;) [18:12:43] so yeah, diffable [18:12:53] * greg-g moves along [18:14:16] paravoid, sorry, missed your comment (restarted IRC) - it is still in testing stage [18:15:12] YuviPanda: what in the world does this do? https://gerrit.wikimedia.org/r/#/c/80201/2/modules/labsproxy/files/proxy.conf,unified [18:15:34] Ryan_Lane: websocket support? [18:15:43] I... should perhaps add comments :P [18:15:57] where are those variables set? [18:16:09] Ryan_Lane: $http_upgrade is set by nginx [18:16:14] Ryan_Lane: $connection_upgrade is set in that map [18:16:25] $backend and $vhost are set from lua [18:28:19] * YuviPanda pokes Ryan_Lane with a ^ [18:28:33] * Ryan_Lane nods [18:29:17] (03PS3) 10Ryan Lane: Add appropriate support for websocket proxying [operations/puppet] - 10https://gerrit.wikimedia.org/r/80201 (owner: 10Yuvipanda) [18:32:30] (03PS2) 10Yuvipanda: Remove useless proxy_redirect directive [operations/puppet] - 10https://gerrit.wikimedia.org/r/80203 [18:54:04] (03CR) 10Dzahn: [C: 032] "well, if you don't need it .." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81489 (owner: 10Hashar) [18:59:50] (03CR) 10Ryan Lane: [C: 032] Add appropriate support for websocket proxying [operations/puppet] - 10https://gerrit.wikimedia.org/r/80201 (owner: 10Yuvipanda) [18:59:59] sweet, now just one more [19:00:24] (03CR) 10Ryan Lane: [C: 032] Remove useless proxy_redirect directive [operations/puppet] - 10https://gerrit.wikimedia.org/r/80203 (owner: 10Yuvipanda) [19:00:34] \o/ ty, Ryan_Lane [19:00:35] any left? [19:00:36] yw [19:00:41] Ryan_Lane: no, all done for now. [19:00:58] great :) [19:01:09] Ryan_Lane: I'll clean out all the current instances (they use self), and setup a new one [19:01:16] Ryan_Lane: should hopefully be able to do the APi over the weekend :) [19:02:53] cool :) [19:03:00] thanks for all this awesome work! [19:03:15] :D :D [19:17:38] PROBLEM - RAID on db1031 is CRITICAL: CRITICAL: Degraded [19:24:58] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [19:44:14] (03CR) 10Ryan Lane: [C: 032] Allow grains to be set with a single value [operations/puppet] - 10https://gerrit.wikimedia.org/r/81372 (owner: 10Ryan Lane) [20:22:42] !log csteipp synchronized php-1.22wmf15/extensions/LiquidThreads 'bug53320' [20:22:48] Logged the message, Master [20:35:24] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [20:38:02] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [20:38:02] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [21:15:02] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 773 MB (10% inode=65%): [21:20:12] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 717 MB (10% inode=65%): [21:22:06] (03PS1) 10QChris: Fix double encoded characters in gitweb -> gitblit forwards [operations/puppet] - 10https://gerrit.wikimedia.org/r/82044 [21:23:01] !log restarting lucene search indexers on searchidx2 and searchidx1001 [21:23:06] Logged the message, Master [21:25:12] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 667 MB (9% inode=65%): [21:26:52] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [21:27:26] !log dumping/importing viwikivoyage and tyvwiki lucene search indices [21:27:32] Logged the message, Master [21:30:02] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 615 MB (8% inode=65%): [21:35:12] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 565 MB (7% inode=65%): [21:40:11] RECOVERY - check_disk on db1025 is OK: DISK OK - free space: / 3819 MB (53% inode=65%): [21:43:50] (03PS1) 10Yuvipanda: Add role to toollabs for generic web proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/82047 [21:43:54] Coren: ^ [21:44:06] Coren: however, it needs a custom deb, so not sure how to put that on toollabs. [21:44:59] YuviPanda: Trivially. /data/project/.system/deb is a local repo for exactly that. [21:45:05] !log temp. disabling search1015 in pybal [21:45:09] Coren: yeah, so I just cp the deb there? [21:45:10] Logged the message, Master [21:45:30] Coren: andrewbogott_afk built that deb, I should ask him for the source as well. [21:47:01] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [21:48:09] !log on sodium: Class[Backup::Host] is already defined ... [21:48:15] Logged the message, Master [22:00:14] (03Abandoned) 10Ori.livneh: Modify access rules [operations/debs/StatsD] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/82051 (owner: 10Ori.livneh) [22:02:16] mutante / Coren, could one of you +2 https://gerrit.wikimedia.org/r/#/c/82053/ ? i followed the instructions on Wikitech for creating a new deb package and realized a split-second too late that the repository creation command template specified ldap/ops as the owner -- so I'm basically locked out. [22:04:13] ori-l: :-) [22:04:20] (03CR) 10Ryan Lane: [C: 032] Modify access rules [operations/debs/StatsD] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/82053 (owner: 10Ori.livneh) [22:04:32] thanks [22:04:41] (03CR) 10Ryan Lane: [V: 032] Modify access rules [operations/debs/StatsD] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/82053 (owner: 10Ori.livneh) [22:04:43] * ori-l feels like a dork [22:04:52] (03CR) 10coren: "Yeah." [operations/debs/StatsD] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/82053 (owner: 10Ori.livneh) [22:04:55] * YuviPanda gives ori-l a hat. [22:05:18] Ryan beat me to it. [22:05:53] thanks guys [22:06:19] yw [22:19:59] !log after re-pooling search1015 a while ago, now temp. disable search1016 while indices are rebuilt for new wikis [22:20:05] Logged the message, Master [22:21:38] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:38] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:38] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [22:37:56] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [22:46:56] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [22:47:56] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [22:50:56] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [22:50:56] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [22:50:56] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [22:53:56] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [22:53:56] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [22:56:56] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [22:56:56] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [22:59:56] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [23:01:56] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [23:02:56] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [23:03:56] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [23:03:56] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [23:08:56] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours