[00:07:28] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 319 seconds [00:07:28] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 320 seconds [00:08:37] PROBLEM - Backend Squid HTTP on sq36 is CRITICAL: Connection timed out [00:10:27] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [00:10:28] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [00:11:44] hey opsen, who's around? [00:11:58] we're getting reports of the api being down again, same error (appears to be to me) [00:12:01] from europeans [00:12:03] see -tech [00:13:14] mutante, Ryan_Lane^^^ [00:13:30] paravoid: ^^ [00:13:36] meh [00:13:37] fun times [00:13:42] so, I think it may be prudent to move esams to eqiad [00:13:46] yeah agreed [00:13:55] on it [00:14:20] I mean, if the point was to keep the caches warm, that point is moot now [00:14:22] wow. my internet connection sucks [00:14:35] yep [00:14:37] we can keep the one cache warm, that doesn't do us much good :) [00:14:59] heh [00:15:03] well, we're keeping other ones warm too [00:15:11] but the API being down to one node is a bad idea :) [00:15:38] it does seem to work for Parsoid [00:15:53] parsoid? esams? [00:16:05] or pmtpa? [00:16:11] no, eqiad [00:16:20] eqiad is what's working :) [00:16:21] pmtpa is not [00:16:24] yeah, issue is with pmtpa and by extension esams [00:16:27] PROBLEM - SSH on sq36 is CRITICAL: Server answer: [00:16:34] ah, k- nm [00:17:10] !log switching esams from using pmtpa to using eqiad for carp [00:17:16] Logged the message, Master [00:18:22] !log rebooting sq36 [00:18:27] Logged the message, Master [00:18:43] wow you're much faster than I would be at this time [00:18:55] well, I had done it earlier and just didn't deploy ;) [00:19:32] it took me a bit to track it down [00:19:32] since it's generated from a weird config file and php [00:19:50] heh, yeah [00:19:53] but it's under git now [00:19:54] not rcs [00:19:57] that's something! [00:21:08] heh [00:21:08] yep [00:21:08] I'm not sure who did that, but I was happy about it :) [00:21:26] (I did) [00:21:27] RECOVERY - SSH on sq36 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:21:38] ah. cool [00:21:43] after my fourth change or so, I couldn't stand it anymore :) [00:21:51] hm. maybe it didn't switch properly [00:21:51] * Ryan_Lane checks the generated files [00:22:09] damn it [00:22:12] that didn't seem to work [00:22:14] oh [00:22:15] right [00:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:23:30] hm. that's not correct either [00:23:30] this is such a pain in the ass [00:24:00] ah [00:24:17] can't switch upload-settings to eqiad [00:24:30] upload shouldn't have to do with api? [00:24:38] *anything to do [00:24:46] it doesn't, but I was trying to see why pmtpa hosts are still showing up in the generated files [00:25:58] this reminds me of a pirate with a steering wheel attached to his pants [00:26:12] "arrrr, it's driving me nuts" [00:26:13] root@fenari:/h/w/conf/squid# grep -r sq36 generated/*/*ams* [00:26:28] doesn't look like matching anything but amssq36 [00:26:28] *rimshot* [00:26:31] did you deploy? [00:26:35] you need to match on IP [00:26:40] ah, right [00:26:42] 208.80.152.96, for instance [00:27:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:35] it's too bad mark isn't here. he'd be able to do this in seconds [00:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:29:27] ah [00:29:27] RECOVERY - Backend Squid HTTP on sq36 is OK: HTTP OK: HTTP/1.0 200 OK - 1249 bytes in 0.115 second response time [00:29:34] ugh, I see errors on normal pageviews, too [00:29:51] I think it's necessary to actually change the 'carp_weights' array [00:29:53] shit, site's down [00:29:57] main page too [00:30:19] GET http://commons.wikimedia.org/wiki/File:Dirty_white_pseudomembrane_classically_seen_in_diptheria_2013-07-06_11-07.jpg, from 91.198.174.47 via amssq40.esams.wikimedia.org (squid/2.7.STABLE9) [00:30:29] I'm going to revert my change [00:31:01] 2013/08/30 00:18:14| TCP connection to 10.64.0.137 (10.64.0.137:3128) failed [00:31:04] 2013/08/30 00:18:14| TCP connection to 10.64.0.137 (10.64.0.137:3128) failed [00:31:06] yeah [00:31:07] esams [00:31:13] definitely my change was incorrect [00:31:20] heh [00:31:28] pushing [00:31:33] k [00:31:40] I can't wait till we're only on varnish [00:31:52] this generated config is relatively confusing [00:32:04] paravoid: back up? [00:32:12] well to be fair, pushing a config to all those servers via puppet would take you about half an hour :P [00:32:15] yes [00:32:27] to this deploy thing is much better in this aspect at least :) [00:32:38] oh, yeah. I wouldn't use puppet :) [00:32:53] paravoid, sudo puppetd -tv via salt?:) [00:33:01] riiiiiight [00:33:14] since puppet is so much faster when being hammered by clients [00:33:15] ewww [00:33:18] and stafford never complains [00:33:27] I'd use something more like git deploy [00:33:27] or just use git deploy itself [00:33:37] though it would need to work in esams, too [00:33:48] have fun with the erbs :) [00:33:54] ok, looking at the config again :) [00:34:08] erbs? you mean for varnish? [00:34:13] yeah [00:34:17] yeah, that's problematic [00:34:39] will our current method continue to work in the future, though? [00:34:47] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [00:34:58] ok. looking again for how to do this properly [00:35:35] the bad thing is, I think I did it properly [00:35:40] we may not be able to switch this [00:35:45] the bad thing is, all cp10xx are intenral [00:35:49] eqiad's squids are all on 10. [00:35:50] yeah [00:36:09] ok, then we need to add more API squids [00:36:19] wait, why would POSTs go to one squid? [00:36:29] because there's just one squid left in the pool... [00:36:41] POSTs generally go to the same location anyway [00:36:43] due to CARP [00:36:54] but in this situation all requests are only going to a single node [00:36:58] because there's only one node :) [00:37:14] ok. I'm going to designate a couple squids as API [00:37:24] yep I was about to suggest that [00:37:39] 37 and 59? [00:37:44] how do they look in ganglia... [00:37:47] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [00:37:47] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [00:37:48] paravoid: that would explain heh [00:37:48] they look fine [00:38:04] I wouldn't see why CARP alone would be the problem unless they were all posts to the same url [00:38:05] paravoid: you're in vim :) [00:38:14] sorry [00:38:15] go ahead [00:38:17] no worries [00:39:13] added 37 and 59 [00:39:13] deploying [00:39:37] I hate taking the site down [00:39:42] at least it wasn't dns this time [00:40:22] at least it's just esams at 1:30am WEST/2:30am CEST/3:30am EEST [00:40:32] Working from home today and getting lots of 502 Bad Gateway errors [00:40:48] wait, in the US? [00:40:52] yes [00:41:07] (we were dealing with esams issues) [00:41:11] what kind of 502 errors? [00:41:14] does it say anything else? [00:41:41] not that I remember. They were very minimal error messages [00:41:42] ok, that should do it [00:42:01] can you get headers next time? [00:42:18] sure [00:42:23] thanks [00:43:03] we're not aware of any issues state-side and haven't heard it from anyone else, so it'd really help [00:43:30] boy this is about to get so much more fun with ulsfo [00:44:11] yeah [00:44:18] looks fine now [00:44:26] sq37 barely noticed [00:44:47] so far :) [00:47:13] speaking of squid, how does the process of its death fare?:) [00:47:56] progressing I think [00:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:54:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [01:03:35] Anyone else (EU) getting bits fail? [01:04:15] Meh, just fail on one connection [01:04:18] Disregard [01:20:04] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 687 MB (9% inode=65%): [01:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:25:14] PROBLEM - check_disk on db1025 is CRITICAL: DISK CRITICAL - free space: / 584 MB (8% inode=65%): [01:25:54] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [01:26:43] Server version: 5.1.53-wm-log (mysql-at-facebook-r3753) [01:26:55] How many mysql boxen do we still have not upgraded to mariadb? [01:30:04] RECOVERY - check_disk on db1025 is OK: DISK OK - free space: / 3760 MB (52% inode=65%): [01:46:04] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [02:04:12] Reedy: about 50/50 mysql/mariadb overall. 40/60 on the s* clusters [02:05:23] Is there any near term plan to finish the migration? [02:06:15] 5.1.53 -> 5.5.32 is a pretty big jump, never mind the mariadb changes [02:07:33] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 301 seconds [02:08:33] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [02:15:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 702 MB (9% inode=83%): [02:15:41] full migration is indeed on the near-term todo list [02:16:51] the remaining 5.1 slaves in the s* clusters, definitely. swithcing all masters needs more care [02:18:27] !log LocalisationUpdate completed (1.22wmf14) at Fri Aug 30 02:18:27 UTC 2013 [02:18:34] Logged the message, Master [02:20:04] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 625 MB (8% inode=83%): [02:20:13] PROBLEM - Puppet freshness on ssl1 is CRITICAL: No successful Puppet run in the last 10 hours [02:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [02:25:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 549 MB (7% inode=83%): [02:26:13] PROBLEM - Puppet freshness on ssl1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:30:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 474 MB (6% inode=83%): [02:33:13] PROBLEM - Puppet freshness on ssl1008 is CRITICAL: No successful Puppet run in the last 10 hours [02:35:03] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 400 MB (5% inode=83%): [02:35:45] Jeff_Green: ^ if you haven't seen already... [02:36:53] (03PS1) 10Kaldari: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 [02:37:13] PROBLEM - Puppet freshness on cp1044 is CRITICAL: No successful Puppet run in the last 10 hours [02:38:01] !log LocalisationUpdate completed (1.22wmf15) at Fri Aug 30 02:38:00 UTC 2013 [02:38:07] Logged the message, Master [02:39:04] Looks like the HTTPS switch broke Flickr importing on Commons. Anyone want to +2 the fix: https://gerrit.wikimedia.org/r/#/c/81890/ [02:40:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 321 MB (4% inode=83%): [02:41:38] (03CR) 10Yuvipanda: [C: 031] Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [02:41:50] (rather useless, i guess) [02:41:51] Reedy: ^ [02:42:17] YuviPanda: Thanks anyway. Every little bit counts... well sort of. [02:42:37] I forgot I didn't have +2 there, and... gave a +1 anyway because I was already there [02:45:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 242 MB (3% inode=83%): [02:46:27] PROBLEM - Puppet freshness on ssl1001 is CRITICAL: No successful Puppet run in the last 10 hours [02:47:27] PROBLEM - Puppet freshness on amssq47 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:07] PROBLEM - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 167 MB (2% inode=83%): [02:50:27] PROBLEM - Puppet freshness on ssl1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:27] PROBLEM - Puppet freshness on ssl4 is CRITICAL: No successful Puppet run in the last 10 hours [02:50:27] PROBLEM - Puppet freshness on ssl1003 is CRITICAL: No successful Puppet run in the last 10 hours [02:51:06] ACKNOWLEDGEMENT - check_disk on db1008 is CRITICAL: DISK CRITICAL - free space: / 167 MB (2% inode=83%): Matt Walker Ive stopped the job that was filling the error logs until tomorrow. [02:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:53:27] PROBLEM - Puppet freshness on ssl1007 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:27] PROBLEM - Puppet freshness on cp1043 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 30 02:53:46 UTC 2013 [02:53:52] Logged the message, Master [02:56:27] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: No successful Puppet run in the last 10 hours [02:56:27] PROBLEM - Puppet freshness on ssl1002 is CRITICAL: No successful Puppet run in the last 10 hours [02:59:27] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [03:01:27] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: No successful Puppet run in the last 10 hours [03:02:27] PROBLEM - Puppet freshness on ssl1009 is CRITICAL: No successful Puppet run in the last 10 hours [03:03:27] PROBLEM - Puppet freshness on ssl3 is CRITICAL: No successful Puppet run in the last 10 hours [03:03:27] PROBLEM - Puppet freshness on ssl3002 is CRITICAL: No successful Puppet run in the last 10 hours [03:08:07] PROBLEM - Puppet freshness on ssl2 is CRITICAL: No successful Puppet run in the last 10 hours [03:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:28] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 303 seconds [03:23:28] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 303 seconds [03:24:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [03:33:00] (03PS1) 10Yurik: Added Orange Madagascar carrier 646-02 [operations/puppet] - 10https://gerrit.wikimedia.org/r/81892 [03:33:27] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [03:33:28] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -0 seconds [03:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:53:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [04:07:10] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:22:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:25:10] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [04:36:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:36:30] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 349 seconds [04:36:31] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 350 seconds [04:37:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [04:38:31] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 77 seconds [04:38:31] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 35 seconds [04:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:02] (03PS2) 10Reedy: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [04:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [05:00:08] Reedy: should we sync it? [05:00:15] Probably [05:00:16] i'm in favor [05:00:20] want me to do it? [05:00:23] I was going to, but got distracted [05:00:30] i'll do it [05:00:35] thanks [05:00:40] (03CR) 10Ori.livneh: [C: 032] Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [05:00:49] (03Merged) 10jenkins-bot: Using protocol-relative URL for Flickr API [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/81890 (owner: 10Kaldari) [05:03:54] !log olivneh synchronized wmf-config/CommonSettings.php 'Make Flickr API URL protocol-relative for compatibility with HTTPS' [05:04:00] Logged the message, Master [05:12:22] huh [05:12:42] just the issue, not that you sync'd it [05:21:35] not that odd [05:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [05:23:49] yeah, just curious [05:25:09] i'd have asked if i knew you were around, btw :) [05:25:26] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 346 seconds [05:25:37] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 349 seconds [05:26:32] ori-l: The eye of Sa..... [05:26:54] I mean, yeah, I was doing some house moving stuff (getting moving quotes, looking up utility info, etc) [05:28:26] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [05:28:36] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [05:32:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.922 second response time [05:44:50] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: No successful Puppet run in the last 10 hours [05:44:50] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [05:51:37] https://commons.wikimedia.org/w/index.php?title=Commons%3AAdministrators%27_noticeboard&diff=102962667&oldid=102962303 [05:51:58] seems to have fixed it [05:52:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [06:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [06:36:33] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 328 seconds [06:36:34] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 330 seconds [06:41:33] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 35 seconds [06:41:34] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay 6 seconds [06:50:03] RECOVERY - check_disk on db1008 is OK: DISK OK - free space: / 4654 MB (65% inode=83%): [07:13:29] PROBLEM - MySQL Slave Delay on db58 is CRITICAL: CRIT replication delay 306 seconds [07:13:38] PROBLEM - MySQL Replication Heartbeat on db58 is CRITICAL: CRIT replication delay 308 seconds [07:16:29] RECOVERY - MySQL Slave Delay on db58 is OK: OK replication delay 0 seconds [07:16:38] RECOVERY - MySQL Replication Heartbeat on db58 is OK: OK replication delay -1 seconds [07:30:09] (03CR) 10TTO: "Why did you try to merge this without fulfilling the dependency?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:41:17] (03CR) 10Akosiaris: [C: 032] require nrpe package before collecting [operations/puppet] - 10https://gerrit.wikimedia.org/r/81676 (owner: 10Akosiaris) [07:43:32] (03PS6) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [07:46:29] (03PS1) 10Akosiaris: Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 [07:46:43] (03CR) 10TTO: "Sorry Reedy, I shouldn't sound so accusing... I guess it was probably jenkins-bot's fault for not being a bit smarter :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:48:44] (03PS4) 10TTO: skwiktionary: Set site logo to local file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [07:49:40] (03PS2) 10TTO: Adjust reupload-own permissions for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80546 [08:25:34] morning [08:27:15] morning [08:28:14] mogge [08:43:04] I am going to restart jenkins .. [08:46:10] !log restarting Jenkins for plugins upgrade [08:46:16] Logged the message, Master [09:24:24] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [09:36:35] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 302 seconds [09:36:36] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 304 seconds [09:38:35] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [09:38:36] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -1 seconds [09:41:35] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [09:41:55] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay -0 seconds [10:34:28] !log Moving esams text squid backend traffic from pmtpa to eqiad [10:34:34] Logged the message, Master [10:35:17] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [10:35:24] how? [10:35:32] site's broker [10:35:35] broken [10:35:36] revert [10:35:50] mark: [10:37:14] this was what ryan did yesterday and broke the site again [10:37:14] oh yeah, private ips [10:37:18] the eqiad group is all private IPs [10:37:19] yep [10:37:43] let's just move to varnish [10:37:43] so [10:37:50] Ryan forgot to commit yesterday [10:38:00] PROBLEM - Puppet freshness on analytics1011 is CRITICAL: No successful Puppet run in the last 10 hours [10:38:00] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [10:38:00] so your commit included the new API squidsn [10:38:08] and your revert reverted those too :) [10:38:11] i know [10:38:14] k [10:38:18] don't care [10:38:19] should I? [10:38:48] unless you're migrating to varnish today :) [10:39:07] we had exactly one API squid and that failed [10:39:12] so API in europe went down [10:39:43] first attempt was to switch to eqiad, when that failed he just designated two random squids as API squids [10:39:43] who cares about the api [10:39:51] which is a fine workaround I think :) [10:40:11] apparently users that came to IRC at 02:30 their time to complain :P [10:46:55] !log reverted [10:47:01] Logged the message, Master [10:48:55] (03PS2) 10TTO: (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [10:51:52] (03PS3) 10TTO: (bug 52997) $wgCategoryCollation to 'uca-ru' on all Russian-language [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79770 (owner: 10Andrey Kiselev) [10:59:57] so I can't migrate to varnish yet [11:00:14] because there are least 2 of those mediawiki bugs left setting cookies in cacheable responses [11:13:53] (03PS2) 10Akosiaris: Replace position-of-the-moon [operations/puppet] - 10https://gerrit.wikimedia.org/r/81898 [11:16:57] sniff [11:17:04] my awesome position of the moon hack [11:17:21] mark: :-D [11:17:28] are you sure the inline template evaluates on every single catalog generation and isn't cached? :) [11:18:44] good question... gimme a sec.. we will know soon enough. I do expect that it is cached though. But does it matter in this case ? [11:19:25] yes [11:19:43] the point is to update twice a day, if it gets cached either it'll be less or more than that [11:20:36] and why twice a day ? [11:20:41] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:20:41] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:21:11] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_ERR 1 pgs inconsistent: 1 scrub errors [11:21:29] dunno? :) [11:22:07] instructing pg 3.3772 on osd.104 to repair [11:22:38] ahaha... so maybe it's twice a day on average for all machines ? which I think will still hold true [11:22:41] PROBLEM - MySQL Slave Delay on db43 is CRITICAL: CRIT replication delay 308 seconds [11:22:42] PROBLEM - MySQL Replication Heartbeat on db43 is CRITICAL: CRIT replication delay 310 seconds [11:23:01] right now it's twice a day because it's a random number from 0-12 [11:23:17] or not, wait [11:23:23] I had that figured out at one point :) [11:23:27] puppet runs every 30' [11:23:33] so it's rand 0-47 I think [11:23:41] RECOVERY - MySQL Slave Delay on db43 is OK: OK replication delay 0 seconds [11:23:42] RECOVERY - MySQL Replication Heartbeat on db43 is OK: OK replication delay -0 seconds [11:26:01] PROBLEM - Puppet freshness on virt0 is CRITICAL: No successful Puppet run in the last 10 hours [11:28:41] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [11:28:42] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [11:29:11] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [11:29:37] it's just so it's not delaying each and every puppet run, right [11:29:40] it doesn't really matter [11:31:00] btw... why are we collecting ssh keys on the entire fleet ? I would have expected just the bastion hosts [11:31:18] because we used to login across hosts a lot more than we do now [11:31:57] and: "why not?" [11:32:00] deployments still happen via SSH from deployment hosts [11:32:03] because puppet sucks, that's why [11:32:19] anyway, it was a reasonable compromise [11:33:56] ok.. makes sense... [11:40:58] (03PS1) 10Hashar: tweak memcached limit on beta (89GB -> 15GB) [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 [11:42:26] (03CR) 10Hashar: "Hey Asher, that is meant to limit the max memory usage of memcached on the beta cluster (instances have 16GB of memory)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/81905 (owner: 10Hashar) [11:43:16] how about using facter variables to discover that? :) [11:45:44] (03PS1) 10Hashar: contint: python-sphinx package [operations/puppet] - 10https://gerrit.wikimedia.org/r/81906 [11:46:17] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [11:50:55] (03CR) 10Andrew Bogott: [C: 031] "This looks fine to me. I'm not sure if labsdebrepo should be included in this patch as well, or if that should be done on a per-instance " [operations/puppet] - 10https://gerrit.wikimedia.org/r/78002 (owner: 10Yuvipanda) [11:51:58] (03CR) 10Andrew Bogott: [C: 031] Route requests based on data from Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/78025 (owner: 10Yuvipanda) [11:52:54] hello, i wanna help wikimedia, how can i contribute ? I mean in the tech field. [11:53:44] uhm [11:53:47] you can make ceph bug free [11:54:07]