[00:02:19] (03PS2) 10Faidon Liambotis: Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [00:22:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [00:25:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [00:33:48] (03PS1) 10Cmjohnson: adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 [00:34:21] (03CR) 10jenkins-bot: [V: 04-1] adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 (owner: 10Cmjohnson) [00:36:09] (03PS2) 10Cmjohnson: adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 [00:37:52] (03CR) 10Cmjohnson: [C: 032] adding site.pp entry for elastic1001-12 [operations/puppet] - 10https://gerrit.wikimedia.org/r/95969 (owner: 10Cmjohnson) [00:37:53] (03PS3) 10Faidon Liambotis: Replace Linux RPS setting with a smart mechanism [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [00:51:55] (03CR) 10Springle: "Given that the job queries are not all equal runtime is dependent on shard size and number of wikis, Nemo's approach seems potentially mor" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95876 (owner: 10MaxSem) [01:12:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [01:23:18] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Miscellaneous+pmtpa&h=hume.wikimedia.org&jr=&js=&v=355970&m=Global_JobQueue_length [01:23:21] wtf [01:24:16] springle: you know db47 has a failed LD, right? [01:24:40] paravoid: yep [01:24:59] ok [01:25:25] also, do you know I depooled db1050 back on Nov 8th? [01:25:41] some LVM issue, possibly kernel bug [01:25:41] db47 is about to be rotated out for decomm [01:25:44] it's still depooled [01:25:48] yes [01:25:51] okay [01:25:54] just making sure :) [01:26:09] i let it catch up, but snapshot slaves are al currently (or practically) depooled [01:26:16] lvm is ok there now [01:26:30] how did you fix it? [01:26:42] snapshots will become "the new tampa slaves" where we do stuff like dumps and slow queries [01:27:03] what do you mean by "snapshots"? [01:28:12] didn't say i fixed lvm, just that it's ok :) by the time I go to it it had taken new snapshots. they all mounted correctly etc, so i let mysql go and catch up [01:29:08] snapshots == snapshot slaves [01:29:11] sorry [01:36:37] RECOVERY - DPKG on stafford is OK: All packages OK [01:46:57] (03PS1) 10Faidon Liambotis: search: disable the lucene_search icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/95971 [01:46:58] (03PS1) 10Faidon Liambotis: base: add check_disk check exception for Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/95972 [01:48:19] (03CR) 10Faidon Liambotis: [C: 032] search: disable the lucene_search icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/95971 (owner: 10Faidon Liambotis) [01:48:27] (03CR) 10Faidon Liambotis: [C: 032] base: add check_disk check exception for Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/95972 (owner: 10Faidon Liambotis) [01:53:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [01:54:08] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [01:56:19] (03PS1) 10Springle: db69 to s2 master db71 to s3 master depool db57 for decom depool db66 for shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95973 [01:57:25] (03CR) 10Springle: [C: 032] db69 to s2 master db71 to s3 master depool db57 for decom depool db66 for shipping [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95973 (owner: 10Springle) [01:58:17] RECOVERY - Disk space on cp1064 is OK: DISK OK [01:58:33] lol [02:01:17] RECOVERY - Disk space on cp1059 is OK: DISK OK [02:03:06] RECOVERY - Disk space on cp1047 is OK: DISK OK [02:04:20] (03CR) 10Faidon Liambotis: [C: 04-1] contint: migrate firewall rules to ferm (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [02:05:17] RECOVERY - Disk space on cp1061 is OK: DISK OK [02:07:11] hmm.. modified stuff on tin /a/common, but only eof newlines removed [02:07:57] !log LocalisationUpdate completed (1.23wmf3) at Mon Nov 18 02:07:57 UTC 2013 [02:08:14] Logged the message, Master [02:08:27] oh er no.. actual differences. drat [02:08:36] RECOVERY - Disk space on cp1050 is OK: DISK OK [02:08:37] RECOVERY - Disk space on cp1046 is OK: DISK OK [02:10:16] RECOVERY - Disk space on cp1063 is OK: DISK OK [02:12:01] what's the best approach dealing with unstaged changes on tin /a/common? [02:12:08] paravoid: ^ ? [02:12:35] I'm not sure [02:12:55] I'm sure TimStarling knows better than I do [02:15:51] !log LocalisationUpdate completed (1.23wmf4) at Mon Nov 18 02:15:51 UTC 2013 [02:16:06] Logged the message, Master [02:16:27] springle: maybe stash and re-apply [02:18:33] (03CR) 10Faidon Liambotis: [C: 04-1] Setting up varnishkafka on mobile varnish caches (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/94169 (owner: 10Ottomata) [02:20:06] Aaron|home: that's safe enough? not too dangerous for any half-done background jobs i don't know about... [02:20:18] i know the merge would work [02:21:23] * springle tries it [02:22:58] !log springle synchronized wmf-config/db-pmtpa.php 'pmtpa replication reconfig for decomissioning' [02:23:11] Logged the message, Master [02:29:16] RECOVERY - Disk space on cp1048 is OK: DISK OK [02:32:06] RECOVERY - Disk space on cp1060 is OK: DISK OK [02:32:31] (03CR) 10Faidon Liambotis: [C: 04-1] "A quick search shows there's also a mobile version of msnbot, so we should make a generic solution for this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95532 (owner: 10Dr0ptp4kt) [02:34:25] (03PS1) 10Springle: correct s3 groupLoadsBySection key [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95974 [02:34:58] (03CR) 10Springle: [C: 032] correct s3 groupLoadsBySection key [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95974 (owner: 10Springle) [02:35:06] RECOVERY - Disk space on cp1045 is OK: DISK OK [02:35:51] (03CR) 10Faidon Liambotis: [C: 032] contint: jenkins git config core.packedGitLimit=2G [operations/puppet] - 10https://gerrit.wikimedia.org/r/95123 (owner: 10Hashar) [02:36:06] !log springle synchronized wmf-config/db-eqiad.php 'correct s3 groupLoadsBySection key' [02:36:18] Logged the message, Master [02:38:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 18 02:37:59 UTC 2013 [02:38:14] Logged the message, Master [02:39:45] (03PS4) 10Faidon Liambotis: Replace Linux RPS setting with a smarter script [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 [02:40:06] RECOVERY - Disk space on cp1051 is OK: DISK OK [02:42:13] (03PS2) 10Faidon Liambotis: Fix duplicate ensure in swift.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/95638 (owner: 10Akosiaris) [02:42:27] RECOVERY - Disk space on cp1049 is OK: DISK OK [02:42:30] (03Abandoned) 10Faidon Liambotis: Fix duplicate ensure in swift.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/95638 (owner: 10Akosiaris) [02:45:22] (03CR) 10Faidon Liambotis: [C: 032] "Fair enough." [operations/puppet] - 10https://gerrit.wikimedia.org/r/95090 (owner: 10Mwalker) [02:48:06] RECOVERY - Disk space on cp1058 is OK: DISK OK [02:49:04] !log for index cardinality on nullable fields, testing innodb_stats_method=nulls_ignored on db1049, db1002, db1003, db1004, db1026, db1040, db1041 + reanalyzing tag_summary and change_tag [02:49:15] Logged the message, Master [02:50:06] RECOVERY - Disk space on cp1062 is OK: DISK OK [03:00:17] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:02:17] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [03:20:46] !log adding mw1045 mw1070 mw1085 mw1165 back to the apache pool; no evidence (RT tickets, pybal comments) as why they are not pooled [03:21:00] Logged the message, Master [03:23:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [03:26:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [04:13:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [04:54:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [04:55:06] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [05:54:59] (03CR) 10Jeremyb: "duplicate port (one is wrong?)" (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [05:55:14] (03CR) 10Jeremyb: [C: 04-1] contint: migrate firewall rules to ferm [operations/puppet] - 10https://gerrit.wikimedia.org/r/95162 (owner: 10Hashar) [05:55:36] PROBLEM - Host ms-be9 is DOWN: PING CRITICAL - Packet loss = 100% [06:00:22] Hey. So... do you need me to post the error message or do you already know there's an issue? [06:00:29] ottomata ^ [06:00:46] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_tallest_structures_in_the_world&action=submit, from 208.80.154.136 via cp1055 frontend ([10.2.2.25]:80), Varnish XID 3386140178 [06:00:47] Forwarded for: 108.133.51.103, 208.80.154.136 [06:00:48] Error: 503, Service Unavailable at Mon, 18 Nov 2013 05:59:05 GMT [06:01:06] Request: POST http://en.wikipedia.org/w/index.php?title=List_of_tallest_structures_in_the_world&action=submit, from 208.80.154.136 via cp1067 frontend ([10.2.2.25]:80), Varnish XID 2827401770 [06:01:08] Forwarded for: 108.133.51.103, 208.80.154.136 [06:01:09] Error: 503, Service Unavailable at Mon, 18 Nov 2013 06:00:29 GMT [06:05:06] RECOVERY - Host ms-be9 is UP: PING OK - Packet loss = 0%, RTA = 35.46 ms [06:10:08] Sven_Manguard: it's not really otto mata i think... [06:10:29] jeremyb: what does on RT duty mean up top then? [06:10:53] that's not for "the site's broken", outages, etc. [06:10:59] RT is for less urgent stuff [06:11:06] site's broken is whoever's available [06:11:19] see https://wikitech.wikimedia.org/wiki/RT_Triage_Duty#Duty_desk_rotation_-_who_is_next.3F [06:12:40] the topic is generally trustworthy only if it's been changed in the last ~6 days and also rotations start on mondays (but holidays are often mondays so that maybe messes things up) so if it was set before the last monday then it's probably outdated [06:13:08] i see no one signed up too recently onwiki [06:14:27] anyway, about the error... [06:14:42] Sven_Manguard: that's only on edit? or also reading? [06:15:22] jeremyb: it was only *that page* [06:15:36] everything else I loaded eventually, still can't load that page [06:15:40] but the edit went through [06:15:48] huh [06:16:34] Sven_Manguard: you can't load it at all? [06:16:40] i have no issues with it [06:18:25] I stopped trying a while ago [06:19:01] can you try again? [06:21:24] it seems like that's a consistently expensive page. NewPP limit report agrees with what i saw. at least compared to its own talk page [06:22:05] Sven_Manguard: pls /j #wikimedia-tech too [06:22:19] it works now [06:24:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [06:27:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [06:28:48] !log powercycled ms-be9, it was unresponsive; doing xfs_repair on /dev/sde1 after 'corruption detected' messages on reboot [06:29:02] Logged the message, Master [06:32:32] https://gdash.wikimedia.org/dashboards/reqwiki/ don't look so hot [06:33:55] apergos: could your powercycle be related to the jump in green/blue? https://graphite.wikimedia.org/render/?title=Top%208%20FileBackend%20Methods%20by%20Max%2090th%20Percentile%20Time%20%28ms%29%20log%282%29%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&logBase=2&lineWidth=1&lineMode=connected&target=cactiStyle%28substr%28highestMax%28FileBackendStore.*.tp90,8%29,0,2%29%29 [06:36:41] the timing looks wrong [06:36:48] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Swift%2520pmtpa&tab=m&vn=&hide-hf=false [06:38:45] k [06:39:05] filed bug 57174 [06:41:39] https://gdash.wikimedia.org/dashboards/reqwiki/ seems to be better now [06:42:04] ok [07:14:06] PROBLEM - Puppet freshness on terbium is CRITICAL: No successful Puppet run in the last 3 hours [07:54:56] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:55:06] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 3 hours [07:55:46] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [07:56:06] PROBLEM - Puppet freshness on bast1001 is CRITICAL: No successful Puppet run in the last 3 hours [09:12:25] (03CR) 10Hashar: "Turns out the real memory hog is git repack-objects, its window memory limit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95123 (owner: 10Hashar) [09:25:06] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 3 hours [09:28:06] PROBLEM - Puppet freshness on tin is CRITICAL: No successful Puppet run in the last 3 hours [09:30:33] (03CR) 10MaxSem: "You might just want to disable HT completely: http://www.fidian.com/problems-only-tyler-has/disabling-hyperthreading" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [09:31:43] for those who ignore puppet freshness spam: neon & tin:) [09:32:14] I know neon's problem, looking into tin [09:32:18] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [09:32:25] :( [09:32:51] akosiaris: you're there [09:33:00] poor amslvs1 :( [09:33:16] I've been trying to figure out bast1001 (presume same as tin) but no joy [09:33:17] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [09:34:40] apergos: btw we should lower the 1 hour puppet interval today back to 30 mins and see how the infra responds to that load [09:34:50] maybe even lower [09:34:52] that would be great [09:35:02] let's start with 1/2 hour and we can see [09:35:13] bear in mind neon takes a while to finish up [09:37:35] tin issue is that puppet-agent can't read from remote server :/ Error 502 on SERVER: 502 Proxy Error. The proxy server received an invalid#015#012response from an upstream server. [09:38:10] yes, and bast1001 [09:38:35] I should have started with terbium, it looks like a much more run of the mill issue :-/ [09:39:14] hashar, are you a root now? [09:39:27] MaxSem: nop [09:39:29] apergos: hashar all these seem the same [09:39:37] apache proxy module timeouts [09:39:39] yep [09:39:43] fixing it now [09:39:44] * apergos fixes the terbium typo [09:39:50] oh? [09:40:25] (03Draft1) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:40:27] are you just raising the timeout? [09:40:56] (03PS1) 10ArielGlenn: fix typo ('mode' should have been 'group') [operations/puppet] - 10https://gerrit.wikimedia.org/r/95997 [09:40:57] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:42:06] (03CR) 10ArielGlenn: [C: 032] fix typo ('mode' should have been 'group') [operations/puppet] - 10https://gerrit.wikimedia.org/r/95997 (owner: 10ArielGlenn) [09:42:18] meh... i did that again ? [09:42:34] easy to fix [09:42:48] why puppet parser validate does not catch these errors ? [09:42:56] they are way too simple to not catch [09:43:16] oh come on .... align correctly :P [09:43:38] I went into all that trouble to align => :-) [09:43:46] RECOVERY - Puppet freshness on terbium is OK: puppet ran at Mon Nov 18 09:43:36 UTC 2013 [09:43:47] (03PS1) 10Springle: depool db74 for move to S6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95998 [09:43:47] :-D [09:43:55] (03PS2) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:44:10] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:44:13] fine, as a penance I'll go through the whole file and clean up such things, if there are any others :-P [09:44:26] and terbium is happy so [09:44:41] (03CR) 10Springle: [C: 032] depool db74 for move to S6 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95998 (owner: 10Springle) [09:44:54] huh... I don't wish that to my worst enemy... linting our puppet manifests [09:46:04] !log springle synchronized wmf-config/db-pmtpa.php 'depool db74 for move to S6' [09:46:15] Logged the message, Master [09:46:57] (03PS1) 10Akosiaris: Adjust proxytimeout for puppetmaster's mod_proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/96000 [09:48:23] (03CR) 10Akosiaris: [C: 032] Adjust proxytimeout for puppetmaster's mod_proxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/96000 (owner: 10Akosiaris) [09:50:00] I'm only doing alignments and only in the one file [09:50:06] I'm not *that* crazy [09:50:29] :-) [09:53:06] so I was sure it had to be something more complicated than the timeout [09:53:16] that it must have been timing out for some other obscure reason :-/ [09:53:19] (03PS1) 10Springle: db74 to S6 during pmtpa decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/96003 [09:54:23] (03CR) 10Springle: [C: 032] db74 to S6 during pmtpa decom [operations/puppet] - 10https://gerrit.wikimedia.org/r/96003 (owner: 10Springle) [09:55:57] Compiled catalog for neon.wikimedia.org in environment production in 7313.52 seconds [09:56:01] ????????????? [09:56:22] (03PS3) 10Aude: Enable Wikidata build on beta labs [WIP] [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 [09:56:41] (03CR) 10Aude: [C: 04-1] "this probably breaks localisation cache rebuild on labs" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/95996 (owner: 10Aude) [09:56:50] * aude needs sticky minus 1 [09:56:52] er?? [09:57:16] now that is a lie [09:57:24] so tin, bast1001, fenari are going to be fixed by the change [09:57:26] ( 7313.52 seconds ) [09:57:35] because they had compilation times of 300-500 secs [09:57:43] but neon ????? 2 hours ? [09:57:45] neon does not take 7000 seconds. [09:57:46] lol [09:57:49] that is completely bogus [09:57:56] i sure hope so [09:58:08] let's wait for this run and see [09:58:11] unles omsone changed something recently (last week or two [09:58:12] ) [09:58:36] to slow it down by an order of magnitude [09:58:40] *smeone [09:58:43] grrr... anyways. [10:00:08] !log xtrabackup db50 to db74 [10:00:17] RECOVERY - Puppet freshness on tin is OK: puppet ran at Mon Nov 18 10:00:12 UTC 2013 [10:00:21] Logged the message, Master [10:00:28] yippi! [10:01:44] one down [10:02:03] (03PS1) 10ArielGlenn: fix alignments in maintenance.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 [10:02:05] I'll run neon by hand and watch it [10:02:06] notice: Finished catalog run in 90.43 seconds [10:02:12] i already do [10:02:15] ah ok [10:02:20] so those 90 secs are on tin [10:02:23] that is also a lie [10:02:35] how long was it then? [10:02:49] compilation took 490 secs [10:02:58] and 90 secs for the catalog to be applied [10:03:22] hm that's not exactly fixed alignment [10:03:54] puppet-lint liked it though (for that only) [10:04:23] are we doing spaces or tabs, anyone know any more? [10:05:45] spaces if you are starting from scratch [10:05:56] otherwise either stick to what the file already has [10:06:12] or convert it to space-only [10:06:18] nice ugly mix [10:06:21] spaces it is [10:10:24] (03PS2) 10ArielGlenn: fix alignments in maintenance.pp and tabs->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 [10:11:26] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 451 statistics [10:12:06] woo [10:12:08] (03CR) 10ArielGlenn: [C: 032] fix alignments in maintenance.pp and tabs->spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/96005 (owner: 10ArielGlenn) [10:12:16] PROBLEM - MySQL Processlist on db1015 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 34 copy to table, 125 statistics [10:12:20] uh [10:13:16] RECOVERY - MySQL Processlist on db1015 is OK: OK 0 unauthenticated, 0 locked, 17 copy to table, 1 statistics [10:13:17] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 1 unauthenticated, 1 locked, 2 copy to table, 1383 statistics [10:13:26] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 15 copy to table, 1771 statistics [10:13:45] i suppose those are worrysome ? [10:13:47] that's bound to give me a merge conflict on that file isn't it [10:14:58] if you were working in it, odds are very good yep [10:14:59] watchlist query madness [10:15:04] oh joy [10:15:26] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:15:27] PROBLEM - MySQL Processlist on db1006 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 14 copy to table, 191 statistics [10:15:51] although, might be a symptom rather than cause. no MW connection errors yet. killing stuff [10:15:56] (03PS1) 10Akosiaris: Swap production puppetmasters to use db1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96007 [10:16:07] ooohhhh [10:16:16] RECOVERY - MySQL Processlist on db1040 is OK: OK 0 unauthenticated, 0 locked, 12 copy to table, 7 statistics [10:16:21] * apergos wonders if sockpuppet is still the puppetca [10:16:26] RECOVERY - MySQL Processlist on db1006 is OK: OK 0 unauthenticated, 0 locked, 8 copy to table, 16 statistics [10:17:38] nope [10:17:40] it is not [10:17:44] ah [10:17:56] and no longer you need to do those weird thingies [10:18:00] with puppet --ca_server [10:18:03] yep [10:18:05] and then again , and then not [10:18:22] and palladium is the (or a) salt master too? [10:18:26] just run puppet, logging to palladium, sign the certificate and you are done [10:18:36] hmmm i got to copy the new_install key ... [10:19:09] I have made no plans to migrate salt master to palladium [10:19:16] oh. nm then [10:19:29] ryan was saying that we should have two [10:19:32] one per DC [10:19:34] yes, a syndic [10:19:58] it's fine on sockpuppet for now [10:20:06] I suppose it would make sense to reuse the puppetca server in order to avoid [10:20:12] double CA work [10:20:38] RECOVERY - Puppet freshness on bast1001 is OK: puppet ran at Mon Nov 18 10:20:34 UTC 2013 [10:21:04] (03CR) 10Akosiaris: [C: 032] Swap production puppetmasters to use db1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/96007 (owner: 10Akosiaris) [10:22:31] have anybody from ops seen https://bugzilla.wikimedia.org/show_bug.cgi?id=56769 ? [10:24:12] MaxSem: seems like mathoid is timing out [10:24:31] oh, mathoid is already live? [10:24:43] and that is what I was going to ask [10:25:02] well anyway the app server (whatever it is) times out [10:25:35] (03CR) 10Nemo bis: "After https://gerrit.wikimedia.org/r/96005 I wonder if it's easier to rebase this or to abandon and resubmit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 (owner: 10Nemo bis) [10:26:09] MaxSem: https://gerrit.wikimedia.org/r/#/c/90733/ is not merged so I assume mathoid is not live [10:29:39] argggghhhhhhhhh [10:29:43] oh access denied [10:29:45] booo [10:29:49] that is mee [10:29:51] sorry [10:29:52] heh [10:29:54] * MaxSem finds precisely zro wfProfileIn() calls in Math [10:30:21] holy fuck [10:30:41] yay [10:30:59] even if we urgently slap a bunch of them and deploy we will not know if there's a perf regression [10:31:13] nope [10:31:46] PROBLEM - Puppetmaster HTTPS on palladium is CRITICAL: Connection refused [10:31:51] err? [10:31:56] yeah just saw that [10:32:28] and again me [10:32:39] figured as much [10:32:40] the freaking change to db1001 [10:32:45] ah crap [10:32:57] Access denied for user 'puppet'@'10.64.0.164' [10:33:02] the users are there [10:33:16] just like 'puppet'@'strontium.eqiad.wmnet' [10:33:36] what the ? [10:33:44] what's wrong? [10:34:10] palladium/strontium can't connect to db10001 [10:34:14] what is going on with math? [10:34:42] 504 Gateway Time-out [10:34:46] RECOVERY - Puppetmaster HTTPS on palladium is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.943 second response time [10:35:02] the only difference I see is that users are specified with fqdns and not ips [10:35:04] IPs [10:35:29] and I am wondering why it used to work on db9 and not on db1001 ... [10:35:46] db1001 has skip_name_resolve on [10:35:50] argh [10:35:50] heh [10:35:53] that would do it [10:35:59] sigh... [10:36:09] and d9 does not [10:36:23] any idea why ? [10:37:11] cause i see a lot of users in mysql.user with FQDNs and not IPs [10:37:24] yeah I was noticing that [10:37:33] (which seems odd to me, I'm used to having the ips in there) [10:37:37] grants on misc suck in general. too long with too many people doing it ad-hoc [10:38:54] ok I will redefine the users on db9 with IPs and db1001 will pick them up. [10:40:28] (03PS3) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 [10:40:31] I feel uneasy using IPs btw. I prefer hostnames. [10:40:37] i don;t [10:40:46] why ? [10:41:01] hostnames make mysql use reverse dns, which can suddenly block all sorts of things if your dns plays up [10:41:13] makes sense [10:41:14] and it needs extra network round trips on connect [10:41:41] i prefer them because they are easier to update and don't stay as stale as ips [10:41:51] yup, trade off [10:42:32] ingeneral I would agree with you (in the case where hostnames are not some obscure misc server name but a cname etc) [10:42:36] but in this case... [10:43:29] (03PS4) 10Nemo bis: Make the monthly querypages updates not hit each cluster on the same day [operations/puppet] - 10https://gerrit.wikimedia.org/r/95889 [10:43:30] yeah i get the point. That is not why I am not arguing to change that [10:50:51] excitingly, that S6 query spike before was not a traffic spike but a side effect of one of the slaves crashing then coming back up with cold caches :( i feel this could be a long night... [10:52:35] springle: well it may (just may) cheer you up to know that puppet is not longer using db9. One less db there :-) [10:52:50] that does cheer me up! [10:54:27] * apergos clears fenari from the puppet freshness whines (hopefully) [10:54:39] akosiaris: it may also cheer you up to know we're running userstat logging on all db boxes since last week with the aim of cleaning up then locking down grants (among other things) [10:55:09] That really cheers me up :-) :-) :-) [10:55:11] have to first find out which stuff is still used. pmtpa going away will make th ejob easier [10:55:19] it sure will [10:55:26] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Mon Nov 18 10:55:21 UTC 2013 [10:55:30] yay [10:57:00] yes!!!! [10:57:10] neon is applying configuration!!! [10:57:15] sweet! [10:57:22] so... The 7030 secs got me wondering [10:57:24] and that's the last [10:57:26] RECOVERY - Puppet freshness on neon is OK: puppet ran at Mon Nov 18 10:57:22 UTC 2013 [10:57:29] yes? [10:57:55] turns out it was the latency of 35 ms between eqiad and pmtpa [10:58:04] that is why is changed puppet to db1001 [10:58:22] which was needed sooner or later anyways [10:58:29] yes :-) [10:58:35] perfect! [10:58:40] 7k secs was absurd [10:58:48] so... activerecord sucks ... [10:58:54] we need to move to puppetdb soon [10:59:10] hm aren't there space issues? [10:59:53] or have they fixed that? [11:00:09] I really don't know... We should evaluate it at least [11:00:31] Compiled catalog for neon.wikimedia.org in environment production in 324.21 seconds [11:00:43] huh.. even that was above the default 300 secs timeout [11:00:45] oh it's the dashboard piece with it, [11:00:46] RECOVERY - puppet disabled on analytics1011 is OK: OK [11:00:46] RECOVERY - SSH on analytics1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:00:46] RECOVERY - DPKG on analytics1011 is OK: All packages OK [11:00:46] RECOVERY - RAID on analytics1011 is OK: OK: no disks configured for RAID [11:00:51] so maybe we'll be ok [11:00:56] RECOVERY - Host analytics1011 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [11:01:06] RECOVERY - Disk space on analytics1011 is OK: DISK OK [11:01:09] 324 is quite respectable [11:01:29] also much closer to all my runs :-D [11:02:55] moin [11:03:14] coin [11:03:16] :P [11:03:46] amslvs1 holding up, good [11:05:28] (03PS1) 10Akosiaris: Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 [11:05:48] (03CR) 10Faidon Liambotis: "We typically disable HT in the BIOS, yes. In this case it doesn't matter much. That blog post's advice though is terrible, though :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/95963 (owner: 10Faidon Liambotis) [11:06:05] (03CR) 10jenkins-bot: [V: 04-1] Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 (owner: 10Akosiaris) [11:10:30] (03PS2) 10Akosiaris: Fix owner/perms ganglia_new::monitor::aggregator [operations/puppet] - 10https://gerrit.wikimedia.org/r/96012 [11:26:37] (03PS1) 10Springle: try to stabilize innodb index cardinality [operations/puppet] - 10https://gerrit.wikimedia.org/r/96013 [11:29:03] (03CR) 10Springle: [C: 032] try to stabilize innodb index cardinality [operations/puppet] - 10https://gerrit.wikimedia.org/r/96013 (owner: 10Springle) [11:32:45] * paravoid smiles at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&servicestatustypes=28&hoststatustypes=3&serviceprops=2097162&nostatusheader [11:32:51] almost there :) [11:37:39] what's up with strontium? [11:41:17] apergos: alex is on it, it's a check that shouldn't exist [11:41:43] I guess it would make sense to check that the two back ends on 8141 exist [11:41:47]